Google Releases DiffusionGemma, A 26B Open Model That Writes Text Like an Image Generator

Google DeepMind released DiffusionGemma, a 26-billion-parameter open model that generates text using diffusion rather than sequential token prediction, reaching 1,000+ tokens per second on a single NVIDIA H100 GPU.

Fawad MalikJune 11, 2026Last Updated: June 11, 2026

2 minutes read

Google Releases DiffusionGemma Open Text Diffusion Model featured banner for NogenTech. — Google expands its open-source AI ecosystem by releasing DiffusionGemma, a brand-new open text diffusion model.

Key Takeaways

DiffusionGemma is a 26 billion parameter Mixture of Experts system released under Apache 2.0, activating only 3.8 billion parameters during inference and running within 18GB of VRAM when quantised.
The model generates 256 tokens per forward pass using bidirectional attention, reaching over 1,000 tokens per second on a single H100 GPU and over 700 tokens per second on an NVIDIA GeForce RTX 5090.
Google frames DiffusionGemma as an experimental multimodal open model; while its standard Gemma 4 models remain the better option for applications requiring maximum output quality, with DiffusionGemma aimed at speed-critical local workflows.
DiffusionGemma is available via Hugging Face with support for vLLM, Transformers, MLX, Unsloth, and NVIDIA NeMo, with official llama.cpp support coming soon.

Google released DiffusionGemma on June 10, 2026, an experimental open model exploring text diffusion, an emerging alternative to traditional autoregressive language models.

Google says it can generate text up to four times faster on dedicated GPUs by refining blocks of text in parallel rather than predicting tokens sequentially.

Based on the company’s earlier Gemini Diffusion research, DiffusionGemma is Google’s first open diffusion text model for local deployment.

It is one of the more unusual additions to the Gemma family, positioned for speed-critical workloads rather than as a replacement for Gemma 4.

For low-latency applications, the performance gains are notable, though Google acknowledges quality tradeoffs compared with conventional Gemma models.

Why Diffusion for Text Is Different From Everything Else

Every major language model in mainstream use today, including Claude and Gemini, generates text autoregressively, predicting one token at a time based on previously generated tokens.

While highly effective, that sequential process can limit inference speed. DiffusionGemma takes a different approach, using a diffusion-based architecture that starts with random placeholder tokens and iteratively refines an entire text block in parallel.

The technique is conceptually similar to diffusion image generators, which transform noisy inputs into coherent outputs through repeated refinement steps.

By updating multiple parts of a sequence simultaneously, DiffusionGemma aims to make better use of modern GPU hardware and increase text-generation throughput.

This capability makes tasks like in-line code editing, real-time transcription cleanup, and complex markdown formatting significantly more natural than they are in token-by-token systems

The Speed, the Trade-Off, and Where It Actually Runs

DiffusionGemma delivers up to 1,000+ tokens per second on a single NVIDIA H100 GPU and more than 700 tokens per second on an RTX 5090, according to Google’s published benchmarks.

NVIDIA has also optimised the model across its AI ecosystem, including GeForce RTX, RTX PRO, and DGX platforms.

As a 26B Mixture of Experts (MoE) model that activates only 3.8B parameters during inference, DiffusionGemma can fit within roughly 18GB of VRAM when quantised, making local deployment possible on high-end consumer GPUs.

Google’s positioning focuses on latency-sensitive workloads such as interactive code infilling, inline editing, and chat applications, where responsiveness matters more than peak benchmark performance.

It is offering a different trade-off: faster, lighter, more interactive, and architecturally novel. For the developers already running the Gemma models locally for agentic workflows, DiffusionGemma is the speed-first variant worth adding to the evaluation stack.

Source: DiffusionGemma model overview

Fawad MalikJune 11, 2026Last Updated: June 11, 2026

2 minutes read