Google Releases DiffusionGemma, A 26B Open Model That Writes Text Like an Image Generator
Google DeepMind released DiffusionGemma, a 26-billion-parameter open model that generates text using diffusion rather than sequential token prediction, reaching 1,000+ tokens per second on a single NVIDIA H100 GPU.
Google released DiffusionGemma on June 10, 2026, an experimental open model exploring text diffusion, an emerging alternative to traditional autoregressive language models.
Google says it can generate text up to four times faster on dedicated GPUs by refining blocks of text in parallel rather than predicting tokens sequentially.
Based on the company’s earlier Gemini Diffusion research, DiffusionGemma is Google’s first open diffusion text model for local deployment.
It is one of the more unusual additions to the Gemma family, positioned for speed-critical workloads rather than as a replacement for Gemma 4.
For low-latency applications, the performance gains are notable, though Google acknowledges quality tradeoffs compared with conventional Gemma models.
Why Diffusion for Text Is Different From Everything Else
Every major language model in mainstream use today, including Claude and Gemini, generates text autoregressively, predicting one token at a time based on previously generated tokens.
While highly effective, that sequential process can limit inference speed. DiffusionGemma takes a different approach, using a diffusion-based architecture that starts with random placeholder tokens and iteratively refines an entire text block in parallel.
The technique is conceptually similar to diffusion image generators, which transform noisy inputs into coherent outputs through repeated refinement steps.
By updating multiple parts of a sequence simultaneously, DiffusionGemma aims to make better use of modern GPU hardware and increase text-generation throughput.
This capability makes tasks like in-line code editing, real-time transcription cleanup, and complex markdown formatting significantly more natural than they are in token-by-token systems
The Speed, the Trade-Off, and Where It Actually Runs
DiffusionGemma delivers up to 1,000+ tokens per second on a single NVIDIA H100 GPU and more than 700 tokens per second on an RTX 5090, according to Google’s published benchmarks.
NVIDIA has also optimised the model across its AI ecosystem, including GeForce RTX, RTX PRO, and DGX platforms.
As a 26B Mixture of Experts (MoE) model that activates only 3.8B parameters during inference, DiffusionGemma can fit within roughly 18GB of VRAM when quantised, making local deployment possible on high-end consumer GPUs.
Google’s positioning focuses on latency-sensitive workloads such as interactive code infilling, inline editing, and chat applications, where responsiveness matters more than peak benchmark performance.
It is offering a different trade-off: faster, lighter, more interactive, and architecturally novel. For the developers already running the Gemma models locally for agentic workflows, DiffusionGemma is the speed-first variant worth adding to the evaluation stack.
Source: DiffusionGemma model overview
![Top Tech Stories of 6th week [2026]](https://www.nogentech.org/wp-content/uploads/2026/02/Top-Tech-Stories-of-7th-Week-2026-390x220.webp)

