DiffusionGemma introduces several milestones improving developer workflows, including faster generation, self-correction, and smaller sizes.
This feature bypasses memory-bandwidth limitations by shifting the bottleneck to compute.
DiffusionGemma delivers up to 4x faster token generation on GPUs.
The model achieves up to 700+ tokens per second on NVIDIA GeForce RTX 5090.
The model achieves 1000+ tokens per second on a single NVIDIA H100.
DiffusionGemma uses bidirectional attention to evaluate the entire text block simultaneously during generation.
This simultaneous evaluation enables real-time error correction and parallel context propagation.
DiffusionGemma is designed as a Mixture of Experts (MoE) model for efficient deployment.
The model is a 26B Mixture of Experts (MoE) model.
Only 3.8B parameters are activated during inference.
This allows quantized deployment within 18 GB VRAM limits.
DiffusionGemma's architecture shifts the primary bottleneck from memory bandwidth to compute, improving performance for LLMs on GPUs.
For traditional LLMs on GPUs, the primary bottleneck is memory bandwidth due to repeated model weight loading.
DiffusionGemma bypasses this limitation by shifting the bottleneck from memory bandwidth to compute.
The model generates and refines a 256-token canvas in parallel.
Providing the GPU with a large parallel workload utilizes tensor cores that would otherwise be idle.
DiffusionGemma starts with a canvas of random placeholder tokens and iteratively refines them in parallel.
Over multiple denoising passes, highly confident tokens help resolve adjacent positions, causing the entire sequence to snap into focus.
This mechanism enables variable length generation for sequences longer than 256 tokens.
Once a 256-token block is fully denoised, the model processes and commits it to the KV cache.
The model then transitions to the next block, initializing a fresh 256-token canvas conditioned on previously committed history.
This approach combines parallel block speed with the sequential stability of autoregressive models.
The Sudoku Solver showcases DiffusionGemma's customization for strict, multivariable constrained problems using parallel denoising.
Traditional autoregressive models struggle with Sudoku because they cannot evaluate future placeholders or backtrack.
A fine-tuning recipe and results using Hackable Diffusion, a modular JAX research toolbox, are released to demonstrate customization.
Sudoku is interesting for DiffusionGemma due to its strict horizontal, vertical, and 9x9 grid constraints.
An 81-character Sudoku string representation marks empty cells with periods.
DiffusionGemma's denoising step allows every canvas query to attend to all positions in parallel, unlike autoregressive models.
Information flows symmetrically across the board, resolving global dependencies in each step.
Under Uniform State Diffusion, the model evaluates the entire board simultaneously, allowing continuous self-correction.
If confidence drops, the sampler replaces digits with random ones to correct errors.
Fine-tuning on Sudoku demonstrates that adapters enhance early stopping.
The SFT-tuned model stabilizes faster than the base model.
Faster stabilization allows the engine to halt sooner, reducing latency and compute costs.
Fine-tuning DiffusionGemma significantly improves its ability to solve Sudoku puzzles.
The base DiffusionGemma model has a ~0% success rate for solving Sudoku puzzles.
Applying the simple JAX SFT recipe on a Sudoku dataset raises correctness to 80% success.
Fine-tuning also decreases the overall inference step count.
The base model is unable to solve Sudoku after 48 steps, while the fine-tuned (SFT) model solves it after 12 steps.
DiffusionGemma alternates between incremental prefill and denoising during inference to enable block autoregressive denoising.
This step uses causal attention to ingest the prompt context and write to the KV cache.
It runs once to prefill the initial context.
It then runs once per block to append each finalized 256-token canvas to the KV cache before denoising the next canvas.
This step uses bidirectional attention to iteratively denoise the canvas.
Query tokens at any canvas position can attend to all other canvas tokens and the KV cache.
The Denoiser's bidirectional attention allows every token on the canvas to attend to every other token, unlike AR models.
This makes the model more effective at solving non-sequential problems like Sudoku, where early digits respect late constraints.
The model can fix earlier mistakes because it iteratively refines the whole canvas.
If a token's confidence drops during a pass, the sampler can re-noise and replace it.
AR models lack this capability as they are 'stuck' with a token once generated, especially during long output sequences.
The 'block-autoregressive' approach allows the model to handle long sequences efficiently.
It combines the parallel speed of diffusion for blocks with the proven sequential stability of AR models for long-form text.
Using the same architecture as the Gemma 4 26B A4B model simplifies deployment for developers.
Developers only need to implement a denoising step, making integration into existing serving frameworks easier.
DiffusionGemma is integrated into vLLM to efficiently serve this experimental architecture.
The integration with vLLM allows the engine to run iterative parallel denoising loops efficiently across batched request streams.
Developers can deploy DiffusionGemma out of the box using vLLM's standard OpenAI-compatible local server.
An example command `vllm serve google/diffusiongemma-26B-A4B-it --max-model-len 262144 ...` shows how to deploy DiffusionGemma.
Developers can access various resources to explore non-autoregressive text generation with DiffusionGemma.
Access the experimental model weights, released under the Apache 2.0 license, directly on Hugging Face.
Review the Visual Guide to DiffusionGemma to understand text-based diffusion model mechanics.
Read more about DiffusionGemma in the Gemma documentation.
Run the model efficiently using vLLM, Hugging Face Transformers, SGLang, and MLX.
Official training recipes using Hackable Diffusion are released for rapid experimentation.
Explore efficient fine-tuning using Unsloth or NVIDIA NeMo.
Instantly deploy on Google Cloud using Model Garden or via NVIDIA NIM.
The model is optimized natively across hardware from consumer RTX 4090/5090 cards to enterprise Hopper and Blackwell servers.
DiffusionGemma introduces several milestones improving developer workflows, including faster generation, self-correction, and smaller sizes.
This feature bypasses memory-bandwidth limitations by shifting the bottleneck to compute.
DiffusionGemma delivers up to 4x faster token generation on GPUs.
The model achieves up to 700+ tokens per second on NVIDIA GeForce RTX 5090.
The model achieves 1000+ tokens per second on a single NVIDIA H100.
DiffusionGemma uses bidirectional attention to evaluate the entire text block simultaneously during generation.
This simultaneous evaluation enables real-time error correction and parallel context propagation.
DiffusionGemma is designed as a Mixture of Experts (MoE) model for efficient deployment.
The model is a 26B Mixture of Experts (MoE) model.
Only 3.8B parameters are activated during inference.
This allows quantized deployment within 18 GB VRAM limits.
DiffusionGemma's architecture shifts the primary bottleneck from memory bandwidth to compute, improving performance for LLMs on GPUs.
For traditional LLMs on GPUs, the primary bottleneck is memory bandwidth due to repeated model weight loading.
DiffusionGemma bypasses this limitation by shifting the bottleneck from memory bandwidth to compute.
The model generates and refines a 256-token canvas in parallel.
Providing the GPU with a large parallel workload utilizes tensor cores that would otherwise be idle.
DiffusionGemma starts with a canvas of random placeholder tokens and iteratively refines them in parallel.
Over multiple denoising passes, highly confident tokens help resolve adjacent positions, causing the entire sequence to snap into focus.
This mechanism enables variable length generation for sequences longer than 256 tokens.
Once a 256-token block is fully denoised, the model processes and commits it to the KV cache.
The model then transitions to the next block, initializing a fresh 256-token canvas conditioned on previously committed history.
This approach combines parallel block speed with the sequential stability of autoregressive models.
The Sudoku Solver showcases DiffusionGemma's customization for strict, multivariable constrained problems using parallel denoising.
Traditional autoregressive models struggle with Sudoku because they cannot evaluate future placeholders or backtrack.
A fine-tuning recipe and results using Hackable Diffusion, a modular JAX research toolbox, are released to demonstrate customization.
Sudoku is interesting for DiffusionGemma due to its strict horizontal, vertical, and 9x9 grid constraints.
An 81-character Sudoku string representation marks empty cells with periods.
DiffusionGemma's denoising step allows every canvas query to attend to all positions in parallel, unlike autoregressive models.
Information flows symmetrically across the board, resolving global dependencies in each step.
Under Uniform State Diffusion, the model evaluates the entire board simultaneously, allowing continuous self-correction.
If confidence drops, the sampler replaces digits with random ones to correct errors.
Fine-tuning on Sudoku demonstrates that adapters enhance early stopping.
The SFT-tuned model stabilizes faster than the base model.
Faster stabilization allows the engine to halt sooner, reducing latency and compute costs.
Fine-tuning DiffusionGemma significantly improves its ability to solve Sudoku puzzles.
The base DiffusionGemma model has a ~0% success rate for solving Sudoku puzzles.
Applying the simple JAX SFT recipe on a Sudoku dataset raises correctness to 80% success.
Fine-tuning also decreases the overall inference step count.
The base model is unable to solve Sudoku after 48 steps, while the fine-tuned (SFT) model solves it after 12 steps.
DiffusionGemma alternates between incremental prefill and denoising during inference to enable block autoregressive denoising.
This step uses causal attention to ingest the prompt context and write to the KV cache.
It runs once to prefill the initial context.
It then runs once per block to append each finalized 256-token canvas to the KV cache before denoising the next canvas.
This step uses bidirectional attention to iteratively denoise the canvas.
Query tokens at any canvas position can attend to all other canvas tokens and the KV cache.
The Denoiser's bidirectional attention allows every token on the canvas to attend to every other token, unlike AR models.
This makes the model more effective at solving non-sequential problems like Sudoku, where early digits respect late constraints.
The model can fix earlier mistakes because it iteratively refines the whole canvas.
If a token's confidence drops during a pass, the sampler can re-noise and replace it.
AR models lack this capability as they are 'stuck' with a token once generated, especially during long output sequences.
The 'block-autoregressive' approach allows the model to handle long sequences efficiently.
It combines the parallel speed of diffusion for blocks with the proven sequential stability of AR models for long-form text.
Using the same architecture as the Gemma 4 26B A4B model simplifies deployment for developers.
Developers only need to implement a denoising step, making integration into existing serving frameworks easier.
DiffusionGemma is integrated into vLLM to efficiently serve this experimental architecture.
The integration with vLLM allows the engine to run iterative parallel denoising loops efficiently across batched request streams.
Developers can deploy DiffusionGemma out of the box using vLLM's standard OpenAI-compatible local server.
An example command `vllm serve google/diffusiongemma-26B-A4B-it --max-model-len 262144 ...` shows how to deploy DiffusionGemma.
Developers can access various resources to explore non-autoregressive text generation with DiffusionGemma.
Access the experimental model weights, released under the Apache 2.0 license, directly on Hugging Face.
Review the Visual Guide to DiffusionGemma to understand text-based diffusion model mechanics.
Read more about DiffusionGemma in the Gemma documentation.
Run the model efficiently using vLLM, Hugging Face Transformers, SGLang, and MLX.
Official training recipes using Hackable Diffusion are released for rapid experimentation.
Explore efficient fine-tuning using Unsloth or NVIDIA NeMo.
Instantly deploy on Google Cloud using Model Garden or via NVIDIA NIM.
The model is optimized natively across hardware from consumer RTX 4090/5090 cards to enterprise Hopper and Blackwell servers.