This developer guide explains how to understand, serve, and customize DiffusionGemma, an experimental model built on the Gemma 4 backbone.

Made with Rinto — analyse your own content free

DiffusionGemma Milestones

DiffusionGemma introduces several milestones improving developer workflows, including faster generation, self-correction, and smaller sizes.

DiffusionGemma Architecture

DiffusionGemma's architecture shifts the primary bottleneck from memory bandwidth to compute, improving performance for LLMs on GPUs.

Traditional LLM bottleneck

For traditional LLMs on GPUs, the primary bottleneck is memory bandwidth due to repeated model weight loading.

Bypassing memory bandwidth

DiffusionGemma bypasses this limitation by shifting the bottleneck from memory bandwidth to compute.

Parallel 256-token canvas generation

The model generates and refines a 256-token canvas in parallel.

GPU tensor core utilization

Providing the GPU with a large parallel workload utilizes tensor cores that would otherwise be idle.

Uniform State Diffusion

DiffusionGemma starts with a canvas of random placeholder tokens and iteratively refines them in parallel.

Block Autoregressive Diffusion

This mechanism enables variable length generation for sequences longer than 256 tokens.

Sudoku Solving Showcase

The Sudoku Solver showcases DiffusionGemma's customization for strict, multivariable constrained problems using parallel denoising.

Traditional AR model struggles

Traditional autoregressive models struggle with Sudoku because they cannot evaluate future placeholders or backtrack.

Customization with Hackable Diffusion

A fine-tuning recipe and results using Hackable Diffusion, a modular JAX research toolbox, are released to demonstrate customization.

Why Sudoku is Interesting

Sudoku is interesting for DiffusionGemma due to its strict horizontal, vertical, and 9x9 grid constraints.

Sudoku Performance Impact

Fine-tuning DiffusionGemma significantly improves its ability to solve Sudoku puzzles.

Block Autoregressive Denoising Steps

DiffusionGemma alternates between incremental prefill and denoising during inference to enable block autoregressive denoising.

Prefill / Incremental Prefill

This step uses causal attention to ingest the prompt context and write to the KV cache.

Denoising

This step uses bidirectional attention to iteratively denoise the canvas.

Query token attention

Query tokens at any canvas position can attend to all other canvas tokens and the KV cache.

Global Context Awareness

The Denoiser's bidirectional attention allows every token on the canvas to attend to every other token, unlike AR models.

Self-Correction Capability

The model can fix earlier mistakes because it iteratively refines the whole canvas.

Efficient Long-Context Scaling

The 'block-autoregressive' approach allows the model to handle long sequences efficiently.

Simplified Deployment

Using the same architecture as the Gemma 4 26B A4B model simplifies deployment for developers.

Serving DiffusionGemma

DiffusionGemma is integrated into vLLM to efficiently serve this experimental architecture.

Getting Started Resources

Developers can access various resources to explore non-autoregressive text generation with DiffusionGemma.

▸ 7 Expand

APEX

DiffusionGemma Developer Guide

This developer guide explains how to understand, serve, and customize DiffusionGemma, an experimental model built on the Gemma 4 backbone.

Made with Rinto — analyse your own content free

▸ 3 Expand

CONC

DiffusionGemma Milestones

DiffusionGemma introduces several milestones improving developer workflows, including faster generation, self-correction, and smaller sizes.

▸ 3 Expand

SUBC

Compute-bound parallel generation

This feature bypasses memory-bandwidth limitations by shifting the bottleneck to compute.

STAT

Faster token generation on GPUs

DiffusionGemma delivers up to 4x faster token generation on GPUs.

STAT

NVIDIA GeForce RTX 5090 performance

The model achieves up to 700+ tokens per second on NVIDIA GeForce RTX 5090.

STAT

NVIDIA H100 performance

The model achieves 1000+ tokens per second on a single NVIDIA H100.

▸ 1 Expand

SUBC

Bidirectional context & self-correction

DiffusionGemma uses bidirectional attention to evaluate the entire text block simultaneously during generation.

DETL

Real-time error correction enabled

This simultaneous evaluation enables real-time error correction and parallel context propagation.

▸ 3 Expand

SUBC

Developer-friendly sizes

DiffusionGemma is designed as a Mixture of Experts (MoE) model for efficient deployment.

STAT

MoE model size

The model is a 26B Mixture of Experts (MoE) model.

STAT

Active parameters during inference

Only 3.8B parameters are activated during inference.

DETL

VRAM deployment limit

This allows quantized deployment within 18 GB VRAM limits.

▸ 6 Expand

CONC

DiffusionGemma Architecture

DiffusionGemma's architecture shifts the primary bottleneck from memory bandwidth to compute, improving performance for LLMs on GPUs.

DETL

Traditional LLM bottleneck

For traditional LLMs on GPUs, the primary bottleneck is memory bandwidth due to repeated model weight loading.

DETL

Bypassing memory bandwidth

DiffusionGemma bypasses this limitation by shifting the bottleneck from memory bandwidth to compute.

DETL

Parallel 256-token canvas generation

The model generates and refines a 256-token canvas in parallel.

DETL

GPU tensor core utilization

Providing the GPU with a large parallel workload utilizes tensor cores that would otherwise be idle.

▸ 1 Expand

SUBC

Uniform State Diffusion

DiffusionGemma starts with a canvas of random placeholder tokens and iteratively refines them in parallel.

DETL

Token resolution process

Over multiple denoising passes, highly confident tokens help resolve adjacent positions, causing the entire sequence to snap into focus.

▸ 3 Expand

SUBC

Block Autoregressive Diffusion

This mechanism enables variable length generation for sequences longer than 256 tokens.

DETL

Processing 256-token blocks

Once a 256-token block is fully denoised, the model processes and commits it to the KV cache.

DETL

Next block initialization

The model then transitions to the next block, initializing a fresh 256-token canvas conditioned on previously committed history.

INSG

Combines parallel speed and stability

This approach combines parallel block speed with the sequential stability of autoregressive models.

▸ 3 Expand

EXMP

Sudoku Solving Showcase

The Sudoku Solver showcases DiffusionGemma's customization for strict, multivariable constrained problems using parallel denoising.

DETL

Traditional AR model struggles

Traditional autoregressive models struggle with Sudoku because they cannot evaluate future placeholders or backtrack.

DETL

Customization with Hackable Diffusion

A fine-tuning recipe and results using Hackable Diffusion, a modular JAX research toolbox, are released to demonstrate customization.

▸ 4 Expand

CONC

Why Sudoku is Interesting

Sudoku is interesting for DiffusionGemma due to its strict horizontal, vertical, and 9x9 grid constraints.

DETL

81-character string representation

An 81-character Sudoku string representation marks empty cells with periods.

▸ 1 Expand

SUBC

Bidirectional Context Propagation

DiffusionGemma's denoising step allows every canvas query to attend to all positions in parallel, unlike autoregressive models.

DETL

Symmetric information flow

Information flows symmetrically across the board, resolving global dependencies in each step.

▸ 1 Expand

SUBC

Error Correction via Re-Noising

Under Uniform State Diffusion, the model evaluates the entire board simultaneously, allowing continuous self-correction.

DETL

Replacing digits with random ones

If confidence drops, the sampler replaces digits with random ones to correct errors.

▸ 2 Expand

SUBC

Efficient Early Stopping

Fine-tuning on Sudoku demonstrates that adapters enhance early stopping.

DETL

SFT-tuned model stabilization

The SFT-tuned model stabilizes faster than the base model.

INSG

Reduced latency and compute costs

Faster stabilization allows the engine to halt sooner, reducing latency and compute costs.

▸ 4 Expand

CONC

Sudoku Performance Impact

Fine-tuning DiffusionGemma significantly improves its ability to solve Sudoku puzzles.

STAT

Base model success rate

The base DiffusionGemma model has a ~0% success rate for solving Sudoku puzzles.

STAT

Fine-tuned success rate

Applying the simple JAX SFT recipe on a Sudoku dataset raises correctness to 80% success.

DETL

Decreased inference step count

Fine-tuning also decreases the overall inference step count.

EXMP

Visual Comparison Base vs SFT

The base model is unable to solve Sudoku after 48 steps, while the fine-tuned (SFT) model solves it after 12 steps.

▸ 2 Expand

CONC

Block Autoregressive Denoising Steps

DiffusionGemma alternates between incremental prefill and denoising during inference to enable block autoregressive denoising.

▸ 2 Expand

SUBC

Prefill / Incremental Prefill

This step uses causal attention to ingest the prompt context and write to the KV cache.

DETL

Initial context prefill

It runs once to prefill the initial context.

DETL

Appending finalized canvases

It then runs once per block to append each finalized 256-token canvas to the KV cache before denoising the next canvas.

▸ 5 Expand

SUBC

Denoising

This step uses bidirectional attention to iteratively denoise the canvas.

DETL

Query token attention

Query tokens at any canvas position can attend to all other canvas tokens and the KV cache.

▸ 1 Expand

INSG

Global Context Awareness

The Denoiser's bidirectional attention allows every token on the canvas to attend to every other token, unlike AR models.

JUST

Effective for non-sequential problems

This makes the model more effective at solving non-sequential problems like Sudoku, where early digits respect late constraints.

▸ 2 Expand

INSG

Self-Correction Capability

The model can fix earlier mistakes because it iteratively refines the whole canvas.

DETL

Re-noising mechanism

If a token's confidence drops during a pass, the sampler can re-noise and replace it.

JUST

AR models lack self-correction

AR models lack this capability as they are 'stuck' with a token once generated, especially during long output sequences.

▸ 1 Expand

INSG

Efficient Long-Context Scaling

The 'block-autoregressive' approach allows the model to handle long sequences efficiently.

JUST

Combines diffusion speed with AR stability

It combines the parallel speed of diffusion for blocks with the proven sequential stability of AR models for long-form text.

▸ 1 Expand

INSG

Simplified Deployment

Using the same architecture as the Gemma 4 26B A4B model simplifies deployment for developers.

JUST

Only denoising step implementation needed

Developers only need to implement a denoising step, making integration into existing serving frameworks easier.

▸ 3 Expand

CONC

Serving DiffusionGemma

DiffusionGemma is integrated into vLLM to efficiently serve this experimental architecture.

DETL

vLLM integration

The integration with vLLM allows the engine to run iterative parallel denoising loops efficiently across batched request streams.

DETL

Out-of-the-box deployment

Developers can deploy DiffusionGemma out of the box using vLLM's standard OpenAI-compatible local server.

DETL

Deployment Command Example

An example command `vllm serve google/diffusiongemma-26B-A4B-it --max-model-len 262144 ...` shows how to deploy DiffusionGemma.

▸ 8 Expand

CONC

Getting Started Resources

Developers can access various resources to explore non-autoregressive text generation with DiffusionGemma.

DETL

Download Weights

Access the experimental model weights, released under the Apache 2.0 license, directly on Hugging Face.

DETL

Integrate & Learn

Review the Visual Guide to DiffusionGemma to understand text-based diffusion model mechanics.

DETL

Gemma documentation

Read more about DiffusionGemma in the Gemma documentation.

DETL

Inference Frameworks

Run the model efficiently using vLLM, Hugging Face Transformers, SGLang, and MLX.

DETL

Adapt & Fine-Tune

Official training recipes using Hackable Diffusion are released for rapid experimentation.

DETL

Efficient fine-tuning options

Explore efficient fine-tuning using Unsloth or NVIDIA NeMo.

DETL

Deployment Options

Instantly deploy on Google Cloud using Model Garden or via NVIDIA NIM.

DETL

Optimized Hardware Stack

The model is optimized natively across hardware from consumer RTX 4090/5090 cards to enterprise Hopper and Blackwell servers.