{"sourceUrl":"https://developers.googleblog.com/diffusiongemma-the-developer-guide/","sourceType":"url","contentType":"Explainer","apex":{"id":"n1","type":"APEX","label":"DiffusionGemma Developer Guide","text":"This developer guide explains how to understand, serve, and customize DiffusionGemma, an experimental model built on the Gemma 4 backbone.","children":[{"id":"n2","type":"CONC","label":"DiffusionGemma Milestones","text":"DiffusionGemma introduces several milestones improving developer workflows, including faster generation, self-correction, and smaller sizes.","parentId":"n1","children":[{"id":"n3","type":"SUBC","label":"Compute-bound parallel generation","text":"This feature bypasses memory-bandwidth limitations by shifting the bottleneck to compute.","parentId":"n2","children":[{"id":"n4","type":"STAT","label":"Faster token generation on GPUs","text":"DiffusionGemma delivers up to 4x faster token generation on GPUs.","parentId":"n3","children":[]},{"id":"n5","type":"STAT","label":"NVIDIA GeForce RTX 5090 performance","text":"The model achieves up to 700+ tokens per second on NVIDIA GeForce RTX 5090.","parentId":"n3","children":[]},{"id":"n6","type":"STAT","label":"NVIDIA H100 performance","text":"The model achieves 1000+ tokens per second on a single NVIDIA H100.","parentId":"n3","children":[]}]},{"id":"n7","type":"SUBC","label":"Bidirectional context & self-correction","text":"DiffusionGemma uses bidirectional attention to evaluate the entire text block simultaneously during generation.","parentId":"n2","children":[{"id":"n8","type":"DETL","label":"Real-time error correction enabled","text":"This simultaneous evaluation enables real-time error correction and parallel context propagation.","parentId":"n7","children":[]}]},{"id":"n9","type":"SUBC","label":"Developer-friendly sizes","text":"DiffusionGemma is designed as a Mixture of Experts (MoE) model for efficient deployment.","parentId":"n2","children":[{"id":"n10","type":"STAT","label":"MoE model size","text":"The model is a 26B Mixture of Experts (MoE) model.","parentId":"n9","children":[]},{"id":"n11","type":"STAT","label":"Active parameters during inference","text":"Only 3.8B parameters are activated during inference.","parentId":"n9","children":[]},{"id":"n12","type":"DETL","label":"VRAM deployment limit","text":"This allows quantized deployment within 18 GB VRAM limits.","parentId":"n9","children":[]}]}]},{"id":"n13","type":"CONC","label":"DiffusionGemma Architecture","text":"DiffusionGemma's architecture shifts the primary bottleneck from memory bandwidth to compute, improving performance for LLMs on GPUs.","parentId":"n1","children":[{"id":"n14","type":"DETL","label":"Traditional LLM bottleneck","text":"For traditional LLMs on GPUs, the primary bottleneck is memory bandwidth due to repeated model weight loading.","parentId":"n13","children":[]},{"id":"n15","type":"DETL","label":"Bypassing memory bandwidth","text":"DiffusionGemma bypasses this limitation by shifting the bottleneck from memory bandwidth to compute.","parentId":"n13","children":[]},{"id":"n16","type":"DETL","label":"Parallel 256-token canvas generation","text":"The model generates and refines a 256-token canvas in parallel.","parentId":"n13","children":[]},{"id":"n17","type":"DETL","label":"GPU tensor core utilization","text":"Providing the GPU with a large parallel workload utilizes tensor cores that would otherwise be idle.","parentId":"n13","children":[]},{"id":"n18","type":"SUBC","label":"Uniform State Diffusion","text":"DiffusionGemma starts with a canvas of random placeholder tokens and iteratively refines them in parallel.","parentId":"n13","children":[{"id":"n19","type":"DETL","label":"Token resolution process","text":"Over multiple denoising passes, highly confident tokens help resolve adjacent positions, causing the entire sequence to snap into focus.","parentId":"n18","children":[]}]},{"id":"n20","type":"SUBC","label":"Block Autoregressive Diffusion","text":"This mechanism enables variable length generation for sequences longer than 256 tokens.","parentId":"n13","children":[{"id":"n21","type":"DETL","label":"Processing 256-token blocks","text":"Once a 256-token block is fully denoised, the model processes and commits it to the KV cache.","parentId":"n20","children":[]},{"id":"n22","type":"DETL","label":"Next block initialization","text":"The model then transitions to the next block, initializing a fresh 256-token canvas conditioned on previously committed history.","parentId":"n20","children":[]},{"id":"n23","type":"INSG","label":"Combines parallel speed and stability","text":"This approach combines parallel block speed with the sequential stability of autoregressive models.","parentId":"n20","children":[]}]}]},{"id":"n24","type":"EXMP","label":"Sudoku Solving Showcase","text":"The Sudoku Solver showcases DiffusionGemma's customization for strict, multivariable constrained problems using parallel denoising.","parentId":"n1","children":[{"id":"n25","type":"DETL","label":"Traditional AR model struggles","text":"Traditional autoregressive models struggle with Sudoku because they cannot evaluate future placeholders or backtrack.","parentId":"n24","children":[]},{"id":"n26","type":"DETL","label":"Customization with Hackable Diffusion","text":"A fine-tuning recipe and results using Hackable Diffusion, a modular JAX research toolbox, are released to demonstrate customization.","parentId":"n24","children":[]},{"id":"n27","type":"CONC","label":"Why Sudoku is Interesting","text":"Sudoku is interesting for DiffusionGemma due to its strict horizontal, vertical, and 9x9 grid constraints.","parentId":"n24","children":[{"id":"n28","type":"DETL","label":"81-character string representation","text":"An 81-character Sudoku string representation marks empty cells with periods.","parentId":"n27","children":[]},{"id":"n29","type":"SUBC","label":"Bidirectional Context Propagation","text":"DiffusionGemma's denoising step allows every canvas query to attend to all positions in parallel, unlike autoregressive models.","parentId":"n27","children":[{"id":"n30","type":"DETL","label":"Symmetric information flow","text":"Information flows symmetrically across the board, resolving global dependencies in each step.","parentId":"n29","children":[]}]},{"id":"n31","type":"SUBC","label":"Error Correction via Re-Noising","text":"Under Uniform State Diffusion, the model evaluates the entire board simultaneously, allowing continuous self-correction.","parentId":"n27","children":[{"id":"n32","type":"DETL","label":"Replacing digits with random ones","text":"If confidence drops, the sampler replaces digits with random ones to correct errors.","parentId":"n31","children":[]}]},{"id":"n33","type":"SUBC","label":"Efficient Early Stopping","text":"Fine-tuning on Sudoku demonstrates that adapters enhance early stopping.","parentId":"n27","children":[{"id":"n34","type":"DETL","label":"SFT-tuned model stabilization","text":"The SFT-tuned model stabilizes faster than the base model.","parentId":"n33","children":[]},{"id":"n35","type":"INSG","label":"Reduced latency and compute costs","text":"Faster stabilization allows the engine to halt sooner, reducing latency and compute costs.","parentId":"n33","children":[]}]}]}]},{"id":"n36","type":"CONC","label":"Sudoku Performance Impact","text":"Fine-tuning DiffusionGemma significantly improves its ability to solve Sudoku puzzles.","parentId":"n1","children":[{"id":"n37","type":"STAT","label":"Base model success rate","text":"The base DiffusionGemma model has a ~0% success rate for solving Sudoku puzzles.","parentId":"n36","children":[]},{"id":"n38","type":"STAT","label":"Fine-tuned success rate","text":"Applying the simple JAX SFT recipe on a Sudoku dataset raises correctness to 80% success.","parentId":"n36","children":[]},{"id":"n39","type":"DETL","label":"Decreased inference step count","text":"Fine-tuning also decreases the overall inference step count.","parentId":"n36","children":[]},{"id":"n40","type":"EXMP","label":"Visual Comparison Base vs SFT","text":"The base model is unable to solve Sudoku after 48 steps, while the fine-tuned (SFT) model solves it after 12 steps.","parentId":"n36","children":[]}]},{"id":"n41","type":"CONC","label":"Block Autoregressive Denoising Steps","text":"DiffusionGemma alternates between incremental prefill and denoising during inference to enable block autoregressive denoising.","parentId":"n1","children":[{"id":"n42","type":"SUBC","label":"Prefill / Incremental Prefill","text":"This step uses causal attention to ingest the prompt context and write to the KV cache.","parentId":"n41","children":[{"id":"n43","type":"DETL","label":"Initial context prefill","text":"It runs once to prefill the initial context.","parentId":"n42","children":[]},{"id":"n44","type":"DETL","label":"Appending finalized canvases","text":"It then runs once per block to append each finalized 256-token canvas to the KV cache before denoising the next canvas.","parentId":"n42","children":[]}]},{"id":"n45","type":"SUBC","label":"Denoising","text":"This step uses bidirectional attention to iteratively denoise the canvas.","parentId":"n41","children":[{"id":"n46","type":"DETL","label":"Query token attention","text":"Query tokens at any canvas position can attend to all other canvas tokens and the KV cache.","parentId":"n45","children":[]},{"id":"n47","type":"INSG","label":"Global Context Awareness","text":"The Denoiser's bidirectional attention allows every token on the canvas to attend to every other token, unlike AR models.","parentId":"n45","children":[{"id":"n48","type":"JUST","label":"Effective for non-sequential problems","text":"This makes the model more effective at solving non-sequential problems like Sudoku, where early digits respect late constraints.","parentId":"n47","children":[]}]},{"id":"n49","type":"INSG","label":"Self-Correction Capability","text":"The model can fix earlier mistakes because it iteratively refines the whole canvas.","parentId":"n45","children":[{"id":"n50","type":"DETL","label":"Re-noising mechanism","text":"If a token's confidence drops during a pass, the sampler can re-noise and replace it.","parentId":"n49","children":[]},{"id":"n51","type":"JUST","label":"AR models lack self-correction","text":"AR models lack this capability as they are 'stuck' with a token once generated, especially during long output sequences.","parentId":"n49","children":[]}]},{"id":"n52","type":"INSG","label":"Efficient Long-Context Scaling","text":"The 'block-autoregressive' approach allows the model to handle long sequences efficiently.","parentId":"n45","children":[{"id":"n53","type":"JUST","label":"Combines diffusion speed with AR stability","text":"It combines the parallel speed of diffusion for blocks with the proven sequential stability of AR models for long-form text.","parentId":"n52","children":[]}]},{"id":"n54","type":"INSG","label":"Simplified Deployment","text":"Using the same architecture as the Gemma 4 26B A4B model simplifies deployment for developers.","parentId":"n45","children":[{"id":"n55","type":"JUST","label":"Only denoising step implementation needed","text":"Developers only need to implement a denoising step, making integration into existing serving frameworks easier.","parentId":"n54","children":[]}]}]}]},{"id":"n56","type":"CONC","label":"Serving DiffusionGemma","text":"DiffusionGemma is integrated into vLLM to efficiently serve this experimental architecture.","parentId":"n1","children":[{"id":"n57","type":"DETL","label":"vLLM integration","text":"The integration with vLLM allows the engine to run iterative parallel denoising loops efficiently across batched request streams.","parentId":"n56","children":[]},{"id":"n58","type":"DETL","label":"Out-of-the-box deployment","text":"Developers can deploy DiffusionGemma out of the box using vLLM's standard OpenAI-compatible local server.","parentId":"n56","children":[]},{"id":"n59","type":"DETL","label":"Deployment Command Example","text":"An example command `vllm serve google/diffusiongemma-26B-A4B-it --max-model-len 262144 ...` shows how to deploy DiffusionGemma.","parentId":"n56","children":[]}]},{"id":"n60","type":"CONC","label":"Getting Started Resources","text":"Developers can access various resources to explore non-autoregressive text generation with DiffusionGemma.","parentId":"n1","children":[{"id":"n61","type":"DETL","label":"Download Weights","text":"Access the experimental model weights, released under the Apache 2.0 license, directly on Hugging Face.","parentId":"n60","children":[]},{"id":"n62","type":"DETL","label":"Integrate & Learn","text":"Review the Visual Guide to DiffusionGemma to understand text-based diffusion model mechanics.","parentId":"n60","children":[]},{"id":"n63","type":"DETL","label":"Gemma documentation","text":"Read more about DiffusionGemma in the Gemma documentation.","parentId":"n60","children":[]},{"id":"n64","type":"DETL","label":"Inference Frameworks","text":"Run the model efficiently using vLLM, Hugging Face Transformers, SGLang, and MLX.","parentId":"n60","children":[]},{"id":"n65","type":"DETL","label":"Adapt & Fine-Tune","text":"Official training recipes using Hackable Diffusion are released for rapid experimentation.","parentId":"n60","children":[]},{"id":"n66","type":"DETL","label":"Efficient fine-tuning options","text":"Explore efficient fine-tuning using Unsloth or NVIDIA NeMo.","parentId":"n60","children":[]},{"id":"n67","type":"DETL","label":"Deployment Options","text":"Instantly deploy on Google Cloud using Model Garden or via NVIDIA NIM.","parentId":"n60","children":[]},{"id":"n68","type":"DETL","label":"Optimized Hardware Stack","text":"The model is optimized natively across hardware from consumer RTX 4090/5090 cards to enterprise Hopper and Blackwell servers.","parentId":"n60","children":[]}]}]},"slug":"analysis-36175a","sharedAt":{"_seconds":1781154395,"_nanoseconds":751000000},"title":"DiffusionGemma: The Developer Guide"}