Abstract
Diffusion models, originally developed for image generation, have emerged as a promising alternative to autoregressive large language models (LLMs). We present a theoretical analysis comparing autoregressive and masked diffusion LLMs, revealing that the intrinsic bidirectional attention mechanism of diffusion LLMs (dLLMs) enables superior context modeling and generation controllability. However, existing dLLM applications face significant challenges in controllable generation: the native multi-step denoising process exhibits high sensitivity to sequence length, elevated hallucination rates, and prohibitive inference costs without specialized optimizations. To address these limitations, we propose \textbf{S}elf-adaptive \textbf{S}chema \textbf{S}caffolding ($S^3$), a novel framework that enables dLLMs to generate structured outputs (e.g., JSON) while maintaining semantic fidelity and accelerating inference. Our approach injects the target schema structure into the output context, reducing unnecessary computation while improving controllability. Extensive experiments demonstrate that $S^3$ achieves substantial improvements: 65\% increase in structural adherence, 48\% enhancement in content fidelity, and 17\% reduction in hallucination rates compared to baseline. These results establish both theoretical foundations and practical pathways for deploying diffusion models in controllable text generation tasks. Code and data will be publicly released.
Abstract (translated)
扩散模型最初用于图像生成,现已作为自回归大型语言模型(LLM)的一种有前景的替代方案出现。我们进行了一项理论分析,比较了自回归和屏蔽式扩散 LLM,揭示出扩散 LLM (dLLMs) 内在的双向注意力机制能够实现更优的上下文建模和生成控制能力。然而,现有的 dLLM 应用面临着可控生成方面的重大挑战:原生多步骤去噪过程对序列长度敏感度高、幻觉率升高以及缺乏专门优化情况下推理成本高昂。为解决这些局限性,我们提出了一种新型框架——自适应模式支架(Self-adaptive Schema Scaffolding, $S^3$),使 dLLMs 能够生成结构化输出(例如 JSON)同时保持语义忠实性和加速推理速度。我们的方法将目标模式结构注入到输出上下文中,减少了不必要的计算量并提高了可控性。广泛的实验表明,$S^3$ 实现了显著的改进:与基线相比,在结构一致性上提升了65%,内容保真度提高了48%,幻觉率降低了17%。这些结果不仅建立了扩散模型在可控制文本生成任务中的理论基础,还为其实用部署提供了途径。代码和数据将公开发布。
URL
https://arxiv.org/abs/2507.04504