Paper Reading AI Learner

Unveiling the Potential of Diffusion Large Language Model in Controllable Generation

2025-07-06 18:41:34
Zhen Xiong, Yujun Cai, Zhecheng Li, Yiwei Wang

Abstract

Diffusion models, originally developed for image generation, have emerged as a promising alternative to autoregressive large language models (LLMs). We present a theoretical analysis comparing autoregressive and masked diffusion LLMs, revealing that the intrinsic bidirectional attention mechanism of diffusion LLMs (dLLMs) enables superior context modeling and generation controllability. However, existing dLLM applications face significant challenges in controllable generation: the native multi-step denoising process exhibits high sensitivity to sequence length, elevated hallucination rates, and prohibitive inference costs without specialized optimizations. To address these limitations, we propose \textbf{S}elf-adaptive \textbf{S}chema \textbf{S}caffolding ($S^3$), a novel framework that enables dLLMs to generate structured outputs (e.g., JSON) while maintaining semantic fidelity and accelerating inference. Our approach injects the target schema structure into the output context, reducing unnecessary computation while improving controllability. Extensive experiments demonstrate that $S^3$ achieves substantial improvements: 65\% increase in structural adherence, 48\% enhancement in content fidelity, and 17\% reduction in hallucination rates compared to baseline. These results establish both theoretical foundations and practical pathways for deploying diffusion models in controllable text generation tasks. Code and data will be publicly released.

Abstract (translated)

扩散模型最初用于图像生成,现已作为自回归大型语言模型(LLM)的一种有前景的替代方案出现。我们进行了一项理论分析,比较了自回归和屏蔽式扩散 LLM,揭示出扩散 LLM (dLLMs) 内在的双向注意力机制能够实现更优的上下文建模和生成控制能力。然而,现有的 dLLM 应用面临着可控生成方面的重大挑战:原生多步骤去噪过程对序列长度敏感度高、幻觉率升高以及缺乏专门优化情况下推理成本高昂。为解决这些局限性,我们提出了一种新型框架——自适应模式支架(Self-adaptive Schema Scaffolding, $S^3$),使 dLLMs 能够生成结构化输出(例如 JSON)同时保持语义忠实性和加速推理速度。我们的方法将目标模式结构注入到输出上下文中,减少了不必要的计算量并提高了可控性。广泛的实验表明,$S^3$ 实现了显著的改进:与基线相比,在结构一致性上提升了65%,内容保真度提高了48%,幻觉率降低了17%。这些结果不仅建立了扩散模型在可控制文本生成任务中的理论基础,还为其实用部署提供了途径。代码和数据将公开发布。

URL

https://arxiv.org/abs/2507.04504

PDF

https://arxiv.org/pdf/2507.04504.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot