Paper Reading AI Learner

Decoupled Diffusion Sparks Adaptive Scene Generation

2025-04-14 17:59:57
Yunsong Zhou, Naisheng Ye, William Ljungbergh, Tianyu Li, Jiazhi Yang, Zetong Yang, Hongzi Zhu, Christoffer Petersson, Hongyang Li

Abstract

Controllable scene generation could reduce the cost of diverse data collection substantially for autonomous driving. Prior works formulate the traffic layout generation as predictive progress, either by denoising entire sequences at once or by iteratively predicting the next frame. However, full sequence denoising hinders online reaction, while the latter's short-sighted next-frame prediction lacks precise goal-state guidance. Further, the learned model struggles to generate complex or challenging scenarios due to a large number of safe and ordinal driving behaviors from open datasets. To overcome these, we introduce Nexus, a decoupled scene generation framework that improves reactivity and goal conditioning by simulating both ordinal and challenging scenarios from fine-grained tokens with independent noise states. At the core of the decoupled pipeline is the integration of a partial noise-masking training strategy and a noise-aware schedule that ensures timely environmental updates throughout the denoising process. To complement challenging scenario generation, we collect a dataset consisting of complex corner cases. It covers 540 hours of simulated data, including high-risk interactions such as cut-in, sudden braking, and collision. Nexus achieves superior generation realism while preserving reactivity and goal orientation, with a 40% reduction in displacement error. We further demonstrate that Nexus improves closed-loop planning by 20% through data augmentation and showcase its capability in safety-critical data generation.

Abstract (translated)

可控场景生成可以大幅降低自动驾驶中多样化数据收集的成本。先前的研究将交通布局的生成视为预测性进展,要么一次去除整个序列中的噪声,要么通过迭代预测下一帧来实现。然而,一次性全序列去噪会阻碍在线反应能力,而后者仅基于下一帧的短期预测又缺乏精确的目标状态指导。此外,由于开放数据集中存在大量安全和常规驾驶行为,学习模型难以生成复杂或具有挑战性的场景。 为了克服这些问题,我们引入了Nexus框架,这是一个解耦的场景生成框架,通过模拟带有独立噪声状态的细粒度令牌,来改善反应性和目标导向性,同时可以生成正常情况及有挑战性的场景。该框架的核心在于集成部分噪声屏蔽训练策略和感知噪声的时间表安排,以确保在整个去噪过程中及时更新环境。 为了补充对具有挑战性场景的生成,我们收集了一个包含复杂边缘案例的数据集,其中包括540小时模拟数据(如切入、突然刹车和碰撞等高风险互动)。Nexus在保持反应性和目标导向的同时实现了更真实的场景生成,并将位移误差减少了40%。此外,我们还展示了通过数据增强方法来提升闭环规划的20%,并证明了其在安全关键性数据生成方面的能力。

URL

https://arxiv.org/abs/2504.10485

PDF

https://arxiv.org/pdf/2504.10485.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot