Paper Reading AI Learner

X-Scene: Large-Scale Driving Scene Generation with High Fidelity and Flexible Controllability

2025-06-16 14:43:18
Yu Yang, Alan Liang, Jianbiao Mei, Yukai Ma, Yong Liu, Gim Hee Lee

Abstract

Diffusion models are advancing autonomous driving by enabling realistic data synthesis, predictive end-to-end planning, and closed-loop simulation, with a primary focus on temporally consistent generation. However, the generation of large-scale 3D scenes that require spatial coherence remains underexplored. In this paper, we propose X-Scene, a novel framework for large-scale driving scene generation that achieves both geometric intricacy and appearance fidelity, while offering flexible controllability. Specifically, X-Scene supports multi-granular control, including low-level conditions such as user-provided or text-driven layout for detailed scene composition and high-level semantic guidance such as user-intent and LLM-enriched text prompts for efficient customization. To enhance geometrical and visual fidelity, we introduce a unified pipeline that sequentially generates 3D semantic occupancy and the corresponding multiview images, while ensuring alignment between modalities. Additionally, we extend the generated local region into a large-scale scene through consistency-aware scene outpainting, which extrapolates new occupancy and images conditioned on the previously generated area, enhancing spatial continuity and preserving visual coherence. The resulting scenes are lifted into high-quality 3DGS representations, supporting diverse applications such as scene exploration. Comprehensive experiments demonstrate that X-Scene significantly advances controllability and fidelity for large-scale driving scene generation, empowering data generation and simulation for autonomous driving.

Abstract (translated)

扩散模型通过实现现实数据合成、预测端到端规划和闭环模拟,在推动自动驾驶技术进步方面发挥了重要作用,尤其是在时间上的一致性生成方面。然而,对于需要空间连贯性的大规模3D场景的生成研究仍较为有限。为此,本文提出了一种名为X-Scene的新框架,用于进行大规模驾驶场景生成,该框架能够同时实现几何复杂性和外观真实性,并提供灵活的可控性。 具体而言,X-Scene支持多粒度控制,包括低级条件(如用户提供的或文本驱动的布局),以构成详细的场景组合;以及高级语义指导(如用户的意图和增强型LLM文本提示)用于高效定制。为了提升几何学和视觉真实感,我们引入了一个统一的管道,在该管道中顺序生成3D语义占用图和对应的多视角图像,并确保不同模式之间的一致性。此外,通过一致性感知场景扩增技术,我们将生成的局部区域扩展为大规模场景,以增强空间连续性和保持视觉连贯性。 最终生成的场景被提升至高质量的3DGS表示形式,支持如场景探索等多样化的应用需求。全面的实验表明,X-Scene显著提升了大规模驾驶场景生成中的可控性和真实感,从而增强了自动驾驶数据生成和模拟的能力。

URL

https://arxiv.org/abs/2506.13558

PDF

https://arxiv.org/pdf/2506.13558.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot