Abstract
We present Genesis, a unified framework for joint generation of multi-view driving videos and LiDAR sequences with spatio-temporal and cross-modal consistency. Genesis employs a two-stage architecture that integrates a DiT-based video diffusion model with 3D-VAE encoding, and a BEV-aware LiDAR generator with NeRF-based rendering and adaptive sampling. Both modalities are directly coupled through a shared latent space, enabling coherent evolution across visual and geometric domains. To guide the generation with structured semantics, we introduce DataCrafter, a captioning module built on vision-language models that provides scene-level and instance-level supervision. Extensive experiments on the nuScenes benchmark demonstrate that Genesis achieves state-of-the-art performance across video and LiDAR metrics (FVD 16.95, FID 4.24, Chamfer 0.611), and benefits downstream tasks including segmentation and 3D detection, validating the semantic fidelity and practical utility of the generated data.
Abstract (translated)
我们介绍了Genesis,这是一个统一的框架,用于生成多视角驾驶视频和LiDAR序列,并确保它们在时空上和跨模态上的连贯性。Genesis采用两阶段架构,结合了基于DiT(Diffusion Transformer)的视频扩散模型与3D-VAE编码器,以及使用NeRF渲染和自适应采样的BEV感知LiDAR生成器。两种模式通过共享潜在空间直接耦合在一起,从而实现在视觉和几何领域的一致演变。为了用结构化的语义引导生成过程,我们引入了DataCrafter,这是一个基于视觉-语言模型的图像描述模块,能够提供场景级和实例级监督。在nuScenes基准测试上的大量实验表明,Genesis在视频和LiDAR指标(FVD 16.95、FID 4.24、Chamfer 0.611)上均达到了最先进的性能,并且增强了下游任务如分割和3D检测的实用性,验证了生成数据在语义保真度和实际应用中的价值。
URL
https://arxiv.org/abs/2506.07497