Abstract
Accurate and high-fidelity driving scene reconstruction relies on fully leveraging scene information as conditioning. However, existing approaches, which primarily use 3D bounding boxes and binary maps for foreground and background control, fall short in capturing the complexity of the scene and integrating multi-modal information. In this paper, we propose DualDiff, a dual-branch conditional diffusion model designed to enhance multi-view driving scene generation. We introduce Occupancy Ray Sampling (ORS), a semantic-rich 3D representation, alongside numerical driving scene representation, for comprehensive foreground and background control. To improve cross-modal information integration, we propose a Semantic Fusion Attention (SFA) mechanism that aligns and fuses features across modalities. Furthermore, we design a foreground-aware masked (FGM) loss to enhance the generation of tiny objects. DualDiff achieves state-of-the-art performance in FID score, as well as consistently better results in downstream BEV segmentation and 3D object detection tasks.
Abstract (translated)
准确且高保真的驾驶场景重建依赖于充分利用场景信息作为条件。然而,现有的方法主要依靠三维边界框和二值图来控制前景和背景,在捕捉场景复杂性和整合多模态信息方面存在不足。在本文中,我们提出了DualDiff,这是一种双分支条件扩散模型,旨在提升多视角驾驶场景生成的效果。我们引入了Occupancy Ray Sampling(ORS),一种语义丰富的三维表示方法,并结合数值驾驶场景表示,实现全面的前景和背景控制。为了改进跨模态信息整合,我们提出了一种语义融合注意力(SFA)机制,用于对齐并融合不同模态特征。此外,我们设计了一种基于前景感知的掩码(FGM)损失函数,以增强微小物体生成的效果。DualDiff在FID评分上达到了最先进的性能,并且在下游BEV分割和3D目标检测任务中也表现出一致更好的结果。
URL
https://arxiv.org/abs/2505.01857