Abstract
3D occupancy, an advanced perception technology for driving scenarios, represents the entire scene without distinguishing between foreground and background by quantifying the physical space into a grid map. The widely adopted projection-first deformable attention, efficient in transforming image features into 3D representations, encounters challenges in aggregating multi-view features due to sensor deployment constraints. To address this issue, we propose our learning-first view attention mechanism for effective multi-view feature aggregation. Moreover, we showcase the scalability of our view attention across diverse multi-view 3D tasks, such as map construction and 3D object detection. Leveraging the proposed view attention as well as an additional multi-frame streaming temporal attention, we introduce ViewFormer, a vision-centric transformer-based framework for spatiotemporal feature aggregation. To further explore occupancy-level flow representation, we present FlowOcc3D, a benchmark built on top of existing high-quality datasets. Qualitative and quantitative analyses on this benchmark reveal the potential to represent fine-grained dynamic scenes. Extensive experiments show that our approach significantly outperforms prior state-of-the-art methods. The codes and benchmark will be released soon.
Abstract (translated)
3D占有率,一种高级驾驶场景感知技术,通过将物理空间量化成一个网格图,表示整个场景,没有区分前景和背景。广泛采用的投影先验可变形注意力,在将图像特征转换为3D表示时高效,但在聚合多视图特征时遇到了传感器部署限制的挑战。为了应对这个问题,我们提出了学习先验视注意机制,用于有效的多视图特征聚合。此外,我们还展示了我们在不同多视图3D任务上的视注意力的可扩展性,如地图构建和3D物体检测。利用所提出的视注意力和额外的多帧流式时间注意,我们引入了ViewFormer,一个以视觉为中心的Transformer框架,用于空间和时间特征的聚合。为进一步探索占有率级流动表示,我们发布了FlowOcc3D,一个基于现有高质量数据集的基准。对这一基准的定性和定量分析揭示了潜在的表示细粒度动态场景的潜力。大量实验证明,我们的方法在性能上显著超过了先前的最先进方法。代码和基准将很快发布。
URL
https://arxiv.org/abs/2405.04299