TempBEV: Improving Learned BEV Encoders with Combined Image and BEV Space Temporal Aggregation

Abstract
Abstract (translated)
URL
PDF

Abstract

Autonomous driving requires an accurate representation of the environment. A strategy toward high accuracy is to fuse data from several sensors. Learned Bird's-Eye View (BEV) encoders can achieve this by mapping data from individual sensors into one joint latent space. For cost-efficient camera-only systems, this provides an effective mechanism to fuse data from multiple cameras with different views. Accuracy can further be improved by aggregating sensor information over time. This is especially important in monocular camera systems to account for the lack of explicit depth and velocity measurements. Thereby, the effectiveness of developed BEV encoders crucially depends on the operators used to aggregate temporal information and on the used latent representation spaces. We analyze BEV encoders proposed in the literature and compare their effectiveness, quantifying the effects of aggregation operators and latent representations. While most existing approaches aggregate temporal information either in image or in BEV latent space, our analyses and performance comparisons suggest that these latent representations exhibit complementary strengths. Therefore, we develop a novel temporal BEV encoder, TempBEV, which integrates aggregated temporal information from both latent spaces. We consider subsequent image frames as stereo through time and leverage methods from optical flow estimation for temporal stereo encoding. Empirical evaluation on the NuScenes dataset shows a significant improvement by TempBEV over the baseline for 3D object detection and BEV segmentation. The ablation uncovers a strong synergy of joint temporal aggregation in the image and BEV latent space. These results indicate the overall effectiveness of our approach and make a strong case for aggregating temporal information in both image and BEV latent spaces.

Abstract (translated)

自动驾驶需要准确地描述环境。实现高准确度的策略是将来自多个传感器的数据进行融合。通过将来自单个传感器的数据映射到联合latent空间，学习到的Bird's-Eye View (BEV)编码器可以实现这一目标。对于成本效益高的摄像头仅系统，这提供了一种将来自不同视角的数据进行融合的有效机制。通过在一段时间内聚合传感器信息，可以进一步提高准确性。这对于单目相机系统尤为重要，因为它们缺乏明确的深度和速度测量。因此，开发出的BEV编码器的有效性取决于用于聚合时间信息的操作员和使用的潜在表示空间。我们分析了许多文献中提出的BEV编码器，并比较了它们的有效性，并量化了聚合操作符和潜在表示空间的影响。虽然大多数现有方法在图像或BEV潜在空间中聚合时间信息，但我们的分析和性能比较结果表明，这些潜在表示空间表现出互补的优势。因此，我们开发了一个新颖的时间BEV编码器，TempBEV，它整合了来自两个潜在空间的时间聚合信息。我们将接下来的图像帧视为立体通过时间，并利用光学流估计的方法进行时间立体编码。在NuScenes数据集上的实证评估表明，TempBEV在3D物体检测和BEV分割方面的性能显著优于基线。消融揭示了图像和BEV潜在空间中关节时间聚合的强烈协同作用。这些结果表明，我们的方法的整体有效性，以及将时间信息在图像和BEV潜在空间中进行聚合的必要性。

URL

https://arxiv.org/abs/2404.11803

PDF

https://arxiv.org/pdf/2404.11803.pdf

TempBEV: Improving Learned BEV Encoders with Combined Image and BEV Space Temporal Aggregation

Abstract

Abstract (translated)

URL

PDF Copy

PDF