Paper Reading AI Learner

TempBEV: Improving Learned BEV Encoders with Combined Image and BEV Space Temporal Aggregation

2024-04-17 23:49:00
Thomas Monninger, Vandana Dokkadi, Md Zafar Anwar, Steffen Staab

Abstract

Autonomous driving requires an accurate representation of the environment. A strategy toward high accuracy is to fuse data from several sensors. Learned Bird's-Eye View (BEV) encoders can achieve this by mapping data from individual sensors into one joint latent space. For cost-efficient camera-only systems, this provides an effective mechanism to fuse data from multiple cameras with different views. Accuracy can further be improved by aggregating sensor information over time. This is especially important in monocular camera systems to account for the lack of explicit depth and velocity measurements. Thereby, the effectiveness of developed BEV encoders crucially depends on the operators used to aggregate temporal information and on the used latent representation spaces. We analyze BEV encoders proposed in the literature and compare their effectiveness, quantifying the effects of aggregation operators and latent representations. While most existing approaches aggregate temporal information either in image or in BEV latent space, our analyses and performance comparisons suggest that these latent representations exhibit complementary strengths. Therefore, we develop a novel temporal BEV encoder, TempBEV, which integrates aggregated temporal information from both latent spaces. We consider subsequent image frames as stereo through time and leverage methods from optical flow estimation for temporal stereo encoding. Empirical evaluation on the NuScenes dataset shows a significant improvement by TempBEV over the baseline for 3D object detection and BEV segmentation. The ablation uncovers a strong synergy of joint temporal aggregation in the image and BEV latent space. These results indicate the overall effectiveness of our approach and make a strong case for aggregating temporal information in both image and BEV latent spaces.

Abstract (translated)

自动驾驶需要准确地描述环境。实现高准确度的策略是将来自多个传感器的数据进行融合。通过将来自单个传感器的数据映射到联合latent空间,学习到的Bird's-Eye View (BEV)编码器可以实现这一目标。对于成本效益高的摄像头仅系统,这提供了一种将来自不同视角的数据进行融合的有效机制。通过在一段时间内聚合传感器信息,可以进一步提高准确性。这对于单目相机系统尤为重要,因为它们缺乏明确的深度和速度测量。因此,开发出的BEV编码器的有效性取决于用于聚合时间信息的操作员和使用的潜在表示空间。我们分析了许多文献中提出的BEV编码器,并比较了它们的有效性,并量化了聚合操作符和潜在表示空间的影响。虽然大多数现有方法在图像或BEV潜在空间中聚合时间信息,但我们的分析和性能比较结果表明,这些潜在表示空间表现出互补的优势。因此,我们开发了一个新颖的时间BEV编码器,TempBEV,它整合了来自两个潜在空间的时间聚合信息。我们将接下来的图像帧视为立体通过时间,并利用光学流估计的方法进行时间立体编码。在NuScenes数据集上的实证评估表明,TempBEV在3D物体检测和BEV分割方面的性能显著优于基线。消融揭示了图像和BEV潜在空间中关节时间聚合的强烈协同作用。这些结果表明,我们的方法的整体有效性,以及将时间信息在图像和BEV潜在空间中进行聚合的必要性。

URL

https://arxiv.org/abs/2404.11803

PDF

https://arxiv.org/pdf/2404.11803.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot