Paper Reading AI Learner

OnlineBEV: Recurrent Temporal Fusion in Bird's Eye View Representations for Multi-Camera 3D Perception

2025-07-11 14:48:59
Junho Koh, Youngwoo Lee, Jungho Kim, Dongyoung Lee, Jun Won Choi

Abstract

Multi-view camera-based 3D perception can be conducted using bird's eye view (BEV) features obtained through perspective view-to-BEV transformations. Several studies have shown that the performance of these 3D perception methods can be further enhanced by combining sequential BEV features obtained from multiple camera frames. However, even after compensating for the ego-motion of an autonomous agent, the performance gain from temporal aggregation is limited when combining a large number of image frames. This limitation arises due to dynamic changes in BEV features over time caused by object motion. In this paper, we introduce a novel temporal 3D perception method called OnlineBEV, which combines BEV features over time using a recurrent structure. This structure increases the effective number of combined features with minimal memory usage. However, it is critical to spatially align the features over time to maintain strong performance. OnlineBEV employs the Motion-guided BEV Fusion Network (MBFNet) to achieve temporal feature alignment. MBFNet extracts motion features from consecutive BEV frames and dynamically aligns historical BEV features with current ones using these motion features. To enforce temporal feature alignment explicitly, we use Temporal Consistency Learning Loss, which captures discrepancies between historical and target BEV features. Experiments conducted on the nuScenes benchmark demonstrate that OnlineBEV achieves significant performance gains over the current best method, SOLOFusion. OnlineBEV achieves 63.9% NDS on the nuScenes test set, recording state-of-the-art performance in the camera-only 3D object detection task.

Abstract (translated)

基于多视角相机的三维感知可以通过透视视图到鸟瞰视图(BEV)变换获得的BEV特征来进行。多项研究表明,通过结合多个摄像机帧获取的连续BEV特征,可以进一步增强这些三维感知方法的性能。然而,在补偿自主代理的自我运动后,当合并大量图像帧时,时间聚合带来的性能提升是有限的,这主要是由于随着时间变化,由物体移动引起的BEV特征动态变化所导致的。 在这篇文章中,我们介绍了一种新颖的时间三维感知方法——OnlineBEV,它通过递归结构结合随时间变化的BEV特征。这种结构能够在最小化内存使用的情况下增加有效组合的功能数量。然而,在保持高性能的同时,跨时间对齐功能是非常重要的。为了实现这一点,OnlineBEV采用了基于运动引导的BEV融合网络(MBFNet)来完成时间特性对准。MBFNet从连续的BEV帧中提取运动特征,并利用这些运动特征动态地将历史BEV特征与当前特征对齐。为确保显式的时间特性对齐,我们使用了时间一致性学习损失,该损失捕捉到历史和目标BEV特征之间的差异。 在nuScenes基准测试上进行的实验表明,OnlineBEV相比目前最佳方法SOLOFusion取得了显著的性能提升。在nuScenes测试集上,OnlineBEV达到了63.9%的NDS(平均精度),创下了仅使用相机的三维物体检测任务中的最新记录。

URL

https://arxiv.org/abs/2507.08644

PDF

https://arxiv.org/pdf/2507.08644.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot