Paper Reading AI Learner

DySS: Dynamic Queries and State-Space Learning for Efficient 3D Object Detection from Multi-Camera Videos

2025-06-11 23:49:56
Rajeev Yasarla, Shizhong Han, Hong Cai, Fatih Porikli

Abstract

Camera-based 3D object detection in Bird's Eye View (BEV) is one of the most important perception tasks in autonomous driving. Earlier methods rely on dense BEV features, which are costly to construct. More recent works explore sparse query-based detection. However, they still require a large number of queries and can become expensive to run when more video frames are used. In this paper, we propose DySS, a novel method that employs state-space learning and dynamic queries. More specifically, DySS leverages a state-space model (SSM) to sequentially process the sampled features over time steps. In order to encourage the model to better capture the underlying motion and correspondence information, we introduce auxiliary tasks of future prediction and masked reconstruction to better train the SSM. The state of the SSM then provides an informative yet efficient summarization of the scene. Based on the state-space learned features, we dynamically update the queries via merge, remove, and split operations, which help maintain a useful, lean set of detection queries throughout the network. Our proposed DySS achieves both superior detection performance and efficient inference. Specifically, on the nuScenes test split, DySS achieves 65.31 NDS and 57.4 mAP, outperforming the latest state of the art. On the val split, DySS achieves 56.2 NDS and 46.2 mAP, as well as a real-time inference speed of 33 FPS.

Abstract (translated)

基于相机的Bird's Eye View(BEV)中的三维物体检测是自动驾驶中最关键的任务之一。早期的方法依赖于密集的BEV特征,这在构造过程中成本较高。最近的工作则探索了稀疏查询基检测方法,尽管如此,这些方法仍然需要大量的查询,在处理更多视频帧时可能会变得非常昂贵。 在这篇文章中,我们提出了DySS,这是一种采用状态空间学习和动态查询的新方法。具体来说,DySS利用一个状态空间模型(SSM)来逐时间步长处理采样特征。为了鼓励模型更好地捕捉底层运动和对应信息,我们在训练SSM时引入了未来预测和掩码重建的辅助任务。这样,SSM的状态可以提供一种既丰富又高效地对场景进行总结的方法。 基于从状态空间学习得到的特征,我们通过合并、移除和分裂操作动态更新查询集,在整个网络中保持有用且精简的检测查询集。 我们的DySS方法实现了优异的检测性能以及高效的推理速度。具体而言,在nuScenes测试集中,DySS达到了65.31 NDS(Normalized Detection Score)和57.4 mAP(mean Average Precision),超越了最新的技术前沿。在验证数据集上,DySS分别获得了56.2的NDS和46.2的mAP,并且实现了每秒33帧的真实时间推理速度。

URL

https://arxiv.org/abs/2506.10242

PDF

https://arxiv.org/pdf/2506.10242.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot