Paper Reading AI Learner

Multimodal 3D Object Detection on Unseen Domains

2024-04-17 21:47:45
Deepti Hegde, Suhas Lohit, Kuan-Chuan Peng, Michael J. Jones, Vishal M. Patel

Abstract

LiDAR datasets for autonomous driving exhibit biases in properties such as point cloud density, range, and object dimensions. As a result, object detection networks trained and evaluated in different environments often experience performance degradation. Domain adaptation approaches assume access to unannotated samples from the test distribution to address this problem. However, in the real world, the exact conditions of deployment and access to samples representative of the test dataset may be unavailable while training. We argue that the more realistic and challenging formulation is to require robustness in performance to unseen target domains. We propose to address this problem in a two-pronged manner. First, we leverage paired LiDAR-image data present in most autonomous driving datasets to perform multimodal object detection. We suggest that working with multimodal features by leveraging both images and LiDAR point clouds for scene understanding tasks results in object detectors more robust to unseen domain shifts. Second, we train a 3D object detector to learn multimodal object features across different distributions and promote feature invariance across these source domains to improve generalizability to unseen target domains. To this end, we propose CLIX$^\text{3D}$, a multimodal fusion and supervised contrastive learning framework for 3D object detection that performs alignment of object features from same-class samples of different domains while pushing the features from different classes apart. We show that CLIX$^\text{3D}$ yields state-of-the-art domain generalization performance under multiple dataset shifts.

Abstract (translated)

LiDAR数据集在自动驾驶中存在属性偏见,如点云密度、范围和物体尺寸等。因此,在不同的环境中训练和评估的对象检测网络通常会性能下降。域适应方法假设可以从测试分布访问未标注样本来解决这个问题。然而,在现实生活中,在训练过程中访问测试分布的未标注样本可能是不可能的。我们认为更现实和具有挑战性的方法是要求在未见过的目标领域中具有稳健性。为了应对这个问题,我们提出了双支柱的方法。首先,我们利用大多数自动驾驶数据集中存在的成对LiDAR图像数据来执行多模态目标检测。我们建议通过同时利用图像和LiDAR点云进行场景理解任务,使物体检测器对未见过的领域转移更加稳健。其次,我们训练了一个3D物体检测器,以学习不同分布中的多模态物体特征,并促进这些源域之间的特征不变性,以提高对未见过的目标领域的泛化能力。为此,我们提出了CLIX$^\text{3D}$,一个用于3D物体检测的多模态融合监督学习框架,它在不同分布的同一类样本之间进行对象特征的 alignment,同时将不同类别的特征推向远离。我们证明了,CLIX$^\text{3D}$在多个数据集变化下实现了最先进的领域泛化性能。

URL

https://arxiv.org/abs/2404.11764

PDF

https://arxiv.org/pdf/2404.11764.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot