Paper Reading AI Learner

Learning 3D Perception from Others' Predictions

2024-10-03 16:31:28
Jinsu Yoo, Zhenyang Feng, Tai-Yu Pan, Yihong Sun, Cheng Perng Phoo, Xiangyu Chen, Mark Campbell, Kilian Q. Weinberger, Bharath Hariharan, Wei-Lun Chao

Abstract

Accurate 3D object detection in real-world environments requires a huge amount of annotated data with high quality. Acquiring such data is tedious and expensive, and often needs repeated effort when a new sensor is adopted or when the detector is deployed in a new environment. We investigate a new scenario to construct 3D object detectors: learning from the predictions of a nearby unit that is equipped with an accurate detector. For example, when a self-driving car enters a new area, it may learn from other traffic participants whose detectors have been optimized for that area. This setting is label-efficient, sensor-agnostic, and communication-efficient: nearby units only need to share the predictions with the ego agent (e.g., car). Naively using the received predictions as ground-truths to train the detector for the ego car, however, leads to inferior performance. We systematically study the problem and identify viewpoint mismatches and mislocalization (due to synchronization and GPS errors) as the main causes, which unavoidably result in false positives, false negatives, and inaccurate pseudo labels. We propose a distance-based curriculum, first learning from closer units with similar viewpoints and subsequently improving the quality of other units' predictions via self-training. We further demonstrate that an effective pseudo label refinement module can be trained with a handful of annotated data, largely reducing the data quantity necessary to train an object detector. We validate our approach on the recently released real-world collaborative driving dataset, using reference cars' predictions as pseudo labels for the ego car. Extensive experiments including several scenarios (e.g., different sensors, detectors, and domains) demonstrate the effectiveness of our approach toward label-efficient learning of 3D perception from other units' predictions.

Abstract (translated)

在现实环境准确3D物体检测需要大量带有高质量标注的数据。获取这样的数据费时且昂贵,而且在采用新传感器或将检测器部署到新环境时,通常需要重复尝试。我们研究了一个新场景来构建3D物体检测器:从配备准确检测器的相邻单元的预测中学习。例如,当自动驾驶汽车进入新区域时,它可以从对该区域进行优化以适应该区域的交通参与者那里学习。这个设置具有标签效率、传感器无关性和通信效率:附近的单元只需要将自己的预测与自车共享。然而,将接收到的预测作为自车训练检测器的地面真值,导致自车性能劣化。我们系统地研究了这个问题,并确定视角不匹配和定位误差(由同步和GPS误差导致的)是主要原因,这导致假阳性、假阴性和不准确的伪标签。我们提出了一个基于距离的教程,首先从具有相似观点的更近的单元学习,然后通过自训练提高其他单位的预测质量。我们还进一步证明了,只需要少量的标注数据,就可以通过自训练训练出有效的伪标签修复模块,从而大大减少训练一个物体检测器所需的數據量。我们在最近发布的合作驾驶数据集上验证了我们的方法,使用自车的预测作为自车的伪标签。包括多个场景(例如不同的传感器、检测器和领域)的丰富实验表明,我们的方法在从其他单位的预测中实现标签有效的3D感知方面是有效的。

URL

https://arxiv.org/abs/2410.02646

PDF

https://arxiv.org/pdf/2410.02646.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot