Paper Reading AI Learner

3D Semantic MapNet: Building Maps for Multi-Object Re-Identification in 3D

2024-03-19 23:01:14
Vincent Cartillier, Neha Jain, Irfan Essa

Abstract

We study the task of 3D multi-object re-identification from embodied tours. Specifically, an agent is given two tours of an environment (e.g. an apartment) under two different layouts (e.g. arrangements of furniture). Its task is to detect and re-identify objects in 3D - e.g. a "sofa" moved from location A to B, a new "chair" in the second layout at location C, or a "lamp" from location D in the first layout missing in the second. To support this task, we create an automated infrastructure to generate paired egocentric tours of initial/modified layouts in the Habitat simulator using Matterport3D scenes, YCB and Google-scanned objects. We present 3D Semantic MapNet (3D-SMNet) - a two-stage re-identification model consisting of (1) a 3D object detector that operates on RGB-D videos with known pose, and (2) a differentiable object matching module that solves correspondence estimation between two sets of 3D bounding boxes. Overall, 3D-SMNet builds object-based maps of each layout and then uses a differentiable matcher to re-identify objects across the tours. After training 3D-SMNet on our generated episodes, we demonstrate zero-shot transfer to real-world rearrangement scenarios by instantiating our task in Replica, Active Vision, and RIO environments depicting rearrangements. On all datasets, we find 3D-SMNet outperforms competitive baselines. Further, we show jointly training on real and generated episodes can lead to significant improvements over training on real data alone.

Abstract (translated)

我们研究了从 embodied 导览中学习 3D 多对象识别的任务。具体来说,一个代理被给予两个环境(例如公寓)的不同布局(例如家具排列)。它的任务是检测和识别 3D 中的对象 - 例如从位置 A 到位置 B 的“沙发”,位置 C 的第二个布局中的新“椅子”,或者在第一个布局中缺失的位置 D 的“灯”。为了支持这项任务,我们创建了一个自动化的基础设施,使用Matterport3D 场景、YCB 和 Google-扫描的对象生成初始/修改布局的对称形导览。我们展示了 3D 语义图网络(3D-SMNet),这是一种由两个阶段组成的识别模型,其第一阶段是一个在已知姿态的 RGB-D 视频上运行的 3D 物体检测器,第二阶段是一个用于解决两个 3D 边界框之间对应关系的不同可导模块。总的来说,3D-SMNet 构建了每个布局的物体基础映射,然后使用可导匹配器在导览之间重新识别物体。在用我们的生成任务训练 3D-SMNet 后,我们在 Replica、Active Vision 和 RIO 等环境中通过实例展示了零散转移到现实世界的重新排列场景。在所有数据集中,我们发现 3D-SMNet 都优于竞争基线。此外,我们还证明了在真实和生成任务上共同训练可以带来在训练仅基于真实数据时的显著改进。

URL

https://arxiv.org/abs/2403.13190

PDF

https://arxiv.org/pdf/2403.13190.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot