Abstract
We study the task of 3D multi-object re-identification from embodied tours. Specifically, an agent is given two tours of an environment (e.g. an apartment) under two different layouts (e.g. arrangements of furniture). Its task is to detect and re-identify objects in 3D - e.g. a "sofa" moved from location A to B, a new "chair" in the second layout at location C, or a "lamp" from location D in the first layout missing in the second. To support this task, we create an automated infrastructure to generate paired egocentric tours of initial/modified layouts in the Habitat simulator using Matterport3D scenes, YCB and Google-scanned objects. We present 3D Semantic MapNet (3D-SMNet) - a two-stage re-identification model consisting of (1) a 3D object detector that operates on RGB-D videos with known pose, and (2) a differentiable object matching module that solves correspondence estimation between two sets of 3D bounding boxes. Overall, 3D-SMNet builds object-based maps of each layout and then uses a differentiable matcher to re-identify objects across the tours. After training 3D-SMNet on our generated episodes, we demonstrate zero-shot transfer to real-world rearrangement scenarios by instantiating our task in Replica, Active Vision, and RIO environments depicting rearrangements. On all datasets, we find 3D-SMNet outperforms competitive baselines. Further, we show jointly training on real and generated episodes can lead to significant improvements over training on real data alone.
Abstract (translated)
我们研究了从 embodied 导览中学习 3D 多对象识别的任务。具体来说,一个代理被给予两个环境(例如公寓)的不同布局(例如家具排列)。它的任务是检测和识别 3D 中的对象 - 例如从位置 A 到位置 B 的“沙发”,位置 C 的第二个布局中的新“椅子”,或者在第一个布局中缺失的位置 D 的“灯”。为了支持这项任务,我们创建了一个自动化的基础设施,使用Matterport3D 场景、YCB 和 Google-扫描的对象生成初始/修改布局的对称形导览。我们展示了 3D 语义图网络(3D-SMNet),这是一种由两个阶段组成的识别模型,其第一阶段是一个在已知姿态的 RGB-D 视频上运行的 3D 物体检测器,第二阶段是一个用于解决两个 3D 边界框之间对应关系的不同可导模块。总的来说,3D-SMNet 构建了每个布局的物体基础映射,然后使用可导匹配器在导览之间重新识别物体。在用我们的生成任务训练 3D-SMNet 后,我们在 Replica、Active Vision 和 RIO 等环境中通过实例展示了零散转移到现实世界的重新排列场景。在所有数据集中,我们发现 3D-SMNet 都优于竞争基线。此外,我们还证明了在真实和生成任务上共同训练可以带来在训练仅基于真实数据时的显著改进。
URL
https://arxiv.org/abs/2403.13190