Abstract
Aiming to link natural language descriptions to specific regions in a 3D scene represented as 3D point clouds, 3D visual grounding is a very fundamental task for human-robot interaction. The recognition errors can significantly impact the overall accuracy and then degrade the operation of AI systems. Despite their effectiveness, existing methods suffer from the difficulty of low recognition accuracy in cases of multiple adjacent objects with similar this http URL address this issue, this work intuitively introduces the human-robot interaction as a cue to facilitate the development of 3D visual grounding. Specifically, a new task termed Embodied Reference Understanding (ERU) is first designed for this concern. Then a new dataset called ScanERU is constructed to evaluate the effectiveness of this idea. Different from existing datasets, our ScanERU is the first to cover semi-synthetic scene integration with textual, real-world visual, and synthetic gestural information. Additionally, this paper formulates a heuristic framework based on attention mechanisms and human body movements to enlighten the research of ERU. Experimental results demonstrate the superiority of the proposed method, especially in the recognition of multiple identical objects. Our codes and dataset are ready to be available publicly.
Abstract (translated)
旨在将自然语言描述连接到以3D点云表示的3D场景中的特定区域,3D视觉接地是人类-机器人交互中的一个极为基本的任务。识别错误可能会显著影响整体准确性,进而降低AI系统的运行水平。尽管其有效性,现有的方法却面临着在多个相邻对象中具有类似http URL地址的多个类似对象的情况下低识别精度的困难。这项工作直觉地引入了人类-机器人交互作为线索,以促进3D视觉接地的发展。具体而言,名为Embodied Reference Understanding(ERU)的新任务首先被设计为解决这个问题。然后,名为ScanERU的新数据集被构建用于评估这个想法的 effectiveness。与现有的数据集不同,我们的ScanERU是第一个涵盖半合成场景与文本、现实世界视觉和合成手势信息整合的数据集。此外,本文基于注意力机制和身体运动提出了一个启发性框架,以阐明ERU的研究。实验结果显示,该提议方法的优势,特别是在识别多个相同的对象方面。我们的代码和数据集已准备公开发布。
URL
https://arxiv.org/abs/2303.13186