Paper Reading AI Learner

ScanERU: Interactive 3D Visual Grounding based on Embodied Reference Understanding

2023-03-23 11:36:14
Ziyang Lu, Yunqiang Pei, Guoqing Wang, Yang Yang, Zheng Wang, Heng Tao Shen

Abstract

Aiming to link natural language descriptions to specific regions in a 3D scene represented as 3D point clouds, 3D visual grounding is a very fundamental task for human-robot interaction. The recognition errors can significantly impact the overall accuracy and then degrade the operation of AI systems. Despite their effectiveness, existing methods suffer from the difficulty of low recognition accuracy in cases of multiple adjacent objects with similar this http URL address this issue, this work intuitively introduces the human-robot interaction as a cue to facilitate the development of 3D visual grounding. Specifically, a new task termed Embodied Reference Understanding (ERU) is first designed for this concern. Then a new dataset called ScanERU is constructed to evaluate the effectiveness of this idea. Different from existing datasets, our ScanERU is the first to cover semi-synthetic scene integration with textual, real-world visual, and synthetic gestural information. Additionally, this paper formulates a heuristic framework based on attention mechanisms and human body movements to enlighten the research of ERU. Experimental results demonstrate the superiority of the proposed method, especially in the recognition of multiple identical objects. Our codes and dataset are ready to be available publicly.

Abstract (translated)

旨在将自然语言描述连接到以3D点云表示的3D场景中的特定区域,3D视觉接地是人类-机器人交互中的一个极为基本的任务。识别错误可能会显著影响整体准确性,进而降低AI系统的运行水平。尽管其有效性,现有的方法却面临着在多个相邻对象中具有类似http URL地址的多个类似对象的情况下低识别精度的困难。这项工作直觉地引入了人类-机器人交互作为线索,以促进3D视觉接地的发展。具体而言,名为Embodied Reference Understanding(ERU)的新任务首先被设计为解决这个问题。然后,名为ScanERU的新数据集被构建用于评估这个想法的 effectiveness。与现有的数据集不同,我们的ScanERU是第一个涵盖半合成场景与文本、现实世界视觉和合成手势信息整合的数据集。此外,本文基于注意力机制和身体运动提出了一个启发性框架,以阐明ERU的研究。实验结果显示,该提议方法的优势,特别是在识别多个相同的对象方面。我们的代码和数据集已准备公开发布。

URL

https://arxiv.org/abs/2303.13186

PDF

https://arxiv.org/pdf/2303.13186.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot