Paper Reading AI Learner

Tell and show: Combining multiple modalities to communicate manipulation tasks to a robot

2024-04-02 07:10:16
Petr Vanc, Radoslav Skoviera, Karla Stepanova

Abstract

As human-robot collaboration is becoming more widespread, there is a need for a more natural way of communicating with the robot. This includes combining data from several modalities together with the context of the situation and background knowledge. Current approaches to communication typically rely only on a single modality or are often very rigid and not robust to missing, misaligned, or noisy data. In this paper, we propose a novel method that takes inspiration from sensor fusion approaches to combine uncertain information from multiple modalities and enhance it with situational awareness (e.g., considering object properties or the scene setup). We first evaluate the proposed solution on simulated bimodal datasets (gestures and language) and show by several ablation experiments the importance of various components of the system and its robustness to noisy, missing, or misaligned observations. Then we implement and evaluate the model on the real setup. In human-robot interaction, we must also consider whether the selected action is probable enough to be executed or if we should better query humans for clarification. For these purposes, we enhance our model with adaptive entropy-based thresholding that detects the appropriate thresholds for different types of interaction showing similar performance as fine-tuned fixed thresholds.

Abstract (translated)

随着人机协作的越来越广泛,我们需要一种更自然与机器人进行交流的方式。这包括将来自多个模态的数据在情境和背景知识的基础上进行结合。通常,目前的交流方法仅依赖于单一模态,或者通常是相当僵硬且对缺失、错位或噪音数据不敏感。在本文中,我们提出了一种新方法,受到传感器融合方法的启发,将来自多个模态的不确定信息进行结合并增强其情境意识(例如,考虑物体属性或场景设置)。我们首先在模拟双模态数据集(手势和语言)上评估所提出的解决方案,并通过几个消融实验来展示系统各个组件的重要性和对噪音、缺失或错位观察的鲁棒性。然后,我们在真实设置上实现并评估该模型。在人与机器人交互中,我们还需要考虑所选行动是否足够可能被执行,或者我们应该更好地向人类进行询问以获得澄清。为此,我们通过自适应熵基于阈值的方法来增强我们的模型,该方法能够检测不同类型的交互中适用的阈值,并表现出与细调固定阈值类似的表现。

URL

https://arxiv.org/abs/2404.01702

PDF

https://arxiv.org/pdf/2404.01702.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot