Abstract
As human-robot collaboration is becoming more widespread, there is a need for a more natural way of communicating with the robot. This includes combining data from several modalities together with the context of the situation and background knowledge. Current approaches to communication typically rely only on a single modality or are often very rigid and not robust to missing, misaligned, or noisy data. In this paper, we propose a novel method that takes inspiration from sensor fusion approaches to combine uncertain information from multiple modalities and enhance it with situational awareness (e.g., considering object properties or the scene setup). We first evaluate the proposed solution on simulated bimodal datasets (gestures and language) and show by several ablation experiments the importance of various components of the system and its robustness to noisy, missing, or misaligned observations. Then we implement and evaluate the model on the real setup. In human-robot interaction, we must also consider whether the selected action is probable enough to be executed or if we should better query humans for clarification. For these purposes, we enhance our model with adaptive entropy-based thresholding that detects the appropriate thresholds for different types of interaction showing similar performance as fine-tuned fixed thresholds.
Abstract (translated)
随着人机协作的越来越广泛,我们需要一种更自然与机器人进行交流的方式。这包括将来自多个模态的数据在情境和背景知识的基础上进行结合。通常,目前的交流方法仅依赖于单一模态,或者通常是相当僵硬且对缺失、错位或噪音数据不敏感。在本文中,我们提出了一种新方法,受到传感器融合方法的启发,将来自多个模态的不确定信息进行结合并增强其情境意识(例如,考虑物体属性或场景设置)。我们首先在模拟双模态数据集(手势和语言)上评估所提出的解决方案,并通过几个消融实验来展示系统各个组件的重要性和对噪音、缺失或错位观察的鲁棒性。然后,我们在真实设置上实现并评估该模型。在人与机器人交互中,我们还需要考虑所选行动是否足够可能被执行,或者我们应该更好地向人类进行询问以获得澄清。为此,我们通过自适应熵基于阈值的方法来增强我们的模型,该方法能够检测不同类型的交互中适用的阈值,并表现出与细调固定阈值类似的表现。
URL
https://arxiv.org/abs/2404.01702