Abstract
In the field of visual affordance learning, previous methods mainly used abundant images or videos that delineate human behavior patterns to identify action possibility regions for object manipulation, with a variety of applications in robotic tasks. However, they encounter a main challenge of action ambiguity, illustrated by the vagueness like whether to beat or carry a drum, and the complexities involved in processing intricate scenes. Moreover, it is important for human intervention to rectify robot errors in time. To address these issues, we introduce Self-Explainable Affordance learning (SEA) with embodied caption. This innovation enables robots to articulate their intentions and bridge the gap between explainable vision-language caption and visual affordance learning. Due to a lack of appropriate dataset, we unveil a pioneering dataset and metrics tailored for this task, which integrates images, heatmaps, and embodied captions. Furthermore, we propose a novel model to effectively combine affordance grounding with self-explanation in a simple but efficient manner. Extensive quantitative and qualitative experiments demonstrate our method's effectiveness.
Abstract (translated)
在视觉启发学习领域,以前的方法主要使用丰富的人或行为图像或视频来确定对象操作可能性区域,应用于机器人任务。然而,它们遇到了一个主要挑战是动作模糊,如图所示,是否打鼓或搬运,以及处理复杂场景涉及的复杂性。此外,人类干预还应该在时间上纠正机器人错误。为解决这些问题,我们引入了具有身体旁注的启发式学习(SEA)方法。这种创新使机器人能够表达其意图,并弥合可解释视觉-语言注释和视觉启发学习的差距。由于缺乏适当的数据集,我们揭示了专门针对这一任务的先驱数据集和指标,整合了图像、热图和身体旁注。此外,我们提出了一个新模型,以有效地将启示性 grounding 与自我解释相结合。 extensive 的定量实验和定性实验证实了我们方法的的有效性。
URL
https://arxiv.org/abs/2404.05603