Paper Reading AI Learner

Self-Explainable Affordance Learning with Embodied Caption

2024-04-08 15:22:38
Zhipeng Zhang, Zhimin Wei, Guolei Sun, Peng Wang, Luc Van Gool

Abstract

In the field of visual affordance learning, previous methods mainly used abundant images or videos that delineate human behavior patterns to identify action possibility regions for object manipulation, with a variety of applications in robotic tasks. However, they encounter a main challenge of action ambiguity, illustrated by the vagueness like whether to beat or carry a drum, and the complexities involved in processing intricate scenes. Moreover, it is important for human intervention to rectify robot errors in time. To address these issues, we introduce Self-Explainable Affordance learning (SEA) with embodied caption. This innovation enables robots to articulate their intentions and bridge the gap between explainable vision-language caption and visual affordance learning. Due to a lack of appropriate dataset, we unveil a pioneering dataset and metrics tailored for this task, which integrates images, heatmaps, and embodied captions. Furthermore, we propose a novel model to effectively combine affordance grounding with self-explanation in a simple but efficient manner. Extensive quantitative and qualitative experiments demonstrate our method's effectiveness.

Abstract (translated)

在视觉启发学习领域,以前的方法主要使用丰富的人或行为图像或视频来确定对象操作可能性区域,应用于机器人任务。然而,它们遇到了一个主要挑战是动作模糊,如图所示,是否打鼓或搬运,以及处理复杂场景涉及的复杂性。此外,人类干预还应该在时间上纠正机器人错误。为解决这些问题,我们引入了具有身体旁注的启发式学习(SEA)方法。这种创新使机器人能够表达其意图,并弥合可解释视觉-语言注释和视觉启发学习的差距。由于缺乏适当的数据集,我们揭示了专门针对这一任务的先驱数据集和指标,整合了图像、热图和身体旁注。此外,我们提出了一个新模型,以有效地将启示性 grounding 与自我解释相结合。 extensive 的定量实验和定性实验证实了我们方法的的有效性。

URL

https://arxiv.org/abs/2404.05603

PDF

https://arxiv.org/pdf/2404.05603.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot