Paper Reading AI Learner

A Multimodal Handover Failure Detection Dataset and Baselines

2024-02-28 13:29:28
Santosh Thoduka, Nico Hochgeschwender, Juergen Gall, Paul G. Plöger

Abstract

An object handover between a robot and a human is a coordinated action which is prone to failure for reasons such as miscommunication, incorrect actions and unexpected object properties. Existing works on handover failure detection and prevention focus on preventing failures due to object slip or external disturbances. However, there is a lack of datasets and evaluation methods that consider unpreventable failures caused by the human participant. To address this deficit, we present the multimodal Handover Failure Detection dataset, which consists of failures induced by the human participant, such as ignoring the robot or not releasing the object. We also present two baseline methods for handover failure detection: (i) a video classification method using 3D CNNs and (ii) a temporal action segmentation approach which jointly classifies the human action, robot action and overall outcome of the action. The results show that video is an important modality, but using force-torque data and gripper position help improve failure detection and action segmentation accuracy.

Abstract (translated)

机器人与人类之间的物体传递是一个协调的动作,由于诸如沟通不畅、操作不正确或意外物体属性等原因,容易出现失败。现有关于物体传递失败检测和预防的作品主要集中在由于物体滑落或外部干扰导致的故障的预防上。然而,目前缺乏考虑人类参与者无法预防的故障的数据集和评估方法。为了弥补这一不足,我们提出了多模态物体传递失败检测数据集,其中包括由人类参与者引起的事故,例如忽略机器人或未释放物体。我们还提出了两种基本的物体传递失败检测方法:(i)使用3D CNN的视觉分类方法;(ii)一种将人类动作、机器人动作和动作结果共同分类为时序动作的方法。结果表明,视频是一个重要的模式,但使用力和扭矩数据以及抓爪位置能够提高故障检测和动作分割的准确性。

URL

https://arxiv.org/abs/2402.18319

PDF

https://arxiv.org/pdf/2402.18319.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot