Paper Reading AI Learner

HOI-Ref: Hand-Object Interaction Referral in Egocentric Vision

2024-04-15 16:59:00
Siddhant Bansal, Michael Wray, Dima Damen

Abstract

Large Vision Language Models (VLMs) are now the de facto state-of-the-art for a number of tasks including visual question answering, recognising objects, and spatial referral. In this work, we propose the HOI-Ref task for egocentric images that aims to understand interactions between hands and objects using VLMs. To enable HOI-Ref, we curate the HOI-QA dataset that consists of 3.9M question-answer pairs for training and evaluating VLMs. HOI-QA includes questions relating to locating hands, objects, and critically their interactions (e.g. referring to the object being manipulated by the hand). We train the first VLM for HOI-Ref on this dataset and call it VLM4HOI. Our results demonstrate that VLMs trained for referral on third person images fail to recognise and refer hands and objects in egocentric images. When fine-tuned on our egocentric HOI-QA dataset, performance improves by 27.9% for referring hands and objects, and by 26.7% for referring interactions.

Abstract (translated)

大视图语言模型(VLMs)现在已成为许多任务的默认最佳实践,包括视觉问答、识别物体和空间指征等。在这项工作中,我们提出了一个针对以自我为中心的图像的HOI-Ref任务,旨在使用VLMs理解手和物体之间的互动。为了实现HOI-Ref,我们编辑了HOI-QA数据集,其中包括用于训练和评估VLMs的390,000个问题-答案对。HOI-QA包括与定位手、物体及其相互作用的 questions(例如,指出正在操作的对象)有关的问题。我们在这个数据集上训练了第一个VLM for HOI-Ref,并称之为VLM4HOI。我们的结果表明,为第三人称图像进行指出的VLMs未能识别和指出在以自我为中心的图像中的手和物体。当在我们的自中心HOI-QA数据集上进行微调时,指出的性能提高了27.9%,而指出的性能则提高了26.7%。

URL

https://arxiv.org/abs/2404.09933

PDF

https://arxiv.org/pdf/2404.09933.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot