Abstract
Large Vision Language Models (VLMs) are now the de facto state-of-the-art for a number of tasks including visual question answering, recognising objects, and spatial referral. In this work, we propose the HOI-Ref task for egocentric images that aims to understand interactions between hands and objects using VLMs. To enable HOI-Ref, we curate the HOI-QA dataset that consists of 3.9M question-answer pairs for training and evaluating VLMs. HOI-QA includes questions relating to locating hands, objects, and critically their interactions (e.g. referring to the object being manipulated by the hand). We train the first VLM for HOI-Ref on this dataset and call it VLM4HOI. Our results demonstrate that VLMs trained for referral on third person images fail to recognise and refer hands and objects in egocentric images. When fine-tuned on our egocentric HOI-QA dataset, performance improves by 27.9% for referring hands and objects, and by 26.7% for referring interactions.
Abstract (translated)
大视图语言模型(VLMs)现在已成为许多任务的默认最佳实践,包括视觉问答、识别物体和空间指征等。在这项工作中,我们提出了一个针对以自我为中心的图像的HOI-Ref任务,旨在使用VLMs理解手和物体之间的互动。为了实现HOI-Ref,我们编辑了HOI-QA数据集,其中包括用于训练和评估VLMs的390,000个问题-答案对。HOI-QA包括与定位手、物体及其相互作用的 questions(例如,指出正在操作的对象)有关的问题。我们在这个数据集上训练了第一个VLM for HOI-Ref,并称之为VLM4HOI。我们的结果表明,为第三人称图像进行指出的VLMs未能识别和指出在以自我为中心的图像中的手和物体。当在我们的自中心HOI-QA数据集上进行微调时,指出的性能提高了27.9%,而指出的性能则提高了26.7%。
URL
https://arxiv.org/abs/2404.09933