Paper Reading AI Learner

Learning Object States from Actions via Large Language Models

2024-05-02 08:43:16
Masatoshi Tateno, Takuma Yagi, Ryosuke Furuta, Yoichi Sato

Abstract

Temporally localizing the presence of object states in videos is crucial in understanding human activities beyond actions and objects. This task has suffered from a lack of training data due to object states' inherent ambiguity and variety. To avoid exhaustive annotation, learning from transcribed narrations in instructional videos would be intriguing. However, object states are less described in narrations compared to actions, making them less effective. In this work, we propose to extract the object state information from action information included in narrations, using large language models (LLMs). Our observation is that LLMs include world knowledge on the relationship between actions and their resulting object states, and can infer the presence of object states from past action sequences. The proposed LLM-based framework offers flexibility to generate plausible pseudo-object state labels against arbitrary categories. We evaluate our method with our newly collected Multiple Object States Transition (MOST) dataset including dense temporal annotation of 60 object state categories. Our model trained by the generated pseudo-labels demonstrates significant improvement of over 29% in mAP against strong zero-shot vision-language models, showing the effectiveness of explicitly extracting object state information from actions through LLMs.

Abstract (translated)

翻译:在理解视频中的人类活动,特别是在动作和物体之外,局部地定位对象状态在时间上是至关重要的。由于对象状态的固有不确定性和多样性,该任务一直缺乏训练数据。为了避免彻底的注释,从教育视频中的转录旁白中学习将是非常有趣的。然而,与动作相比,对象状态在旁白中的描述较少,因此它们的有效性较低。在这项工作中,我们提出了一种利用大型语言模型(LLMs)提取叙述中动作信息中的物体状态信息的方法。我们的观察是,LLMs包括了动作与其导致物体状态之间关系的世界知识,并且可以从过去的动作序列中推断出物体状态的存在。基于LLM的框架为生成任意类别的伪物体状态标签提供了灵活性。我们使用新收集的Multiple Object States Transition(MOST)数据集进行了实验,其中包括对60个物体状态类别的密集时间注释。我们通过训练由生成的伪标签来训练的模型,在强 zero-shot 视觉语言模型上取得了超过29%的mAP的显著改进,表明通过LLMs明确提取动作中的物体状态信息是非常有效的。

URL

https://arxiv.org/abs/2405.01090

PDF

https://arxiv.org/pdf/2405.01090.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot