Abstract
Temporally localizing the presence of object states in videos is crucial in understanding human activities beyond actions and objects. This task has suffered from a lack of training data due to object states' inherent ambiguity and variety. To avoid exhaustive annotation, learning from transcribed narrations in instructional videos would be intriguing. However, object states are less described in narrations compared to actions, making them less effective. In this work, we propose to extract the object state information from action information included in narrations, using large language models (LLMs). Our observation is that LLMs include world knowledge on the relationship between actions and their resulting object states, and can infer the presence of object states from past action sequences. The proposed LLM-based framework offers flexibility to generate plausible pseudo-object state labels against arbitrary categories. We evaluate our method with our newly collected Multiple Object States Transition (MOST) dataset including dense temporal annotation of 60 object state categories. Our model trained by the generated pseudo-labels demonstrates significant improvement of over 29% in mAP against strong zero-shot vision-language models, showing the effectiveness of explicitly extracting object state information from actions through LLMs.
Abstract (translated)
翻译:在理解视频中的人类活动,特别是在动作和物体之外,局部地定位对象状态在时间上是至关重要的。由于对象状态的固有不确定性和多样性,该任务一直缺乏训练数据。为了避免彻底的注释,从教育视频中的转录旁白中学习将是非常有趣的。然而,与动作相比,对象状态在旁白中的描述较少,因此它们的有效性较低。在这项工作中,我们提出了一种利用大型语言模型(LLMs)提取叙述中动作信息中的物体状态信息的方法。我们的观察是,LLMs包括了动作与其导致物体状态之间关系的世界知识,并且可以从过去的动作序列中推断出物体状态的存在。基于LLM的框架为生成任意类别的伪物体状态标签提供了灵活性。我们使用新收集的Multiple Object States Transition(MOST)数据集进行了实验,其中包括对60个物体状态类别的密集时间注释。我们通过训练由生成的伪标签来训练的模型,在强 zero-shot 视觉语言模型上取得了超过29%的mAP的显著改进,表明通过LLMs明确提取动作中的物体状态信息是非常有效的。
URL
https://arxiv.org/abs/2405.01090