Abstract
There has been growing sentiment recently that modern large multimodal models (LMMs) have addressed most of the key challenges related to short video comprehension. As a result, both academia and industry are gradually shifting their attention towards the more complex challenges posed by understanding long-form videos. However, is this really the case? Our studies indicate that LMMs still lack many fundamental reasoning capabilities even when dealing with short videos. We introduce Vinoground, a temporal counterfactual LMM evaluation benchmark encompassing 1000 short and natural video-caption pairs. We demonstrate that existing LMMs severely struggle to distinguish temporal differences between different actions and object transformations. For example, the best model GPT-4o only obtains ~50% on our text and video scores, showing a large gap compared to the human baseline of ~90%. All open-source multimodal models and CLIP-based models perform much worse, producing mostly random chance performance. Through this work, we shed light onto the fact that temporal reasoning in short videos is a problem yet to be fully solved. The dataset and evaluation code are available at this https URL.
Abstract (translated)
最近,人们普遍认为现代大型多模态模型(LMMs)已经解决了与短视频理解相关的大部分关键挑战。因此,学术界和工业界逐渐将注意力转向理解长视频所提出的更复杂挑战。然而,这是真的吗?我们的研究结果表明,即使处理短视频,LMMs仍然缺乏许多基本推理能力。我们引入了Vinoground,一个包含1000个短和自然视频对的时间反事实LMM评估基准。我们证明了现有的LMMs在区分不同动作和物体变换的时间差异方面严重挣扎。例如,最佳模型GPT-4o在我们的文本和视频评分上的得分仅为~50%,与人类基线(~90%)相比存在很大的差距。所有开源的多模态模型和CLIP基于模型表现得更糟,产生主要是随机猜测的性能。通过这项工作,我们阐明了一个重要的问题,即短视频中的时间推理是一个尚未完全解决的问题。数据集和评估代码可在此链接查看:https://github.com/jhlau/Vinoground
URL
https://arxiv.org/abs/2410.02763