Abstract
Generating captions for long and complex videos is both critical and challenging, with significant implications for the growing fields of text-to-video generation and multi-modal understanding. One key challenge in long video captioning is accurately recognizing the same individuals who appear in different frames, which we refer to as the ID-Matching problem. Few prior works have focused on this important issue. Those that have, usually suffer from limited generalization and depend on point-wise matching, which limits their overall effectiveness. In this paper, unlike previous approaches, we build upon LVLMs to leverage their powerful priors. We aim to unlock the inherent ID-Matching capabilities within LVLMs themselves to enhance the ID-Matching performance of captions. Specifically, we first introduce a new benchmark for assessing the ID-Matching capabilities of video captions. Using this benchmark, we investigate LVLMs containing GPT-4o, revealing key insights that the performance of ID-Matching can be improved through two methods: 1) enhancing the usage of image information and 2) increasing the quantity of information of individual descriptions. Based on these insights, we propose a novel video captioning method called Recognizing Identities for Captioning Effectively (RICE). Extensive experiments including assessments of caption quality and ID-Matching performance, demonstrate the superiority of our approach. Notably, when implemented on GPT-4o, our RICE improves the precision of ID-Matching from 50% to 90% and improves the recall of ID-Matching from 15% to 80% compared to baseline. RICE makes it possible to continuously track different individuals in the captions of long videos.
Abstract (translated)
为长而复杂的视频生成字幕既是关键又是挑战,对于正在增长的文本到视频生成和多模态理解领域具有重要意义。在长视频字幕生成中,一个关键挑战在于准确识别出现在不同帧中的同一人物,我们称之为ID-Matching问题。此前很少有研究关注这个问题,那些试图解决此问题的研究通常面临泛化能力有限的问题,并且依赖于点对点匹配方法,这限制了其整体效果。 本文与以往的方法有所不同,我们在LVLMs(大型视觉语言模型)的基础上进行构建,以利用这些模型的强大先验知识。我们的目标是解锁LVLM内部的ID-Matching潜力,从而增强字幕中的ID-Matching性能。具体来说,我们首先引入了一个新基准用于评估视频字幕中ID-Matching能力,并通过此基准对包含GPT-4o的LVLM进行了调查,揭示了两个关键见解:1)增强图像信息的使用;2)增加个人描述的信息量可以提高ID-Matching的表现。基于这些见解,我们提出了一种新的视频字幕生成方法——有效识别身份以改进字幕(Recognizing Identities for Captioning Effectively, RICE)。大量的实验,包括对字幕质量和ID-Matching性能的评估,证明了我们的方法优于现有技术。 特别地,在应用到GPT-4o时,我们的RICE使ID-Matching的精度从50%提高到了90%,并且将召回率从15%提升至80%,相较于基准方法。这使得在长视频字幕中连续追踪不同的人物成为可能。
URL
https://arxiv.org/abs/2510.06973