Abstract
The burgeoning field of video-text retrieval has witnessed significant advancements with the advent of deep learning. However, the challenge of matching text and video persists due to inadequate textual descriptions of videos. The substantial information gap between the two modalities hinders a comprehensive understanding of videos, resulting in ambiguous retrieval results. While rewriting methods based on large language models have been proposed to broaden text expressions, carefully crafted prompts are essential to ensure the reasonableness and completeness of the rewritten texts. This paper proposes an automatic caption enhancement method that enhances expression quality and mitigates empiricism in augmented captions through self-learning. Additionally, an expertized caption selection mechanism is designed and introduced to customize augmented captions for each video, facilitating video-text matching. Our method is entirely data-driven, which not only dispenses with heavy data collection and computation workload but also improves self-adaptability by circumventing lexicon dependence and introducing personalized matching. The superiority of our method is validated by state-of-the-art results on various benchmarks, specifically achieving Top-1 recall accuracy of 68.5% on MSR-VTT, 68.1% on MSVD, and 62.0% on DiDeMo.
Abstract (translated)
随着深度学习的出现,视频文本检索这一新兴领域取得了显著进展。然而,由于对视频的文字描述不足,匹配文字和视频仍然是一个挑战。这种模态之间的信息差距阻碍了对视频的全面理解,并导致模糊的检索结果。虽然基于大型语言模型的重写方法已被提出以扩展文本表达方式,但是精心设计的提示对于确保重写文本的合理性与完整性至关重要。本文提出了一种自动字幕增强方法,该方法通过自我学习来提高表达质量并减少扩充字幕中的经验主义倾向。此外,还设计和引入了专家级字幕选择机制,以针对每个视频定制增强后的字幕,从而促进视频文字匹配。 我们的方法完全基于数据驱动,不仅消除了繁重的数据收集和计算工作负担,而且还通过避免词汇依赖并引入个性化匹配来提高自我适应性。我们在各种基准测试中验证了我们方法的优越性,特别是在MSR-VTT、MSVD和DiDeMo上分别实现了68.5%、68.1%和62.0%的Top-1检索准确率。
URL
https://arxiv.org/abs/2502.02885