Abstract
Visual language tracking (VLT) has emerged as a cutting-edge research area, harnessing linguistic data to enhance algorithms with multi-modal inputs and broadening the scope of traditional single object tracking (SOT) to encompass video understanding applications. Despite this, most VLT benchmarks still depend on succinct, human-annotated text descriptions for each video. These descriptions often fall short in capturing the nuances of video content dynamics and lack stylistic variety in language, constrained by their uniform level of detail and a fixed annotation frequency. As a result, algorithms tend to default to a "memorize the answer" strategy, diverging from the core objective of achieving a deeper understanding of video content. Fortunately, the emergence of large language models (LLMs) has enabled the generation of diverse text. This work utilizes LLMs to generate varied semantic annotations (in terms of text lengths and granularities) for representative SOT benchmarks, thereby establishing a novel multi-modal benchmark. Specifically, we (1) propose a new visual language tracking benchmark with diverse texts, named DTVLT, based on five prominent VLT and SOT benchmarks, including three sub-tasks: short-term tracking, long-term tracking, and global instance tracking. (2) We offer four granularity texts in our benchmark, considering the extent and density of semantic information. We expect this multi-granular generation strategy to foster a favorable environment for VLT and video understanding research. (3) We conduct comprehensive experimental analyses on DTVLT, evaluating the impact of diverse text on tracking performance and hope the identified performance bottlenecks of existing algorithms can support further research in VLT and video understanding. The proposed benchmark, experimental results and toolkit will be released gradually on this http URL.
Abstract (translated)
视觉语言跟踪(VLT)已成为一个尖端的研究领域,利用语言数据来增强具有多模态输入的算法的性能,并将传统单对象跟踪(SOT)的范围扩展到涵盖视频理解应用。然而,大多数VLT基准仍然依赖于对每个视频的简洁、人类编写的文本描述。这些描述往往捕捉不到视频内容动态的细微之处,缺乏语言的风格多样性,受到其详细程度和固定注释周期的限制。因此,算法倾向于默认采用“记住答案”策略,从实现对视频内容更深刻理解的核心目标上偏离。 幸运的是,大型语言模型(LLMs)的出现已经使得生成多样文本成为可能。这项工作利用LLMs生成具有不同文本长度和粒度的多样语义注释(在语义层次上),从而建立了一个新颖的多模态基准。具体来说,我们(1)提出了一个名为DTVLT的新视觉语言跟踪基准,基于五个突出的VLT和SOT基准,包括三个子任务:短期跟踪、长期跟踪和全局实例跟踪。(2)我们在基准中提供了四种粒度文本,考虑了语义信息的范围和密度。我们期望这种多粒度生成策略将为VLT和视频理解研究创造一个有利的环境。(3)我们对DTVLT进行了全面实验分析,评估了多样性文本对跟踪性能的影响,并希望识别出现有算法的性能瓶颈,以便进一步研究VLT和视频理解。提出的基准、实验结果和工具包将逐步发布在上述网址。
URL
https://arxiv.org/abs/2410.02492