Paper Reading AI Learner

DTVLT: A Multi-modal Diverse Text Benchmark for Visual Language Tracking Based on LLM

2024-10-03 13:57:07
Xuchen Li, Shiyu Hu, Xiaokun Feng, Dailing Zhang, Meiqi Wu, Jing Zhang, Kaiqi Huang

Abstract

Visual language tracking (VLT) has emerged as a cutting-edge research area, harnessing linguistic data to enhance algorithms with multi-modal inputs and broadening the scope of traditional single object tracking (SOT) to encompass video understanding applications. Despite this, most VLT benchmarks still depend on succinct, human-annotated text descriptions for each video. These descriptions often fall short in capturing the nuances of video content dynamics and lack stylistic variety in language, constrained by their uniform level of detail and a fixed annotation frequency. As a result, algorithms tend to default to a "memorize the answer" strategy, diverging from the core objective of achieving a deeper understanding of video content. Fortunately, the emergence of large language models (LLMs) has enabled the generation of diverse text. This work utilizes LLMs to generate varied semantic annotations (in terms of text lengths and granularities) for representative SOT benchmarks, thereby establishing a novel multi-modal benchmark. Specifically, we (1) propose a new visual language tracking benchmark with diverse texts, named DTVLT, based on five prominent VLT and SOT benchmarks, including three sub-tasks: short-term tracking, long-term tracking, and global instance tracking. (2) We offer four granularity texts in our benchmark, considering the extent and density of semantic information. We expect this multi-granular generation strategy to foster a favorable environment for VLT and video understanding research. (3) We conduct comprehensive experimental analyses on DTVLT, evaluating the impact of diverse text on tracking performance and hope the identified performance bottlenecks of existing algorithms can support further research in VLT and video understanding. The proposed benchmark, experimental results and toolkit will be released gradually on this http URL.

Abstract (translated)

视觉语言跟踪(VLT)已成为一个尖端的研究领域,利用语言数据来增强具有多模态输入的算法的性能,并将传统单对象跟踪(SOT)的范围扩展到涵盖视频理解应用。然而,大多数VLT基准仍然依赖于对每个视频的简洁、人类编写的文本描述。这些描述往往捕捉不到视频内容动态的细微之处,缺乏语言的风格多样性,受到其详细程度和固定注释周期的限制。因此,算法倾向于默认采用“记住答案”策略,从实现对视频内容更深刻理解的核心目标上偏离。 幸运的是,大型语言模型(LLMs)的出现已经使得生成多样文本成为可能。这项工作利用LLMs生成具有不同文本长度和粒度的多样语义注释(在语义层次上),从而建立了一个新颖的多模态基准。具体来说,我们(1)提出了一个名为DTVLT的新视觉语言跟踪基准,基于五个突出的VLT和SOT基准,包括三个子任务:短期跟踪、长期跟踪和全局实例跟踪。(2)我们在基准中提供了四种粒度文本,考虑了语义信息的范围和密度。我们期望这种多粒度生成策略将为VLT和视频理解研究创造一个有利的环境。(3)我们对DTVLT进行了全面实验分析,评估了多样性文本对跟踪性能的影响,并希望识别出现有算法的性能瓶颈,以便进一步研究VLT和视频理解。提出的基准、实验结果和工具包将逐步发布在上述网址。

URL

https://arxiv.org/abs/2410.02492

PDF

https://arxiv.org/pdf/2410.02492.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot