Paper Reading AI Learner

TimeLoc: A Unified End-to-End Framework for Precise Timestamp Localization in Long Videos

2025-03-09 09:11:26
Chen-Lin Zhang, Lin Sui, Shuming Liu, Fangzhou Mu, Zhangcheng Wang, Bernard Ghanem

Abstract

Temporal localization in untrimmed videos, which aims to identify specific timestamps, is crucial for video understanding but remains challenging. This task encompasses several subtasks, including temporal action localization, temporal video grounding, moment retrieval, and generic event boundary detection. Existing methods in each subfield are typically designed for specific tasks and lack generalizability across domains. In this paper, we propose TimeLoc, a unified end-to-end framework for timestamp localization that can handle multiple tasks. First, our approach employs a simple yet effective one-stage localization model that supports text queries as input and multiple actions as output. Second, we jointly train the video encoder and localization model in an end-to-end manner. To efficiently process long videos, we introduce temporal chunking, enabling the handling of videos with over 30k frames. Third, we find that fine-tuning pre-trained text encoders with a multi-stage training strategy further enhances text-conditioned localization. TimeLoc achieves state-of-the-art results across multiple benchmarks: +1.3% and +1.9% mAP over previous best methods on THUMOS14 and EPIC-Kitchens-100, +1.1% on Kinetics-GEBD, +2.94% mAP on QVHighlights, and significant improvements in temporal video grounding (+11.5% on TACoS and +6.7% on Charades-STA under R1@0.5). Our code and checkpoints will be released at this https URL.

Abstract (translated)

无修剪视频中的时间定位,旨在识别特定的时间戳,在视频理解中至关重要但仍然具有挑战性。这一任务包括若干子任务,如时间动作定位、时间视频对齐、时刻检索和通用事件边界检测等。现有方法通常针对具体任务设计,并且在跨域应用方面缺乏泛化能力。本文提出了TimeLoc,这是一个统一的端到端框架,用于处理多个任务的时间戳定位。首先,我们的方法采用了一种简单而有效的单阶段定位模型,支持以文本查询作为输入并输出多个动作。其次,我们通过端到端方式联合训练视频编码器和定位模型。为了高效地处理长视频,我们引入了时间分块技术,使得能够处理超过30k帧的视频。第三,我们发现使用多阶段微调策略对预训练文本编码器进行细化,进一步增强了基于文本条件下的定位效果。 TimeLoc在多个基准测试中取得了最先进的结果:THUMOS14和EPIC-Kitchens-100上的mAP分别比之前最佳方法高出+1.3%和+1.9%,Kinetics-GEBD上高出+1.1%,QVHighlights上的mAP为+2.94%,以及在TACoS和Charades-STA(R1@0.5)的视频时间对齐任务中分别提高了+11.5%和+6.7%。 我们的代码和检查点将在此网址上发布。

URL

https://arxiv.org/abs/2503.06526

PDF

https://arxiv.org/pdf/2503.06526.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot