Paper Reading AI Learner

Video Editing for Video Retrieval

2024-02-04 04:13:31
Bin Zhu Kevin Flanagan Adriano Fragomeni Michael Wray Dima Damen

Abstract

Though pre-training vision-language models have demonstrated significant benefits in boosting video-text retrieval performance from large-scale web videos, fine-tuning still plays a critical role with manually annotated clips with start and end times, which requires considerable human effort. To address this issue, we explore an alternative cheaper source of annotations, single timestamps, for video-text retrieval. We initialise clips from timestamps in a heuristic way to warm up a retrieval model. Then a video clip editing method is proposed to refine the initial rough boundaries to improve retrieval performance. A student-teacher network is introduced for video clip editing. The teacher model is employed to edit the clips in the training set whereas the student model trains on the edited clips. The teacher weights are updated from the student's after the student's performance increases. Our method is model agnostic and applicable to any retrieval models. We conduct experiments based on three state-of-the-art retrieval models, COOT, VideoCLIP and CLIP4Clip. Experiments conducted on three video retrieval datasets, YouCook2, DiDeMo and ActivityNet-Captions show that our edited clips consistently improve retrieval performance over initial clips across all the three retrieval models.

Abstract (translated)

尽管预训练的视觉语言模型已经在大型网络视频中的视频文本检索表现出了显著的提高,但通过手动注释带有开始和结束时间的片段仍然在视频文本检索中扮演着关键角色,这需要大量的人力劳动。为了解决这个问题,我们探索了一个替代的、更便宜的标注来源,即单时刻度,用于视频文本检索。我们以一种启发式的方式从时刻开始对片段进行初始化,以热身检索模型。然后我们提出了一种视频片段编辑方法,以优化初始粗略边界的精细度,从而提高检索性能。我们还引入了一个学生-教师网络来进行视频片段编辑。教师模型用于编辑训练集中的片段,而学生模型在编辑后的片段上进行训练。教师权重在学生成绩增加后从学生那里更新。我们的方法对模型一无所知,而且适用于任何检索模型。我们对三种最先进的检索模型:COOT、VideoCLIP和CLIP4Clip进行了实验。对三个视频检索数据集:YouCook2、DiDeMo和ActivityNet-Captions的实验表明,我们编辑的片段在所有三个检索模型中都能持续提高检索性能。

URL

https://arxiv.org/abs/2402.02335

PDF

https://arxiv.org/pdf/2402.02335.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot