Abstract
Though pre-training vision-language models have demonstrated significant benefits in boosting video-text retrieval performance from large-scale web videos, fine-tuning still plays a critical role with manually annotated clips with start and end times, which requires considerable human effort. To address this issue, we explore an alternative cheaper source of annotations, single timestamps, for video-text retrieval. We initialise clips from timestamps in a heuristic way to warm up a retrieval model. Then a video clip editing method is proposed to refine the initial rough boundaries to improve retrieval performance. A student-teacher network is introduced for video clip editing. The teacher model is employed to edit the clips in the training set whereas the student model trains on the edited clips. The teacher weights are updated from the student's after the student's performance increases. Our method is model agnostic and applicable to any retrieval models. We conduct experiments based on three state-of-the-art retrieval models, COOT, VideoCLIP and CLIP4Clip. Experiments conducted on three video retrieval datasets, YouCook2, DiDeMo and ActivityNet-Captions show that our edited clips consistently improve retrieval performance over initial clips across all the three retrieval models.
Abstract (translated)
尽管预训练的视觉语言模型已经在大型网络视频中的视频文本检索表现出了显著的提高,但通过手动注释带有开始和结束时间的片段仍然在视频文本检索中扮演着关键角色,这需要大量的人力劳动。为了解决这个问题,我们探索了一个替代的、更便宜的标注来源,即单时刻度,用于视频文本检索。我们以一种启发式的方式从时刻开始对片段进行初始化,以热身检索模型。然后我们提出了一种视频片段编辑方法,以优化初始粗略边界的精细度,从而提高检索性能。我们还引入了一个学生-教师网络来进行视频片段编辑。教师模型用于编辑训练集中的片段,而学生模型在编辑后的片段上进行训练。教师权重在学生成绩增加后从学生那里更新。我们的方法对模型一无所知,而且适用于任何检索模型。我们对三种最先进的检索模型:COOT、VideoCLIP和CLIP4Clip进行了实验。对三个视频检索数据集:YouCook2、DiDeMo和ActivityNet-Captions的实验表明,我们编辑的片段在所有三个检索模型中都能持续提高检索性能。
URL
https://arxiv.org/abs/2402.02335