Towards Efficient and Effective Text-to-Video Retrieval with Coarse-to-Fine Visual Representation Learning

Abstract
Abstract (translated)
URL
PDF

Abstract

In recent years, text-to-video retrieval methods based on CLIP have experienced rapid development. The primary direction of evolution is to exploit the much wider gamut of visual and textual cues to achieve alignment. Concretely, those methods with impressive performance often design a heavy fusion block for sentence (words)-video (frames) interaction, regardless of the prohibitive computation complexity. Nevertheless, these approaches are not optimal in terms of feature utilization and retrieval efficiency. To address this issue, we adopt multi-granularity visual feature learning, ensuring the model's comprehensiveness in capturing visual content features spanning from abstract to detailed levels during the training phase. To better leverage the multi-granularity features, we devise a two-stage retrieval architecture in the retrieval phase. This solution ingeniously balances the coarse and fine granularity of retrieval content. Moreover, it also strikes a harmonious equilibrium between retrieval effectiveness and efficiency. Specifically, in training phase, we design a parameter-free text-gated interaction block (TIB) for fine-grained video representation learning and embed an extra Pearson Constraint to optimize cross-modal representation learning. In retrieval phase, we use coarse-grained video representations for fast recall of top-k candidates, which are then reranked by fine-grained video representations. Extensive experiments on four benchmarks demonstrate the efficiency and effectiveness. Notably, our method achieves comparable performance with the current state-of-the-art methods while being nearly 50 times faster.

Abstract (translated)

近年来，基于CLIP的文本转视频检索方法经历了快速的发展。演变的主要方向是利用更广泛的视觉和文本提示来实现对齐。具体来说，那些具有出色性能的方法通常为句子（单词）-视频（帧）交互设计重的融合模块，无论计算复杂度如何。然而，这些方法在特征利用和检索效率方面并不是最优的。为了解决这个问题，我们采用了多粒度视觉特征学习，确保在训练阶段模型能够全面捕捉从抽象到详细程度的视觉内容特征。为了更好地利用多粒度特征，我们在检索阶段设计了一个两阶段检索架构。这个解决方案巧妙地平衡了检索内容的粗细粒度。此外，它还实现了检索有效性和效率的和谐平衡。具体来说，在训练阶段，我们设计了一个无参数文本门控（TIB）用于精细视频表示学习，并内嵌入一个Pearson约束以优化跨模态表示学习。在检索阶段，我们使用粗粒度的视频表示来快速召回前k个候选项，然后通过精细的视觉表示对其进行排序。在四个基准测试上进行的大量实验证明，这种方法具有高效性和有效性。值得注意的是，与最先进的方法相比，我们的方法在性能上具有相似的竞争力，而速度却快了几乎50倍。

URL

https://arxiv.org/abs/2401.00701

PDF

https://arxiv.org/pdf/2401.00701.pdf

Towards Efficient and Effective Text-to-Video Retrieval with Coarse-to-Fine Visual Representation Learning

Abstract

Abstract (translated)

URL

PDF Copy

PDF