Text Is MASS: Modeling as Stochastic Embedding for Text-Video Retrieval

Abstract
Abstract (translated)
URL
PDF

Abstract

The increasing prevalence of video clips has sparked growing interest in text-video retrieval. Recent advances focus on establishing a joint embedding space for text and video, relying on consistent embedding representations to compute similarity. However, the text content in existing datasets is generally short and concise, making it hard to fully describe the redundant semantics of a video. Correspondingly, a single text embedding may be less expressive to capture the video embedding and empower the retrieval. In this study, we propose a new stochastic text modeling method T-MASS, i.e., text is modeled as a stochastic embedding, to enrich text embedding with a flexible and resilient semantic range, yielding a text mass. To be specific, we introduce a similarity-aware radius module to adapt the scale of the text mass upon the given text-video pairs. Plus, we design and develop a support text regularization to further control the text mass during the training. The inference pipeline is also tailored to fully exploit the text mass for accurate retrieval. Empirical evidence suggests that T-MASS not only effectively attracts relevant text-video pairs while distancing irrelevant ones, but also enables the determination of precise text embeddings for relevant pairs. Our experimental results show a substantial improvement of T-MASS over baseline (3% to 6.3% by R@1). Also, T-MASS achieves state-of-the-art performance on five benchmark datasets, including MSRVTT, LSMDC, DiDeMo, VATEX, and Charades.

Abstract (translated)

随着视频片段的日益普及，对文本-视频检索的兴趣不断增加。最近，研究的重点在于建立文本和视频共同的嵌入空间，利用一致的嵌入表示计算相似度。然而，现有数据集中的文本内容通常较短且简洁，使得难以完全描述视频的冗余语义。相应地，单个文本嵌入可能不足以捕捉视频嵌入并增强检索。在这项研究中，我们提出了一种新的随机文本建模方法T-MASS，即文本被视为随机嵌入，以丰富文本嵌入的灵活性和韧性，产生文本质量。具体来说，我们引入了一个相似度感知半径模块，以便在给定的文本-视频对中调整文本质量的规模。此外，我们还设计和开发了一种支持文本正则化，以在训练过程中进一步控制文本质量。推理过程也专门设计以充分利用文本质量进行准确检索。实验证据表明，T-MASS不仅有效地吸引了相关的文本-视频对，还将无关的 ones远离，而且还可以精确地确定相关对之间的文本嵌入。我们的实验结果表明，与基线相比（相对精度@1从3%到6.3%），T-MASS在T@1方面有显著的改进。此外，T-MASS在包括MSRVTT、LSMDC、DiDeMo、VATEX和Charades在内的五个基准数据集上实现了最先进的性能。

URL

https://arxiv.org/abs/2403.17998

PDF

https://arxiv.org/pdf/2403.17998.pdf

Text Is MASS: Modeling as Stochastic Embedding for Text-Video Retrieval

Abstract

Abstract (translated)

URL

PDF Copy

PDF