Paper Reading AI Learner

Text Is MASS: Modeling as Stochastic Embedding for Text-Video Retrieval

2024-03-26 17:59:52
Jiamian Wang, Guohao Sun, Pichao Wang, Dongfang Liu, Sohail Dianat, Majid Rabbani, Raghuveer Rao, Zhiqiang Tao

Abstract

The increasing prevalence of video clips has sparked growing interest in text-video retrieval. Recent advances focus on establishing a joint embedding space for text and video, relying on consistent embedding representations to compute similarity. However, the text content in existing datasets is generally short and concise, making it hard to fully describe the redundant semantics of a video. Correspondingly, a single text embedding may be less expressive to capture the video embedding and empower the retrieval. In this study, we propose a new stochastic text modeling method T-MASS, i.e., text is modeled as a stochastic embedding, to enrich text embedding with a flexible and resilient semantic range, yielding a text mass. To be specific, we introduce a similarity-aware radius module to adapt the scale of the text mass upon the given text-video pairs. Plus, we design and develop a support text regularization to further control the text mass during the training. The inference pipeline is also tailored to fully exploit the text mass for accurate retrieval. Empirical evidence suggests that T-MASS not only effectively attracts relevant text-video pairs while distancing irrelevant ones, but also enables the determination of precise text embeddings for relevant pairs. Our experimental results show a substantial improvement of T-MASS over baseline (3% to 6.3% by R@1). Also, T-MASS achieves state-of-the-art performance on five benchmark datasets, including MSRVTT, LSMDC, DiDeMo, VATEX, and Charades.

Abstract (translated)

随着视频片段的日益普及,对文本-视频检索的兴趣不断增加。最近,研究的重点在于建立文本和视频共同的嵌入空间,利用一致的嵌入表示计算相似度。然而,现有数据集中的文本内容通常较短且简洁,使得难以完全描述视频的冗余语义。相应地,单个文本嵌入可能不足以捕捉视频嵌入并增强检索。在这项研究中,我们提出了一种新的随机文本建模方法T-MASS,即文本被视为随机嵌入,以丰富文本嵌入的灵活性和韧性,产生文本质量。具体来说,我们引入了一个相似度感知半径模块,以便在给定的文本-视频对中调整文本质量的规模。此外,我们还设计和开发了一种支持文本正则化,以在训练过程中进一步控制文本质量。推理过程也专门设计以充分利用文本质量进行准确检索。实验证据表明,T-MASS不仅有效地吸引了相关的文本-视频对,还将无关的 ones远离,而且还可以精确地确定相关对之间的文本嵌入。我们的实验结果表明,与基线相比(相对精度@1从3%到6.3%),T-MASS在T@1方面有显著的改进。此外,T-MASS在包括MSRVTT、LSMDC、DiDeMo、VATEX和Charades在内的五个基准数据集上实现了最先进的性能。

URL

https://arxiv.org/abs/2403.17998

PDF

https://arxiv.org/pdf/2403.17998.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot