Paper Reading AI Learner

DiffusionRet: Generative Text-Video Retrieval with Diffusion Model

2023-03-17 10:07:19
Peng Jin, Hao Li, Zesen Cheng, Kehan Li, Xiangyang Ji, Chang Liu, Li Yuan, Jie Chen

Abstract

Existing text-video retrieval solutions are, in essence, discriminant models focused on maximizing the conditional likelihood, i.e., p(candidates|query). While straightforward, this de facto paradigm overlooks the underlying data distribution p(query), which makes it challenging to identify out-of-distribution data. To address this limitation, we creatively tackle this task from a generative viewpoint and model the correlation between the text and the video as their joint probability p(candidates,query). This is accomplished through a diffusion-based text-video retrieval framework (DiffusionRet), which models the retrieval task as a process of gradually generating joint distribution from noise. During training, DiffusionRet is optimized from both the generation and discrimination perspectives, with the generator being optimized by generation loss and the feature extractor trained with contrastive loss. In this way, DiffusionRet cleverly leverages the strengths of both generative and discriminative methods. Extensive experiments on five commonly used text-video retrieval benchmarks, including MSRVTT, LSMDC, MSVD, ActivityNet Captions, and DiDeMo, with superior performances, justify the efficacy of our method. More encouragingly, without any modification, DiffusionRet even performs well in out-domain retrieval settings. We believe this work brings fundamental insights into the related fields. Code will be available at this https URL.

Abstract (translated)

现有的文本-视频检索解决方案本质上是关注最大化条件概率的判别模型,即p(candidates|query)。尽管直观,但这种实际范式忽略了p(query) underlying数据分布,这使得找到不在分布中的数据变得困难。为了解决这一限制,我们创造性地从生成视角出发,并将文本和视频之间的相关性建模为它们的联合概率p(candidates,query)。这通过基于扩散的文本-视频检索框架(DiffusionRet)来实现,该框架将检索任务建模为从噪声中逐渐生成联合分布的过程。在训练期间,DiffusionRet从生成和区分两个方面进行优化,生成器通过生成损失优化,特征提取器通过对比损失进行训练。这样,DiffusionRet巧妙地利用了生成和区分方法的优势。我们对五个常用的文本-视频检索基准点进行了广泛的实验,包括MSRVTT、LSMDC、MSVD、ActivityNetcaptions和DiDeMo,结果显示它们表现优异,从而证明了我们方法的有效性。更鼓舞人心的是,未做任何修改,DiffusionRet在跨域检索设置中表现良好。我们相信这项工作为相关领域带来了基本见解。代码将放在这个httpsURL上。

URL

https://arxiv.org/abs/2303.09867

PDF

https://arxiv.org/pdf/2303.09867.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot