Abstract
Existing text-video retrieval solutions are, in essence, discriminant models focused on maximizing the conditional likelihood, i.e., p(candidates|query). While straightforward, this de facto paradigm overlooks the underlying data distribution p(query), which makes it challenging to identify out-of-distribution data. To address this limitation, we creatively tackle this task from a generative viewpoint and model the correlation between the text and the video as their joint probability p(candidates,query). This is accomplished through a diffusion-based text-video retrieval framework (DiffusionRet), which models the retrieval task as a process of gradually generating joint distribution from noise. During training, DiffusionRet is optimized from both the generation and discrimination perspectives, with the generator being optimized by generation loss and the feature extractor trained with contrastive loss. In this way, DiffusionRet cleverly leverages the strengths of both generative and discriminative methods. Extensive experiments on five commonly used text-video retrieval benchmarks, including MSRVTT, LSMDC, MSVD, ActivityNet Captions, and DiDeMo, with superior performances, justify the efficacy of our method. More encouragingly, without any modification, DiffusionRet even performs well in out-domain retrieval settings. We believe this work brings fundamental insights into the related fields. Code will be available at this https URL.
Abstract (translated)
现有的文本-视频检索解决方案本质上是关注最大化条件概率的判别模型,即p(candidates|query)。尽管直观,但这种实际范式忽略了p(query) underlying数据分布,这使得找到不在分布中的数据变得困难。为了解决这一限制,我们创造性地从生成视角出发,并将文本和视频之间的相关性建模为它们的联合概率p(candidates,query)。这通过基于扩散的文本-视频检索框架(DiffusionRet)来实现,该框架将检索任务建模为从噪声中逐渐生成联合分布的过程。在训练期间,DiffusionRet从生成和区分两个方面进行优化,生成器通过生成损失优化,特征提取器通过对比损失进行训练。这样,DiffusionRet巧妙地利用了生成和区分方法的优势。我们对五个常用的文本-视频检索基准点进行了广泛的实验,包括MSRVTT、LSMDC、MSVD、ActivityNetcaptions和DiDeMo,结果显示它们表现优异,从而证明了我们方法的有效性。更鼓舞人心的是,未做任何修改,DiffusionRet在跨域检索设置中表现良好。我们相信这项工作为相关领域带来了基本见解。代码将放在这个httpsURL上。
URL
https://arxiv.org/abs/2303.09867