Abstract
Composed image retrieval aims to find an image that best matches a given multi-modal user query consisting of a reference image and text pair. Existing methods commonly pre-compute image embeddings over the entire corpus and compare these to a reference image embedding modified by the query text at test time. Such a pipeline is very efficient at test time since fast vector distances can be used to evaluate candidates, but modifying the reference image embedding guided only by a short textual description can be difficult, especially independent of potential candidates. An alternative approach is to allow interactions between the query and every possible candidate, i.e., reference-text-candidate triplets, and pick the best from the entire set. Though this approach is more discriminative, for large-scale datasets the computational cost is prohibitive since pre-computation of candidate embeddings is no longer possible. We propose to combine the merits of both schemes using a two-stage model. Our first stage adopts the conventional vector distancing metric and performs a fast pruning among candidates. Meanwhile, our second stage employs a dual-encoder architecture, which effectively attends to the input triplet of reference-text-candidate and re-ranks the candidates. Both stages utilize a vision-and-language pre-trained network, which has proven beneficial for various downstream tasks. Our method consistently outperforms state-of-the-art approaches on standard benchmarks for the task.
Abstract (translated)
Composed image retrieval的目标是找到与给定的多项式用户查询包含参考图像和文本一对的最优匹配图像。现有的方法通常会对整个语料库进行图像嵌入的预处理,并在测试时比较参考图像嵌入由查询文本修改后的结果。这种管道在测试时非常高效,因为可以快速计算向量距离来评估候选人,但仅通过简短的文本描述指导修改参考图像嵌入可能会很困难,特别是与潜在候选人独立的。另一种方法是允许查询和每个可能候选人之间的交互,即参考文本候选人三件套,并从中选择最好的。尽管这种方法更加歧视性,但对于大型数据集,计算成本过高,因为预计算候选人嵌入不再可行。我们提议使用两个阶段的模型将两种方案的优点结合起来。我们的第一阶段采用传统的向量距离度量,并快速修剪候选人之间的中间结果。与此同时,我们的第二阶段采用双编码架构,有效地关注输入的参考文本-候选人三件套并重新评估候选人。两个阶段都使用视觉和语言预训练网络,已经证明对于各种后续任务有益。我们的方法在任务的标准基准测试中 consistently 优于最先进的方法。
URL
https://arxiv.org/abs/2305.16304