Abstract
Composed Image Retrieval (CIR) is a pivotal and complex task in multimodal understanding. Current CIR benchmarks typically feature limited query categories and fail to capture the diverse requirements of real-world scenarios. To bridge this evaluation gap, we leverage image editing to achieve precise control over modification types and content, enabling a pipeline for synthesizing queries across a broad spectrum of categories. Using this pipeline, we construct EDIR, a novel fine-grained CIR benchmark. EDIR encompasses 5,000 high-quality queries structured across five main categories and fifteen subcategories. Our comprehensive evaluation of 13 multimodal embedding models reveals a significant capability gap; even state-of-the-art models (e.g., RzenEmbed and GME) struggle to perform consistently across all subcategories, highlighting the rigorous nature of our benchmark. Through comparative analysis, we further uncover inherent limitations in existing benchmarks, such as modality biases and insufficient categorical coverage. Furthermore, an in-domain training experiment demonstrates the feasibility of our benchmark. This experiment clarifies the task challenges by distinguishing between categories that are solvable with targeted data and those that expose intrinsic limitations of current model architectures.
Abstract (translated)
组成图像检索(CIR)是跨模态理解中的一个重要且复杂的任务。当前的CIR基准测试通常包含有限的查询类别,无法捕捉到真实世界场景中多样化的需求。为了弥补这一评估差距,我们利用图像编辑来实现对修改类型和内容的精确控制,并构建了一条能够跨越广泛类别合成查询的流水线。通过这条管道,我们创建了EDIR,这是一个新型的细粒度CIR基准测试集。EDIR包括5000个高质量的查询,这些查询结构化分布在五个主要类别和十五个子类别中。 对13种跨模态嵌入模型进行全面评估后,我们发现了一个显著的能力差距;即使是当前最先进的模型(如RzenEmbed和GME)也无法在所有子类别上保持一致性表现,这进一步强调了我们的基准测试的严格性。通过对比分析,我们还揭示了现有基准中存在的固有局限性,例如模态偏见和类别覆盖不足的问题。 此外,一个针对特定领域的训练实验展示了我们基准的有效性。该实验通过区分可以使用定向数据解决的任务类别与揭示当前模型架构内在限制的任务类别来明确任务挑战。
URL
https://arxiv.org/abs/2601.16125