Foundation models (FMs) have shown transformative potential in radiology by performing diverse, complex tasks across imaging modalities. Here, we developed CT-FM, a large-scale 3D image-based pre-trained model designed explicitly for various radiological tasks. CT-FM was pre-trained using 148,000 computed tomography (CT) scans from the Imaging Data Commons through label-agnostic contrastive learning. We evaluated CT-FM across four categories of tasks, namely, whole-body and tumor segmentation, head CT triage, medical image retrieval, and semantic understanding, showing superior performance against state-of-the-art models. Beyond quantitative success, CT-FM demonstrated the ability to cluster regions anatomically and identify similar anatomical and structural concepts across scans. Furthermore, it remained robust across test-retest settings and indicated reasonable salient regions attached to its embeddings. This study demonstrates the value of large-scale medical imaging foundation models and by open-sourcing the model weights, code, and data, aims to support more adaptable, reliable, and interpretable AI solutions in radiology.
基础模型(FMs)在放射学中展示了变革潜力,能够在不同的成像模式下执行多样且复杂的任务。在这里,我们开发了CT-FM,这是一个大规模的基于3D图像的预训练模型,专门针对各种放射学任务设计。CT-FM通过无标签对比学习方法,在来自影像数据公用库(Imaging Data Commons)的148,000个计算机断层扫描(CT)扫描基础上进行了预训练。 我们评估了CT-FM在四个类别的任务中的表现,包括全身和肿瘤分割、头部CT分类、医学图像检索以及语义理解。结果显示,在所有类别中,CT-FM的表现均优于最先进的模型。 除了量化的成功之外,CT-FM还展示出将解剖区域进行聚类,并能够在不同扫描之间识别类似的解剖结构的能力。此外,它在测试-重测设置下仍然保持了其稳健性,并显示出了与其嵌入关联的合理显著区域。 本研究证明了大规模医学影像基础模型的价值,并通过开源该模型权重、代码和数据,旨在支持放射学领域中更灵活、可靠且可解释的人工智能解决方案。
https://arxiv.org/abs/2501.09001
We study image segmentation in the biological domain, particularly trait and part segmentation from specimen images (e.g., butterfly wing stripes or beetle body parts). This is a crucial, fine-grained task that aids in understanding the biology of organisms. The conventional approach involves hand-labeling masks, often for hundreds of images per species, and training a segmentation model to generalize these labels to other images, which can be exceedingly laborious. We present a label-efficient method named Static Segmentation by Tracking (SST). SST is built upon the insight: while specimens of the same species have inherent variations, the traits and parts we aim to segment show up consistently. This motivates us to concatenate specimen images into a ``pseudo-video'' and reframe trait and part segmentation as a tracking problem. Concretely, SST generates masks for unlabeled images by propagating annotated or predicted masks from the ``pseudo-preceding'' images. Powered by Segment Anything Model 2 (SAM~2) initially developed for video segmentation, we show that SST can achieve high-quality trait and part segmentation with merely one labeled image per species -- a breakthrough for analyzing specimen images. We further develop a cycle-consistent loss to fine-tune the model, again using one labeled image. Additionally, we highlight the broader potential of SST, including one-shot instance segmentation on images taken in the wild and trait-based image retrieval.
我们在生物领域研究图像分割,特别是从标本图片(例如蝴蝶翅膀条纹或甲虫身体部位)中进行特征和部分的分割。这是一个至关重要的精细任务,有助于理解生物体的特点。传统的做法是手动标注掩码,通常每个物种需要对数百张图片进行标注,并训练一个分割模型来将这些标签泛化到其他图像上,这非常耗费人力。 我们提出了一种名为“通过跟踪实现静态分割”(Static Segmentation by Tracking, SST)的标签高效方法。SST基于这样的见解:虽然同一物种的不同标本会有内在变化,但我们希望分割出的特征和部分在不同标本之间表现出一致性。这促使我们将标本图像串联成一个“伪视频”,并重新定义特征和部分分割为跟踪问题。具体而言,SST通过从“伪前帧”图像传播标注或预测掩码来生成未标记图像上的掩码。 借助于最初为视频分割而开发的Segment Anything Model 2(SAM~2),我们展示了仅用一个物种的一张标签图片就能实现高质量特征和部分分割——这对于分析标本图像是一个重要突破。此外,我们还利用一个循环一致损失函数来进一步微调模型,同样使用单个标注图像进行训练。 最后,我们强调了SST的更广泛应用潜力,包括野外拍摄图像上的一次性实例分割以及基于特征的图像检索。
https://arxiv.org/abs/2501.06749
Compositional image retrieval (CIR) is a multimodal learning task where a model combines a query image with a user-provided text modification to retrieve a target image. CIR finds applications in a variety of domains including product retrieval (e-commerce) and web search. Existing methods primarily focus on fully-supervised learning, wherein models are trained on datasets of labeled triplets such as FashionIQ and CIRR. This poses two significant challenges: (i) curating such triplet datasets is labor intensive; and (ii) models lack generalization to unseen objects and domains. In this work, we propose SCOT (Self-supervised COmpositional Training), a novel zero-shot compositional pretraining strategy that combines existing large image-text pair datasets with the generative capabilities of large language models to contrastively train an embedding composition network. Specifically, we show that the text embedding from a large-scale contrastively-pretrained vision-language model can be utilized as proxy target supervision during compositional pretraining, replacing the target image embedding. In zero-shot settings, this strategy surpasses SOTA zero-shot compositional retrieval methods as well as many fully-supervised methods on standard benchmarks such as FashionIQ and CIRR.
组成图像检索(CIR)是一种多模态学习任务,其中模型结合查询图像和用户提供的文本修改来检索目标图像。这种技术在产品检索(电子商务)、网页搜索等多个领域中找到应用。现有方法主要集中在完全监督的学习上,即模型是在标记的三元组数据集如FashionIQ和CIRR上进行训练的。这种方法带来了两个显著挑战:一是创建这样的三元组数据集需要大量的人力;二是模型在未见过的对象和域上的泛化能力较差。 在此工作中,我们提出了SCOT(自监督组成训练),这是一种新颖的零样本组成预训练策略,结合了现有的大规模图像-文本配对数据集与大型语言模型的生成能力,以对比方式训练嵌入组合网络。具体来说,我们证明了一个大规模对比预训练视觉-语言模型中的文本嵌入可以作为代理目标监督,在组成预训练期间替代目标图像嵌入。 在零样本设置下,这种方法不仅超越了现有的最先进的零样本组成检索方法,并且也优于许多完全监督的方法,尤其是在FashionIQ和CIRR等标准基准测试上。
https://arxiv.org/abs/2501.08347
Current methods for searching brain MR images rely on text-based approaches, highlighting a significant need for content-based image retrieval (CBIR) systems. Directly applying 3D brain MR images to machine learning models offers the benefit of effectively learning the brain's structure; however, building the generalized model necessitates a large amount of training data. While models that consider depth direction and utilize continuous 2D slices have demonstrated success in segmentation and classification tasks involving 3D data, concerns remain. Specifically, using general 2D slices may lead to the oversight of pathological features and discontinuities in depth direction information. Furthermore, to the best of the authors' knowledge, there have been no attempts to develop a practical CBIR system that preserves the entire brain's structural information. In this study, we propose an interpretable CBIR method for brain MR images, named iCBIR-Sli (Interpretable CBIR with 2D Slice Embedding), which, for the first time globally, utilizes a series of 2D slices. iCBIR-Sli addresses the challenges associated with using 2D slices by effectively aggregating slice information, thereby achieving low-dimensional representations with high completeness, usability, robustness, and interoperability, which are qualities essential for effective CBIR. In retrieval evaluation experiments utilizing five publicly available brain MR datasets (ADNI2/3, OASIS3/4, AIBL) for Alzheimer's disease and cognitively normal, iCBIR-Sli demonstrated top-1 retrieval performance (macro F1 = 0.859), comparable to existing deep learning models explicitly designed for classification, without the need for an external classifier. Additionally, the method provided high interpretability by clearly identifying the brain regions indicative of the searched-for disease.
目前,用于搜索大脑MRI图像的方法主要依赖于基于文本的途径,这凸显了开发基于内容的图像检索(CBIR)系统的重要需求。直接将3D脑部MR图像应用于机器学习模型能够有效地捕捉到大脑结构的特点;然而,构建一个通用化的模型需要大量的训练数据。尽管那些考虑到深度方向并利用连续2D切片的数据分割和分类任务中取得了成功,但仍存在一些问题。具体来说,使用一般的2D切片可能导致忽略病理特征,并且在深度方向上的信息可能会出现断层现象。据作者所知,还没有针对保存整个大脑结构信息的实际CBIR系统的研究尝试。 在这项研究中,我们提出了一种解释性强的脑部MR图像CBIR方法——iCBIR-Sli(带有2D切片嵌入的可解释性CBIR),首次在全球范围内利用了一系列2D切片。通过有效地聚合切片信息,iCBIR-Sli解决了使用2D切片时遇到的问题,实现了具有高完整性和实用性、鲁棒性的低维表示,并且这些特性是有效CBIR所必需的。在使用五个公开的大脑MR数据集(ADNI2/3,OASIS3/4和AIBL)进行阿尔茨海默病及认知正常状态下的检索评估实验中,iCBIR-Sli展示了顶级的检索性能(宏观F1 = 0.859),这一表现与现有的专门用于分类任务的深度学习模型相当,并且无需外部分类器的支持。此外,该方法还提供了高度的可解释性,能够明确地识别出与搜索疾病相关的脑区。
https://arxiv.org/abs/2501.01642
When conducting large-scale studies that collect brain MR images from multiple facilities, the impact of differences in imaging equipment and protocols at each site cannot be ignored, and this domain gap has become a significant issue in recent years. In this study, we propose a new low-dimensional representation (LDR) acquisition method called style encoder adversarial domain adaptation (SE-ADA) to realize content-based image retrieval (CBIR) of brain MR images. SE-ADA reduces domain differences while preserving pathological features by separating domain-specific information from LDR and minimizing domain differences using adversarial learning. In evaluation experiments comparing SE-ADA with recent domain harmonization methods on eight public brain MR datasets (ADNI1/2/3, OASIS1/2/3/4, PPMI), SE-ADA effectively removed domain information while preserving key aspects of the original brain structure and demonstrated the highest disease search accuracy.
在进行大规模研究并从多个设施收集脑部MRI图像时,各站点成像设备和协议的差异所带来的影响不容忽视,这种领域差距近年来已成为一个重要的问题。在这项研究中,我们提出了一种新的低维表示(LDR)获取方法,称为风格编码对抗领域适应(SE-ADA),以实现基于内容的脑部MRI图像检索(CBIR)。SE-ADA通过从LDR中分离特定领域的信息并利用对抗学习最小化域间差异来减少不同站点间的差异,同时保留病理特征。在与最近几种领域调和方法进行比较的评估实验中,在八个公开的脑部MRI数据集上(ADNI1/2/3、OASIS1/2/3/4、PPMI),SE-ADA有效地消除了域信息,同时保持了原始大脑结构的关键方面,并展示了最高的疾病检索准确性。
https://arxiv.org/abs/2501.01326
As working with large datasets becomes standard, the task of accurately retrieving images containing objects of interest by an open set textual query gains practical importance. The current leading approach utilizes a pre-trained CLIP model without any adaptation to the target domain, balancing accuracy and efficiency through additional post-processing. In this work, we propose FOR: Finetuning for Object-centric Open-vocabulary Image Retrieval, which allows finetuning on a target dataset using closed-set labels while keeping the visual-language association crucial for open vocabulary retrieval. FOR is based on two design elements: a specialized decoder variant of the CLIP head customized for the intended task, and its coupling within a multi-objective training framework. Together, these design choices result in a significant increase in accuracy, showcasing improvements of up to 8 mAP@50 points over SoTA across three datasets. Additionally, we demonstrate that FOR is also effective in a semi-supervised setting, achieving impressive results even when only a small portion of the dataset is labeled.
随着处理大规模数据集成为标准,通过开放集合文本查询准确检索包含感兴趣对象的图像的任务变得具有实际重要性。当前领先的方法利用未经目标领域调整的预训练CLIP模型,并通过额外的后处理平衡准确性与效率。在本工作中,我们提出了FOR(针对面向对象的开放式词汇图像检索的微调方法),该方法允许使用封闭集标签对目标数据集进行微调,同时保持开放词汇检索所需的视觉-语言关联。FOR基于两个设计要素:一种为特定任务定制化的CLIP头部的专用解码器变体,以及它在多目标训练框架中的耦合。这两个设计选择共同实现了显著的准确性提升,在三个数据集上相对于现有最佳方法(SoTA)表现出高达8 mAP@50分的改进。此外,我们还证明了FOR在半监督设置中同样有效,即使只有数据集中的一小部分被标注,也能取得令人印象深刻的成果。
https://arxiv.org/abs/2412.18806
ERVD: An Efficient and Robust ViT-Based Distillation Framework for Remote Sensing Image Retrieval
ERVD:一种用于遥感图像检索的高效鲁棒ViT蒸馏框架
https://arxiv.org/abs/2412.18136
Cross-view geo-localization identifies the locations of street-view images by matching them with geo-tagged satellite images or OSM. However, most studies focus on image-to-image retrieval, with fewer addressing text-guided retrieval, a task vital for applications like pedestrian navigation and emergency response. In this work, we introduce a novel task for cross-view geo-localization with natural language descriptions, which aims to retrieve corresponding satellite images or OSM database based on scene text. To support this task, we construct the CVG-Text dataset by collecting cross-view data from multiple cities and employing a scene text generation approach that leverages the annotation capabilities of Large Multimodal Models to produce high-quality scene text descriptions with localization this http URL, we propose a novel text-based retrieval localization method, CrossText2Loc, which improves recall by 10% and demonstrates excellent long-text retrieval capabilities. In terms of explainability, it not only provides similarity scores but also offers retrieval reasons. More information can be found at this https URL.
跨视角地理定位通过将街景图像与带地理标签的卫星图像或OSM(开放街道地图)进行匹配来识别其位置。然而,大多数研究集中于图像到图像的检索,较少关注文本引导的检索任务,而这一任务对于行人导航和紧急响应等应用至关重要。在本项工作中,我们引入了一种新的跨视角地理定位任务,该任务使用自然语言描述来检索相应的卫星图像或OSM数据库,基于场景中的文字内容进行匹配。为了支持这项任务,我们通过收集来自多个城市的跨视角数据并采用一种利用大型多模态模型的标注能力生成高质量场景文本描述的方法构建了CVG-Text数据集(详情请参见此链接:[http://this%20https%20URL])。在此基础上,我们提出了一种新颖的文字检索定位方法——CrossText2Loc,该方法能将召回率提高10%,并且展示了出色的长文本检索能力。在解释性方面,它不仅提供了相似度得分,还给出了检索原因。更多详情请参见此链接:[https://this%20https%20URL]。 注释:原文中的两个“this https URL”指向具体的网址信息,在翻译中保留了这种表示方式,实际应用时应替换为具体的网络地址。
https://arxiv.org/abs/2412.17007
Growing labor shortages are increasing the demand for domestic service robots (DSRs) to assist in various settings. In this study, we develop a DSR that transports everyday objects to specified pieces of furniture based on open-vocabulary instructions. Our approach focuses on retrieving images of target objects and receptacles from pre-collected images of indoor environments. For example, given an instruction "Please get the right red towel hanging on the metal towel rack and put it in the white washing machine on the left," the DSR is expected to carry the red towel to the washing machine based on the retrieved images. This is challenging because the correct images should be retrieved from thousands of collected images, which may include many images of similar towels and appliances. To address this, we propose RelaX-Former, which learns diverse and robust representations from among positive, unlabeled positive, and negative samples. We evaluated RelaX-Former on a dataset containing real-world indoor images and human annotated instructions including complex referring expressions. The experimental results demonstrate that RelaX-Former outperformed existing baseline models across standard image retrieval metrics. Moreover, we performed physical experiments using a DSR to evaluate the performance of our approach in a zero-shot transfer setting. The experiments involved the DSR to carry objects to specific receptacles based on open-vocabulary instructions, achieving an overall success rate of 75%.
日益严重的劳动力短缺正在增加对家用服务机器人(DSR)的需求,以在各种环境中提供帮助。在这项研究中,我们开发了一种能够根据开放词汇指令将日常物品运送到指定家具的DSR。我们的方法集中在从预先收集的室内环境图像中检索目标物体和容器的图像。例如,在给出指令“请取挂在金属毛巾架上的右侧红色毛巾,并将其放入左侧的白色洗衣机中”时,期望DSR能基于检索到的图像将红毛巾运送到洗衣机处。这具有挑战性,因为需要从数千张收集的图像中正确地检索出目标图像,这些图像可能包括许多相似的毛巾和电器的图片。为了解决这个问题,我们提出了RelaX-Former,它可以从正样本、未标记的正样本和负样本中学习多样且鲁棒的表示。我们在包含真实世界室内图像以及带有复杂指代表达的人类标注指令的数据集上评估了RelaX-Former。实验结果表明,相较于现有的基准模型,RelaX-Former在标准图像检索指标上表现更优。此外,我们还使用DSR进行了物理实验,以评估我们的方法在零样本迁移设置下的性能。这些实验涉及基于开放词汇指令让DSR将物体运送到特定容器中,整体成功率达到了75%。
https://arxiv.org/abs/2412.16576
Compositional understanding allows visual language models to interpret complex relationships between objects, attributes, and relations in images and text. However, most existing methods often rely on hard negative examples and fine-tuning, which can overestimate improvements and are limited by the difficulty of obtaining hard negatives. In this work, we introduce Zero-Shot Compositional Understanding (ZS-CU), a novel task that enhances compositional understanding without requiring hard negative training data. We propose YUKINO (Yielded Compositional Understanding Knowledge via Textual Inversion with NO), which uses textual inversion to map unlabeled images to pseudo-tokens in a pre-trained CLIP model. We propose introducing "no" logical regularization to address the issue of token interaction in inversion. Additionally, we suggest using knowledge distillation to reduce the time complexity of textual inversion. Experimental results show that YUKINO outperforms the existing multi-modal SOTA models by over 8% on the SugarCREPE benchmark, and also achieves significant improvements in image retrieval tasks.
组合理解能力使视觉语言模型能够解析图像和文本中对象、属性以及关系之间的复杂联系。然而,大多数现有方法通常依赖于硬性负面样本和微调,这可能导致对改进效果的高估,并受限于难以获取难负样本的问题。在本研究中,我们引入了零样本组合理解(ZS-CU),这是一个不需要硬性负面训练数据就能增强组合理解的新任务。我们提出了YUKINO(通过文本反转与无逻辑正则化生成组合理解知识),它利用文本反转将未标记的图像映射到预训练CLIP模型中的伪令牌上。我们建议引入“否”逻辑正则化来解决反转过程中的令牌交互问题。此外,我们还提出使用知识蒸馏以降低文本反转的时间复杂度。实验结果显示,在SugarCREPE基准测试中,YUKINO比现有的多模态SOTA模型高出超过8%的性能,并且在图像检索任务上也取得了显著的进步。
https://arxiv.org/abs/2412.15632
Visual imagery does not consist of solitary objects, but instead reflects the composition of a multitude of fluid concepts. While there have been great advances in visual representation learning, such advances have focused on building better representations for a small number of discrete objects bereft of an understanding of how these objects are interacting. One can observe this limitation in representations learned through captions or contrastive learning -- where the learned model treats an image essentially as a bag of words. Several works have attempted to address this limitation through the development of bespoke learned architectures to directly address the shortcomings in compositional learning. In this work, we focus on simple, and scalable approaches. In particular, we demonstrate that by substantially improving weakly labeled data, i.e. captions, we can vastly improve the performance of standard contrastive learning approaches. Previous CLIP models achieved near chance rate on challenging tasks probing compositional learning. However, our simple approach boosts performance of CLIP substantially and surpasses all bespoke architectures. Furthermore, we showcase our results on a relatively new captioning benchmark derived from DOCCI. We demonstrate through a series of ablations that a standard CLIP model trained with enhanced data may demonstrate impressive performance on image retrieval tasks.
视觉意象并非由孤立的对象组成,而是反映了一种多样的、流动的概念组合。虽然在视觉表示学习方面已经取得了重大进展,但这些进展主要集中在为少量离散对象构建更好的表示上,缺乏对这些对象如何相互作用的理解。这一局限性可以从通过字幕或对比学习所学到的表示中观察到——其中,学习模型基本上将图像视为一组单词。一些工作试图通过开发专门的学习架构来解决这一限制,以直接应对组合式学习中的不足。在这项工作中,我们关注简单且可扩展的方法。特别地,我们证明了通过大幅改进弱标签数据(即字幕),我们可以显著提高标准对比学习方法的性能。以前的CLIP模型在探测组合式学习能力的挑战性任务上几乎达到了随机水平的表现。然而,我们的简单方法大大提升了CLIP的性能,并超越了所有专门架构。此外,我们在一个相对新的、源自DOCCI的字幕基准测试中展示了我们的结果。我们通过一系列消融研究证明,使用增强数据训练的标准CLIP模型可以在图像检索任务上表现出色。
https://arxiv.org/abs/2412.15396
Despite the rapidly growing demand for multimodal retrieval, progress in this field remains severely constrained by a lack of training data. In this paper, we introduce MegaPairs, a novel data synthesis method that leverages vision language models (VLMs) and open-domain images, together with a massive synthetic dataset generated from this method. Our empirical analysis shows that MegaPairs generates high-quality data, enabling the multimodal retriever to significantly outperform the baseline model trained on 70$\times$ more data from existing datasets. Moreover, since MegaPairs solely relies on general image corpora and open-source VLMs, it can be easily scaled up, enabling continuous improvements in retrieval performance. In this stage, we produced more than 26 million training instances and trained several models of varying sizes using this data. These new models achieve state-of-the-art zero-shot performance across 4 popular composed image retrieval (CIR) benchmarks and the highest overall performance on the 36 datasets provided by MMEB. They also demonstrate notable performance improvements with additional downstream fine-tuning. Our produced dataset, well-trained models, and data synthesis pipeline will be made publicly available to facilitate the future development of this field.
尽管多模态检索的需求迅速增长,但该领域的进展仍因训练数据的缺乏而受到严重限制。在本文中,我们介绍了MegaPairs,这是一种新颖的数据合成方法,利用视觉语言模型(VLMs)和开放域图像,以及由这种方法生成的大规模合成数据集。我们的实证分析表明,MegaPairs能够生成高质量的数据,使得多模态检索器显著超越了基于现有数据集70倍训练数据的基线模型。此外,由于MegaPairs仅依赖于通用图像语料库和开源VLMs,因此可以轻松扩展,从而实现检索性能的持续改进。在此阶段,我们生产了超过2600万的训练实例,并使用这些数据训练了几种不同大小的新模型。这些新模型在4个流行的组合图像检索(CIR)基准测试中实现了最先进的零样本性能,在MMEB提供的36个数据集中也取得了最高的整体表现。它们还展示了通过额外下游微调带来的显著性能提升。我们的生成数据集、经过良好训练的模型和数据合成管道将公开发布,以促进该领域的未来发展。
https://arxiv.org/abs/2412.14475
Query suggestion, a technique widely adopted in information retrieval, enhances system interactivity and the browsing experience of document collections. In cross-modal retrieval, many works have focused on retrieving relevant items from natural language queries, while few have explored query suggestion solutions. In this work, we address query suggestion in cross-modal retrieval, introducing a novel task that focuses on suggesting minimal textual modifications needed to explore visually consistent subsets of the collection, following the premise of ''Maybe you are looking for''. To facilitate the evaluation and development of methods, we present a tailored benchmark named CroQS. This dataset comprises initial queries, grouped result sets, and human-defined suggested queries for each group. We establish dedicated metrics to rigorously evaluate the performance of various methods on this task, measuring representativeness, cluster specificity, and similarity of the suggested queries to the original ones. Baseline methods from related fields, such as image captioning and content summarization, are adapted for this task to provide reference performance scores. Although relatively far from human performance, our experiments reveal that both LLM-based and captioning-based methods achieve competitive results on CroQS, improving the recall on cluster specificity by more than 115% and representativeness mAP by more than 52% with respect to the initial query. The dataset, the implementation of the baseline methods and the notebooks containing our experiments are available here: this https URL
查询建议是一种在信息检索中广泛采用的技术,它增强了系统的交互性和文档集合的浏览体验。在跨模态检索领域,许多研究集中于从自然语言查询中检索相关项,而对查询建议解决方案的研究则相对较少。本工作中,我们针对跨模态检索中的查询建议问题提出了一个新任务,即提出最小的文字修改建议以探索集合中视觉上一致的子集,并遵循“可能您在寻找”的前提。为了促进方法的评估和发展,我们提供了一个定制化的基准测试数据集 CroQS。该数据集包括初始查询、分组的结果集以及针对每个结果组的人工定义的建议查询。我们制定了专门的指标以严格评估各种方法在此任务上的表现,衡量建议查询的代表性、聚类特异性和与原始查询的相似性。相关领域的基线方法,如图像标题生成和内容摘要,被调整用于此任务以提供参考性能分数。虽然这些方法的表现仍离人类水平较远,但我们的实验显示基于大语言模型(LLM)的方法和基于标题生成的方法在 CroQS 上实现了具有竞争力的结果,相比初始查询,在聚类特异性上的召回率提高了超过115%,代表性mAP也提升了超过52%。数据集、基线方法的实现以及包含我们实验的笔记本都可通过此链接获取:[https URL]
https://arxiv.org/abs/2412.13834
This paper addresses supervised deep metric learning for open-set image retrieval, focusing on three key aspects: the loss function, mixup regularization, and model initialization. In deep metric learning, optimizing the retrieval evaluation metric, recall@k, via gradient descent is desirable but challenging due to its non-differentiable nature. To overcome this, we propose a differentiable surrogate loss that is computed on large batches, nearly equivalent to the entire training set. This computationally intensive process is made feasible through an implementation that bypasses the GPU memory limitations. Additionally, we introduce an efficient mixup regularization technique that operates on pairwise scalar similarities, effectively increasing the batch size even further. The training process is further enhanced by initializing the vision encoder using foundational models, which are pre-trained on large-scale datasets. Through a systematic study of these components, we demonstrate that their synergy enables large models to nearly solve popular benchmarks.
本文探讨了开放集图像检索中的监督深度度量学习,重点关注三个方面:损失函数、mixup正则化和模型初始化。在深度度量学习中,通过梯度下降优化检索评估指标(如recall@k)是理想但具有挑战性的,因为该指标不可微分。为了解决这一问题,我们提出了一种可微的替代损失函数,并且该损失函数是在接近整个训练集大小的大批量上计算得出的。尽管这个过程在计算上非常密集,但我们通过一种绕过GPU内存限制的实现方法使其成为可能。此外,我们引入了一种高效的mixup正则化技术,该技术操作于成对标量相似性之上,从而进一步有效地增加了批量大小。训练过程还得到了改进,即使用在大规模数据集上预训练的基础模型来初始化视觉编码器。通过对这些组件的系统研究,我们展示了它们之间的协同作用使大型模型几乎能够解决流行的基准测试问题。
https://arxiv.org/abs/2412.12432
Composed Image Retrieval (CIR) aims to retrieve target images that closely resemble a reference image while integrating user-specified textual modifications, thereby capturing user intent more precisely. Existing training-free zero-shot CIR (ZS-CIR) methods often employ a two-stage process: they first generate a caption for the reference image and then use Large Language Models for reasoning to obtain a target description. However, these methods suffer from missing critical visual details and limited reasoning capabilities, leading to suboptimal retrieval performance. To address these challenges, we propose a novel, training-free one-stage method, One-Stage Reflective Chain-of-Thought Reasoning for ZS-CIR (OSrCIR), which employs Multimodal Large Language Models to retain essential visual information in a single-stage reasoning process, eliminating the information loss seen in two-stage methods. Our Reflective Chain-of-Thought framework further improves interpretative accuracy by aligning manipulation intent with contextual cues from reference images. OSrCIR achieves performance gains of 1.80% to 6.44% over existing training-free methods across multiple tasks, setting new state-of-the-art results in ZS-CIR and enhancing its utility in vision-language applications. Our code will be available at this https URL.
组合图像检索(CIR)旨在检索与参考图像高度相似的目标图像,同时整合用户指定的文本修改,从而更精确地捕捉用户意图。现有的无训练零样本CIR(ZS-CIR)方法通常采用两阶段过程:首先为参考图像生成一个标题,然后使用大型语言模型进行推理以获得目标描述。然而,这些方法存在缺少关键视觉细节和推理能力有限的问题,导致检索性能不佳。为了应对这些挑战,我们提出了一种新颖的无训练单阶段方法——用于ZS-CIR的一阶段反射思维链推理(OSrCIR),该方法采用多模态大型语言模型,在单一推理过程中保留重要的视觉信息,避免了两阶段方法中的信息损失。我们的反思思维链框架通过将操作意图与参考图像的上下文线索对齐,进一步提高了解释准确性。在多个任务中,OSrCIR相较于现有的无训练方法性能提升1.80%至6.44%,并在ZS-CIR领域设定了新的最先进的结果,并提升了其在视觉语言应用中的实用性。我们的代码将在以下网址提供:[https URL]。
https://arxiv.org/abs/2412.11077
Visual Place Recognition (VPR) aims to robustly identify locations by leveraging image retrieval based on descriptors encoded from environmental images. However, drastic appearance changes of images captured from different viewpoints at the same location pose incoherent supervision signals for descriptor learning, which severely hinder the performance of VPR. Previous work proposes classifying images based on manually defined rules or ground truth labels for viewpoints, followed by descriptor training based on the classification results. However, not all datasets have ground truth labels of viewpoints and manually defined rules may be suboptimal, leading to degraded descriptor this http URL address these challenges, we introduce the mutual learning of viewpoint self-classification and VPR. Starting from coarse classification based on geographical coordinates, we progress to finer classification of viewpoints using simple clustering techniques. The dataset is partitioned in an unsupervised manner while simultaneously training a descriptor extractor for place recognition. Experimental results show that this approach almost perfectly partitions the dataset based on viewpoints, thus achieving mutually reinforcing effects. Our method even excels state-of-the-art (SOTA) methods that partition datasets using ground truth labels.
视觉地点识别(VPR)旨在通过利用基于环境图像编码描述符的图像检索来稳健地识别位置。然而,同一地点不同视角捕捉到的图像外观的巨大变化为描述符学习提供了不一致的监督信号,这对VPR的表现造成了严重阻碍。以前的工作提出根据手动定义的规则或视点的地面真实标签对图像进行分类,然后基于这些分类结果训练描述符。但是,并非所有数据集都有视点的地面真实标签,而手动定义的规则可能次优,导致描述符质量下降。为了解决这些问题,我们引入了视点自我分类和VPR之间的相互学习。从基于地理坐标的粗略分类开始,我们使用简单的聚类技术逐步进行更精细的视角分类。数据集在无监督的方式下被划分,同时训练用于地点识别的描述符提取器。实验结果表明,这种方法几乎完美地根据视点对数据集进行了划分,从而实现了互为促进的效果。我们的方法甚至超过了那些使用地面真实标签划分数据集的状态-of-the-art(SOTA)方法。
https://arxiv.org/abs/2412.09199
Deep supervised hashing has become a pivotal technique in large-scale image retrieval, offering significant benefits in terms of storage and search efficiency. However, existing deep supervised hashing models predominantly focus on generating fixed-length hash codes. This approach fails to address the inherent trade-off between efficiency and effectiveness when using hash codes of varying lengths. To determine the optimal hash code length for a specific task, multiple models must be trained for different lengths, leading to increased training time and computational overhead. Furthermore, the current paradigm overlooks the potential relationships between hash codes of different lengths, limiting the overall effectiveness of the models. To address these challenges, we propose the Nested Hash Layer (NHL), a plug-and-play module designed for existing deep supervised hashing models. The NHL framework introduces a novel mechanism to simultaneously generate hash codes of varying lengths in a nested manner. To tackle the optimization conflicts arising from the multiple learning objectives associated with different code lengths, we further propose an adaptive weights strategy that dynamically monitors and adjusts gradients during training. Additionally, recognizing that the structural information in longer hash codes can provide valuable guidance for shorter hash codes, we develop a long-short cascade self-distillation method within the NHL to enhance the overall quality of the generated hash codes. Extensive experiments demonstrate that NHL not only accelerates the training process but also achieves superior retrieval performance across various deep hashing models. Our code is publicly available at this https URL.
深度监督哈希已经成为大规模图像检索中的关键技术,提供了在存储和搜索效率方面的显著优势。然而,现有的深度监督哈希模型主要集中在生成固定长度的哈希码上。这种做法无法解决使用不同长度哈希码时固有的效率与效果之间的权衡问题。为了确定特定任务下的最优哈希码长度,需要为不同的长度训练多个模型,导致训练时间和计算开销增加。此外,当前的方法忽略了不同长度哈希码之间潜在的关系,限制了模型的整体有效性。为了解决这些问题,我们提出了嵌套哈希层(NHL),这是一个即插即用模块,旨在用于现有的深度监督哈希模型中。NHL框架引入了一种新型机制,能够以嵌套方式同时生成不同长度的哈希码。为了应对与不同代码长度相关的多个学习目标所引起的优化冲突,我们进一步提出了一种自适应权重策略,在训练过程中动态监控和调整梯度。此外,考虑到较长哈希码中的结构信息可以为较短哈希码提供有价值的指导,我们在NHL中开发了长-短级联自我蒸馏方法以提高生成哈希码的整体质量。广泛的实验表明,NHL不仅加速了训练过程,还在各种深度哈希模型中实现了优越的检索性能。我们的代码可以在以下网址公开获取:[此处为提供的URL]。
https://arxiv.org/abs/2412.08922
Image retrieval methods rely on metric learning to train backbone feature extraction models that can extract discriminant queries and reference (gallery) feature representations for similarity matching. Although state-of-the-art accuracy has improved considerably with the advent of deep learning (DL) models trained on large datasets, image retrieval remains challenging in many real-world video analytics and surveillance applications, e.g., person re-identification. Using the Euclidean space for matching limits the performance in real-world applications due to the curse of dimensionality, overfitting, and sensitivity to noisy data. We argue that the feature dissimilarity space is more suitable for similarity matching, and propose a dichotomy transformation to project query and reference embeddings into a single embedding in the dissimilarity space. We also advocate for end-to-end training of a backbone and binary classification models for pair-wise matching. As opposed to comparing the distance between queries and reference embeddings, we show the benefits of classifying the single dissimilarity space embedding (as similar or dissimilar), especially when trained end-to-end. We propose a method to train the max-margin classifier together with the backbone feature extractor by applying constraints to the L2 norm of the classifier weights along with the hinge loss. Our extensive experiments on challenging image retrieval datasets and using diverse feature extraction backbones highlight the benefits of similarity matching in the dissimilarity space. In particular, when jointly training the feature extraction backbone and regularised classifier for matching, the dissimilarity space provides a higher level of accuracy.
图像检索方法依赖度量学习来训练骨干特征提取模型,该模型能够抽取具有判别力的查询和参考(画廊)特征表示以进行相似性匹配。尽管随着深度学习(DL)模型在大规模数据集上的应用,状态-of-the-art准确率有了显著提高,但在许多实际视频分析和监控应用场景中,如行人重识别,图像检索仍然是一项挑战。使用欧几里得空间进行匹配由于维度灾难、过拟合以及对噪声数据的敏感性限制了其在实际应用中的性能表现。我们认为特征不相似空间更适合于相似性匹配,并提出了一种二分转换方法,将查询和参考嵌入投影到单个不相似度空间中的单一嵌入中。我们还提倡端到端训练骨干模型与用于配对匹配的二元分类模型。通过分类单个不相似度空间嵌入(将其归类为相似或不相似),而不是比较查询和参考嵌入之间的距离,尤其是在进行端到端训练时,能够展示出显著的好处。我们提出了一种方法,通过对分类器权重施加L2范数约束以及结合铰链损失来同时训练最大间隔分类器与骨干特征提取器。我们在具有挑战性的图像检索数据集上的广泛实验及使用各种特征提取骨干模型均表明,在不相似度空间中进行相似性匹配的优势。特别是,当联合训练特征提取骨干和正则化分类器时,不相似度空间提供了更高的准确率水平。
https://arxiv.org/abs/2412.08618
Purpose: Intraoperative ultrasound (US) can enhance real-time visualization in transoral robotic surgery. The surgeon creates a mental map with a pre-operative scan. Then, a surgical assistant performs freehand US scanning during the surgery while the surgeon operates at the remote surgical console. Communicating the target scanning plane in the surgeon's mental map is difficult. Automatic image retrieval can help match intraoperative images to preoperative scans, guiding the assistant to adjust the US probe toward the target plane. Methods: We propose a self-supervised contrastive learning approach to match intraoperative US views to a preoperative image database. We introduce a novel contrastive learning strategy that leverages intra-sweep similarity and US probe location to improve feature encoding. Additionally, our model incorporates a flexible threshold to reject unsatisfactory matches. Results: Our method achieves 92.30% retrieval accuracy on simulated data and outperforms state-of-the-art temporal-based contrastive learning approaches. Our ablation study demonstrates that using probe location in the optimization goal improves image representation, suggesting that semantic information can be extracted from probe location. We also present our approach on real patient data to show the feasibility of the proposed US probe localization system despite tissue deformation from tongue retraction. Conclusion: Our contrastive learning method, which utilizes intra-sweep similarity and US probe location, enhances US image representation learning. We also demonstrate the feasibility of using our image retrieval method to provide neck US localization on real patient US after tongue retraction.
目的:术中超声(US)可以增强经口机器人手术中的实时可视化。外科医生在术前扫描中创建一个心理地图。然后,在手术过程中,一名手术助手进行手动超声波扫描,而外科医生则在远程手术控制台操作。向助手传达外科医生心理地图中的目标扫描平面存在困难。自动图像检索可以帮助匹配术中超声图像与术前扫描,指导助手调整超声探头指向目标平面。 方法:我们提出了一种自我监督对比学习的方法,将术中超声视图与术前影像数据库进行匹配。我们引入了一种新的对比学习策略,利用扫查内相似性和超声探头位置来改进特征编码。此外,我们的模型还集成了一个灵活的阈值以拒绝不满意的匹配。 结果:在模拟数据上,我们的方法达到了92.30%的检索准确率,并优于现有的基于时间的对比学习方法。我们的消融研究显示,在优化目标中使用探头位置可以改进图像表示,表明可以从探头位置提取语义信息。我们还展示了在实际患者数据上的方法,证明了所提出的超声探头定位系统即使在舌缩引起组织变形的情况下也是可行的。 结论:利用扫查内相似性和超声探头位置的对比学习方法增强了超声图像表示学习。此外,我们的图像检索方法在实际患者舌头缩回后的颈部超声中提供了定位的可行性也得到了证明。
https://arxiv.org/abs/2412.07741
Composed Image Retrieval (CIR) involves retrieving a target image based on a composed query of an image paired with text that specifies modifications or changes to the visual reference. CIR is inherently an instruction-following task, as the model needs to interpret and apply modifications to the image. In practice, due to the scarcity of annotated data in downstream tasks, Zero-Shot CIR (ZS-CIR) is desirable. While existing ZS-CIR models based on CLIP have shown promising results, their capability in interpreting and following modification instructions remains limited. Some research attempts to address this by incorporating Large Language Models (LLMs). However, these approaches still face challenges in effectively integrating multimodal information and instruction understanding. To tackle above challenges, we propose a novel embedding method utilizing an instruction-tuned Multimodal LLM (MLLM) to generate composed representation, which significantly enhance the instruction following capability for a comprehensive integration between images and instructions. Nevertheless, directly applying MLLMs introduces a new challenge since MLLMs are primarily designed for text generation rather than embedding extraction as required in CIR. To address this, we introduce a two-stage training strategy to efficiently learn a joint multimodal embedding space and further refining the ability to follow modification instructions by tuning the model in a triplet dataset similar to the CIR format. Extensive experiments on four public datasets: FashionIQ, CIRR, GeneCIS, and CIRCO demonstrates the superior performance of our model, outperforming state-of-the-art baselines by a significant margin. Codes are available at the GitHub repository.
合成图像检索(Composed Image Retrieval,简称CIR)是指根据一个由图像和描述修改或变化的文本组成的查询来检索目标图像。本质上,CIR是一项遵循指令的任务,因为模型需要解释并应用对图像的修改。在实际操作中,由于下游任务注释数据稀缺,零样本合成图像检索(Zero-Shot CIR, ZS-CIR)变得非常有价值。虽然基于CLIP的现有ZS-CIR模型显示出了令人鼓舞的结果,但它们在解读和遵循修改指令方面的能力依然有限。一些研究试图通过整合大型语言模型(LLMs)来解决这个问题。然而,这些方法仍然面临着有效融合多模态信息和理解指令方面的挑战。为了解决上述问题,我们提出了一种新颖的嵌入方法,利用经过指令调优的多模态大语言模型(MLLM)生成合成表示,这显著增强了图像与指令之间综合整合的遵循指令能力。然而,直接应用MLLM会带来新的挑战,因为这些模型主要是为了文本生成而设计的,并非用于CIR所需的嵌入提取。为了解决这个问题,我们引入了一种两阶段训练策略来高效学习联合多模态嵌入空间,并通过在类似CIR格式的三元组数据集上微调模型进一步提升遵循修改指令的能力。我们在四个公开数据集(FashionIQ、CIRR、GeneCIS和CIRCO)上的广泛实验表明,我们的模型性能优于最先进的基线方法,且优势显著。代码可在GitHub仓库中获取。
https://arxiv.org/abs/2412.05756