This work addresses composed image retrieval in the context of domain conversion, where the content of a query image is retrieved in the domain specified by the query text. We show that a strong vision-language model provides sufficient descriptive power without additional training. The query image is mapped to the text input space using textual inversion. Unlike common practice that invert in the continuous space of text tokens, we use the discrete word space via a nearest-neighbor search in a text vocabulary. With this inversion, the image is softly mapped across the vocabulary and is made more robust using retrieval-based augmentation. Database images are retrieved by a weighted ensemble of text queries combining mapped words with the domain text. Our method outperforms prior art by a large margin on standard and newly introduced benchmarks. Code: this https URL
https://arxiv.org/abs/2412.03297
Active Learning (AL) is a user-interactive approach aimed at reducing annotation costs by selecting the most crucial examples to label. Although AL has been extensively studied for image classification tasks, the specific scenario of interactive image retrieval has received relatively little attention. This scenario presents unique characteristics, including an open-set and class-imbalanced binary classification, starting with very few labeled samples. We introduce a novel batch-mode Active Learning framework named GAL (Greedy Active Learning) that better copes with this application. It incorporates a new acquisition function for sample selection that measures the impact of each unlabeled sample on the classifier. We further embed this strategy in a greedy selection approach, better exploiting the samples within each batch. We evaluate our framework with both linear (SVM) and non-linear MLP/Gaussian Process classifiers. For the Gaussian Process case, we show a theoretical guarantee on the greedy approximation. Finally, we assess our performance for the interactive content-based image retrieval task on several benchmarks and demonstrate its superiority over existing approaches and common baselines. Code is available at this https URL.
https://arxiv.org/abs/2412.02310
Approximate Nearest Neighbor search is one of the keys to high-scale data retrieval performance in many applications. The work is a bridge between feature extraction and ANN indexing through fine-tuning a ResNet50 model with various ANN methods: FAISS and Annoy. We evaluate the systems with respect to indexing time, memory usage, query time, precision, recall, F1-score, and Recall@5 on a custom image dataset. FAISS's Product Quantization can achieve a precision of 98.40% with low memory usage at 0.24 MB index size, and Annoy is the fastest, with average query times of 0.00015 seconds, at a slight cost to accuracy. These results reveal trade-offs among speed, accuracy, and memory efficiency and offer actionable insights into the optimization of feature-based image retrieval systems. This study will serve as a blueprint for constructing actual retrieval pipelines and be built on fine-tuned deep learning networks and associated ANN methods.
https://arxiv.org/abs/2412.01555
In this paper, we present a Neuron Abandoning Attention Flow (NAFlow) method to address the open problem of visually explaining the attention evolution dynamics inside CNNs when making their classification decisions. A novel cascading neuron abandoning back-propagation algorithm is designed to trace neurons in all layers of a CNN that involve in making its prediction to address the problem of significant interference from abandoned neurons. Firstly, a Neuron Abandoning Back-Propagation (NA-BP) module is proposed to generate Back-Propagated Feature Maps (BPFM) by using the inverse function of the intermediate layers of CNN models, on which the neurons not used for decision-making are abandoned. Meanwhile, the cascading NA-BP modules calculate the tensors of importance coefficients which are linearly combined with the tensors of BPFMs to form the NAFlow. Secondly, to be able to visualize attention flow for similarity metric-based CNN models, a new channel contribution weights module is proposed to calculate the importance coefficients via Jacobian Matrix. The effectiveness of the proposed NAFlow is validated on nine widely-used CNN models for various tasks of general image classification, contrastive learning classification, few-shot image classification, and image retrieval.
https://arxiv.org/abs/2412.01202
Text-to-image retrieval is a critical task for managing diverse visual content, but common benchmarks for the task rely on small, single-domain datasets that fail to capture real-world complexity. Pre-trained vision-language models tend to perform well with easy negatives but struggle with hard negatives--visually similar yet incorrect images--especially in open-domain scenarios. To address this, we introduce Episodic Few-Shot Adaptation (EFSA), a novel test-time framework that adapts pre-trained models dynamically to a query's domain by fine-tuning on top-k retrieved candidates and synthetic captions generated for them. EFSA improves performance across diverse domains while preserving generalization, as shown in evaluations on queries from eight highly distinct visual domains and an open-domain retrieval pool of over one million images. Our work highlights the potential of episodic few-shot adaptation to enhance robustness in the critical and understudied task of open-domain text-to-image retrieval.
https://arxiv.org/abs/2412.00139
Sketch-based image retrieval (SBIR) relies on free-hand sketches to retrieve natural photos within the same class. However, its practical application is limited by its inability to retrieve classes absent from the training set. To address this limitation, the task has evolved into Zero-Shot Sketch-Based Image Retrieval (ZS-SBIR), where model performance is evaluated on unseen categories. Traditional SBIR primarily focuses on narrowing the domain gap between photo and sketch modalities. However, in the zero-shot setting, the model not only needs to address this cross-modal discrepancy but also requires a strong generalization capability to transfer knowledge to unseen categories. To this end, we propose a novel framework for ZS-SBIR that employs a pair-based relation-aware quadruplet loss to bridge feature gaps. By incorporating two negative samples from different modalities, the approach prevents positive features from becoming disproportionately distant from one modality while remaining close to another, thus enhancing inter-class separability. We also propose a Relation-Aware Meta-Learning Network (RAMLN) to obtain the margin, a hyper-parameter of cross-modal quadruplet loss, to improve the generalization ability of the model. RAMLN leverages external memory to store feature information, which it utilizes to assign optimal margin values. Experimental results obtained on the extended Sketchy and TU-Berlin datasets show a sharp improvement over existing state-of-the-art methods in ZS-SBIR.
https://arxiv.org/abs/2412.00120
Structuring latent representations in a hierarchical manner enables models to learn patterns at multiple levels of abstraction. However, most prevalent image understanding models focus on visual similarity, and learning visual hierarchies is relatively unexplored. In this work, for the first time, we introduce a learning paradigm that can encode user-defined multi-level visual hierarchies in hyperbolic space without requiring explicit hierarchical labels. As a concrete example, first, we define a part-based image hierarchy using object-level annotations within and across images. Then, we introduce an approach to enforce the hierarchy using contrastive loss with pairwise entailment metrics. Finally, we discuss new evaluation metrics to effectively measure hierarchical image retrieval. Encoding these complex relationships ensures that the learned representations capture semantic and structural information that transcends mere visual similarity. Experiments in part-based image retrieval show significant improvements in hierarchical retrieval tasks, demonstrating the capability of our model in capturing visual hierarchies.
以分层方式构建潜在表示可以使模型在多个抽象层次上学习模式。然而,大多数流行的图像理解模型都集中在视觉相似性上,而对学习视觉层级的研究相对较少。在这项工作中,我们首次引入了一种可以将用户定义的多层次视觉层级编码到双曲空间中的学习范式,并且无需显式的层级标签。具体来说,首先,我们使用跨图片和图片内的对象级注释来定义基于部分的图像层级。然后,我们介绍一种方法,通过对比损失和成对蕴含度量来强制执行该层级关系。最后,我们讨论了新的评估指标,以有效衡量分层图像检索的效果。编码这些复杂的相互关系确保学习到的表示不仅捕捉到了语义信息和结构信息,而且超越了简单的视觉相似性。基于部分的图像检索实验显示,在分层检索任务中有了显著改进,证明了我们的模型在捕获视觉层级方面的能力。
https://arxiv.org/abs/2411.17490
Generative methods now produce outputs nearly indistinguishable from real data but often fail to fully capture the data distribution. Unlike quality issues, diversity limitations in generative models are hard to detect visually, requiring specific metrics for assessment. In this paper, we draw attention to the current lack of diversity in generative models and the inability of common metrics to measure this. We achieve this by framing diversity as an image retrieval problem, where we measure how many real images can be retrieved using synthetic data as queries. This yields the Image Retrieval Score (IRS), an interpretable, hyperparameter-free metric that quantifies the diversity of a generative model's output. IRS requires only a subset of synthetic samples and provides a statistical measure of confidence. Our experiments indicate that current feature extractors commonly used in generative model assessment are inadequate for evaluating diversity effectively. Consequently, we perform an extensive search for the best feature extractors to assess diversity. Evaluation reveals that current diffusion models converge to limited subsets of the real distribution, with no current state-of-the-art models superpassing 77% of the diversity of the training data. To address this limitation, we introduce Diversity-Aware Diffusion Models (DiADM), a novel approach that improves diversity of unconditional diffusion models without loss of image quality. We do this by disentangling diversity from image quality by using a diversity aware module that uses pseudo-unconditional features as input. We provide a Python package offering unified feature extraction and metric computation to further facilitate the evaluation of generative models this https URL.
生成方法现在产生的输出几乎与真实数据无法区分,但常常未能完全捕捉到数据分布。与质量问题不同,生成模型中的多样性限制难以通过视觉检测出来,需要特定的指标来进行评估。在这篇论文中,我们关注当前生成模型中存在的多样性不足以及常见指标无法测量这一问题。我们通过将多样性定义为一个图像检索问题来实现这一点,在这个问题中,我们衡量使用合成数据作为查询可以检索到多少真实图像。这产生了图像检索得分(IRS),这是一种可解释的、无超参数的度量标准,用于量化生成模型输出的多样性。IRS只需要一小部分合成样本,并提供了一种统计置信度度量。我们的实验表明,目前在生成模型评估中常用的特征提取器不足以有效评估多样性。因此,我们进行了广泛的搜索以找到最佳的特征提取器来评估多样性。评估显示,当前的扩散模型收敛到真实分布中的有限子集,没有任何最先进的模型能超过训练数据多样性的77%。为了解决这一限制,我们引入了多元化感知扩散模型(DiADM),这是一种新颖的方法,在不牺牲图像质量的情况下提高无条件扩散模型的多样性。我们通过使用一个以伪无条件特征作为输入的多元化感知模块来分离多样性和图像质量来实现这一点。我们提供了一个Python软件包,用于统一的特征提取和度量计算,进一步促进生成模型的评估,详情请参见此链接:[https URL]。
https://arxiv.org/abs/2411.16171
The Zero-shot Composed Image Retrieval (ZSCIR) requires retrieving images that match the query image and the relative captions. Current methods focus on projecting the query image into the text feature space, subsequently combining them with features of query texts for retrieval. However, retrieving images only with the text features cannot guarantee detailed alignment due to the natural gap between images and text. In this paper, we introduce Imagined Proxy for CIR (IP-CIR), a training-free method that creates a proxy image aligned with the query image and text description, enhancing query representation in the retrieval process. We first leverage the large language model's generalization capability to generate an image layout, and then apply both the query text and image for conditional generation. The robust query features are enhanced by merging the proxy image, query image, and text semantic perturbation. Our newly proposed balancing metric integrates text-based and proxy retrieval similarities, allowing for more accurate retrieval of the target image while incorporating image-side information into the process. Experiments on three public datasets demonstrate that our method significantly improves retrieval performances. We achieve state-of-the-art (SOTA) results on the CIRR dataset with a Recall@K of 70.07 at K=10. Additionally, we achieved an improvement in Recall@10 on the FashionIQ dataset, rising from 45.11 to 45.74, and improved the baseline performance in CIRCO with a mAPK@10 score, increasing from 32.24 to 34.26.
零样本组合图像检索(ZSCIR)需要找到与查询图像及其相对应的标题匹配的图像。当前的方法主要集中在将查询图像投影到文本特征空间中,然后结合查询文本的特征进行检索。然而,仅使用文本特征来检索图像不能保证详细的对齐,因为图像和文本之间存在天然的差距。在本文中,我们引入了无训练方法——用于CIR的想象代理(IP-CIR),该方法创建了一个与查询图像和文本描述相一致的代理图像,在检索过程中增强查询表示。首先,我们利用大型语言模型的一般化能力生成一个图像布局,然后同时应用查询文本和图像进行条件生成。通过合并代理图像、查询图像和文本语义扰动来增强健壮的查询特征。我们新提出的平衡度量集成了基于文本的检索相似性和代理检索相似性,允许更准确地检索目标图像并结合图像方面的信息。在三个公共数据集上的实验表明,我们的方法显著提高了检索性能。我们在CIRR数据集中实现了最先进的(SOTA)结果,在Recall@K指标上达到了70.07(K=10)。此外,我们在FashionIQ数据集中将Recall@10从45.11提高到了45.74,并在CIRCO中提升了基线性能,mAPK@10分数从32.24增加到34.26。
https://arxiv.org/abs/2411.16752
Diffusion models have recently been employed to generate high-quality images, reducing the need for manual data collection and improving model generalization in tasks such as object detection, instance segmentation, and image perception. However, the synthetic framework is usually designed with meticulous human effort for each task due to various requirements on image layout, content, and annotation formats, restricting the application of synthetic data on more general scenarios. In this paper, we propose AnySynth, a unified framework integrating adaptable, comprehensive, and highly controllable components capable of generating an arbitrary type of synthetic data given diverse requirements. Specifically, the Task-Specific Layout Generation Module is first introduced to produce reasonable layouts for different tasks by leveraging the generation ability of large language models and layout priors of real-world images. A Uni-Controlled Image Generation Module is then developed to create high-quality synthetic images that are controllable and based on the generated layouts. In addition, user specific reference images, and style images can be incorporated into the generation to task requirements. Finally, the Task-Oriented Annotation Module offers precise and detailed annotations for the generated images across different tasks. We have validated our framework's performance across various tasks, including Few-shot Object Detection, Cross-domain Object Detection, Zero-shot Composed Image Retrieval, and Multi-modal Image Perception and Grounding. The specific data synthesized by our framework significantly improves model performance in these tasks, demonstrating the generality and effectiveness of our framework.
扩散模型最近被用于生成高质量的图像,减少了手动数据收集的需求,并在诸如对象检测、实例分割和图像感知等任务中提高了模型的泛化能力。然而,由于对图像布局、内容和标注格式的不同要求,合成框架通常需要针对每个任务进行细致的人工设计,这限制了合成数据在更广泛场景中的应用。在这篇论文中,我们提出了AnySynth,一个整合了可适应性、全面性和高度可控组件的统一框架,能够根据不同的需求生成任意类型的合成数据。具体而言,首先引入了任务特定布局生成模块,通过利用大型语言模型的生成能力和真实世界图像的布局先验来产生适用于不同任务的合理布局。然后开发了一个统一控制的图像生成模块,用于创建高质量且基于生成布局的高度可控的合成图像。此外,还可以将用户指定的参考图像和风格图像融入到生成过程中以满足特定的任务需求。最后,面向任务的标注模块提供了对生成图像进行跨任务的精确详细标注。我们验证了框架在多个任务中的性能,包括少样本对象检测、跨域对象检测、零样本组合图像检索以及多模态图像感知与定位。通过我们的框架合成的具体数据显著提升了这些任务中模型的表现,展示了该框架的通用性和有效性。
https://arxiv.org/abs/2411.16749
Remote sensing cross-modal text-image retrieval (RSCTIR) has gained attention for its utility in information mining. However, challenges remain in effectively integrating global and local information due to variations in remote sensing imagery and ensuring proper feature pre-alignment before modal fusion, which affects retrieval accuracy and efficiency. To address these issues, we propose CMPAGL, a cross-modal pre-aligned method leveraging global and local information. Our Gswin transformer block combines local window self-attention and global-local window cross-attention to capture multi-scale features. A pre-alignment mechanism simplifies modal fusion training, improving retrieval performance. Additionally, we introduce a similarity matrix reweighting (SMR) algorithm for reranking, and enhance the triplet loss function with an intra-class distance term to optimize feature learning. Experiments on four datasets, including RSICD and RSITMD, validate CMPAGL's effectiveness, achieving up to 4.65% improvement in R@1 and 2.28% in mean Recall (mR) over state-of-the-art methods.
遥感跨模态文本-图像检索(RSCTIR)因其在信息挖掘中的实用性而受到关注。然而,由于遥感影像的变异性和确保模式融合前特征预对齐的问题,有效地整合全局和局部信息仍然面临挑战,这影响了检索的准确性和效率。为了解决这些问题,我们提出了CMPAGL方法,这是一种利用全局和局部信息进行跨模态预对齐的方法。我们的Gswin变换块结合了局部窗口自注意力机制和全局-局部窗口交叉注意力机制以捕捉多尺度特征。预对齐机制简化了模式融合的训练过程,提高了检索性能。此外,我们引入了一种相似性矩阵重加权(SMR)算法用于重新排序,并通过添加类内距离项来增强三元组损失函数,从而优化特征学习。在包括RSICD和RSITMD在内的四个数据集上的实验验证了CMPAGL的有效性,在R@1和平均召回率(mR)上分别达到了高达4.65%和2.28%的提升,优于现有方法。
https://arxiv.org/abs/2411.14704
Hard negative generation aims to generate informative negative samples that help to determine the decision boundaries and thus facilitate advancing deep metric learning. Current works select pair/triplet samples, learn their correlations, and fuse them to generate hard negatives. However, these works merely consider the local correlations of selected samples, ignoring global sample correlations that would provide more significant information to generate more informative negatives. In this work, we propose a Globally Correlation-Aware Hard Negative Generation (GCA-HNG) framework, which first learns sample correlations from a global perspective and exploits these correlations to guide generating hardness-adaptive and diverse negatives. Specifically, this approach begins by constructing a structured graph to model sample correlations, where each node represents a specific sample and each edge represents the correlations between corresponding samples. Then, we introduce an iterative graph message propagation to propagate the messages of node and edge through the whole graph and thus learn the sample correlations globally. Finally, with the guidance of the learned global correlations, we propose a channel-adaptive manner to combine an anchor and multiple negatives for HNG. Compared to current methods, GCA-HNG allows perceiving sample correlations with numerous negatives from a global and comprehensive perspective and generates the negatives with better hardness and diversity. Extensive experiment results demonstrate that the proposed GCA-HNG is superior to related methods on four image retrieval benchmark datasets. Codes and trained models are available at \url{this https URL}.
硬负样本生成旨在生成具有信息量的负样本,以帮助确定决策边界,并从而促进深度度量学习的发展。当前的工作选择配对/三元组样本,学习它们的相关性并融合这些相关性来生成难样本。然而,这些工作仅考虑了所选样本的局部相关性,忽略了可以提供更显著信息的全局样本相关性,以生成更有信息量的负样本。在本研究中,我们提出了一种全局相关感知硬负样本生成(GCA-HNG)框架,该框架首先从全局视角学习样本的相关性,并利用这些相关性来指导生成难度自适应且多样化的负样本。具体而言,这一方法首先构建一个结构化图来建模样本间的相关性,其中每个节点表示特定的样本,每条边则代表对应样本之间的相关性。接着,我们引入了一种迭代图消息传递机制,在整个图中传播节点和边的消息,从而学习全局样本的相关性。最后,在所学全局相关性的指导下,我们提出了一种通道自适应方式来结合锚点与多个负样本来进行硬负样本生成。相比于当前的方法,GCA-HNG能够从全局和全面的视角感知样本间的相关性,并生成难度更大、多样性更高的负样本。大量的实验结果表明,所提出的GCA-HNG在四个图像检索基准数据集上优于相关的现有方法。代码和训练模型可以在\url{this https URL}获取。
https://arxiv.org/abs/2411.13145
With the widespread adoption of digital devices equipped with cameras and the rapid development of Internet technology, numerous content-based image retrieval systems and novel image feature extraction techniques have emerged in recent years. This paper introduces a saliency map-based image retrieval approach using invariant Krawtchouk moments (SM-IKM) to enhance retrieval speed and accuracy. The proposed method applies a global contrast-based salient region detection algorithm to create a saliency map that effectively isolates the foreground from the background. It then combines multiple orders of invariant Krawtchouk moments (IKM) with local binary patterns (LBPs) and color histograms to comprehensively represent the foreground and background. Additionally, it incorporates LBPs derived from the saliency map to improve discriminative power, facilitating more precise image differentiation. A bag-of-visual-words (BoVW) model is employed to generate a codebook for classification and discrimination. By using compact IKMs in the BoVW framework and integrating a range of region-based feature-including color histograms, LBPs, and saliency map-enhanced LBPs, our proposed SM-IKM achieves efficient and accurate image retrieval. xtensive experiments on publicly available datasets, such as Caltech 101 and Wang, demonstrate that SM-IKM outperforms recent state-of-the-art retrieval methods. The source code for SM-IKM is available at this http URL.
随着配备摄像头的数字设备的广泛使用和互联网技术的迅速发展,近年来出现了许多基于内容的图像检索系统和新颖的图像特征提取技术。本文介绍了一种基于显著图的图像检索方法,该方法利用不变Krawtchouk矩(SM-IKM)来提高检索速度和准确性。所提出的方法采用一种全局对比度为基础的显著区域检测算法来生成显著图,有效地将前景与背景分离。然后结合多阶不变Krawtchouk矩(IKM)、局部二值模式(LBPs)和颜色直方图来全面表示前景和背景。此外,它还利用从显著图衍生出的LBPs以提高区分力,从而实现更精确的图像区别。通过使用紧凑型IKMs并在视觉词袋(BoVW)模型中生成用于分类和区别的词汇表,我们提出的SM-IKM实现了高效且准确的图像检索。在公开数据集如Caltech 101和Wang上的广泛实验表明,SM-IKM优于近期最先进的检索方法。SM-IKM的源代码可以在这个网址获取:[此处提供URL]。
https://arxiv.org/abs/2411.08567
Remote sensing has evolved from simple image acquisition to complex systems capable of integrating and processing visual and textual data. This review examines the development and application of multi-modal language models (MLLMs) in remote sensing, focusing on their ability to interpret and describe satellite imagery using natural language. We cover the technical underpinnings of MLLMs, including dual-encoder architectures, Transformer models, self-supervised and contrastive learning, and cross-modal integration. The unique challenges of remote sensing data--varying spatial resolutions, spectral richness, and temporal changes--are analyzed for their impact on MLLM performance. Key applications such as scene description, object detection, change detection, text-to-image retrieval, image-to-text generation, and visual question answering are discussed to demonstrate their relevance in environmental monitoring, urban planning, and disaster response. We review significant datasets and resources supporting the training and evaluation of these models. Challenges related to computational demands, scalability, data quality, and domain adaptation are highlighted. We conclude by proposing future research directions and technological advancements to further enhance MLLM utility in remote sensing.
遥感技术已经从简单的图像采集发展成为能够整合和处理视觉与文本数据的复杂系统。本综述探讨了多模态语言模型(MLLMs)在遥感领域的开发与应用,重点关注这些模型利用自然语言解释和描述卫星影像的能力。我们涵盖了 MLLMs 的技术基础,包括双编码器架构、Transformer 模型、自监督学习和对比学习以及跨模态整合。分析了遥感数据的独特挑战——如空间分辨率变化、光谱丰富性和时间变化——对 MLLM 性能的影响。讨论了场景描述、目标检测、变化检测、文本到图像检索、图像到文本生成和视觉问答等关键应用,展示了它们在环境监测、城市规划和灾害响应中的相关性。我们回顾了支持这些模型训练与评估的重要数据集和资源。指出了计算需求、扩展性、数据质量以及领域适应性的挑战。最后,我们提出了未来的研究方向和技术进步以进一步增强 MLLM 在遥感领域的实用性。
https://arxiv.org/abs/2411.05826
Contrastive Language-Image Pretraining (CLIP) models maximize the mutual information between text and visual modalities to learn representations. This makes the nature of the training data a significant factor in the efficacy of CLIP for downstream tasks. However, the lack of compositional diversity in contemporary image-text datasets limits the compositional reasoning ability of CLIP. We show that generating ``hard'' negative captions via in-context learning and synthesizing corresponding negative images with text-to-image generators offers a solution. We introduce a novel contrastive pre-training strategy that leverages these hard negative captions and images in an alternating fashion to train CLIP. We demonstrate that our method, named TripletCLIP, when applied to existing datasets such as CC3M and CC12M, enhances the compositional capabilities of CLIP, resulting in an absolute improvement of over 9% on the SugarCrepe benchmark on an equal computational budget, as well as improvements in zero-shot image classification and image retrieval. Our code, models, and data are available at: this https URL
对比语言-图像预训练(CLIP)模型通过最大化文本和视觉模态之间的互信息来学习表示。这使得训练数据的性质成为影响CLIP在下游任务中效果的重要因素。然而,当代图像-文本数据集中缺乏组合多样性限制了CLIP的组合推理能力。我们展示了通过上下文学习生成“难”的负样本描述,并使用文字转图像生成器合成相应的负面图像提供了解决方案。我们引入了一种新颖的对比预训练策略,该策略交替利用这些困难的负样本描述和图像来训练CLIP。我们的方法名为TripletCLIP,在应用于现有数据集如CC3M和CC12M时,能够提升CLIP的组合能力,并在SugarCrepe基准测试中,以相同的计算预算实现了超过9%的绝对性能改进,同时还在零样本图像分类和图像检索方面取得了改善。我们的代码、模型和数据可在以下链接获取:此 https URL
https://arxiv.org/abs/2411.02545
We introduce INQUIRE, a text-to-image retrieval benchmark designed to challenge multimodal vision-language models on expert-level queries. INQUIRE includes iNaturalist 2024 (iNat24), a new dataset of five million natural world images, along with 250 expert-level retrieval queries. These queries are paired with all relevant images comprehensively labeled within iNat24, comprising 33,000 total matches. Queries span categories such as species identification, context, behavior, and appearance, emphasizing tasks that require nuanced image understanding and domain expertise. Our benchmark evaluates two core retrieval tasks: (1) INQUIRE-Fullrank, a full dataset ranking task, and (2) INQUIRE-Rerank, a reranking task for refining top-100 retrievals. Detailed evaluation of a range of recent multimodal models demonstrates that INQUIRE poses a significant challenge, with the best models failing to achieve an mAP@50 above 50%. In addition, we show that reranking with more powerful multimodal models can enhance retrieval performance, yet there remains a significant margin for improvement. By focusing on scientifically-motivated ecological challenges, INQUIRE aims to bridge the gap between AI capabilities and the needs of real-world scientific inquiry, encouraging the development of retrieval systems that can assist with accelerating ecological and biodiversity research. Our dataset and code are available at this https URL
我们介绍INQUIRE,这是一个旨在挑战多模态视觉-语言模型处理专家级查询的文本到图像检索基准测试。INQUIRE 包含 iNaturalist 2024(iNat24),这是一组包含五百万自然世界图片的新数据集,以及250个专家级别的检索查询。这些查询与iNat24中全面标注的相关图片进行了配对,总共有33,000个匹配项。查询涵盖了物种识别、上下文、行为和外观等类别,强调了需要细致图像理解和领域专业知识的任务。我们的基准测试评估两个核心的检索任务:(1)INQUIRE-Fullrank,一个完整的数据集排名任务;(2)INQUIRE-Rerank,用于优化前100个检索结果的重新排序任务。对一系列近期多模态模型的详细评估表明,INQUIRE 提出了显著挑战,最佳模型未能达到50%以上的mAP@50分数。此外,我们展示了通过更强大的多模态模型进行重新排序可以提升检索性能,但仍存在较大的改进空间。通过专注于科学动机的生态学挑战,INQUIRE 力求弥合AI能力与现实世界科学研究需求之间的差距,促进能够加速生态和生物多样性研究的检索系统的发展。我们的数据集和代码可以在这个网址获取:[链接]。
https://arxiv.org/abs/2411.02537
Objects, in the real world, rarely occur in isolation and exhibit typical arrangements governed by their independent utility, and their expected interaction with humans and other objects in the context. For example, a chair is expected near a table, and a computer is expected on top. Humans use this spatial context and relative placement as an important cue for visual recognition in case of ambiguities. Similar to human's, DNN's exploit contextual information from data to learn representations. Our research focuses on harnessing the contextual aspects of visual data to optimize data annotation and enhance the training of deep networks. Our contributions can be summarized as follows: (1) We introduce the notion of contextual diversity for active learning CDAL and show its applicability in three different visual tasks semantic segmentation, object detection and image classification, (2) We propose a data repair algorithm to curate contextually fair data to reduce model bias, enabling the model to detect objects out of their obvious context, (3) We propose Class-based annotation, where contextually relevant classes are selected that are complementary for model training under domain shift. Understanding the importance of well-curated data, we also emphasize the necessity of involving humans in the loop to achieve accurate annotations and to develop novel interaction strategies that allow humans to serve as fact-checkers. In line with this we are working on developing image retrieval system for wildlife camera trap images and reliable warning system for poor quality rural roads. For large-scale annotation, we are employing a strategic combination of human expertise and zero-shot models, while also integrating human input at various stages for continuous feedback.
物体在现实世界中很少单独出现,而是呈现出由其独立用途和预期与人类及其他物体的互动所决定的典型排列。例如,在桌子附近通常会有一把椅子,桌子上可能会有电脑。人类利用这种空间背景及其相对位置作为视觉识别的重要线索,特别是在存在模糊性的情况下。类似于人类的做法,深度神经网络(DNN)也会从数据中提取上下文信息来学习表示形式。我们的研究重点在于利用视觉数据的背景方面优化数据标注并提升深度网络训练的效果。我们的贡献可以总结如下:(1) 我们引入了主动学习中的上下文多样性(CDAL)概念,并展示了它在三个不同的视觉任务——语义分割、物体检测和图像分类中应用的可能性。(2) 我们提出了一种数据修复算法,以整理出背景公平的数据来减少模型偏差,使模型能够识别超出明显背景的物体。(3) 我们提出了基于类别的标注方法,在领域迁移下选择互补的背景相关的类别进行模型训练。认识到精心策划数据的重要性,我们也强调了在实现准确标注和开发新型互动策略方面纳入人类参与的必要性,这些策略允许人类充当事实核查者。与这一目标一致,我们正在开发野生动物相机陷阱图像检索系统以及用于农村低质量道路的可靠预警系统。对于大规模标注工作,我们在运用专家知识和零样本模型的战略组合的同时,在各个阶段整合人类输入以实现持续反馈。
https://arxiv.org/abs/2411.01925
That datasets that are used in todays research are especially vast in the medical field. Different types of medical images such as X-rays, MRI, CT scan etc. take up large amounts of space. This volume of data introduces challenges like accessing and retrieving specific images due to the size of the database. An efficient image retrieval system is essential as the database continues to grow to save time and resources. In this paper, we propose an approach to medical image retrieval using DenseNet for feature extraction and use FAISS for similarity search. DenseNet is well-suited for feature extraction in complex medical images and FAISS enables efficient handling of high-dimensional data in large-scale datasets. Unlike existing methods focused solely on classification accuracy, our method prioritizes both retrieval speed and diagnostic relevance, addressing a critical gap in real-time case comparison for radiologists. We applied the classification of breast cancer images using the BIRADS system. We utilized DenseNet's powerful feature representation and FAISSs efficient indexing capabilities to achieve high precision and recall in retrieving relevant images for diagnosis. We experimented on a dataset of 2006 images from the Categorized Digital Database for Low Energy and Subtracted Contrast Enhanced Spectral Mammography (CDD-CESM) images available on The Cancer Imaging Archive (TCIA). Our method outperforms conventional retrieval techniques, achieving a precision of 80% at k=5 for BIRADS classification. The dataset includes annotated CESM images and medical reports, providing a comprehensive foundation for our research.
那些在当今研究中使用的数据集,在医学领域尤其庞大。如X光片、MRI和CT扫描等不同类型的医学影像占用了大量的存储空间。这些数据量的大小带来了诸如因数据库规模而难以访问和检索特定图像等挑战。随着数据库的不断增长,一个高效的图像检索系统对于节省时间和资源而言是至关重要的。在本文中,我们提出了一种使用DenseNet进行特征提取并利用FAISS进行相似性搜索的方法来进行医学影像检索。DenseNet特别适合用于复杂医学影像中的特征提取,而FAISS则能够高效地处理大规模数据集中的高维数据。与现有的仅专注于分类准确性的方法不同,我们的方法优先考虑了检索速度和诊断相关性,解决了放射科医生在实时案例对比中遇到的关键问题。我们应用了BIRADS系统对乳腺癌影像进行了分类。通过利用DenseNet的强大特征表示能力和FAISS的高效索引功能,我们在检索用于诊断的相关图像时实现了高精度和召回率。我们的实验数据集来源于《低能量和减影对比增强光谱乳房X线摄影(CDD-CESM)》图像分类数字化数据库中的2006张影像,该数据集可在癌症成像档案馆(TCIA)上获取。我们的方法超越了传统的检索技术,在BIRADS分类中以k=5时达到了80%的精度。这个数据集包括带有注释的CESM图像和医疗报告,为我们的研究提供了全面的基础。
https://arxiv.org/abs/2411.01473
Multimodal models leverage large-scale pre-training to achieve strong but still imperfect performance on tasks such as image captioning, visual question answering, and cross-modal retrieval. In this paper, we present a simple and efficient method for correcting errors in trained contrastive image-text retrieval models with no additional training, called Nearest Neighbor Normalization (NNN). We show an improvement on retrieval metrics in both text retrieval and image retrieval for all of the contrastive models that we tested (CLIP, BLIP, ALBEF, SigLIP, BEiT) and for both of the datasets that we used (MS-COCO and Flickr30k). NNN requires a reference database, but does not require any training on this database, and can even increase the retrieval accuracy of a model after finetuning.
多模态模型利用大规模预训练在图像描述、视觉问答和跨模态检索等任务上取得了较强但仍然不完美的表现。在这篇论文中,我们提出了一种简单且高效的方法——最近邻归一化(NNN),用于纠正已训练的对比图像文本检索模型中的错误,而无需额外的训练。我们的结果显示,在使用的两个数据集(MS-COCO 和 Flickr30k)上,对于测试的所有对比模型(CLIP、BLIP、ALBEF、SigLIP、BEiT),NNN 在文本检索和图像检索指标方面都有所提升。最近邻归一化需要一个参考数据库,但不需要在这个数据库上进行任何训练,并且甚至可以在微调之后提高模型的检索精度。
https://arxiv.org/abs/2410.24114
Composed Image Retrieval (CIR) is a challenging vision-language task, utilizing bi-modal (image+text) queries to retrieve target images. Despite the impressive performance of supervised CIR, the dependence on costly, manually-labeled triplets limits its scalability and zero-shot capability. To address this issue, zero-shot composed image retrieval (ZS-CIR) is presented along with projection-based approaches. However, such methods face two major problems, i.e., task discrepancy between pre-training (image $\leftrightarrow$ text) and inference (image+text $\rightarrow$ image), and modality discrepancy. The latter pertains to approaches based on text-only projection training due to the necessity of feature extraction from the reference image during inference. In this paper, we propose a two-stage framework to tackle both discrepancies. First, to ensure efficiency and scalability, a textual inversion network is pre-trained on large-scale caption datasets. Subsequently, we put forward Modality-Task Dual Alignment (MoTaDual) as the second stage, where large-language models (LLMs) generate triplet data for fine-tuning, and additionally, prompt learning is introduced in a multi-modal context to effectively alleviate both modality and task discrepancies. The experimental results show that our MoTaDual achieves the state-of-the-art performance across four widely used ZS-CIR benchmarks, while maintaining low training time and computational cost. The code will be released soon.
组合图像检索(CIR)是一项具有挑战性的视觉-语言任务,利用双模态(图像+文本)查询来检索目标图像。尽管监督式CIR表现优异,但对昂贵的手动标注三元组的依赖限制了其可扩展性和零样本能力。为了解决这个问题,本文提出了零样本组合图像检索(ZS-CIR),并引入基于投影的方法。然而,这些方法面临两个主要问题:即预训练(图像$\leftrightarrow$文本)与推理(图像+文本$\rightarrow$图像)之间的任务差异性,以及模态差异性。后者涉及仅基于文本投影训练的方法,因为在推理过程中需要从参考图像中提取特征。本文提出了一种两阶段框架来解决这两个差异问题。首先,为了确保效率和可扩展性,在大规模标题数据集上预训练了一个文本反转网络。其次,我们提出了多模态任务双对齐(MoTaDual)作为第二阶段,在此过程中大型语言模型(LLMs)生成三元组数据进行微调,并且在多模态环境中引入提示学习以有效缓解模态和任务差异性。实验结果表明,我们的MoTaDual在四个广泛使用的ZS-CIR基准上实现了最先进的性能,同时保持较低的训练时间和计算成本。代码即将发布。
https://arxiv.org/abs/2410.23736