Objects, in the real world, rarely occur in isolation and exhibit typical arrangements governed by their independent utility, and their expected interaction with humans and other objects in the context. For example, a chair is expected near a table, and a computer is expected on top. Humans use this spatial context and relative placement as an important cue for visual recognition in case of ambiguities. Similar to human's, DNN's exploit contextual information from data to learn representations. Our research focuses on harnessing the contextual aspects of visual data to optimize data annotation and enhance the training of deep networks. Our contributions can be summarized as follows: (1) We introduce the notion of contextual diversity for active learning CDAL and show its applicability in three different visual tasks semantic segmentation, object detection and image classification, (2) We propose a data repair algorithm to curate contextually fair data to reduce model bias, enabling the model to detect objects out of their obvious context, (3) We propose Class-based annotation, where contextually relevant classes are selected that are complementary for model training under domain shift. Understanding the importance of well-curated data, we also emphasize the necessity of involving humans in the loop to achieve accurate annotations and to develop novel interaction strategies that allow humans to serve as fact-checkers. In line with this we are working on developing image retrieval system for wildlife camera trap images and reliable warning system for poor quality rural roads. For large-scale annotation, we are employing a strategic combination of human expertise and zero-shot models, while also integrating human input at various stages for continuous feedback.
物体在现实世界中很少单独出现,而是呈现出由其独立用途和预期与人类及其他物体的互动所决定的典型排列。例如,在桌子附近通常会有一把椅子,桌子上可能会有电脑。人类利用这种空间背景及其相对位置作为视觉识别的重要线索,特别是在存在模糊性的情况下。类似于人类的做法,深度神经网络(DNN)也会从数据中提取上下文信息来学习表示形式。我们的研究重点在于利用视觉数据的背景方面优化数据标注并提升深度网络训练的效果。我们的贡献可以总结如下:(1) 我们引入了主动学习中的上下文多样性(CDAL)概念,并展示了它在三个不同的视觉任务——语义分割、物体检测和图像分类中应用的可能性。(2) 我们提出了一种数据修复算法,以整理出背景公平的数据来减少模型偏差,使模型能够识别超出明显背景的物体。(3) 我们提出了基于类别的标注方法,在领域迁移下选择互补的背景相关的类别进行模型训练。认识到精心策划数据的重要性,我们也强调了在实现准确标注和开发新型互动策略方面纳入人类参与的必要性,这些策略允许人类充当事实核查者。与这一目标一致,我们正在开发野生动物相机陷阱图像检索系统以及用于农村低质量道路的可靠预警系统。对于大规模标注工作,我们在运用专家知识和零样本模型的战略组合的同时,在各个阶段整合人类输入以实现持续反馈。
https://arxiv.org/abs/2411.01925
That datasets that are used in todays research are especially vast in the medical field. Different types of medical images such as X-rays, MRI, CT scan etc. take up large amounts of space. This volume of data introduces challenges like accessing and retrieving specific images due to the size of the database. An efficient image retrieval system is essential as the database continues to grow to save time and resources. In this paper, we propose an approach to medical image retrieval using DenseNet for feature extraction and use FAISS for similarity search. DenseNet is well-suited for feature extraction in complex medical images and FAISS enables efficient handling of high-dimensional data in large-scale datasets. Unlike existing methods focused solely on classification accuracy, our method prioritizes both retrieval speed and diagnostic relevance, addressing a critical gap in real-time case comparison for radiologists. We applied the classification of breast cancer images using the BIRADS system. We utilized DenseNet's powerful feature representation and FAISSs efficient indexing capabilities to achieve high precision and recall in retrieving relevant images for diagnosis. We experimented on a dataset of 2006 images from the Categorized Digital Database for Low Energy and Subtracted Contrast Enhanced Spectral Mammography (CDD-CESM) images available on The Cancer Imaging Archive (TCIA). Our method outperforms conventional retrieval techniques, achieving a precision of 80% at k=5 for BIRADS classification. The dataset includes annotated CESM images and medical reports, providing a comprehensive foundation for our research.
那些在当今研究中使用的数据集,在医学领域尤其庞大。如X光片、MRI和CT扫描等不同类型的医学影像占用了大量的存储空间。这些数据量的大小带来了诸如因数据库规模而难以访问和检索特定图像等挑战。随着数据库的不断增长,一个高效的图像检索系统对于节省时间和资源而言是至关重要的。在本文中,我们提出了一种使用DenseNet进行特征提取并利用FAISS进行相似性搜索的方法来进行医学影像检索。DenseNet特别适合用于复杂医学影像中的特征提取,而FAISS则能够高效地处理大规模数据集中的高维数据。与现有的仅专注于分类准确性的方法不同,我们的方法优先考虑了检索速度和诊断相关性,解决了放射科医生在实时案例对比中遇到的关键问题。我们应用了BIRADS系统对乳腺癌影像进行了分类。通过利用DenseNet的强大特征表示能力和FAISS的高效索引功能,我们在检索用于诊断的相关图像时实现了高精度和召回率。我们的实验数据集来源于《低能量和减影对比增强光谱乳房X线摄影(CDD-CESM)》图像分类数字化数据库中的2006张影像,该数据集可在癌症成像档案馆(TCIA)上获取。我们的方法超越了传统的检索技术,在BIRADS分类中以k=5时达到了80%的精度。这个数据集包括带有注释的CESM图像和医疗报告,为我们的研究提供了全面的基础。
https://arxiv.org/abs/2411.01473
Multimodal models leverage large-scale pre-training to achieve strong but still imperfect performance on tasks such as image captioning, visual question answering, and cross-modal retrieval. In this paper, we present a simple and efficient method for correcting errors in trained contrastive image-text retrieval models with no additional training, called Nearest Neighbor Normalization (NNN). We show an improvement on retrieval metrics in both text retrieval and image retrieval for all of the contrastive models that we tested (CLIP, BLIP, ALBEF, SigLIP, BEiT) and for both of the datasets that we used (MS-COCO and Flickr30k). NNN requires a reference database, but does not require any training on this database, and can even increase the retrieval accuracy of a model after finetuning.
多模态模型利用大规模预训练在图像描述、视觉问答和跨模态检索等任务上取得了较强但仍然不完美的表现。在这篇论文中,我们提出了一种简单且高效的方法——最近邻归一化(NNN),用于纠正已训练的对比图像文本检索模型中的错误,而无需额外的训练。我们的结果显示,在使用的两个数据集(MS-COCO 和 Flickr30k)上,对于测试的所有对比模型(CLIP、BLIP、ALBEF、SigLIP、BEiT),NNN 在文本检索和图像检索指标方面都有所提升。最近邻归一化需要一个参考数据库,但不需要在这个数据库上进行任何训练,并且甚至可以在微调之后提高模型的检索精度。
https://arxiv.org/abs/2410.24114
Composed Image Retrieval (CIR) is a challenging vision-language task, utilizing bi-modal (image+text) queries to retrieve target images. Despite the impressive performance of supervised CIR, the dependence on costly, manually-labeled triplets limits its scalability and zero-shot capability. To address this issue, zero-shot composed image retrieval (ZS-CIR) is presented along with projection-based approaches. However, such methods face two major problems, i.e., task discrepancy between pre-training (image $\leftrightarrow$ text) and inference (image+text $\rightarrow$ image), and modality discrepancy. The latter pertains to approaches based on text-only projection training due to the necessity of feature extraction from the reference image during inference. In this paper, we propose a two-stage framework to tackle both discrepancies. First, to ensure efficiency and scalability, a textual inversion network is pre-trained on large-scale caption datasets. Subsequently, we put forward Modality-Task Dual Alignment (MoTaDual) as the second stage, where large-language models (LLMs) generate triplet data for fine-tuning, and additionally, prompt learning is introduced in a multi-modal context to effectively alleviate both modality and task discrepancies. The experimental results show that our MoTaDual achieves the state-of-the-art performance across four widely used ZS-CIR benchmarks, while maintaining low training time and computational cost. The code will be released soon.
组合图像检索(CIR)是一项具有挑战性的视觉-语言任务,利用双模态(图像+文本)查询来检索目标图像。尽管监督式CIR表现优异,但对昂贵的手动标注三元组的依赖限制了其可扩展性和零样本能力。为了解决这个问题,本文提出了零样本组合图像检索(ZS-CIR),并引入基于投影的方法。然而,这些方法面临两个主要问题:即预训练(图像$\leftrightarrow$文本)与推理(图像+文本$\rightarrow$图像)之间的任务差异性,以及模态差异性。后者涉及仅基于文本投影训练的方法,因为在推理过程中需要从参考图像中提取特征。本文提出了一种两阶段框架来解决这两个差异问题。首先,为了确保效率和可扩展性,在大规模标题数据集上预训练了一个文本反转网络。其次,我们提出了多模态任务双对齐(MoTaDual)作为第二阶段,在此过程中大型语言模型(LLMs)生成三元组数据进行微调,并且在多模态环境中引入提示学习以有效缓解模态和任务差异性。实验结果表明,我们的MoTaDual在四个广泛使用的ZS-CIR基准上实现了最先进的性能,同时保持较低的训练时间和计算成本。代码即将发布。
https://arxiv.org/abs/2410.23736
What representation do deep neural networks learn? How similar are images to each other for neural networks? Despite the overwhelming success of deep learning methods key questions about their internal workings still remain largely unanswered, due to their internal high dimensionality and complexity. To address this, one approach is to measure the similarity of activation responses to various inputs. Representational Similarity Matrices (RSMs) distill this similarity into scalar values for each input pair. These matrices encapsulate the entire similarity structure of a system, indicating which input leads to similar responses. While the similarity between images is ambiguous, we argue that the spatial location of semantic objects does neither influence human perception nor deep learning classifiers. Thus this should be reflected in the definition of similarity between image responses for computer vision systems. Revisiting the established similarity calculations for RSMs we expose their sensitivity to spatial alignment. In this paper, we propose to solve this through semantic RSMs, which are invariant to spatial permutation. We measure semantic similarity between input responses by formulating it as a set-matching problem. Further, we quantify the superiority of semantic RSMs over spatio-semantic RSMs through image retrieval and by comparing the similarity between representations to the similarity between predicted class probabilities.
深度神经网络学习了什么样的表示?对于神经网络而言,图像之间的相似性有多大?尽管深度学习方法取得了巨大的成功,但关于其内部工作原理的关键问题仍然大多未得到解答,这主要归因于它们内部的高维度和复杂性。为了解决这个问题,一种方法是衡量各种输入所引起的激活响应的相似度。表示相似矩阵(RSMs)将这种相似度转化为每个输入对的标量值。这些矩阵封装了系统的整个相似结构,表明哪些输入会引起类似的反应。尽管图像之间的相似性难以界定,我们主张语义对象的空间位置既不影响人类感知也不影响深度学习分类器。因此,这应该反映在计算计算机视觉系统中图像响应之间相似性的定义上。回顾已建立的RSMs相似度计算方法,我们揭示了它们对空间对齐的敏感性。本文提出通过语义RSMs来解决这一问题,这些矩阵对于空间排列是不变的。我们通过将输入响应之间的语义相似性表述为一个集合匹配问题来进行衡量。进一步地,我们通过图像检索以及比较表示间的相似性和预测类概率间相似性的方法量化了语义RSMs在优越性上的表现。
https://arxiv.org/abs/2410.23107
Large Language Models (LLMs) have demonstrated impressive capabilities in answering questions, but they lack domain-specific knowledge and are prone to hallucinations. Retrieval Augmented Generation (RAG) is one approach to address these challenges, while multimodal models are emerging as promising AI assistants for processing both text and images. In this paper we describe a series of experiments aimed at determining how to best integrate multimodal models into RAG systems for the industrial domain. The purpose of the experiments is to determine whether including images alongside text from documents within the industrial domain increases RAG performance and to find the optimal configuration for such a multimodal RAG system. Our experiments include two approaches for image processing and retrieval, as well as two LLMs (GPT4-Vision and LLaVA) for answer synthesis. These image processing strategies involve the use of multimodal embeddings and the generation of textual summaries from images. We evaluate our experiments with an LLM-as-a-Judge approach. Our results reveal that multimodal RAG can outperform single-modality RAG settings, although image retrieval poses a greater challenge than text retrieval. Additionally, leveraging textual summaries from images presents a more promising approach compared to the use of multimodal embeddings, providing more opportunities for future advancements.
大型语言模型(LLMs)在回答问题方面展现了令人印象深刻的能力,但它们缺乏特定领域的知识,并且容易出现幻觉。检索增强生成(RAG)是一种解决这些问题的方法,而多模态模型正逐渐成为处理文本和图像的有前途的人工智能助手。本文中,我们描述了一系列旨在确定如何最好地将多模态模型整合到工业领域中的RAG系统中的实验。这些实验的目的在于确定是否在文档中同时包含图像与文本可以提高RAG系统的性能,并找到这种多模态RAG系统的最佳配置。我们的实验包括两种图像处理和检索的方法,以及两个用于答案合成的LLMs(GPT4-Vision 和 LLaVA)。这些图像处理策略涉及使用多模态嵌入和从图像生成文本摘要。我们采用“以LLM作为裁判”的方法来评估我们的实验。结果表明,多模态RAG可以超越单一模式的RAG设置,尽管图像检索比文本检索更具挑战性。此外,与使用多模态嵌入相比,利用图像的文字摘要展现出了更具有前景的方法,并为未来的进步提供了更多机会。
https://arxiv.org/abs/2410.21943
In this paper, we investigate the task of general conversational image retrieval on open-domain images. The objective is to search for images based on interactive conversations between humans and computers. To advance this task, we curate a dataset called ChatSearch. This dataset includes a multi-round multimodal conversational context query for each target image, thereby requiring the retrieval system to find the accurate image from database. Simultaneously, we propose a generative retrieval model named ChatSearcher, which is trained end-to-end to accept/produce interleaved image-text inputs/outputs. ChatSearcher exhibits strong capability in reasoning with multimodal context and can leverage world knowledge to yield visual retrieval results. It demonstrates superior performance on the ChatSearch dataset and also achieves competitive results on other image retrieval tasks and visual conversation tasks. We anticipate that this work will inspire further research on interactive multimodal retrieval systems. Our dataset will be available at this https URL.
在这篇论文中,我们研究了在开放领域图像上进行通用对话式图像检索的任务。目标是基于人机之间的互动对话来搜索图像。为了推进这项任务,我们整理了一个名为ChatSearch的数据集。该数据集为每个目标图像包含一个多轮次的多模态对话上下文查询,从而要求检索系统能够从数据库中找到准确的图像。同时,我们提出了一种生成式检索模型,命名为ChatSearcher,它经过端到端训练,可以接受/产生交错的图文输入/输出。ChatSearcher在利用多模态上下文进行推理方面表现出强大的能力,并能借助世界知识来获得视觉检索结果。该模型在ChatSearch数据集上展现了优越的表现,并且在其他图像检索任务和视觉对话任务中也取得了有竞争力的结果。我们希望这项工作能够激发对互动式多模态检索系统进一步研究的兴趣。我们的数据集可以在以下链接获取:[此 https URL]。
https://arxiv.org/abs/2410.18715
Zero-Shot Composed Image Retrieval (ZS-CIR) supports diverse tasks with a broad range of visual content manipulation intentions that can be related to domain, scene, object, and attribute. A key challenge for ZS-CIR is to accurately map image representation to a pseudo-word token that captures the manipulation intention relevant image information for generalized CIR. However, existing methods between the retrieval and pre-training stages lead to significant redundancy in the pseudo-word tokens. In this paper, we propose a novel denoising image-to-word mapping approach, named Denoise-I2W, for mapping images into denoising pseudo-word tokens that, without intention-irrelevant visual information, enhance accurate ZS-CIR. Specifically, a pseudo triplet construction module first automatically constructs pseudo triples (\textit{i.e.,} a pseudo-reference image, a pseudo-manipulation text, and a target image) for pre-training the denoising mapping network. Then, a pseudo-composed mapping module maps the pseudo-reference image to a pseudo-word token and combines it with the pseudo-manipulation text with manipulation intention. This combination aligns with the target image, facilitating denoising intention-irrelevant visual information for mapping. Our proposed Denoise-I2W is a model-agnostic and annotation-free approach. It demonstrates strong generalization capabilities across three state-of-the-art ZS-CIR models on four benchmark datasets. By integrating Denoise-I2W with existing best models, we obtain consistent and significant performance boosts ranging from 1.45\% to 4.17\% over the best methods without increasing inference costs. and achieve new state-of-the-art results on ZS-CIR. Our code is available at \url{this https URL}.
零样本组合图像检索(ZS-CIR)支持多种任务,涵盖了与领域、场景、对象和属性相关的广泛视觉内容操纵意图。对于ZS-CIR而言,一个关键挑战在于准确地将图像表示映射到能够捕捉相关图像信息以实现通用化CIR的伪词令牌上。然而,现有的检索和预训练阶段之间的方法导致了在伪词令牌中出现大量冗余。本文提出了一种新颖的去噪图像到单词映射方法,名为Denoise-I2W,用于将图像映射为不含与操纵意图无关的视觉信息的去噪伪词令牌,从而增强准确的ZS-CIR。具体来说,一个伪三元组构造模块首先自动构建伪三元组(即,伪参考图像、伪操纵文本和目标图像),以预训练去噪映射网络。随后,伪组合映射模块将伪参考图像映射到伪词令牌,并将其与具有操纵意图的伪操纵文本结合。这种结合与目标图像对齐,有助于在映射过程中去噪无关视觉信息。我们提出的Denoise-I2W是一种模型不可知且无需标注的方法,在四个基准数据集上的三种最先进的ZS-CIR模型中展示了强大的泛化能力。通过将Denoise-I2W整合进现有的最佳模型中,我们在不增加推理成本的情况下,获得了从1.45%到4.17%的一致且显著的性能提升,并在ZS-CIR上达到了新的最先进结果。我们的代码可以在\url{this https URL}获取。
https://arxiv.org/abs/2410.17393
Cross-modal metric learning is a prominent research topic that bridges the semantic heterogeneity between vision and language. Existing methods frequently utilize simple cosine or complex distance metrics to transform the pairwise features into a similarity score, which suffers from an inadequate or inefficient capability for distance measurements. Consequently, we propose a Generalized Structural Sparse Function to dynamically capture thorough and powerful relationships across modalities for pair-wise similarity learning while remaining concise but efficient. Specifically, the distance metric delicately encapsulates two formats of diagonal and block-diagonal terms, automatically distinguishing and highlighting the cross-channel relevancy and dependency inside a structured and organized topology. Hence, it thereby empowers itself to adapt to the optimal matching patterns between the paired features and reaches a sweet spot between model complexity and capability. Extensive experiments on cross-modal and two extra uni-modal retrieval tasks (image-text retrieval, person re-identification, fine-grained image retrieval) have validated its superiority and flexibility over various popular retrieval frameworks. More importantly, we further discover that it can be seamlessly incorporated into multiple application scenarios, and demonstrates promising prospects from Attention Mechanism to Knowledge Distillation in a plug-and-play manner. Our code is publicly available at: this https URL.
https://arxiv.org/abs/2410.15266
Digital tools for text analysis have long been essential for the searchability and accessibility of digitised library collections. Recent computer vision advances have introduced similar capabilities for visual materials, with deep learning-based embeddings showing promise for analysing visual heritage. Given that many books feature visuals in addition to text, taking advantage of these breakthroughs is critical to making library collections open and accessible. In this work, we present a proof-of-concept image search application for exploring images in the National Library of Norway's pre-1900 books, comparing Vision Transformer (ViT), Contrastive Language-Image Pre-training (CLIP), and Sigmoid loss for Language-Image Pre-training (SigLIP) embeddings for image retrieval and classification. Our results show that the application performs well for exact image retrieval, with SigLIP embeddings slightly outperforming CLIP and ViT in both retrieval and classification tasks. Additionally, SigLIP-based image classification can aid in cleaning image datasets from a digitisation pipeline.
https://arxiv.org/abs/2410.14969
As we enter the era of big data, collecting high-quality data is very important. However, collecting data by humans is not only very time-consuming but also expensive. Therefore, many scientists have devised various methods to collect data using computers. Among them, there is a method called web crawling, but the authors found that the crawling method has a problem in that unintended data is collected along with the user. The authors found that this can be filtered using the object recognition model YOLOv10. However, there are cases where data that is not properly filtered remains. Here, image reclassification was performed by additionally utilizing the distance output from the Siamese network, and higher performance was recorded than other classification models. (average \_f1 score YOLO+MobileNet 0.678->YOLO+SiameseNet 0.772)) The user can specify a distance threshold to adjust the balance between data deficiency and noise-robustness. The authors also found that the Siamese network can achieve higher performance with fewer resources because the cropped images are used for object recognition when processing images in the Siamese network. (Class 20 mean-based f1 score, non-crop+Siamese(MobileNetV3-Small) 80.94 -> crop preprocessing+Siamese(MobileNetV3-Small) 82.31) In this way, the image retrieval system that utilizes two consecutive models to reduce errors can save users' time and effort, and build better quality data faster and with fewer resources than before.
https://arxiv.org/abs/2410.12561
Visual localization involves estimating a query image's 6-DoF (degrees of freedom) camera pose, which is a fundamental component in various computer vision and robotic tasks. This paper presents LoGS, a vision-based localization pipeline utilizing the 3D Gaussian Splatting (GS) technique as scene representation. This novel representation allows high-quality novel view synthesis. During the mapping phase, structure-from-motion (SfM) is applied first, followed by the generation of a GS map. During localization, the initial position is obtained through image retrieval, local feature matching coupled with a PnP solver, and then a high-precision pose is achieved through the analysis-by-synthesis manner on the GS map. Experimental results on four large-scale datasets demonstrate the proposed approach's SoTA accuracy in estimating camera poses and robustness under challenging few-shot conditions.
https://arxiv.org/abs/2410.11505
A text encoder within Vision-Language Models (VLMs) like CLIP plays a crucial role in translating textual input into an embedding space shared with images, thereby facilitating the interpretative analysis of vision tasks through natural language. Despite the varying significance of different textual elements within a sentence depending on the context, efforts to account for variation of importance in constructing text embeddings have been lacking. We propose a framework of Semantic Token Reweighting to build Interpretable text embeddings (SToRI), which incorporates controllability as well. SToRI refines the text encoding process in CLIP by differentially weighting semantic elements based on contextual importance, enabling finer control over emphasis responsive to data-driven insights and user preferences. The efficacy of SToRI is demonstrated through comprehensive experiments on few-shot image classification and image retrieval tailored to user preferences.
https://arxiv.org/abs/2410.08469
Supervised Person Re-identification (Person ReID) methods have achieved excellent performance when training and testing within one camera network. However, they usually suffer from considerable performance degradation when applied to different camera systems. In recent years, many Domain Adaptation Person ReID methods have been proposed, achieving impressive performance without requiring labeled data from the target domain. However, these approaches still need the unlabeled data of the target domain during the training process, making them impractical in many real-world scenarios. Our work focuses on the more practical Domain Generalized Person Re-identification (DG-ReID) problem. Given one or more source domains, it aims to learn a generalized model that can be applied to unseen target domains. One promising research direction in DG-ReID is the use of implicit deep semantic feature expansion, and our previous method, Domain Embedding Expansion (DEX), is one such example that achieves powerful results in DG-ReID. However, in this work we show that DEX and other similar implicit deep semantic feature expansion methods, due to limitations in their proposed loss function, fail to reach their full potential on large evaluation benchmarks as they have a tendency to saturate too early. Leveraging on this analysis, we propose Unified Deep Semantic Expansion, our novel framework that unifies implicit and explicit semantic feature expansion techniques in a single framework to mitigate this early over-fitting and achieve a new state-of-the-art (SOTA) in all DG-ReID benchmarks. Further, we apply our method on more general image retrieval tasks, also surpassing the current SOTA in all of these benchmarks by wide margins.
https://arxiv.org/abs/2410.08456
Recent advancements in Vision-Language Models (VLMs) have enabled complex multimodal tasks by processing text and image data simultaneously, significantly enhancing the field of artificial intelligence. However, these models often exhibit biases that can skew outputs towards societal stereotypes, thus necessitating debiasing strategies. Existing debiasing methods focus narrowly on specific modalities or tasks, and require extensive retraining. To address these limitations, this paper introduces Selective Feature Imputation for Debiasing (SFID), a novel methodology that integrates feature pruning and low confidence imputation (LCI) to effectively reduce biases in VLMs. SFID is versatile, maintaining the semantic integrity of outputs and costly effective by eliminating the need for retraining. Our experimental results demonstrate SFID's effectiveness across various VLMs tasks including zero-shot classification, text-to-image retrieval, image captioning, and text-to-image generation, by significantly reducing gender biases without compromising performance. This approach not only enhances the fairness of VLMs applications but also preserves their efficiency and utility across diverse scenarios.
https://arxiv.org/abs/2410.07593
In this work, we present MedImageInsight, an open-source medical imaging embedding model. MedImageInsight is trained on medical images with associated text and labels across a diverse collection of domains, including X-Ray, CT, MRI, dermoscopy, OCT, fundus photography, ultrasound, histopathology, and mammography. Rigorous evaluations demonstrate MedImageInsight's ability to achieve state-of-the-art (SOTA) or human expert level performance across classification, image-image search, and fine-tuning tasks. Specifically, on public datasets, MedImageInsight achieves SOTA in CT 3D medical image retrieval, as well as SOTA in disease classification and search for chest X-ray, dermatology, and OCT imaging. Furthermore, MedImageInsight achieves human expert performance in bone age estimation (on both public and partner data), as well as AUC above 0.9 in most other domains. When paired with a text decoder, MedImageInsight achieves near SOTA level single image report findings generation with less than 10\% the parameters of other models. Compared to fine-tuning GPT-4o with only MIMIC-CXR data for the same task, MedImageInsight outperforms in clinical metrics, but underperforms on lexical metrics where GPT-4o sets a new SOTA. Importantly for regulatory purposes, MedImageInsight can generate ROC curves, adjust sensitivity and specificity based on clinical need, and provide evidence-based decision support through image-image search (which can also enable retrieval augmented generation). In an independent clinical evaluation of image-image search in chest X-ray, MedImageInsight outperformed every other publicly available foundation model evaluated by large margins (over 6 points AUC), and significantly outperformed other models in terms of AI fairness (across age and gender). We hope releasing MedImageInsight will help enhance collective progress in medical imaging AI research and development.
https://arxiv.org/abs/2410.06542
Multimodal models, which combine visual and textual information, have recently gained significant recognition. This paper addresses the multimodal challenge of Text-Image retrieval and introduces a novel task that extends the modalities to include temporal data. The Temporal Image Caption Retrieval Competition (TICRC) presented in this paper is based on the Chronicling America and Challenging America projects, which offer access to an extensive collection of digitized historic American newspapers spanning 274 years. In addition to the competition results, we provide an analysis of the delivered dataset and the process of its creation.
https://arxiv.org/abs/2410.06314
We present GSLoc: a new visual localization method that performs dense camera alignment using 3D Gaussian Splatting as a map representation of the scene. GSLoc backpropagates pose gradients over the rendering pipeline to align the rendered and target images, while it adopts a coarse-to-fine strategy by utilizing blurring kernels to mitigate the non-convexity of the problem and improve the convergence. The results show that our approach succeeds at visual localization in challenging conditions of relatively small overlap between initial and target frames inside textureless environments when state-of-the-art neural sparse methods provide inferior results. Using the byproduct of realistic rendering from the 3DGS map representation, we show how to enhance localization results by mixing a set of observed and virtual reference keyframes when solving the image retrieval problem. We evaluate our method both on synthetic and real-world data, discussing its advantages and application potential.
https://arxiv.org/abs/2410.06165
Vision-Language Models (VLMs) have transformed tasks requiring visual and reasoning abilities, such as image retrieval and Visual Question Answering (VQA). Despite their success, VLMs face significant challenges with tasks involving geometric reasoning, algebraic problem-solving, and counting. These limitations stem from difficulties effectively integrating multiple modalities and accurately interpreting geometry-related tasks. Various works claim that introducing a captioning pipeline before VQA tasks enhances performance. We incorporated this pipeline for tasks involving geometry, algebra, and counting. We found that captioning results are not generalizable, specifically with larger VLMs primarily trained on downstream QnA tasks showing random performance on math-related challenges. However, we present a promising alternative: task-based prompting, enriching the prompt with task-specific guidance. This approach shows promise and proves more effective than direct captioning methods for math-heavy problems.
https://arxiv.org/abs/2410.05928
Understanding long text is of great demands in practice but beyond the reach of most language-image pre-training (LIP) models. In this work, we empirically confirm that the key reason causing such an issue is that the training images are usually paired with short captions, leaving certain tokens easily overshadowed by salient tokens. Towards this problem, our initial attempt is to relabel the data with long captions, however, directly learning with which may lead to performance degradation in understanding short text (e.g., in the image classification task). Then, after incorporating corner tokens to aggregate diverse textual information, we manage to help the model catch up to its original level of short text understanding yet greatly enhance its capability of long text understanding. We further look into whether the model can continuously benefit from longer captions and notice a clear trade-off between the performance and the efficiency. Finally, we validate the effectiveness of our approach using a self-constructed large-scale dataset, which consists of 100M long caption oriented text-image pairs. It is noteworthy that, on the task of long-text image retrieval, we beat the competitor using long captions with 11.1% improvement (i.e., from 72.62% to 83.72%). We will release the code, the model, and the new dataset to facilitate the reproducibility and further research. The project page is available at this https URL.
在实践中,理解长文本是一个巨大的挑战,但大多数语言图像预训练(LIP)模型无法达到这个目标。在这项工作中,我们通过实验证实,导致这种问题的关键原因是因为训练图像通常与短脚本配对,这使得某些标记词很容易被显著的标记词所淹没。为了解决这个问题,我们最初的尝试是重新给数据赋值,但是直接学习可能会导致对短文本理解的性能下降(例如,在图像分类任务中)。然后,在纳入角落标记词以聚合多样文本信息后,我们设法帮助模型赶上其原始短文本理解水平,同时大大提高了其长文本理解的能力。我们进一步研究了模型是否可以从更长的脚本中持续受益,并注意到性能和效率之间的明显权衡。最后,我们使用自构建的大规模数据集来验证我们的方法的有效性,该数据集包括100M个指向长脚本的图像对。值得注意的是,在长文本图像检索任务中,我们使用具有11.1%改进(即从72.62%到83.72%)的长脚本击败了对手。我们将发布代码、模型和新数据集,以促进可重复性和进一步的研究。项目页面可在https:// this URL找到。
https://arxiv.org/abs/2410.05249