Today, most methods for image understanding tasks rely on feed-forward neural networks. While this approach has allowed for empirical accuracy, efficiency, and task adaptation via fine-tuning, it also comes with fundamental disadvantages. Existing networks often struggle to generalize across different datasets, even on the same task. By design, these networks ultimately reason about high-dimensional scene features, which are challenging to analyze. This is true especially when attempting to predict 3D information based on 2D images. We propose to recast 3D multi-object tracking from RGB cameras as an \emph{Inverse Rendering (IR)} problem, by optimizing via a differentiable rendering pipeline over the latent space of pre-trained 3D object representations and retrieve the latents that best represent object instances in a given input image. To this end, we optimize an image loss over generative latent spaces that inherently disentangle shape and appearance properties. We investigate not only an alternate take on tracking but our method also enables examining the generated objects, reasoning about failure situations, and resolving ambiguous cases. We validate the generalization and scaling capabilities of our method by learning the generative prior exclusively from synthetic data and assessing camera-based 3D tracking on the nuScenes and Waymo datasets. Both these datasets are completely unseen to our method and do not require fine-tuning. Videos and code are available at this https URL.
today,大多数图像理解任务的方法都依赖于前馈神经网络。虽然这种方法通过微调获得了 empirical 的准确性和效率,但同时也存在一些基本缺陷。现有的网络往往难以在不同的数据集上泛化,即使是相同任务。通过设计,这些网络最终在预训练的3D对象表示的潜在空间中进行推理,这是具有挑战性的。尤其是在试图根据2D图像预测3D信息时,这更是如此。我们提出将从RGB相机中的3D多对象跟踪重新建模为同义词{反向渲染(IR)问题,通过优化通过不同的渲染管道在预训练3D对象表示的潜在空间中进行优化,并检索在给定输入图像中最好地表示物体实例的潜在。为此,我们优化了一个在生成性潜在空间上进行的图像损失。我们研究了不仅是对跟踪的另一种看法,而且我们的方法还允许我们检查生成的物体,推理失败情况,并解决模糊情况。我们通过仅从合成数据中学习生成先验来评估我们的方法的泛化能力和扩展能力。我们在 nuScenes 和 Waymo 数据集上对相机基于3D跟踪的性能进行了评估。这两个数据集完全未见对我们的方法,也不需要微调。视频和代码可在此处 https:// URL 下载。
https://arxiv.org/abs/2404.12359
Self-supervised learning (SSL) has emerged as a promising technique for medical image analysis due to its ability to learn without annotations. However, despite the promising potential, conventional SSL methods encounter limitations, including challenges in achieving semantic alignment and capturing subtle details. This leads to suboptimal representations, which fail to accurately capture the underlying anatomical structures and pathological details. In response to these constraints, we introduce a novel SSL framework OPTiML, employing optimal transport (OT), to capture the dense semantic invariance and fine-grained details, thereby enhancing the overall effectiveness of SSL in medical image representation learning. The core idea is to integrate OT with a cross-viewpoint semantics infusion module (CV-SIM), which effectively captures complex, fine-grained details inherent in medical images across different viewpoints. In addition to the CV-SIM module, OPTiML imposes the variance and covariance regularizations within OT framework to force the model focus on clinically relevant information while discarding less informative features. Through these, the proposed framework demonstrates its capacity to learn semantically rich representations that can be applied to various medical imaging tasks. To validate its effectiveness, we conduct experimental studies on three publicly available datasets from chest X-ray modality. Our empirical results reveal OPTiML's superiority over state-of-the-art methods across all evaluated tasks.
自监督学习(SSL)作为一种无需标注的学习技术,在医学图像分析领域呈现出巨大的潜力。然而,尽管具有潜在的积极影响,传统的 SSL 方法也存在局限性,包括在实现语义对齐和捕捉细微细节方面遇到的挑战。这导致 suboptimal 表示,无法准确捕捉到解剖学结构和病理细节。为了应对这些限制,我们引入了一个名为 OPTiML 的新 SSL 框架,采用最优传输(OT)技术,以捕捉密集的语义不变性和细粒度细节,从而增强 SSL 在医学图像表示学习中的整体效果。核心思想是将 OT 与跨视点语义注入模块(CV-SIM)相结合,有效地捕捉不同观点下医学图像中复杂、细粒度的细节。除了 CV-SIM 模块之外,OPTiML 对 OT 框架内的方差和协方差进行正则化,以迫使模型将注意力集中在临床相关信息上,而忽略更不相关的特征。通过这些,所提出的框架展示了其学习语义丰富表示的能力,可以应用于各种医学成像任务。为了验证其有效性,我们在三个公开可用的数据集(包括胸部 X 光摄影模式)上进行了实验研究。我们的实证结果表明,OPTiML 在所有评估任务上都优于最先进的 methods。
https://arxiv.org/abs/2404.11868
In the face of burgeoning image data, efficiently retrieving similar images poses a formidable challenge. Past research has focused on refining hash functions to distill images into compact indicators of resemblance. Initial attempts used shallow models, evolving to attention mechanism-based architectures from Convolutional Neural Networks (CNNs) to advanced models. Recognizing limitations in gradient-based models for spatial information embedding, we propose an innovative image hashing method, NeuroHash leveraging Hyperdimensional Computing (HDC). HDC symbolically encodes spatial information into high-dimensional vectors, reshaping image representation. Our approach combines pre-trained large vision models with HDC operations, enabling spatially encoded feature representations. Hashing with locality-sensitive hashing (LSH) ensures swift and efficient image retrieval. Notably, our framework allows dynamic hash manipulation for conditional image retrieval. Our work introduces a transformative image hashing framework enabling spatial-aware conditional retrieval. By seamlessly combining DNN-based neural and HDC-based symbolic models, our methodology breaks from traditional training, offering flexible and conditional image retrieval. Performance evaluations signify a paradigm shift in image-hashing methodologies, demonstrating enhanced retrieval accuracy.
面对快速增长的图像数据,高效地检索相似的图像是一个具有挑战性的任务。过去的研究所侧重于优化哈希函数,以将图像压缩成相似性的简洁指标。初始尝试使用浅层模型,从卷积神经网络(CNNs)进化到关注机制为基础的架构,最终达到更先进的模型。然而,对于基于梯度的模型的空间信息嵌入限制,我们提出了创新性的图像哈希方法:NeuroHash,利用高维计算(HDC)。HDC 符号化地编码空间信息为高维向量,重新塑造图像表示。我们的方法将预训练的大视觉模型与 HDC 操作相结合,实现了空间编码特征表示。使用局部感知哈希(LSH)进行哈希确保快速且高效的图像检索。值得注意的是,我们的框架允许动态哈希操作进行条件图像检索。我们的工作引入了一个 transformative 图像哈希框架,实现空间感知条件检索。通过将基于深度神经网络(DNN)的神经模型与基于高维计算(HDC)的符号模型无缝结合,我们的方法摒弃了传统的训练方式,实现了灵活的带有条件图像检索。性能评估表明,图像哈希方法论正处于一种范式性的转变,并证明了更准确的检索精度。
https://arxiv.org/abs/2404.11025
Diffusion models have exhibited remarkable capabilities in text-to-image generation. However, their performance in image-to-text generation, specifically image captioning, has lagged behind Auto-Regressive (AR) models, casting doubt on their applicability for such tasks. In this work, we revisit diffusion models, highlighting their capacity for holistic context modeling and parallel decoding. With these benefits, diffusion models can alleviate the inherent limitations of AR methods, including their slow inference speed, error propagation, and unidirectional constraints. Furthermore, we identify the prior underperformance of diffusion models stemming from the absence of an effective latent space for image-text alignment, and the discrepancy between continuous diffusion processes and discrete textual data. In response, we introduce a novel architecture, LaDiC, which utilizes a split BERT to create a dedicated latent space for captions and integrates a regularization module to manage varying text lengths. Our framework also includes a diffuser for semantic image-to-text conversion and a Back&Refine technique to enhance token interactivity during inference. LaDiC achieves state-of-the-art performance for diffusion-based methods on the MS COCO dataset with 38.2 BLEU@4 and 126.2 CIDEr, demonstrating exceptional performance without pre-training or ancillary modules. This indicates strong competitiveness with AR models, revealing the previously untapped potential of diffusion models in image-to-text generation.
扩散模型在文本到图像生成方面的表现引人注目。然而,在图像到文本生成方面,特别是图像标题生成,它们的性能已经落后于自回归模型。在这项工作中,我们重新审视了扩散模型,突出了它们整体上下文建模和并行解码的能力。借助这些优点,扩散模型可以减轻AR方法固有的限制,包括其缓慢的推理速度、错误传播和单向约束。此外,我们指出了扩散模型由于缺乏有效的图像文本对齐的潜在空间而表现出的先前的低性能,以及连续扩散过程和离散文本数据之间的差异。为了应对这些问题,我们引入了一种名为LaDiC的新架构,它利用分裂的BERT创建了专用的潜在空间,并包括一个正则化模块来管理不同的文本长度。我们的框架还包括一个扩散器用于语义图像到文本转换和 Back&Refine 技术,用于在推理过程中增强标记交互。LaDiC 在基于扩散的方法在 MS COCO 数据集上实现了最先进的性能,达到38.2 BLEU@4 和126.2 CIDEr,这表明 LaDiC 在没有预训练或辅助模块的情况下具有出色的性能。这揭示了扩散模型在图像到文本生成方面的潜力,这是 AR 模型所无法匹敌的。
https://arxiv.org/abs/2404.10763
Text-to-Image (T2I) Synthesis has made tremendous strides in enhancing synthesized image quality, but current datasets evaluate model performance only on descriptive, instruction-based prompts. Real-world news image captions take a more pragmatic approach, providing high-level situational and Named-Entity (NE) information and limited physical object descriptions, making them abstractive. To evaluate the ability of T2I models to capture intended subjects from news captions, we introduce the Abstractive News Captions with High-level cOntext Representation (ANCHOR) dataset, containing 70K+ samples sourced from 5 different news media organizations. With Large Language Models (LLM) achieving success in language and commonsense reasoning tasks, we explore the ability of different LLMs to identify and understand key subjects from abstractive captions. Our proposed method Subject-Aware Finetuning (SAFE), selects and enhances the representation of key subjects in synthesized images by leveraging LLM-generated subject weights. It also adapts to the domain distribution of news images and captions through custom Domain Fine-tuning, outperforming current T2I baselines on ANCHOR. By launching the ANCHOR dataset, we hope to motivate research in furthering the Natural Language Understanding (NLU) capabilities of T2I models.
文本转图像(T2I)合成已经在提高合成图像质量方面取得了巨大的进步,但是现有的数据集仅评估模型在描述性、指令式提示上的性能。现实世界的新闻标题更务实,提供高级情境和命名实体(NE)信息,以及有限的物体描述,使它们具有抽象性。为了评估T2I模型从新闻标题中捕捉意图主题的能力,我们引入了抽象新闻标题高级上下文表示(ANCHOR)数据集,包含来自5个不同新闻媒体组织的70K+个样本。在大语言模型(LLM)在语言和常识推理任务中取得成功之后,我们研究了不同LLM从抽象性摘要中识别和理解关键主题的能力。我们提出的SAFE方法选择和增强了通过LLM生成的主题权重对合成图像中关键主题的表示。它还通过自定义领域微调适应新闻图像和摘要的领域分布,在ANCHOR数据集上优于当前的T2I基线。通过启动ANCHOR数据集,我们希望激励研究进一步提高T2I模型的自然语言理解(NLU)能力。
https://arxiv.org/abs/2404.10141
The advent of Large Multimodal Models (LMMs) has sparked a surge in research aimed at harnessing their remarkable reasoning abilities. However, for understanding text-rich images, challenges persist in fully leveraging the potential of LMMs, and existing methods struggle with effectively processing high-resolution images. In this work, we propose TextCoT, a novel Chain-of-Thought framework for text-rich image understanding. TextCoT utilizes the captioning ability of LMMs to grasp the global context of the image and the grounding capability to examine local textual regions. This allows for the extraction of both global and local visual information, facilitating more accurate question-answering. Technically, TextCoT consists of three stages, including image overview, coarse localization, and fine-grained observation. The image overview stage provides a comprehensive understanding of the global scene information, and the coarse localization stage approximates the image area containing the answer based on the question asked. Then, integrating the obtained global image descriptions, the final stage further examines specific regions to provide accurate answers. Our method is free of extra training, offering immediate plug-and-play functionality. Extensive experiments are conducted on a series of text-rich image question-answering benchmark datasets based on several advanced LMMs, and the results demonstrate the effectiveness and strong generalization ability of our method. Code is available at this https URL.
大规模多模态模型(LMMs)的出现引发了旨在充分利用其非凡推理能力的研究高潮。然而,对于理解富含文本的图像,要完全利用LMMs的潜力仍然具有挑战性,现有的方法在处理高分辨率图像时也存在困难。在这项工作中,我们提出了TextCoT,一种用于文本丰富图像理解的全新链式思维框架。TextCoT利用LMMs的摘要能力来把握图像的全局上下文和定位能力来检查局部文本区域。这使得可以提取全局和局部视觉信息,从而促进更准确的问题回答。从技术上讲,TextCoT由三个阶段组成,包括图像概述、粗略定位和细粒度观察。图像概述阶段提供了对全局场景信息的全面理解,粗略定位阶段根据提出的问题估算包含答案的图像区域。然后,将获得的全局图像描述集成到其中,最后的阶段进一步检查具体区域以提供准确答案。我们的方法无需额外训练,具有即插即用的功能。在多个基于先进LMM的文本丰富图像问题回答基准数据集上进行了广泛的实验,结果表明,我们的方法具有有效性和强大的泛化能力。代码可以从该链接处获取。
https://arxiv.org/abs/2404.09797
The emergence of Large Multimodal Models (LMMs) marks a significant milestone in the development of artificial intelligence. Insurance, as a vast and complex discipline, involves a wide variety of data forms in its operational processes, including text, images, and videos, thereby giving rise to diverse multimodal tasks. Despite this, there has been limited systematic exploration of multimodal tasks specific to insurance, nor a thorough investigation into how LMMs can address these challenges. In this paper, we explore GPT-4V's capabilities in the insurance domain. We categorize multimodal tasks by focusing primarily on visual aspects based on types of insurance (e.g., auto, household/commercial property, health, and agricultural insurance) and insurance stages (e.g., risk assessment, risk monitoring, and claims processing). Our experiment reveals that GPT-4V exhibits remarkable abilities in insurance-related tasks, demonstrating not only a robust understanding of multimodal content in the insurance domain but also a comprehensive knowledge of insurance scenarios. However, there are notable shortcomings: GPT-4V struggles with detailed risk rating and loss assessment, suffers from hallucination in image understanding, and shows variable support for different languages. Through this work, we aim to bridge the insurance domain with cutting-edge LMM technology, facilitate interdisciplinary exchange and development, and provide a foundation for the continued advancement and evolution of future research endeavors.
大规模多模态模型(LMMs)的出现标志着人工智能发展的重要里程碑。作为一门广阔而复杂的学科,保险领域涉及多种数据形式,包括文本、图像和视频,从而产生了各种多模态任务。尽管如此,在保险领域的多模态任务方面,系统性的探索还是有限的,而且关于LMM如何应对这些挑战的研究也是有限的。在本文中,我们探讨了GPT-4V在保险领域的应用能力。我们主要根据保险类型(如汽车、家庭/商业财产、健康和农业保险)对多模态任务进行分类,并关注保险阶段(如风险评估、风险监测和索赔处理)。我们的实验揭示了GPT-4V在保险相关任务中的非凡能力,这不仅表明其在保险领域的多模态内容方面具有稳健的理解,而且表明其在保险场景方面具有全面的了解。然而,仍存在显著的不足:GPT-4V在详细风险评估和损失评估方面表现不佳,在图像理解方面存在幻觉,并且对不同语言的支持具有波动性。通过这项工作,我们旨在将保险领域与最先进的多模态模型技术相连接,促进跨学科的交流和发展,并为未来研究的进步和演变提供基础。
https://arxiv.org/abs/2404.09690
Large vision-language models (LVLMs) are markedly proficient in deriving visual representations guided by natural language. Recent explorations have utilized LVLMs to tackle zero-shot visual anomaly detection (VAD) challenges by pairing images with textual descriptions indicative of normal and abnormal conditions, referred to as anomaly prompts. However, existing approaches depend on static anomaly prompts that are prone to cross-semantic ambiguity, and prioritize global image-level representations over crucial local pixel-level image-to-text alignment that is necessary for accurate anomaly localization. In this paper, we present ALFA, a training-free approach designed to address these challenges via a unified model. We propose a run-time prompt adaptation strategy, which first generates informative anomaly prompts to leverage the capabilities of a large language model (LLM). This strategy is enhanced by a contextual scoring mechanism for per-image anomaly prompt adaptation and cross-semantic ambiguity mitigation. We further introduce a novel fine-grained aligner to fuse local pixel-level semantics for precise anomaly localization, by projecting the image-text alignment from global to local semantic spaces. Extensive evaluations on the challenging MVTec and VisA datasets confirm ALFA's effectiveness in harnessing the language potential for zero-shot VAD, achieving significant PRO improvements of 12.1% on MVTec AD and 8.9% on VisA compared to state-of-the-art zero-shot VAD approaches.
大视觉语言模型(LVLMs)在从自然语言中引导视觉表示方面明显高效。最近的研究利用LVLMs来解决零散视觉异常检测(VAD)挑战,通过将图像与文本描述正常和异常情况的图像相结合,称为异常提示。然而,现有的方法依赖于静态异常提示,容易受到跨语义歧义的影响,并且优先考虑全局图像层面的表示,而忽略了关键的局部像素级图像到文本对齐,这是准确异常定位所必需的。在本文中,我们提出了ALFA,一种无需训练的解决方案,通过统一的模型来解决这些挑战。我们提出了一个运行时提示自适应策略,该策略首先生成有益的异常提示,以利用大型语言模型的能力。该策略通过针对每个图像的异常提示自适应得分机制来增强其跨语义歧义缓解效果。我们进一步引入了一种新细粒度对齐器,通过将全局图像到局部语义空间的图像文本对齐进行投影,实现精确的异常定位。对MVTec和VisA等具有挑战性的数据集的广泛评估证实了ALFA在利用语言潜力进行零散VAD方面的有效性,与最先进的零散VAD方法相比,其性能提高了12.1%的MVTec AD和8.9%的VisA。
https://arxiv.org/abs/2404.09654
This paper introduces VLAP, a novel approach that bridges pretrained vision models and large language models (LLMs) to make frozen LLMs understand the visual world. VLAP transforms the embedding space of pretrained vision models into the LLMs' word embedding space using a single linear layer for efficient and general-purpose visual and language understanding. Specifically, we harness well-established word embeddings to bridge two modality embedding spaces. The visual and text representations are simultaneously assigned to a set of word embeddings within pretrained LLMs by formulating the assigning procedure as an optimal transport problem. We predict the assignment of one modality from the representation of another modality data, enforcing consistent assignments for paired multimodal data. This allows vision and language representations to contain the same information, grounding the frozen LLMs' word embedding space in visual data. Moreover, a robust semantic taxonomy of LLMs can be preserved with visual data since the LLMs interpret and reason linguistic information from correlations between word embeddings. Experimental results show that VLAP achieves substantial improvements over the previous linear transformation-based approaches across a range of vision-language tasks, including image captioning, visual question answering, and cross-modal retrieval. We also demonstrate the learned visual representations hold a semantic taxonomy of LLMs, making visual semantic arithmetic possible.
本文介绍了一种名为VLAP的新方法,它将预训练的视觉模型(VMs)和大语言模型(LLMs)相桥,使冻结的LLMs能够理解视觉世界。VLAP通过使用单线性层将预训练的视觉模型的嵌入空间转换为LLMs的词向量空间,实现高效的视觉和语言理解。具体来说,我们利用已经确立的词向量来连接两个模态的嵌入空间。通过将分配过程表示为最优传输问题,将视觉和文本表示同时分配给预训练的LLM中的一个单词向量集合。我们预测从另一个模态的表示中分配一个模态,强制保持成对多模态数据的相似分配。这使得视觉和语言表示包含相同的信息,将冻结的LLM的词嵌入空间 grounded in visual data。此外,通过视觉数据可以保留LLM的语义分类器,因为LLM解释并推理单词嵌入之间的相关性。实验结果表明,VLAP在各种视觉-语言任务上的改进都超过了基于线性变换的先前方法。我们还证明了学习到的视觉表示具有LLM的语义分类器,使视觉语义算术成为可能。
https://arxiv.org/abs/2404.09632
Few-shot learning has been successfully applied to medical image classification as only very few medical examples are available for training. Due to the challenging problem of limited number of annotated medical images, image representations should not be solely derived from a single image modality which is insufficient for characterizing concept classes. In this paper, we propose a new prompting multi-modal model paradigm on medical image classification based on multi-modal foundation models, called PM2. Besides image modality,PM2 introduces another supplementary text input, known as prompt, to further describe corresponding image or concept classes and facilitate few-shot learning across diverse modalities. To better explore the potential of prompt engineering, we empirically investigate five distinct prompt schemes under the new paradigm. Furthermore, linear probing in multi-modal models acts as a linear classification head taking as input only class token, which ignores completely merits of rich statistics inherent in high-level visual tokens. Thus, we alternatively perform a linear classification on feature distribution of visual tokens and class token simultaneously. To effectively mine such rich statistics, a global covariance pooling with efficient matrix power normalization is used to aggregate visual tokens. Then we study and combine two classification heads. One is shared for class token of image from vision encoder and prompt representation encoded by text encoder. The other is to classification on feature distribution of visual tokens from vision encoder. Extensive experiments on three medical datasets show that our PM2 significantly outperforms counterparts regardless of prompt schemes and achieves state-of-the-art performance.
少样本学习在医学图像分类中的应用已经取得成功,因为可用的医学图像非常少。由于有限数量的注释医学图像具有挑战性的问题,图像表示不应该仅从单个图像模态中提取,这种模态不足以描述概念类别。在本文中,我们基于多模态基础模型提出了一种新的提示多模态模型范例,称为PM2。除了图像模态外,PM2还引入了另一个补充文本输入,称为提示,以进一步描述相应的图像或概念类别,并促进跨不同模态的少样本学习。为了更好地探索提示工程潜力,我们在新范例下实证研究了五种不同的提示方案。此外,在多模态模型中,线性探测在仅接收类标签的情况下充当线性分类头,这忽略了高层次视觉标签中蕴含的丰富统计数据。因此,我们分别对视觉元数据的分布和类标签进行线性分类。为了有效地挖掘这些丰富统计数据,采用全局协方差池化与高效的矩阵功率归一化对视觉元数据进行聚合。然后,我们研究并组合两个分类头。一个共享于从视觉编码器获得的图像类标签和由文本编码器编码的提示表示。另一个是只对视觉编码器获得的视觉元数据进行分类。在三个医疗数据集上进行的广泛实验证明,我们的PM2在提示方案不同的情况下显著优于对照组,并实现了最先进的性能。
https://arxiv.org/abs/2404.08915
Inference with Multimodal Large Language Models (MLLMs) is slow due to their large-language-model backbone which suffers from memory bandwidth bottleneck and generates tokens auto-regressively. In this paper, we explore the application of speculative decoding to enhance the inference efficiency of MLLMs, specifically the LLaVA 7B model. We show that a language-only model can serve as a good draft model for speculative decoding with LLaVA 7B, bypassing the need for image tokens and their associated processing components from the draft model. Our experiments across three different tasks show that speculative decoding can achieve a memory-bound speedup of up to 2.37$\times$ using a 115M parameter language model that we trained from scratch. Additionally, we introduce a compact LLaVA draft model incorporating an image adapter, which shows marginal performance gains in image captioning while maintaining comparable results in other tasks.
使用多模态大型语言模型(MLLMs)进行推理速度较慢,因为它们的大型语言模型骨架存在内存带宽瓶颈并生成逐个生成的标记。在本文中,我们探讨了使用类推断来提高MLLMs的推理效率,特别是LLaVa 7B模型。我们证明了仅使用语言模型的类推断可以作为良好的原型模型,绕过原型模型中图像标记和相关的处理组件的需求。我们在三个不同的任务上的实验表明,使用我们从头训练的115M参数语言模型可以实现多达2.37倍的内存加速。此外,我们还介绍了一个紧凑的LLaVa原型模型,其中包含图像适配器,在保持与其它任务类似的结果的同时,在图像描述性任务上表现出微小的性能提升。
https://arxiv.org/abs/2404.08856
Visual question answering (VQA) is known as an AI-complete task as it requires understanding, reasoning, and inferring about the vision and the language content. Over the past few years, numerous neural architectures have been suggested for the VQA problem. However, achieving success in zero-shot VQA remains a challenge due to its requirement for advanced generalization and reasoning skills. This study explores the impact of incorporating image captioning as an intermediary process within the VQA pipeline. Specifically, we explore the efficacy of utilizing image captions instead of images and leveraging large language models (LLMs) to establish a zero-shot setting. Since image captioning is the most crucial step in this process, we compare the impact of state-of-the-art image captioning models on VQA performance across various question types in terms of structure and semantics. We propose a straightforward and efficient question-driven image captioning approach within this pipeline to transfer contextual information into the question-answering (QA) model. This method involves extracting keywords from the question, generating a caption for each image-question pair using the keywords, and incorporating the question-driven caption into the LLM prompt. We evaluate the efficacy of using general-purpose and question-driven image captions in the VQA pipeline. Our study highlights the potential of employing image captions and harnessing the capabilities of LLMs to achieve competitive performance on GQA under the zero-shot setting. Our code is available at \url{this https URL}.
视觉问题回答(VQA)被认为是AI完成的任务,因为它需要理解、推理和推断关于视觉和语言内容的视觉和语言内容。在过去的几年里,为VQA问题提出了许多神经架构建议。然而,在零散射击VQA上取得成功仍然具有挑战性,因为需要具备高级的泛化能力和推理能力。本文探讨了将图像摘要作为VQA管道中中间过程的引入对视觉问题回答效果的影响。具体来说,我们探讨了使用图像摘要而不是图像并利用大型语言模型(LLMs)建立零散射击设置的有效性。 由于图像摘要是这个过程中最关键的一步,因此我们比较了最先进的图像摘要模型在各种问题类型的结构和语义方面的视觉问题回答性能。我们提出了一种直接而有效的基于问题的图像摘要方法,将上下文信息传递给问题回答(QA)模型。这种方法涉及从问题中提取关键词,为图像-问题对生成文本摘要,并将问题驱动的摘要融入LLM提示中。我们评估了使用通用和基于问题的图像摘要在VQA管道中的效果。 我们的研究突出了在零散射击设置下利用图像摘要和大型语言模型的潜力,以实现GQA竞争力的性能。我们的代码可在此处访问:\url{这个链接}。
https://arxiv.org/abs/2404.08589
Scalable annotation approaches are crucial for constructing extensive 3D-text datasets, facilitating a broader range of applications. However, existing methods sometimes lead to the generation of hallucinated captions, compromising caption quality. This paper explores the issue of hallucination in 3D object captioning, with a focus on Cap3D method, which renders 3D objects into 2D views for captioning using pre-trained models. We pinpoint a major challenge: certain rendered views of 3D objects are atypical, deviating from the training data of standard image captioning models and causing hallucinations. To tackle this, we present DiffuRank, a method that leverages a pre-trained text-to-3D model to assess the alignment between 3D objects and their 2D rendered views, where the view with high alignment closely represent the object's characteristics. By ranking all rendered views and feeding the top-ranked ones into GPT4-Vision, we enhance the accuracy and detail of captions, enabling the correction of 200k captions in the Cap3D dataset and extending it to 1 million captions across Objaverse and Objaverse-XL datasets. Additionally, we showcase the adaptability of DiffuRank by applying it to pre-trained text-to-image models for a Visual Question Answering task, where it outperforms the CLIP model.
可扩展的标注方法对于构建广泛的3D文本数据集至关重要,促进了一系列应用。然而,现有的方法有时会导致生成伪影的旁注,从而损害了旁注的质量。本文重点探讨了在3D物体旁注中出现伪影的问题,重点关注Cap3D方法,该方法使用预训练模型将3D物体转换为2D视图进行旁注。我们指出了一个主要挑战:某些3D物体的渲染视图是非典型的,与标准图像旁注模型的训练数据不一致,导致伪影。为解决这个问题,我们提出了DiffuRank方法,该方法利用预训练的文本到3D模型评估3D物体与其2D渲染视图之间的对齐程度,其中高对齐视图最能代表对象的特性。通过排名所有渲染视图并将排名前几位的输入GPT4-Vision,我们提高了旁注的准确性和细节,使得在Cap3D数据集中的20000个旁注和将它们扩展到Objaverse和Objaverse-XL数据集中的100000个旁注得到纠正。此外,我们还展示了DiffuRank的适应性,将其应用于预训练的文本到图像模型上进行视觉问答任务,其中它超过了CLIP模型。
https://arxiv.org/abs/2404.07984
While Ferret seamlessly integrates regional understanding into the Large Language Model (LLM) to facilitate its referring and grounding capability, it poses certain limitations: constrained by the pre-trained fixed visual encoder and failed to perform well on broader tasks. In this work, we unveil Ferret-v2, a significant upgrade to Ferret, with three key designs. (1) Any resolution grounding and referring: A flexible approach that effortlessly handles higher image resolution, improving the model's ability to process and understand images in greater detail. (2) Multi-granularity visual encoding: By integrating the additional DINOv2 encoder, the model learns better and diverse underlying contexts for global and fine-grained visual information. (3) A three-stage training paradigm: Besides image-caption alignment, an additional stage is proposed for high-resolution dense alignment before the final instruction tuning. Experiments show that Ferret-v2 provides substantial improvements over Ferret and other state-of-the-art methods, thanks to its high-resolution scaling and fine-grained visual processing.
尽管Ferret无缝地将区域理解融入大型语言模型(LLM),以促进其指称和定根能力,但仍然存在某些限制:受到预训练固定视觉编码器的约束,在更广泛的任务上表现不佳。在这项工作中,我们揭示了Ferret-v2,这是Ferret的重大升级,包括三个关键设计。(1)任何分辨率定根和指称:一种轻松处理更高图像分辨率的方法,提高模型对图像的详细处理和理解能力。(2)多粒度视觉编码:通过整合附加的DINOv2编码器,模型更好地学习全球和细粒度视觉信息的不同 underlying contexts。(3)三阶段训练范式:除了图像捕捉对齐之外,还提出了一个 high-resolution dense 对齐阶段,在最终指令调整之前。实验证明,Ferret-v2 提供了比Ferret和其他最先进方法更大的改进,得益于其高分辨率扩展和细粒度视觉处理。
https://arxiv.org/abs/2404.07973
Deep neural networks that achieve remarkable performance in image classification have previously been shown to be easily fooled by tiny transformations such as a one pixel translation of the input image. In order to address this problem, two approaches have been proposed in recent years. The first approach suggests using huge datasets together with data augmentation in the hope that a highly varied training set will teach the network to learn to be invariant. The second approach suggests using architectural modifications based on sampling theory to deal explicitly with image translations. In this paper, we show that these approaches still fall short in robustly handling 'natural' image translations that simulate a subtle change in camera orientation. Our findings reveal that a mere one-pixel translation can result in a significant change in the predicted image representation for approximately 40% of the test images in state-of-the-art models (e.g. open-CLIP trained on LAION-2B or DINO-v2) , while models that are explicitly constructed to be robust to cyclic translations can still be fooled with 1 pixel realistic (non-cyclic) translations 11% of the time. We present Robust Inference by Crop Selection: a simple method that can be proven to achieve any desired level of consistency, although with a modest tradeoff with the model's accuracy. Importantly, we demonstrate how employing this method reduces the ability to fool state-of-the-art models with a 1 pixel translation to less than 5% while suffering from only a 1% drop in classification accuracy. Additionally, we show that our method can be easy adjusted to deal with circular shifts as well. In such case we achieve 100% robustness to integer shifts with state-of-the-art accuracy, and with no need for any further training.
近年来,已经提出了两种解决这个问题的方法。第一种方法建议使用巨大的数据集并结合数据增强来希望有一个高度多样化的训练集来使网络学会对输入图像的微小变换免疫。第二种方法建议基于采样的理论来构建架构修改,以明确处理图像平移。在本文中,我们证明了这些方法仍然不足以应对“自然”图像平移,这些平移会使得大约40%的测试图像的预测图像表示发生显著变化(例如,基于LAION-2B或DINO-v2的open-CLIP训练模型)。而那些明确设计以应对循环平移的模型,仍然可以在1像素的现实生活中(非循环)平移的情况下被欺骗11%的情况。我们提出了Robust Inference by Crop Selection: 一个简单的方法,可以证明达到任何期望的一致性,尽管在模型的准确度上有所牺牲。重要的是,我们证明了使用这种方法可以使 state-of-the-art 模型的1像素平移能力下降至不到5%,而分类准确度却下降了1%。此外,我们还证明了我们的方法可以很容易地调整以处理环形平移。在这种情况下,我们可以在 state-of-the-art 准确度的范围内实现100% 的鲁棒性,并且无需进行进一步训练。
https://arxiv.org/abs/2404.07153
Open-vocabulary semantic segmentation aims at segmenting arbitrary categories expressed in textual form. Previous works have trained over large amounts of image-caption pairs to enforce pixel-level multimodal alignments. However, captions provide global information about the semantics of a given image but lack direct localization of individual concepts. Further, training on large-scale datasets inevitably brings significant computational costs. In this paper, we propose FreeDA, a training-free diffusion-augmented method for open-vocabulary semantic segmentation, which leverages the ability of diffusion models to visually localize generated concepts and local-global similarities to match class-agnostic regions with semantic classes. Our approach involves an offline stage in which textual-visual reference embeddings are collected, starting from a large set of captions and leveraging visual and semantic contexts. At test time, these are queried to support the visual matching process, which is carried out by jointly considering class-agnostic regions and global semantic similarities. Extensive analyses demonstrate that FreeDA achieves state-of-the-art performance on five datasets, surpassing previous methods by more than 7.0 average points in terms of mIoU and without requiring any training.
开放词汇语义分割的目的是对文本形式表达的任意类别的分割进行分割。之前的工作已经通过训练大量的图像-文本对来强制像素级的多模态对齐。然而,捕获的文本提供了关于给定图像语义的全局信息,但缺乏对单个概念的直接定位。此外,在大型数据集上进行训练不可避免地会带来相当大的计算成本。在本文中,我们提出了FreeDA,一种无需训练的开放词汇语义分割方法,它利用扩散模型的能力在视觉上定位生成的概念和局部全局相似性,将类无关的区域与语义类进行匹配。我们的方法包括一个离线阶段,其中收集了文本视觉参考嵌入,从大量注释开始,并利用视觉和语义上下文。在测试时,这些参考被查询以支持视觉匹配过程,该过程通过同时考虑类无关的区域和全局语义相似性来实施。大量分析结果表明,FreeDA在五个数据集上实现了最先进的性能,比前方法高7.0个平均分数,无需进行训练。
https://arxiv.org/abs/2404.06542
Radiologists highly desire fully automated versatile AI for medical imaging interpretation. However, the lack of extensively annotated large-scale multi-disease datasets has hindered the achievement of this goal. In this paper, we explore the feasibility of leveraging language as a naturally high-quality supervision for chest CT imaging. In light of the limited availability of image-report pairs, we bootstrap the understanding of 3D chest CT images by distilling chest-related diagnostic knowledge from an extensively pre-trained 2D X-ray expert model. Specifically, we propose a language-guided retrieval method to match each 3D CT image with its semantically closest 2D X-ray image, and perform pair-wise and semantic relation knowledge distillation. Subsequently, we use contrastive learning to align images and reports within the same patient while distinguishing them from the other patients. However, the challenge arises when patients have similar semantic diagnoses, such as healthy patients, potentially confusing if treated as negatives. We introduce a robust contrastive learning that identifies and corrects these false negatives. We train our model with over 12,000 pairs of chest CT images and radiology reports. Extensive experiments across multiple scenarios, including zero-shot learning, report generation, and fine-tuning processes, demonstrate the model's feasibility in interpreting chest CT images.
放射科医生非常渴望完全自动化的多用途AI医疗影像诊断。然而,缺乏大量注释的大型多病种数据集限制了这一目标的实现。在本文中,我们探讨了将语言作为自然高质量监督来利用在胸部CT成像中的可行性。鉴于图像报告对有限可用性,我们通过从广泛预训练的2D X射线专家模型中提取与胸部相关的诊断知识,对3D胸部CT图像进行建模。具体来说,我们提出了一种语言引导的检索方法,将每个3D CT图像与它们的语义最近2D X射线图像匹配,并进行成对和语义关系知识蒸馏。随后,我们使用对比学习在同一患者内对图像和报告进行对齐,同时将它们与其他患者区分开来。然而,当患者具有类似的语义诊断时,如健康患者,就可能产生混淆。我们引入了一种鲁棒的对比学习来识别和纠正这些假阴性。我们用超过12,000对胸部CT图像和放射科报告进行训练。在多个场景中进行广泛的实验,包括零散学习、报告生成和微调过程,证明了模型在解释胸部CT图像方面的可行性。
https://arxiv.org/abs/2404.04936
Current remote-sensing interpretation models often focus on a single task such as detection, segmentation, or caption. However, the task-specific designed models are unattainable to achieve the comprehensive multi-level interpretation of images. The field also lacks support for multi-task joint interpretation datasets. In this paper, we propose Panoptic Perception, a novel task and a new fine-grained dataset (FineGrip) to achieve a more thorough and universal interpretation for RSIs. The new task, 1) integrates pixel-level, instance-level, and image-level information for universal image perception, 2) captures image information from coarse to fine granularity, achieving deeper scene understanding and description, and 3) enables various independent tasks to complement and enhance each other through multi-task learning. By emphasizing multi-task interactions and the consistency of perception results, this task enables the simultaneous processing of fine-grained foreground instance segmentation, background semantic segmentation, and global fine-grained image captioning. Concretely, the FineGrip dataset includes 2,649 remote sensing images, 12,054 fine-grained instance segmentation masks belonging to 20 foreground things categories, 7,599 background semantic masks for 5 stuff classes and 13,245 captioning sentences. Furthermore, we propose a joint optimization-based panoptic perception model. Experimental results on FineGrip demonstrate the feasibility of the panoptic perception task and the beneficial effect of multi-task joint optimization on individual tasks. The dataset will be publicly available.
目前,遥感解释模型通常集中于单一任务,如检测、分割或注释。然而,针对任务的定制模型对于实现图像的全面多层次解释是不现实的。该领域也缺乏支持多任务联合解释数据集。在本文中,我们提出了全面感知(Panoptic Perception),一种新的任务和新精细数据集(FineGrip),以实现对RSIs的更全面和通用的解释。新任务1)将像素级别、实例级别和图像级别信息集成到通用图像感知中,实现更深层次的场景理解和描述;2)从粗到细粒度捕捉图像信息,实现更深的场景理解和描述;3)通过多任务学习使各种独立任务互补和增强。通过强调多任务交互和感知结果的一致性,这项任务实现了同时处理细粒度前景实例分割、背景语义分割和全局细粒度图像注释的同时处理。具体来说,FineGrip数据集包括2,649个遥感图像,20个前景类别的细粒度实例分割掩码,5个类别和13,245个注释句子。此外,我们提出了一个基于联合优化方案的全面感知模型。FineGrip实验结果表明,全面感知任务是可行的,多任务联合优化对各个任务具有积极影响。该数据集将公开发布。
https://arxiv.org/abs/2404.04608
Diffusion models have demonstrated great success in the field of text-to-image generation. However, alleviating the misalignment between the text prompts and images is still challenging. The root reason behind the misalignment has not been extensively investigated. We observe that the misalignment is caused by inadequate token attention activation. We further attribute this phenomenon to the diffusion model's insufficient condition utilization, which is caused by its training paradigm. To address the issue, we propose CoMat, an end-to-end diffusion model fine-tuning strategy with an image-to-text concept matching mechanism. We leverage an image captioning model to measure image-to-text alignment and guide the diffusion model to revisit ignored tokens. A novel attribute concentration module is also proposed to address the attribute binding problem. Without any image or human preference data, we use only 20K text prompts to fine-tune SDXL to obtain CoMat-SDXL. Extensive experiments show that CoMat-SDXL significantly outperforms the baseline model SDXL in two text-to-image alignment benchmarks and achieves start-of-the-art performance.
扩散模型在文本到图像生成领域取得了巨大的成功。然而,减轻文本提示和图像之间的不匹配仍然是具有挑战性的。导致不匹配的根本原因尚未进行深入研究。我们观察到,不匹配是由不足的词注意激活引起的。我们进一步将这种现象归因于扩散模型的不足条件利用率,这是由于其训练范式造成的。为了应对这个问题,我们提出了CoMat,一种端到端的扩散模型微调策略,具有图像到文本的概念匹配机制。我们利用图像标题模型测量图像到文本的匹配,并指导扩散模型重新关注被忽略的词。还提出了一个新的属性绑定模块来解决属性绑定问题。在没有图像或人类偏好数据的情况下,我们仅使用20K个文本提示微调SDXL以获得CoMat-SDXL。大量实验证明,CoMat-SDXL在两个文本到图像对齐基准测试中的表现明显优于基线模型SDXL,并实现了最先进的性能。
https://arxiv.org/abs/2404.03653
We investigate the impact of deep generative models on potential social biases in upcoming computer vision models. As the internet witnesses an increasing influx of AI-generated images, concerns arise regarding inherent biases that may accompany them, potentially leading to the dissemination of harmful content. This paper explores whether a detrimental feedback loop, resulting in bias amplification, would occur if generated images were used as the training data for future models. We conduct simulations by progressively substituting original images in COCO and CC3M datasets with images generated through Stable Diffusion. The modified datasets are used to train OpenCLIP and image captioning models, which we evaluate in terms of quality and bias. Contrary to expectations, our findings indicate that introducing generated images during training does not uniformly amplify bias. Instead, instances of bias mitigation across specific tasks are observed. We further explore the factors that may influence these phenomena, such as artifacts in image generation (e.g., blurry faces) or pre-existing biases in the original datasets.
我们研究了深度生成模型对即将发布的计算机视觉模型中潜在社会偏见的影响。随着互联网越来越多地涌现出由AI生成的图像,人们开始关注可能伴随着它们固有的偏见,这可能导致有害内容的传播。本文探讨了,如果生成图像被用作未来模型的训练数据,是否会发生有害反馈循环,导致偏见放大。我们通过逐步用通过Stable Diffusion生成的图像替换COCO和CC3M数据集中的原始图像来进行模拟。修改后的数据集用于训练OpenCLIP和图像标题模型,我们对质量和偏见进行评估。与预期相反,我们的研究结果表明,在训练过程中引入生成图像并不均匀地放大偏见。相反,我们观察到特定任务上偏见的缓解实例。我们进一步研究了可能影响这些现象的因素,例如图像生成中的伪影(例如模糊的脸)或原始数据集中的预先存在的偏见。
https://arxiv.org/abs/2404.03242