In the recent years, the dual-encoder vision-language models (\eg CLIP) have achieved remarkable text-to-image retrieval performance. However, we discover that these models usually results in very different retrievals for a pair of paraphrased queries. Such behavior might render the retrieval system less predictable and lead to user frustration. In this work, we consider the task of paraphrased text-to-image retrieval where a model aims to return similar results given a pair of paraphrased queries. To start with, we collect a dataset of paraphrased image descriptions to facilitate quantitative evaluation for this task. We then hypothesize that the undesired behavior of existing dual-encoder model is due to their text towers which are trained on image-sentence pairs and lack the ability to capture the semantic similarity between paraphrased queries. To improve on this, we investigate multiple strategies for training a dual-encoder model starting from a language model pretrained on a large text corpus. Compared to public dual-encoder models such as CLIP and OpenCLIP, the model trained with our best adaptation strategy achieves a significantly higher ranking similarity for paraphrased queries while maintaining similar zero-shot classification and retrieval accuracy.
在最近几年,双编码器视觉语言模型(例如CLIP)在文本到图像检索方面取得了显著的性能。然而,我们发现,这些模型通常对于一对同义词查询的检索结果会有非常不同的情况。这种行为可能会使得检索系统更具预测性,从而导致用户不满。在这项工作中,我们考虑了同义词文本到图像检索的任务,其中模型旨在通过一对同义词查询返回类似的结果。为了开始,我们收集了一个同义词图像描述的 dataset,以便为这个任务进行定量评估。然后我们假设现有双编码器模型的不良行为是因为它们在训练时使用的文本塔,这些文本塔针对图像-句子对进行训练,无法捕捉到同义词查询之间的语义相似性。为了改进这种状况,我们研究了从预训练于大型文本语料库的语言模型开始,为双编码器模型训练多种策略。与公共的双编码器模型(如CLIP和OpenCLIP)相比,我们使用最佳适应策略训练的双编码器模型在非同义词查询的检索排名相似性方面明显更高,同时保持相同的零散分类和检索准确性。
https://arxiv.org/abs/2405.03190
Effective image classification hinges on discerning relevant features from both foreground and background elements, with the foreground typically holding the critical information. While humans adeptly classify images with limited exposure, artificial neural networks often struggle with feature selection from rare samples. To address this challenge, we propose a novel method for selecting class-relevant patch embeddings. Our approach involves splitting support and query images into patches, encoding them using a pre-trained Vision Transformer (ViT) to obtain class embeddings and patch embeddings, respectively. Subsequently, we filter patch embeddings using class embeddings to retain only the class-relevant ones. For each image, we calculate the similarity between class embedding and each patch embedding, sort the similarity sequence in descending order, and only retain top-ranked patch embeddings. By prioritizing similarity between the class embedding and patch embeddings, we select top-ranked patch embeddings to be fused with class embedding to form a comprehensive image representation, enhancing pattern recognition across instances. Our strategy effectively mitigates the impact of class-irrelevant patch embeddings, yielding improved performance in pre-trained models. Extensive experiments on popular few-shot classification benchmarks demonstrate the simplicity, efficacy, and computational efficiency of our approach, outperforming state-of-the-art baselines under both 5-shot and 1-shot scenarios.
有效的图像分类依赖于从前景和背景元素中辨别相关特征,通常前景持有关键信息。虽然人类在有限曝光下也能够分类图像,但人工神经网络通常在从罕见样本中选择特征时遇到困难。为了应对这个挑战,我们提出了一种选择类相关补丁嵌入的新方法。我们的方法将支持性和查询图像分割成补丁,并使用预训练的Vision Transformer(ViT)对其进行编码,分别获得类嵌入和补丁嵌入。接下来,我们使用类嵌入过滤补丁嵌入,保留只有类相关的补丁。对于每个图像,我们计算类嵌入与每个补丁嵌入之间的相似度,将相似度序列按下降顺序排序,并仅保留排名靠前的补丁嵌入。通过优先考虑类嵌入与补丁嵌入之间的相似性,我们选择排名靠前的补丁嵌入与类嵌入融合,形成全面图像表示,增强模式识别。通过有效地减轻类无关补丁嵌入的影响,我们的策略在预训练模型上产生了改进。在流行的小样本分类基准上进行广泛的实验,证明了我们的方法的简单性、有效性和计算效率,在5-shot和1-shot场景下均优于最先进的基线。
https://arxiv.org/abs/2405.03722
Given a query consisting of a reference image and a relative caption, Composed Image Retrieval (CIR) aims to retrieve target images visually similar to the reference one while incorporating the changes specified in the relative caption. The reliance of supervised methods on labor-intensive manually labeled datasets hinders their broad applicability. In this work, we introduce a new task, Zero-Shot CIR (ZS-CIR), that addresses CIR without the need for a labeled training dataset. We propose an approach named iSEARLE (improved zero-Shot composEd imAge Retrieval with textuaL invErsion) that involves mapping the visual information of the reference image into a pseudo-word token in CLIP token embedding space and combining it with the relative caption. To foster research on ZS-CIR, we present an open-domain benchmarking dataset named CIRCO (Composed Image Retrieval on Common Objects in context), the first CIR dataset where each query is labeled with multiple ground truths and a semantic categorization. The experimental results illustrate that iSEARLE obtains state-of-the-art performance on three different CIR datasets -- FashionIQ, CIRR, and the proposed CIRCO -- and two additional evaluation settings, namely domain conversion and object composition. The dataset, the code, and the model are publicly available at this https URL.
给定一个由参考图像和相对描述组成的查询,组合图像检索(CIR)旨在通过包含在相对描述中指定的变化来检索与参考图像视觉上相似的目标图像,同时实现这一点。依赖于有监督方法对劳动密集型手动标注数据集的依赖会限制其广泛的适用性。在这项工作中,我们引入了一个新的任务,名为零 shot组合图像检索(ZS-CIR),它不需要有标签的训练数据集来解决组合图像检索(CIR)。我们提出了一个名为iSEARLE(改进零 shot组合图像age检索与文本ual invErsion)的方法,该方法涉及将参考图像的视觉信息映射到CLIP标记嵌入空间中的伪词标记,并将其与相对描述相结合。为了促进对零 shot组合图像检索的研究,我们提出了一个名为CIRCO(在上下文中共同对象图像检索)的开源领域基准数据集,它是第一个每个查询都带有多个地面真实值和语义分类的CIR数据集。实验结果表明,iSEARLE在三个不同的CIR数据集--FashionIQ,CIRR和所提出的CIRCO--上都取得了最先进的性能,同时还取得了另外两个评估设置,即领域转换和对象组合。数据集、代码和模型都可以在https://这个链接上获得。
https://arxiv.org/abs/2405.02951
In recent years, AI-Generated Content (AIGC) has witnessed rapid advancements, facilitating the generation of music, images, and other forms of artistic expression across various industries. However, researches on general multi-modal music generation model remain scarce. To fill this gap, we propose a multi-modal music generation framework Mozart's Touch. It could generate aligned music with the cross-modality inputs, such as images, videos and text. Mozart's Touch is composed of three main components: Multi-modal Captioning Module, Large Language Model (LLM) Understanding & Bridging Module, and Music Generation Module. Unlike traditional approaches, Mozart's Touch requires no training or fine-tuning pre-trained models, offering efficiency and transparency through clear, interpretable prompts. We also introduce "LLM-Bridge" method to resolve the heterogeneous representation problems between descriptive texts of different modalities. We conduct a series of objective and subjective evaluations on the proposed model, and results indicate that our model surpasses the performance of current state-of-the-art models. Our codes and examples is availble at: this https URL
近年来,AI-Generated Content(AIGC)见证了迅速的发展,推动了各种行业中音乐、图像和其他形式的艺术表达。然而,关于多模态音乐生成的研究仍然很少。为了填补这一空白,我们提出了一个多模态音乐生成框架——莫扎特触摸。它可以通过跨模态输入生成对应的音乐,例如图像、视频和文本。莫扎特触摸主要由三个主要组件组成:多模态捕获模块、大型语言模型(LLM)理解和桥接模块以及音乐生成模块。与传统方法不同,莫扎特触摸无需预训练模型,通过清晰、可解释的提示实现效率和透明。我们还引入了"LLM-Bridge"方法来解决不同模态描述文本之间的异质表示问题。我们对所提出的模型进行了一系列客观和主观评估,结果表明,我们的模型超越了当前最先进的模型的性能。我们的代码和示例可在:<https:// this URL>
https://arxiv.org/abs/2405.02801
Despite the longstanding adage "an image is worth a thousand words," creating accurate and hyper-detailed image descriptions for training Vision-Language models remains challenging. Current datasets typically have web-scraped descriptions that are short, low-granularity, and often contain details unrelated to the visual content. As a result, models trained on such data generate descriptions replete with missing information, visual inconsistencies, and hallucinations. To address these issues, we introduce ImageInWords (IIW), a carefully designed human-in-the-loop annotation framework for curating hyper-detailed image descriptions and a new dataset resulting from this process. We validate the framework through evaluations focused on the quality of the dataset and its utility for fine-tuning with considerations for readability, comprehensiveness, specificity, hallucinations, and human-likeness. Our dataset significantly improves across these dimensions compared to recently released datasets (+66%) and GPT-4V outputs (+48%). Furthermore, models fine-tuned with IIW data excel by +31% against prior work along the same human evaluation dimensions. Given our fine-tuned models, we also evaluate text-to-image generation and vision-language reasoning. Our model's descriptions can generate images closest to the original, as judged by both automated and human metrics. We also find our model produces more compositionally rich descriptions, outperforming the best baseline by up to 6% on ARO, SVO-Probes, and Winoground datasets.
尽管谚语“图片价值千言万语”,为训练视觉语言模型创建准确且超详细图像描述仍然具有挑战性。通常的数据集包含从互联网上抓取的简短、低粒度和通常与视觉内容无关的描述。因此,训练在这些数据上的模型生成的描述充满了缺失信息、视觉不一致性和幻觉。为了应对这些问题,我们引入了ImageInWords (IIW),一种由人类和机器共同审核的超详细图像描述的元数据框架和一个由此过程产生的新数据集。我们通过关注数据集质量和其在微调方面的可用性、可读性、全面性、幻觉和人性化来评估该框架。我们的数据在这些方面显著优于最近发布的数据集(+66%)和GPT-4V输出(+48%)。此外,使用IIW数据进行微调的模型在相同的人类评价维度上表现优异,与之前的工作相比提高了31%。鉴于我们微调的模型,我们还评估了文本到图像生成和视觉语言推理。我们的模型描述可以生成与原始图像最接近的图像,这由自动和人类指标进行评估。我们发现,我们的模型在ARO、SVO-Probes和Winoground数据集上产生了更丰富的描述,比最佳基线高出6%以上。
https://arxiv.org/abs/2405.02793
The distribution of subpopulations is an important property hidden within a dataset. Uncovering and analyzing the subpopulation distribution within datasets provides a comprehensive understanding of the datasets, standing as a powerful tool beneficial to various downstream tasks, including Dataset Subpopulation Organization, Subpopulation Shift, and Slice Discovery. Despite its importance, there has been no work that systematically explores the subpopulation distribution of datasets to our knowledge. To address the limitation and solve all the mentioned tasks in a unified way, we introduce a novel concept of subpopulation structures to represent, analyze, and utilize subpopulation distributions within datasets. To characterize the structures in an interpretable manner, we propose the Subpopulation Structure Discovery with Large Language Models (SSD-LLM) framework, which employs world knowledge and instruction-following capabilities of Large Language Models (LLMs) to linguistically analyze informative image captions and summarize the structures. Furthermore, we propose complete workflows to address downstream tasks, named Task-specific Tuning, showcasing the application of the discovered structure to a spectrum of subpopulation-related tasks, including dataset subpopulation organization, subpopulation shift, and slice discovery. Furthermore, we propose complete workflows to address downstream tasks, named Task-specific Tuning, showcasing the application of the discovered structure to a spectrum of subpopulation-related tasks, including dataset subpopulation organization, subpopulation shift, and slice discovery.
子种群分布是数据集中一个重要的属性。揭示和分析数据集中的子种群分布提供了对数据集的全面了解,作为各种下游任务的有力工具,包括数据集子种群组织、子种群平移和切片发现。尽管这对数据集非常重要,但我们不知道有没有系统地研究了数据集中的子种群分布。为了克服这一局限,并统一解决所有提到的任务,我们引入了一个新的子种群结构概念,用于表示、分析和利用数据集中的子种群分布。为了以可解释的方式描述结构,我们提出了 Subpopulation Structure Discovery with Large Language Models (SSD-LLM) framework,该框架利用大型语言模型的世界知识和指令跟随能力进行语言分析,并总结结构。此外,我们提出了完整的下游任务工作流程,名为任务特定调整,展示了发现的结构在子种群相关任务中的应用,包括数据集子种群组织、子种群平移和切片发现。此外,我们还提出了完整的下游任务工作流程,名为任务特定调整,展示了发现的结构在子种群相关任务中的应用,包括数据集子种群组织、子种群平移和切片发现。
https://arxiv.org/abs/2405.02363
Large Vision-Language models (VLMs) have demonstrated strong reasoning capabilities in tasks requiring a fine-grained understanding of literal images and text, such as visual question-answering or visual entailment. However, there has been little exploration of these models' capabilities when presented with images and captions containing figurative phenomena such as metaphors or humor, the meaning of which is often implicit. To close this gap, we propose a new task and a high-quality dataset: Visual Figurative Language Understanding with Textual Explanations (V-FLUTE). We frame the visual figurative language understanding problem as an explainable visual entailment task, where the model has to predict whether the image (premise) entails a claim (hypothesis) and justify the predicted label with a textual explanation. Using a human-AI collaboration framework, we build a high-quality dataset, V-FLUTE, that contains 6,027 <image, claim, label, explanation> instances spanning five diverse multimodal figurative phenomena: metaphors, similes, idioms, sarcasm, and humor. The figurative phenomena can be present either in the image, the caption, or both. We further conduct both automatic and human evaluations to assess current VLMs' capabilities in understanding figurative phenomena.
大视觉语言模型(VLMs)已经在需要对字面图像和文本进行深入理解的任务中表现出强大的推理能力,例如视觉问答或视觉蕴含。然而,在遇到包含象征性现象(如隐喻或幽默)的图像和字幕时,对这些模型的能力进行了深入的研究还是很少的。为了填补这一空白,我们提出了一个新的任务和高质量的数据集:视觉符号语言理解与文本解释(V-FLUTE)。我们将视觉符号语言理解问题视为一种可解释的视觉蕴含任务,其中模型需要预测图像(前提)是否符合一个假设(结论),并通过文本解释预测标签。利用人机合作框架,我们构建了一个高质量的数据集V-FLUTE,其中包括6,027个<图像,陈述,标签,解释>实例,涵盖了五种多样 multimodal 符号现象:隐喻、比喻、惯用语、讽刺和幽默。符号现象可以出现在图像中,描述中,或两者兼备。我们进一步进行了自动和人类评估,以评估现有 VLMs 对符号现象的理解能力。
https://arxiv.org/abs/2405.01474
Large 2D vision-language models (2D-LLMs) have gained significant attention by bridging Large Language Models (LLMs) with images using a simple projector. Inspired by their success, large 3D point cloud-language models (3D-LLMs) also integrate point clouds into LLMs. However, directly aligning point clouds with LLM requires expensive training costs, typically in hundreds of GPU-hours on A100, which hinders the development of 3D-LLMs. In this paper, we introduce MiniGPT-3D, an efficient and powerful 3D-LLM that achieves multiple SOTA results while training for only 27 hours on one RTX 3090. Specifically, we propose to align 3D point clouds with LLMs using 2D priors from 2D-LLMs, which can leverage the similarity between 2D and 3D visual information. We introduce a novel four-stage training strategy for modality alignment in a cascaded way, and a mixture of query experts module to adaptively aggregate features with high efficiency. Moreover, we utilize parameter-efficient fine-tuning methods LoRA and Norm fine-tuning, resulting in only 47.8M learnable parameters, which is up to 260x fewer than existing methods. Extensive experiments show that MiniGPT-3D achieves SOTA on 3D object classification and captioning tasks, with significantly cheaper training costs. Notably, MiniGPT-3D gains an 8.12 increase on GPT-4 evaluation score for the challenging object captioning task compared to ShapeLLM-13B, while the latter costs 160 total GPU-hours on 8 A800. We are the first to explore the efficient 3D-LLM, offering new insights to the community. Code and weights are available at this https URL.
大型的2D视觉语言模型(2D-LLMs)通过简单地使用投影器将大型语言模型(LLMs)与图像相连接,已经引起了 significant 的关注。受到他们的成功启发,大型3D点云语言模型(3D-LLMs)也 将点云集成到 LLMs 中。然而,直接将点云与 LLM 对齐需要昂贵的训练成本,通常在 A100 上需要数百个 GPU-小时,这阻碍了 3D-LLMs 的开发。在本文中,我们介绍了 MiniGPT-3D,一种高效且强大的3D-LLM,在仅训练27小时的情况下实现了多个SOTA结果。具体来说,我们提出了一种使用来自2D-LLMs的2D先验来对齐3D点云与LLM的方法,可以利用2D和3D视觉信息的相似性。我们还提出了一个新颖的四阶段模态对齐训练策略,以及一个混合查询专家模块以高效地适应性地聚合特征。此外,我们还利用参数高效的微调方法 LoRA 和 Norm 微调,实现了仅47.8M可学习参数,比现有方法少260倍。 extensive实验证明,MiniGPT-3D在3D物体分类和文本摘要任务上实现了SOTA,具有显著的训练成本优势。值得注意的是,与ShapeLLM-13B相比,MiniGPT-3D在具有挑战性的物体文本摘要任务上获得了8.12的提高,而后者需要160个总共的GPU-小时,在8个A800上。我们是第一个探索高效3D-LLM,为社区提供了新的见解。代码和权重可以从该https URL获取。
https://arxiv.org/abs/2405.01413
This report presents the ECO (Ensembled Clip score and cOnsensus score) pipeline from team DSBA LAB, which is a new framework used to evaluate and rank captions for a given image. ECO selects the most accurate caption describing image. It is made possible by combining an Ensembled CLIP score, which considers the semantic alignment between the image and captions, with a Consensus score that accounts for the essentialness of the captions. Using this framework, we achieved notable success in the CVPR 2024 Workshop Challenge on Caption Re-ranking Evaluation at the New Frontiers for Zero-Shot Image Captioning Evaluation (NICE). Specifically, we secured third place based on the CIDEr metric, second in both the SPICE and METEOR metrics, and first in the ROUGE-L and all BLEU Score metrics. The code and configuration for the ECO framework are available at this https URL DSBA-Lab/ECO .
本报告介绍了一个名为ECO(集成剪辑分数和共识分数)的框架,该框架用于评估和排名给定图像的 caption。ECO 选择最准确的图像描述。这是通过将集成 CLIP 分数(考虑图像与捕获的文本之间的语义对齐)与共识分数(考虑捕获的文本的重要性)相结合而实现的。使用这个框架,我们在 CVPR 2024 工作站挑战中对捕捉重新排名评估的新前沿(NICE)取得了显著的成功。具体来说,我们在 CIDEr 指标上获得了第三名的成绩,在 SPICE 和 METEOR 指标上排名第二,而在 ROUGE-L 和所有 BLEU 分数指标上排名第一。ECO 框架的代码和配置可用於此链接 DSBA-Lab/ECO。
https://arxiv.org/abs/2405.01028
The development of Audio Description (AD) has been a pivotal step forward in making video content more accessible and inclusive. Traditionally, AD production has demanded a considerable amount of skilled labor, while existing automated approaches still necessitate extensive training to integrate multimodal inputs and tailor the output from a captioning style to an AD style. In this paper, we introduce an automated AD generation pipeline that harnesses the potent multimodal and instruction-following capacities of GPT-4V(ision). Notably, our methodology employs readily available components, eliminating the need for additional training. It produces ADs that not only comply with established natural language AD production standards but also maintain contextually consistent character information across frames, courtesy of a tracking-based character recognition module. A thorough analysis on the MAD dataset reveals that our approach achieves a performance on par with learning-based methods in automated AD production, as substantiated by a CIDEr score of 20.5.
音频描述(AD)的发展是一个使视频内容更加可访问和包容的重要一步。传统上,AD制作需要大量专业劳动,而现有的自动方法仍然需要对多模态输入进行广泛培训,并对句尾风格从字幕风格定制为AD风格。在本文中,我们介绍了一个利用GPT-4V(Vision)的多模态和指令跟随能力来自动生成AD的管道。值得注意的是,我们的方法采用 readily available的组件,无需额外培训。它生产的AD符合已建立的自然语言AD制作标准,并且由于一个跟踪为基础的字符识别模块,保持帧之间的上下文一致的字符信息。对MAD数据集的深入分析证实,我们的方法在自动AD制作方面的表现与基于学习的method相当,据CIDEr分数为20.5。
https://arxiv.org/abs/2405.00983
Communication is defined as ``Who says what to whom with what effect.'' A message from a communicator generates downstream receiver effects, also known as behavior. Receiver behavior, being a downstream effect of the message, carries rich signals about it. Even after carrying signals about the message, the behavior data is often ignored while training large language models. We show that training LLMs on receiver behavior can actually help improve their content-understanding abilities. Specifically, we show that training LLMs to predict the receiver behavior of likes and comments improves the LLM's performance on a wide variety of downstream content understanding tasks. We show this performance increase over 40 video and image understanding tasks over 23 benchmark datasets across both 0-shot and fine-tuning settings, outperforming many supervised baselines. Moreover, since receiver behavior, such as likes and comments, is collected by default on the internet and does not need any human annotations to be useful, the performance improvement we get after training on this data is essentially free-lunch. We release the receiver behavior cleaned comments and likes of 750k images and videos collected from multiple platforms along with our instruction-tuning data.
交流被定义为“谁对谁说什么,对谁产生什么效果。” 发送者的信息产生下游接收者影响,也称为行为。接收者的行为作为信息下游的效果,对其携带的丰富信号具有重要的影响。即使在携带关于信息的行为数据之后,行为数据通常也会在训练大语言模型时被忽略。我们证明,在接收者行为上训练LLM实际上可以帮助提高其内容理解能力。具体来说,我们证明了将LLM预测接收者喜欢和评论的行为来改善LLM在各种下游内容理解任务上的表现。我们在0-shot和微调设置的23个基准数据集上展示了这种性能增加,超越了许多监督基线。此外,由于接收者行为(如喜欢和评论)默认情况下在互联网上收集,不需要任何人类注释即可变得有用,因此我们训练这种数据获得的性能提升本质上是一种免费午餐。我们还将我们指令调整后的750万张图像和视频的接收者行为和喜欢/评论一起发布。
https://arxiv.org/abs/2405.00942
Vision language models (VLMs) have recently emerged and gained the spotlight for their ability to comprehend the dual modality of image and textual data. VLMs such as LLaVA, ChatGPT-4, and Gemini have recently shown impressive performance on tasks such as natural image captioning, visual question answering (VQA), and spatial reasoning. Additionally, a universal segmentation model by Meta AI, Segment Anything Model (SAM) shows unprecedented performance at isolating objects from unforeseen images. Since medical experts, biologists, and materials scientists routinely examine microscopy or medical images in conjunction with textual information in the form of captions, literature, or reports, and draw conclusions of great importance and merit, it is indubitably essential to test the performance of VLMs and foundation models such as SAM, on these images. In this study, we charge ChatGPT, LLaVA, Gemini, and SAM with classification, segmentation, counting, and VQA tasks on a variety of microscopy images. We observe that ChatGPT and Gemini are impressively able to comprehend the visual features in microscopy images, while SAM is quite capable at isolating artefacts in a general sense. However, the performance is not close to that of a domain expert - the models are readily encumbered by the introduction of impurities, defects, artefact overlaps and diversity present in the images.
近年来,随着Vision语言模型(VLMs)的出现,它们在理解图像和文本数据的双模态方面得到了关注。例如,LLaVA、ChatGPT-4和Gemini等VLM最近在自然图像描述性、视觉问答(VQA)和空间推理等任务中表现出色。此外,由元人工智能(Meta AI)开发的普遍分割模型Semantic Anywhere Model(SAM)在从未见过的图像中隔离物体方面表现出史无前例的性能。由于医疗专家、生物学家和材料科学家通常将显微镜图像或医学图像与文本信息(标题、文献或报告)一起检查,并从中得出重要且有益的结论,因此测试VLM和基础模型(如SAM)在这些图像上的性能无疑至关重要。在这项研究中,我们对ChatGPT、LLaVA、Gemini和SAM在各种显微镜图像上执行分类、分割、计数和VQA任务。我们观察到,ChatGPT和Gemini在显微镜图像的视觉特征方面表现出惊人的理解能力,而SAM在分离总体上的伪影方面表现相当出色。然而,这些模型的性能与领域专家的相当距离,模型很容易受到图像中存在的杂质、缺陷、伪影和多样性等因素的影响。
https://arxiv.org/abs/2405.00876
Composed Image Retrieval (CIR) is a complex task that retrieves images using a query, which is configured with an image and a caption that describes desired modifications to that image. Supervised CIR approaches have shown strong performance, but their reliance on expensive manually-annotated datasets restricts their scalability and broader applicability. To address these issues, previous studies have proposed pseudo-word token-based Zero-Shot CIR (ZS-CIR) methods, which utilize a projection module to map images to word tokens. However, we conjecture that this approach has a downside: the projection module distorts the original image representation and confines the resulting composed embeddings to the text-side. In order to resolve this, we introduce a novel ZS-CIR method that uses Spherical Linear Interpolation (Slerp) to directly merge image and text representations by identifying an intermediate embedding of both. Furthermore, we introduce Text-Anchored-Tuning (TAT), a method that fine-tunes the image encoder while keeping the text encoder fixed. TAT closes the modality gap between images and text, making the Slerp process much more effective. Notably, the TAT method is not only efficient in terms of the scale of the training dataset and training time, but it also serves as an excellent initial checkpoint for training supervised CIR models, thereby highlighting its wider potential. The integration of the Slerp-based ZS-CIR with a TAT-tuned model enables our approach to deliver state-of-the-art retrieval performance across CIR benchmarks.
组成图像检索(CIR)是一个复杂的任务,它使用查询来检索图像,该查询配置了一个图像和一个描述对图像所需修改的文本。监督的CIR方法已经展示了强大的性能,但它们依赖于昂贵的手动标注数据集,从而限制了它们的可扩展性和更广泛的适用性。为解决这些问题,以前的研究提出了基于伪词词向量的零 shots CIR(ZS-CIR)方法,该方法利用投影模块将图像映射到词向量。然而,我们推测这种方法的一个缺点是:投影模块扭曲了原始图像表示,并将所得组合嵌入限制在文本侧。为了解决这个问题,我们引入了一种新颖的ZS-CIR方法,该方法使用球面线性插值(Slerp)直接将图像和文本表示合并。此外,我们还引入了文本锚定调整(TAT)方法,该方法在保持文本编码器固定的情况下,对图像编码器进行微调。TAT缩小了图像和文本之间的模式差距,使得Slerp过程更加有效。值得注意的是,TAT方法不仅在训练数据规模和训练时间方面具有效率,而且还可以作为训练监督CIR模型的良好初始检查点,从而突出其更广泛的潜力。将Slerp-based ZS-CIR与TAT调整的模型相结合,使得我们的方法在CIR基准测试中实现了最先进的检索性能。
https://arxiv.org/abs/2405.00571
We introduce a formal information-theoretic framework for image captioning by regarding it as a representation learning task. Our framework defines three key objectives: task sufficiency, minimal redundancy, and human interpretability. Building upon this foundation, we propose a novel Pyramid of Captions (PoCa) method, which constructs caption pyramids by generating localized captions for zoomed-in image patches and integrating them with global caption information using large language models. This approach leverages intuition that the detailed examination of local patches can reduce error risks and address inaccuracies in global captions, either by correcting the hallucination or adding missing details. Based on our theoretical framework, we formalize this intuition and provide formal proof demonstrating the effectiveness of PoCa under certain assumptions. Empirical tests with various image captioning models and large language models show that PoCa consistently yields more informative and semantically aligned captions, maintaining brevity and interpretability.
我们提出了一个形式化的信息论框架来处理图像标题任务,将其视为一种表示学习任务。我们的框架定义了三个关键目标:任务完备性、最小冗余性和人可解释性。在此基础上,我们提出了一个新的金字塔式标题方法(PoCa) ,通过为缩放的图像补丁生成局部标题,并使用大型语言模型将它们与全局标题信息集成来构建标题金字塔。这种方法利用直觉,即对局部补丁的详细检查可以降低错误风险并解决全局标题的不准确性,或者通过纠正幻觉或添加缺失细节来解决。根据我们的理论框架,我们形式化了这个直觉,并提供了形式化的证明,证明在某些假设下,PoCa具有有效性。用各种图像标题模型和大型语言模型进行实证测试,结果表明,PoCa始终产生更有信息量和语义一致性的标题,保持简短和可解释性。
https://arxiv.org/abs/2405.00485
There has been growing interest in audio-language retrieval research, where the objective is to establish the correlation between audio and text modalities. However, most audio-text paired datasets often lack rich expression of the text data compared to the audio samples. One of the significant challenges facing audio-text datasets is the presence of similar or identical captions despite different audio samples. Therefore, under many-to-one mapping conditions, audio-text datasets lead to poor performance of retrieval tasks. In this paper, we propose a novel approach to tackle the data imbalance problem in audio-language retrieval task. To overcome the limitation, we introduce a method that employs a distance sampling-based paraphraser leveraging ChatGPT, utilizing distance function to generate a controllable distribution of manipulated text data. For a set of sentences with the same context, the distance is used to calculate a degree of manipulation for any two sentences, and ChatGPT's few-shot prompting is performed using a text cluster with a similar distance defined by the Jaccard similarity. Therefore, ChatGPT, when applied to few-shot prompting with text clusters, can adjust the diversity of the manipulated text based on the distance. The proposed approach is shown to significantly enhance performance in audio-text retrieval, outperforming conventional text augmentation techniques.
音频语言检索研究引起了越来越多的关注,其目标是建立音频和文本模态之间的相关性。然而,大多数音频-文本配对数据集通常缺乏文本数据的丰富表达,与音频样本相比。音频-文本数据集面临的一个关键挑战是,尽管存在不同的音频样本,但存在与音频样本相似或相同的字幕。因此,在许多对一映射条件下,音频-文本数据集导致检索任务的性能较差。在本文中,我们提出了一个新方法来解决音频-语言检索任务中的数据不平衡问题。为了克服这一限制,我们引入了一种基于距离采样 的文本同义词生成方法,利用 ChatGPT,通过距离函数生成可控制文本数据的操纵分布。对于具有相同上下文的句子,距离用于计算任意两个句子之间的 manipulation 程度,而 ChatGPT 的 few-shot 提示通过具有相同距离定义的文本簇进行。因此,当将 ChatGPT 应用于 few-shot 提示与文本簇时,可以根据距离调整被操纵文本的多样性。该方法被证明可以在音频-语言检索中显著增强性能,超过传统文本增强技术。
https://arxiv.org/abs/2405.00367
App developers use the Graphical User Interface (GUI) of other apps as an important source of inspiration to design and improve their own apps. In recent years, research suggested various approaches to retrieve GUI designs that fit a certain text query from screenshot datasets acquired through automated GUI exploration. However, such text-to-GUI retrieval approaches only leverage the textual information of the GUI elements in the screenshots, neglecting visual information such as icons or background images. In addition, the retrieved screenshots are not steered by app developers and often lack important app features, e.g. whose UI pages require user authentication. To overcome these limitations, this paper proposes GUing, a GUI search engine based on a vision-language model called UIClip, which we trained specifically for the app GUI domain. For this, we first collected app introduction images from Google Play, which usually display the most representative screenshots selected and often captioned (i.e. labeled) by app vendors. Then, we developed an automated pipeline to classify, crop, and extract the captions from these images. This finally results in a large dataset which we share with this paper: including 303k app screenshots, out of which 135k have captions. We used this dataset to train a novel vision-language model, which is, to the best of our knowledge, the first of its kind in GUI retrieval. We evaluated our approach on various datasets from related work and in manual experiment. The results demonstrate that our model outperforms previous approaches in text-to-GUI retrieval achieving a Recall@10 of up to 0.69 and a HIT@10 of 0.91. We also explored the performance of UIClip for other GUI tasks including GUI classification and Sketch-to-GUI retrieval with encouraging results.
翻译: 应用程序开发者会从其他应用程序的图形用户界面(GUI)中获得灵感来设计和改进他们的应用程序。近年来,研究建议从自动抓取通过 GUI 探索获得的屏幕截图数据集中检索 GUI 设计的各种方法。然而,这样的文本到 GUI 检索方法仅利用了屏幕截图中 GUI 元素的文本信息,而忽视了视觉信息,如图标或背景图像。此外,检索到的屏幕截图通常不是由应用程序开发者引导的,并且通常缺乏重要的应用程序功能,例如需要用户身份验证的 UI 页面。为了克服这些限制,本文提出了基于 UIClip 视觉语言模型的 GUI 搜索引擎,该模型专门为应用程序 GUI 领域进行训练。为此,我们首先从 Google Play 收集了应用程序介绍图片,这些图片通常显示了由应用开发商选择的最具有代表性的屏幕截图并附有标签(即标注)。然后,我们开发了一个自动化的管道来对这些图像进行分类、裁剪和提取标签。最终,我们得到了一个大型数据集,我们将其与本文分享:包括 303k 个应用程序屏幕截图,其中 135k 个带有标签。我们使用这个数据集来训练了一种新颖的视觉语言模型,据我们所知,这是 GUI 检索领域第一个这样的模型。我们在相关工作和手动实验的各种数据集上评估了我们的方法,结果表明,我们的模型在文本到 GUI 检索方面优于先前的方法,达到召回率@10 最高可达 0.69 和精确率@10 最高可达 0.91。我们还研究了 UIClip 在其他 GUI 任务上的性能,包括 GUI 分类和 Sketch-to-GUI 检索,具有鼓舞人心的结果。
https://arxiv.org/abs/2405.00145
Vision-language datasets are vital for both text-to-image (T2I) and image-to-text (I2T) research. However, current datasets lack descriptions with fine-grained detail that would allow for richer associations to be learned by models. To fill the gap, we introduce Descriptions of Connected and Contrasting Images (DOCCI), a dataset with long, human-annotated English descriptions for 15k images that were taken, curated and donated by a single researcher intent on capturing key challenges such as spatial relations, counting, text rendering, world knowledge, and more. We instruct human annotators to create comprehensive descriptions for each image; these average 136 words in length and are crafted to clearly distinguish each image from those that are related or similar. Each description is highly compositional and typically encompasses multiple challenges. Through both quantitative and qualitative analyses, we demonstrate that DOCCI serves as an effective training resource for image-to-text generation -- a PaLI 5B model finetuned on DOCCI shows equal or superior results compared to highly-performant larger models like LLaVA-1.5 7B and InstructBLIP 7B. Furthermore, we show that DOCCI is a useful testbed for text-to-image generation, highlighting the limitations of current text-to-image models in capturing long descriptions and fine details.
视觉语言数据对于文本到图像(T2I)和图像到文本(I2T)研究来说至关重要。然而,现有的数据集缺乏描述性,这些描述可以让模型更丰富地学习关联。为了填补这个空白,我们引入了连接和对比图像的描述(DOCCI)数据集,这是一个由单个研究人员收集、策划和捐赠的15000张图片的数据集,旨在捕捉一些关键挑战,如空间关系、计数、文本渲染、世界知识等。我们指示人类标注者为每张图片创建全面的描述;这些描述通常长度为136个词,并刻意区分彼此的关系或相似性。每个描述高度可组合,通常涵盖多个挑战。通过数量和定性分析,我们证明DOCCI是一个有效的图像到文本生成训练资源——在DOCCI上训练的PaLI 5B模型与高度表现的大模型(如LaVA-1.5 7B和InstructBLIP 7B)相比,表现出相同或更好的效果。此外,我们还展示了DOCCI对于文本到图像生成的测试平台的价值,突出了当前文本到图像模型的局限性,即捕捉不了长描述和细节。
https://arxiv.org/abs/2404.19753
Existing automatic captioning methods for visual content face challenges such as lack of detail, content hallucination, and poor instruction following. In this work, we propose VisualFactChecker (VFC), a flexible training-free pipeline that generates high-fidelity and detailed captions for both 2D images and 3D objects. VFC consists of three steps: 1) proposal, where image-to-text captioning models propose multiple initial captions; 2) verification, where a large language model (LLM) utilizes tools such as object detection and VQA models to fact-check proposed captions; 3) captioning, where an LLM generates the final caption by summarizing caption proposals and the fact check verification results. In this step, VFC can flexibly generate captions in various styles following complex instructions. We conduct comprehensive captioning evaluations using four metrics: 1) CLIP-Score for image-text similarity; 2) CLIP-Image-Score for measuring the image-image similarity between the original and the reconstructed image generated by a text-to-image model using the caption. 3) human study on Amazon Mechanical Turk; 4) GPT-4V for fine-grained evaluation. Evaluation results show that VFC outperforms state-of-the-art open-sourced captioning methods for 2D images on the COCO dataset and 3D assets on the Objaverse dataset. Our study demonstrates that by combining open-source models into a pipeline, we can attain captioning capability comparable to proprietary models such as GPT-4V, despite being over 10x smaller in model size.
目前,为视觉内容设计的自动摘要方法面临着缺乏细节、内容偏差和差劲的指令等挑战。在这项工作中,我们提出了VisualFactChecker(VFC),一种灵活的训练免费管道,为2D图像和3D对象生成高保真度和详细摘要。VFC包括三个步骤:1)提议,其中图像到文本摘要模型提出多个初始摘要;2)验证,其中大型语言模型(LLM)利用诸如物体检测和VQA模型等工具对提议的摘要进行验证;3)摘要,其中LLM通过总结摘要建议和验证结果生成最终的摘要。在这一步骤,VFC可以根据复杂指令灵活生成各种风格的摘要。我们使用四个指标对全面摘要评估:1)CLIP-Score,衡量图像与文本相似度;2)CLIP-Image-Score,衡量原图像和由文本到图像模型生成的图像之间的图像图像相似度;3)在Amazon Mechanical Turk上的人类研究;4)GPT-4V进行微细化评估。评估结果显示,VFC在COCO数据集上的2D图像上的表现优于最先进的开源摘要方法,而在Objaverse数据集上的3D资产上的表现也优于最先进的开放式源代码方法。我们的研究证明了通过将开源模型集成到管道中,我们可以实现与 proprietary 模型如GPT-4V相当的摘要能力,尽管模型的规模是开源模型的10倍以上。
https://arxiv.org/abs/2404.19752
Anomaly synthesis is one of the effective methods to augment abnormal samples for training. However, current anomaly synthesis methods predominantly rely on texture information as input, which limits the fidelity of synthesized abnormal samples. Because texture information is insufficient to correctly depict the pattern of anomalies, especially for logical anomalies. To surmount this obstacle, we present the AnomalyXFusion framework, designed to harness multi-modality information to enhance the quality of synthesized abnormal samples. The AnomalyXFusion framework comprises two distinct yet synergistic modules: the Multi-modal In-Fusion (MIF) module and the Dynamic Dif-Fusion (DDF) module. The MIF module refines modality alignment by aggregating and integrating various modality features into a unified embedding space, termed X-embedding, which includes image, text, and mask features. Concurrently, the DDF module facilitates controlled generation through an adaptive adjustment of X-embedding conditioned on the diffusion steps. In addition, to reveal the multi-modality representational power of AnomalyXFusion, we propose a new dataset, called MVTec Caption. More precisely, MVTec Caption extends 2.2k accurate image-mask-text annotations for the MVTec AD and LOCO datasets. Comprehensive evaluations demonstrate the effectiveness of AnomalyXFusion, especially regarding the fidelity and diversity for logical anomalies. Project page: http:github.com/hujiecpp/MVTec-Caption
异常合成是一种有效的增强训练异常样本的方法。然而,现有的异常合成方法主要依赖于纹理信息作为输入,这限制了合成异常样本的保真度。因为纹理信息不足以正确地描绘异常的图案,尤其是对于逻辑异常。为克服这一障碍,我们提出了AnomalyXFusion框架,旨在利用多模态信息提高合成异常样本的质量。AnomalyXFusion框架包括两个不同的但相互作用的模块:多模态In-Fusion(MIF)模块和动态Dif-Fusion(DDF)模块。MIF模块通过聚合和整合各种模态特征到统一的嵌入空间X-embedding中,称为X-嵌入,来优化模态对齐。同时,DDF模块通过根据扩散步骤自适应调整X-嵌入来促进控制生成。此外,为了揭示AnomalyXFusion的多模态表示能力,我们提出了一个新的数据集,称为MVTec Caption。更具体地说,MVTec Caption扩展了MVTec AD和LOCO数据集中的2200个准确图像-纹理-文本注释。全面的评估证明了AnomalyXFusion的有效性,特别是对于逻辑异常的保真度和多样性。项目页面:http:github.com/hujiecpp/MVTec-Caption
https://arxiv.org/abs/2404.19444
In patent prosecution, image-based retrieval systems for identifying similarities between current patent images and prior art are pivotal to ensure the novelty and non-obviousness of patent applications. Despite their growing popularity in recent years, existing attempts, while effective at recognizing images within the same patent, fail to deliver practical value due to their limited generalizability in retrieving relevant prior art. Moreover, this task inherently involves the challenges posed by the abstract visual features of patent images, the skewed distribution of image classifications, and the semantic information of image descriptions. Therefore, we propose a language-informed, distribution-aware multimodal approach to patent image feature learning, which enriches the semantic understanding of patent image by integrating Large Language Models and improves the performance of underrepresented classes with our proposed distribution-aware contrastive losses. Extensive experiments on DeepPatent2 dataset show that our proposed method achieves state-of-the-art or comparable performance in image-based patent retrieval with mAP +53.3%, Recall@10 +41.8%, and MRR@10 +51.9%. Furthermore, through an in-depth user analysis, we explore our model in aiding patent professionals in their image retrieval efforts, highlighting the model's real-world applicability and effectiveness.
在专利申请过程中,基于图像的检索系统用于识别当前专利图像与先驱技术的相似性,以确保专利申请的新颖性和非显而易见性。尽管近年来它在同一专利中识别图像的有效性得到了显著提高,但现有的尝试虽然能在同一专利中识别图像,但由于其在提取相关先驱技术方面具有有限的可扩展性,导致实际应用价值有限。此外,这项任务本身涉及专利图像抽象视觉特征、图像分类分布的不对称性和图像描述的语义信息等挑战。因此,我们提出了一个语言驱动、分布关注的多模态方法来进行专利图像特征学习,通过整合大型语言模型和改进我们提出的分布关注对比损失,从而丰富专利图像的语义理解。在DeepPatent2数据集上的大量实验证明,我们提出的方法在基于图像的专利检索中实现了与mAP +53.3%和Recall@10 +41.8%相当或更好的性能。通过深入的用户分析,我们探讨了我们的模型如何帮助专利专业人员提高图像检索工作,突出了模型的实际应用价值和有效性。
https://arxiv.org/abs/2404.19360