We introduce a formal information-theoretic framework for image captioning by regarding it as a representation learning task. Our framework defines three key objectives: task sufficiency, minimal redundancy, and human interpretability. Building upon this foundation, we propose a novel Pyramid of Captions (PoCa) method, which constructs caption pyramids by generating localized captions for zoomed-in image patches and integrating them with global caption information using large language models. This approach leverages intuition that the detailed examination of local patches can reduce error risks and address inaccuracies in global captions, either by correcting the hallucination or adding missing details. Based on our theoretical framework, we formalize this intuition and provide formal proof demonstrating the effectiveness of PoCa under certain assumptions. Empirical tests with various image captioning models and large language models show that PoCa consistently yields more informative and semantically aligned captions, maintaining brevity and interpretability.
我们提出了一个形式化的信息论框架来处理图像标题任务,将其视为一种表示学习任务。我们的框架定义了三个关键目标:任务完备性、最小冗余性和人可解释性。在此基础上,我们提出了一个新的金字塔式标题方法(PoCa) ,通过为缩放的图像补丁生成局部标题,并使用大型语言模型将它们与全局标题信息集成来构建标题金字塔。这种方法利用直觉,即对局部补丁的详细检查可以降低错误风险并解决全局标题的不准确性,或者通过纠正幻觉或添加缺失细节来解决。根据我们的理论框架,我们形式化了这个直觉,并提供了形式化的证明,证明在某些假设下,PoCa具有有效性。用各种图像标题模型和大型语言模型进行实证测试,结果表明,PoCa始终产生更有信息量和语义一致性的标题,保持简短和可解释性。
https://arxiv.org/abs/2405.00485
There has been growing interest in audio-language retrieval research, where the objective is to establish the correlation between audio and text modalities. However, most audio-text paired datasets often lack rich expression of the text data compared to the audio samples. One of the significant challenges facing audio-text datasets is the presence of similar or identical captions despite different audio samples. Therefore, under many-to-one mapping conditions, audio-text datasets lead to poor performance of retrieval tasks. In this paper, we propose a novel approach to tackle the data imbalance problem in audio-language retrieval task. To overcome the limitation, we introduce a method that employs a distance sampling-based paraphraser leveraging ChatGPT, utilizing distance function to generate a controllable distribution of manipulated text data. For a set of sentences with the same context, the distance is used to calculate a degree of manipulation for any two sentences, and ChatGPT's few-shot prompting is performed using a text cluster with a similar distance defined by the Jaccard similarity. Therefore, ChatGPT, when applied to few-shot prompting with text clusters, can adjust the diversity of the manipulated text based on the distance. The proposed approach is shown to significantly enhance performance in audio-text retrieval, outperforming conventional text augmentation techniques.
音频语言检索研究引起了越来越多的关注,其目标是建立音频和文本模态之间的相关性。然而,大多数音频-文本配对数据集通常缺乏文本数据的丰富表达,与音频样本相比。音频-文本数据集面临的一个关键挑战是,尽管存在不同的音频样本,但存在与音频样本相似或相同的字幕。因此,在许多对一映射条件下,音频-文本数据集导致检索任务的性能较差。在本文中,我们提出了一个新方法来解决音频-语言检索任务中的数据不平衡问题。为了克服这一限制,我们引入了一种基于距离采样 的文本同义词生成方法,利用 ChatGPT,通过距离函数生成可控制文本数据的操纵分布。对于具有相同上下文的句子,距离用于计算任意两个句子之间的 manipulation 程度,而 ChatGPT 的 few-shot 提示通过具有相同距离定义的文本簇进行。因此,当将 ChatGPT 应用于 few-shot 提示与文本簇时,可以根据距离调整被操纵文本的多样性。该方法被证明可以在音频-语言检索中显著增强性能,超过传统文本增强技术。
https://arxiv.org/abs/2405.00367
App developers use the Graphical User Interface (GUI) of other apps as an important source of inspiration to design and improve their own apps. In recent years, research suggested various approaches to retrieve GUI designs that fit a certain text query from screenshot datasets acquired through automated GUI exploration. However, such text-to-GUI retrieval approaches only leverage the textual information of the GUI elements in the screenshots, neglecting visual information such as icons or background images. In addition, the retrieved screenshots are not steered by app developers and often lack important app features, e.g. whose UI pages require user authentication. To overcome these limitations, this paper proposes GUing, a GUI search engine based on a vision-language model called UIClip, which we trained specifically for the app GUI domain. For this, we first collected app introduction images from Google Play, which usually display the most representative screenshots selected and often captioned (i.e. labeled) by app vendors. Then, we developed an automated pipeline to classify, crop, and extract the captions from these images. This finally results in a large dataset which we share with this paper: including 303k app screenshots, out of which 135k have captions. We used this dataset to train a novel vision-language model, which is, to the best of our knowledge, the first of its kind in GUI retrieval. We evaluated our approach on various datasets from related work and in manual experiment. The results demonstrate that our model outperforms previous approaches in text-to-GUI retrieval achieving a Recall@10 of up to 0.69 and a HIT@10 of 0.91. We also explored the performance of UIClip for other GUI tasks including GUI classification and Sketch-to-GUI retrieval with encouraging results.
翻译: 应用程序开发者会从其他应用程序的图形用户界面(GUI)中获得灵感来设计和改进他们的应用程序。近年来,研究建议从自动抓取通过 GUI 探索获得的屏幕截图数据集中检索 GUI 设计的各种方法。然而,这样的文本到 GUI 检索方法仅利用了屏幕截图中 GUI 元素的文本信息,而忽视了视觉信息,如图标或背景图像。此外,检索到的屏幕截图通常不是由应用程序开发者引导的,并且通常缺乏重要的应用程序功能,例如需要用户身份验证的 UI 页面。为了克服这些限制,本文提出了基于 UIClip 视觉语言模型的 GUI 搜索引擎,该模型专门为应用程序 GUI 领域进行训练。为此,我们首先从 Google Play 收集了应用程序介绍图片,这些图片通常显示了由应用开发商选择的最具有代表性的屏幕截图并附有标签(即标注)。然后,我们开发了一个自动化的管道来对这些图像进行分类、裁剪和提取标签。最终,我们得到了一个大型数据集,我们将其与本文分享:包括 303k 个应用程序屏幕截图,其中 135k 个带有标签。我们使用这个数据集来训练了一种新颖的视觉语言模型,据我们所知,这是 GUI 检索领域第一个这样的模型。我们在相关工作和手动实验的各种数据集上评估了我们的方法,结果表明,我们的模型在文本到 GUI 检索方面优于先前的方法,达到召回率@10 最高可达 0.69 和精确率@10 最高可达 0.91。我们还研究了 UIClip 在其他 GUI 任务上的性能,包括 GUI 分类和 Sketch-to-GUI 检索,具有鼓舞人心的结果。
https://arxiv.org/abs/2405.00145
Vision-language datasets are vital for both text-to-image (T2I) and image-to-text (I2T) research. However, current datasets lack descriptions with fine-grained detail that would allow for richer associations to be learned by models. To fill the gap, we introduce Descriptions of Connected and Contrasting Images (DOCCI), a dataset with long, human-annotated English descriptions for 15k images that were taken, curated and donated by a single researcher intent on capturing key challenges such as spatial relations, counting, text rendering, world knowledge, and more. We instruct human annotators to create comprehensive descriptions for each image; these average 136 words in length and are crafted to clearly distinguish each image from those that are related or similar. Each description is highly compositional and typically encompasses multiple challenges. Through both quantitative and qualitative analyses, we demonstrate that DOCCI serves as an effective training resource for image-to-text generation -- a PaLI 5B model finetuned on DOCCI shows equal or superior results compared to highly-performant larger models like LLaVA-1.5 7B and InstructBLIP 7B. Furthermore, we show that DOCCI is a useful testbed for text-to-image generation, highlighting the limitations of current text-to-image models in capturing long descriptions and fine details.
视觉语言数据对于文本到图像(T2I)和图像到文本(I2T)研究来说至关重要。然而,现有的数据集缺乏描述性,这些描述可以让模型更丰富地学习关联。为了填补这个空白,我们引入了连接和对比图像的描述(DOCCI)数据集,这是一个由单个研究人员收集、策划和捐赠的15000张图片的数据集,旨在捕捉一些关键挑战,如空间关系、计数、文本渲染、世界知识等。我们指示人类标注者为每张图片创建全面的描述;这些描述通常长度为136个词,并刻意区分彼此的关系或相似性。每个描述高度可组合,通常涵盖多个挑战。通过数量和定性分析,我们证明DOCCI是一个有效的图像到文本生成训练资源——在DOCCI上训练的PaLI 5B模型与高度表现的大模型(如LaVA-1.5 7B和InstructBLIP 7B)相比,表现出相同或更好的效果。此外,我们还展示了DOCCI对于文本到图像生成的测试平台的价值,突出了当前文本到图像模型的局限性,即捕捉不了长描述和细节。
https://arxiv.org/abs/2404.19753
Existing automatic captioning methods for visual content face challenges such as lack of detail, content hallucination, and poor instruction following. In this work, we propose VisualFactChecker (VFC), a flexible training-free pipeline that generates high-fidelity and detailed captions for both 2D images and 3D objects. VFC consists of three steps: 1) proposal, where image-to-text captioning models propose multiple initial captions; 2) verification, where a large language model (LLM) utilizes tools such as object detection and VQA models to fact-check proposed captions; 3) captioning, where an LLM generates the final caption by summarizing caption proposals and the fact check verification results. In this step, VFC can flexibly generate captions in various styles following complex instructions. We conduct comprehensive captioning evaluations using four metrics: 1) CLIP-Score for image-text similarity; 2) CLIP-Image-Score for measuring the image-image similarity between the original and the reconstructed image generated by a text-to-image model using the caption. 3) human study on Amazon Mechanical Turk; 4) GPT-4V for fine-grained evaluation. Evaluation results show that VFC outperforms state-of-the-art open-sourced captioning methods for 2D images on the COCO dataset and 3D assets on the Objaverse dataset. Our study demonstrates that by combining open-source models into a pipeline, we can attain captioning capability comparable to proprietary models such as GPT-4V, despite being over 10x smaller in model size.
目前,为视觉内容设计的自动摘要方法面临着缺乏细节、内容偏差和差劲的指令等挑战。在这项工作中,我们提出了VisualFactChecker(VFC),一种灵活的训练免费管道,为2D图像和3D对象生成高保真度和详细摘要。VFC包括三个步骤:1)提议,其中图像到文本摘要模型提出多个初始摘要;2)验证,其中大型语言模型(LLM)利用诸如物体检测和VQA模型等工具对提议的摘要进行验证;3)摘要,其中LLM通过总结摘要建议和验证结果生成最终的摘要。在这一步骤,VFC可以根据复杂指令灵活生成各种风格的摘要。我们使用四个指标对全面摘要评估:1)CLIP-Score,衡量图像与文本相似度;2)CLIP-Image-Score,衡量原图像和由文本到图像模型生成的图像之间的图像图像相似度;3)在Amazon Mechanical Turk上的人类研究;4)GPT-4V进行微细化评估。评估结果显示,VFC在COCO数据集上的2D图像上的表现优于最先进的开源摘要方法,而在Objaverse数据集上的3D资产上的表现也优于最先进的开放式源代码方法。我们的研究证明了通过将开源模型集成到管道中,我们可以实现与 proprietary 模型如GPT-4V相当的摘要能力,尽管模型的规模是开源模型的10倍以上。
https://arxiv.org/abs/2404.19752
Anomaly synthesis is one of the effective methods to augment abnormal samples for training. However, current anomaly synthesis methods predominantly rely on texture information as input, which limits the fidelity of synthesized abnormal samples. Because texture information is insufficient to correctly depict the pattern of anomalies, especially for logical anomalies. To surmount this obstacle, we present the AnomalyXFusion framework, designed to harness multi-modality information to enhance the quality of synthesized abnormal samples. The AnomalyXFusion framework comprises two distinct yet synergistic modules: the Multi-modal In-Fusion (MIF) module and the Dynamic Dif-Fusion (DDF) module. The MIF module refines modality alignment by aggregating and integrating various modality features into a unified embedding space, termed X-embedding, which includes image, text, and mask features. Concurrently, the DDF module facilitates controlled generation through an adaptive adjustment of X-embedding conditioned on the diffusion steps. In addition, to reveal the multi-modality representational power of AnomalyXFusion, we propose a new dataset, called MVTec Caption. More precisely, MVTec Caption extends 2.2k accurate image-mask-text annotations for the MVTec AD and LOCO datasets. Comprehensive evaluations demonstrate the effectiveness of AnomalyXFusion, especially regarding the fidelity and diversity for logical anomalies. Project page: http:github.com/hujiecpp/MVTec-Caption
异常合成是一种有效的增强训练异常样本的方法。然而,现有的异常合成方法主要依赖于纹理信息作为输入,这限制了合成异常样本的保真度。因为纹理信息不足以正确地描绘异常的图案,尤其是对于逻辑异常。为克服这一障碍,我们提出了AnomalyXFusion框架,旨在利用多模态信息提高合成异常样本的质量。AnomalyXFusion框架包括两个不同的但相互作用的模块:多模态In-Fusion(MIF)模块和动态Dif-Fusion(DDF)模块。MIF模块通过聚合和整合各种模态特征到统一的嵌入空间X-embedding中,称为X-嵌入,来优化模态对齐。同时,DDF模块通过根据扩散步骤自适应调整X-嵌入来促进控制生成。此外,为了揭示AnomalyXFusion的多模态表示能力,我们提出了一个新的数据集,称为MVTec Caption。更具体地说,MVTec Caption扩展了MVTec AD和LOCO数据集中的2200个准确图像-纹理-文本注释。全面的评估证明了AnomalyXFusion的有效性,特别是对于逻辑异常的保真度和多样性。项目页面:http:github.com/hujiecpp/MVTec-Caption
https://arxiv.org/abs/2404.19444
In patent prosecution, image-based retrieval systems for identifying similarities between current patent images and prior art are pivotal to ensure the novelty and non-obviousness of patent applications. Despite their growing popularity in recent years, existing attempts, while effective at recognizing images within the same patent, fail to deliver practical value due to their limited generalizability in retrieving relevant prior art. Moreover, this task inherently involves the challenges posed by the abstract visual features of patent images, the skewed distribution of image classifications, and the semantic information of image descriptions. Therefore, we propose a language-informed, distribution-aware multimodal approach to patent image feature learning, which enriches the semantic understanding of patent image by integrating Large Language Models and improves the performance of underrepresented classes with our proposed distribution-aware contrastive losses. Extensive experiments on DeepPatent2 dataset show that our proposed method achieves state-of-the-art or comparable performance in image-based patent retrieval with mAP +53.3%, Recall@10 +41.8%, and MRR@10 +51.9%. Furthermore, through an in-depth user analysis, we explore our model in aiding patent professionals in their image retrieval efforts, highlighting the model's real-world applicability and effectiveness.
在专利申请过程中,基于图像的检索系统用于识别当前专利图像与先驱技术的相似性,以确保专利申请的新颖性和非显而易见性。尽管近年来它在同一专利中识别图像的有效性得到了显著提高,但现有的尝试虽然能在同一专利中识别图像,但由于其在提取相关先驱技术方面具有有限的可扩展性,导致实际应用价值有限。此外,这项任务本身涉及专利图像抽象视觉特征、图像分类分布的不对称性和图像描述的语义信息等挑战。因此,我们提出了一个语言驱动、分布关注的多模态方法来进行专利图像特征学习,通过整合大型语言模型和改进我们提出的分布关注对比损失,从而丰富专利图像的语义理解。在DeepPatent2数据集上的大量实验证明,我们提出的方法在基于图像的专利检索中实现了与mAP +53.3%和Recall@10 +41.8%相当或更好的性能。通过深入的用户分析,我们探讨了我们的模型如何帮助专利专业人员提高图像检索工作,突出了模型的实际应用价值和有效性。
https://arxiv.org/abs/2404.19360
There are a thousand ways to caption an image. Contrastive Language Pretraining (CLIP) on the other hand, works by mapping an image and its caption to a single vector -- limiting how well CLIP-like models can represent the diverse ways to describe an image. In this work, we introduce Llip, Latent Language Image Pretraining, which models the diversity of captions that could match an image. Llip's vision encoder outputs a set of visual features that are mixed into a final representation by conditioning on information derived from the text. We show that Llip outperforms non-contextualized baselines like CLIP and SigLIP on a variety of tasks even with large-scale encoders. Llip improves zero-shot classification by an average of 2.9% zero-shot classification benchmarks with a ViT-G/14 encoder. Specifically, Llip attains a zero-shot top-1 accuracy of 83.5% on ImageNet outperforming a similarly sized CLIP by 1.4%. We also demonstrate improvement on zero-shot retrieval on MS-COCO by 6.0%. We provide a comprehensive analysis of the components introduced by the method and demonstrate that Llip leads to richer visual representations.
图像的标题有很多种方式。相比之下,Contrastive Language Pretraining(CLIP)通过将图像及其标题映射到单个向量上来限制CLIP类似模型在描述图像多样方式方面的表现。在本文中,我们引入了Llip,Latent Language Image Pretraining,该模型能够表示与图像相匹配的标题的多样性。Llip的视觉编码器输出一系列视觉特征,通过在文本信息的基础上进行条件处理,将其混合为最终表示。我们证明了,即使在大型编码器的情况下,Llip在各种任务上也要优于无背景的基线,如CLIP和SigLIP。Llip通过ViT-G/14编码器平均提高了2.9%的零散数据基准的零散分类性能。具体来说,Llip在ImageNet上的零散分类顶点为83.5%,超越了同样大小的CLIP。我们还证明了在MS-COCO上,Llip在零散检索方面提高了6.0%的性能。我们全面分析了该方法引入的组件,并表明Llip能够产生更丰富的视觉表示。
https://arxiv.org/abs/2405.00740
Effective communication is paramount for the inclusion of deaf individuals in society. However, persistent communication barriers due to limited Sign Language (SL) knowledge hinder their full participation. In this context, Sign Language Recognition (SLR) systems have been developed to improve communication between signing and non-signing individuals. In particular, there is the problem of recognizing isolated signs (Isolated Sign Language Recognition, ISLR) of great relevance in the development of vision-based SL search engines, learning tools, and translation systems. This work proposes an ISLR approach where body, hands, and facial landmarks are extracted throughout time and encoded as 2-D images. These images are processed by a convolutional neural network, which maps the visual-temporal information into a sign label. Experimental results demonstrate that our method surpassed the state-of-the-art in terms of performance metrics on two widely recognized datasets in Brazilian Sign Language (LIBRAS), the primary focus of this study. In addition to being more accurate, our method is more time-efficient and easier to train due to its reliance on a simpler network architecture and solely RGB data as input.
有效的沟通对于确保聋人融入社会至关重要。然而,由于缺乏手语(SL)知识而导致的持续沟通障碍,阻碍了聋人充分参与。在这种情况下,签名语言识别(SLR)系统已经开发出来,以改善手语和 non-signer 之间的沟通。特别是,在视觉导向 SL 搜索引擎、学习工具和翻译系统的发展中,存在识别孤立手语(Isolated Sign Language Recognition,ISLR)的重大问题。本文提出了一种 ISLR方法,通过提取人体的身体、手部、面部特征并编码为二维图像,在整个过程中进行编码。这些图像通过卷积神经网络进行处理,将视觉-时间信息映射到手语标签。实验结果表明,在我们的方法在巴西手语(LIBRAS)两个广泛认可的数据集上的性能指标方面超过了最先进的水平,这是本研究的主要目标。除了更加准确外,由于其依赖较简单的网络架构和对 RGB 数据输入的唯一性,我们的方法还具有更高效和易训练的特点。
https://arxiv.org/abs/2404.19148
Remote Sensing Image Change Captioning (RSICC) aims to identify surface changes in multi-temporal remote sensing images and describe them in natural language. Current methods typically rely on an encoder-decoder architecture and focus on designing a sophisticated neck to process bi-temporal features extracted by the backbone. Recently, State Space Models (SSMs), especially Mamba, have demonstrated outstanding performance in many fields, owing to their efficient feature-selective modelling capability. However, their potential in the RSICC task remains unexplored. In this paper, we introduce Mamba into RSICC and propose a novel approach called RSCaMa (Remote Sensing Change Captioning Mamba). Specifically, we utilize Siamese backbones to extract bi-temporal features, which are then processed through multiple CaMa layers consisting of Spatial Difference-guided SSM (SD-SSM) and Temporal Traveling SSM (TT-SSM). SD-SSM uses differential features to enhance change perception, while TT-SSM promotes bitemporal interactions in a token-wise cross-scanning manner. Experimental results validate the effectiveness of CaMa layers and demonstrate the superior performance of RSCaMa, as well as the potential of Mamba in the RSICC task. Additionally, we systematically compare the effects of three language decoders, including Mamba, GPT-style decoder with causal attention mechanism, and Transformer decoder with cross-attention mechanism. This provides valuable insights for future RSICC research. The code will be available at this https URL
远地遥感图像变化捕捉(RSICC)旨在通过自然语言描述多时程遥感图像表面的变化。目前的方法通常依赖于编码器-解码器架构,并专注于设计一个复杂的颈以处理由骨干网络提取的生物时程特征。最近,状态空间模型(SSMs)特别是Mamba在许多领域表现出卓越的性能,因为它们具有高效的特征选择建模能力。然而,在RSICC任务中,它们的表现潜力仍未经探索。在本文中,我们将Mamba引入RSICC,并提出了名为RSCaMa(远程遥感变化捕捉Mamba)的新方法。具体来说,我们利用序列到序列(Siamese)网络提取生物时程特征,然后通过多个CaMa层进行处理,包括空间差异引导的状态空间模型(SD-SSM)和时间旅行状态空间模型(TT-SSM)。SD-SSM使用差分特征来增强变化感知,而TT-SSM通过逐个扫描的token级跨扫描方式促进比特时相互作用。实验结果验证了CaMa层的有效性,并证明了RSCaMa以及Mamba在RSICC任务中的卓越表现。此外,我们系统地比较了三种语言解码器,包括Mamba、具有因果注意力机制的GPT风格解码器和具有跨注意力机制的Transformer解码器。这为未来的RSICC研究提供了宝贵的洞见。代码将在此处https:// URL中提供。
https://arxiv.org/abs/2404.18895
Image search stands as a pivotal task in multimedia and computer vision, finding applications across diverse domains, ranging from internet search to medical diagnostics. Conventional image search systems operate by accepting textual or visual queries, retrieving the top-relevant candidate results from the database. However, prevalent methods often rely on single-turn procedures, introducing potential inaccuracies and limited recall. These methods also face the challenges, such as vocabulary mismatch and the semantic gap, constraining their overall effectiveness. To address these issues, we propose an interactive image retrieval system capable of refining queries based on user relevance feedback in a multi-turn setting. This system incorporates a vision language model (VLM) based image captioner to enhance the quality of text-based queries, resulting in more informative queries with each iteration. Moreover, we introduce a large language model (LLM) based denoiser to refine text-based query expansions, mitigating inaccuracies in image descriptions generated by captioning models. To evaluate our system, we curate a new dataset by adapting the MSR-VTT video retrieval dataset to the image retrieval task, offering multiple relevant ground truth images for each query. Through comprehensive experiments, we validate the effectiveness of our proposed system against baseline methods, achieving state-of-the-art performance with a notable 10\% improvement in terms of recall. Our contributions encompass the development of an innovative interactive image retrieval system, the integration of an LLM-based denoiser, the curation of a meticulously designed evaluation dataset, and thorough experimental validation.
图像搜索在多媒体和计算机视觉领域具有关键作用,应用于各种领域,从互联网搜索到医疗诊断。传统的图像搜索系统通过接受文本或图像查询,从数据库中检索最相关的候选结果来操作。然而,普遍方法往往依赖于单轮过程,这可能导致不准确的结果和有限的召回率。这些方法还面临着词汇不匹配和语义鸿沟等挑战,限制了其整体效果。为了应对这些问题,我们提出了一个多轮图像检索系统,该系统可以根据用户相关反馈来优化查询。该系统采用基于视觉语言模型的图像摘要器来提高文本查询的质量,从而每次迭代产生更有信息性的查询。此外,我们还引入了一个基于大语言模型的去噪器来优化基于文本的查询扩展,减轻 captioning 模型生成的图像描述中的不准确。为了评估我们的系统,我们通过将 MSR-VTT 视频检索数据集改编为图像检索任务,为每个查询提供多个相关 ground truth 图像。通过全面的实验,我们验证了我们提出的系统在基线方法上的有效性,在召回率方面取得了显著的10%提升。我们的贡献包括开发了一个创新的交互式图像检索系统、引入了基于 LLM 的去噪器、创建了精心设计的评估数据集,以及进行了充分的实验验证。
https://arxiv.org/abs/2404.18746
Recent advancements in Large Multimodal Models (LMMs) have attracted interest in their generalization capability with only a few samples in the prompt. This progress is particularly relevant to the medical domain, where the quality and sensitivity of data pose unique challenges for model training and application. However, the dependency on high-quality data for effective in-context learning raises questions about the feasibility of these models when encountering with the inevitable variations and errors inherent in real-world medical data. In this paper, we introduce MID-M, a novel framework that leverages the in-context learning capabilities of a general-domain Large Language Model (LLM) to process multimodal data via image descriptions. MID-M achieves a comparable or superior performance to task-specific fine-tuned LMMs and other general-domain ones, without the extensive domain-specific training or pre-training on multimodal data, with significantly fewer parameters. This highlights the potential of leveraging general-domain LLMs for domain-specific tasks and offers a sustainable and cost-effective alternative to traditional LMM developments. Moreover, the robustness of MID-M against data quality issues demonstrates its practical utility in real-world medical domain applications.
近年来在大型多模态模型(LMMs)方面的进步引起了人们对它们在提示中仅几个样本时的泛化能力的关注。在医疗领域,数据质量和准确性对模型的训练和应用提出了独特的挑战。然而,对高质量数据的需求对于在真实世界医疗数据中实现有效的上下文学习提出了问题,这也使得这些模型在遇到真实世界医疗数据的固有变异性误差时具有可行性。在本文中,我们引入了MID-M,一种利用一般领域大型语言模型(LLM)的上下文学习能力处理多模态数据的框架。MID-M在任务特定细粒度调整的LLM和其他一般领域的模型上实现了与LLM相当或更好的性能,而无需进行广泛的领域特定训练或预训练。这突显了利用通用领域LLM进行领域特定任务的潜力,并为传统LMM发展提供了一种可持续且成本效益高的替代方案。此外,MID-M对数据质量问题的鲁棒性表明其在真实世界医疗领域应用的实用性。
https://arxiv.org/abs/2405.01591
Despite the advancements and impressive performance of Multimodal Large Language Models (MLLMs) on benchmarks, their effectiveness in real-world, long-context, and multi-image tasks is unclear due to the benchmarks' limited scope. Existing benchmarks often focus on single-image and short-text samples, and when assessing multi-image tasks, they either limit the image count or focus on specific task (e.g time-series captioning), potentially obscuring the performance challenges of MLLMs. To address these limitations, we introduce MileBench, a pioneering benchmark designed to test the MultImodal Long-contExt capabilities of MLLMs. This benchmark comprises not only multimodal long contexts, but also multiple tasks requiring both comprehension and generation. We establish two distinct evaluation sets, diagnostic and realistic, to systematically assess MLLMs' long-context adaptation capacity and their ability to complete tasks in long-context scenarios. Our experimental results, obtained from testing 20 models, revealed that while the closed-source GPT-4(Vision) and Gemini 1.5 outperform others, most open-source MLLMs struggle in long-context situations. Interestingly, the performance gap tends to widen with an increase in the number of images. We strongly encourage an intensification of research efforts towards enhancing MLLMs' long-context capabilities, especially in scenarios involving multiple images.
尽管在基准测试中,多模态大型语言模型(MLLMs)的进步和出色的表现非常明显,但在现实世界、长文本和多图像任务中,它们在实际应用中的有效性尚不清楚,因为基准测试的范围有限。现有的基准测试通常仅关注单张图像和较短文本样本,当评估多图像任务时,它们可能限制图像数量或将注意力集中在特定任务(例如时间序列摘要),这可能会掩盖MLLMs在长文本场景中的性能挑战。为了应对这些限制,我们引入了MileBench,一个旨在测试MLLMs多模态长文本能力的开创性基准。该基准不仅包括多模态长文本,还包括需要理解和生成的多个任务。我们建立了两个不同的评估集,分别是诊断和现实主义评估,以系统地评估MLLMs在长文本场景中的长文本适应能力和完成任务的能力。我们对20个模型进行了实验,结果表明,虽然闭源的GPT-4(视觉)和Gemini 1.5的表现最好,但大多数开源MLLM在长文本环境中表现不佳。有趣的是,随着图像数量的增加,性能差距往往扩大。我们强烈鼓励在涉及多个图像的场景中加强研究努力,以提高MLLMs的长文本能力。
https://arxiv.org/abs/2404.18532
Multimodal machine translation (MMT) is a challenging task that seeks to improve translation quality by incorporating visual information. However, recent studies have indicated that the visual information provided by existing MMT datasets is insufficient, causing models to disregard it and overestimate their capabilities. This issue presents a significant obstacle to the development of MMT research. This paper presents a novel solution to this issue by introducing 3AM, an ambiguity-aware MMT dataset comprising 26,000 parallel sentence pairs in English and Chinese, each with corresponding images. Our dataset is specifically designed to include more ambiguity and a greater variety of both captions and images than other MMT datasets. We utilize a word sense disambiguation model to select ambiguous data from vision-and-language datasets, resulting in a more challenging dataset. We further benchmark several state-of-the-art MMT models on our proposed dataset. Experimental results show that MMT models trained on our dataset exhibit a greater ability to exploit visual information than those trained on other MMT datasets. Our work provides a valuable resource for researchers in the field of multimodal learning and encourages further exploration in this area. The data, code and scripts are freely available at this https URL.
多模态机器翻译(MMT)是一项具有挑战性的任务,旨在通过融入视觉信息来提高翻译质量。然而,最近的研究表明,现有MMT数据集中的视觉信息不足以改善模型的性能,导致模型忽视它并夸大其能力。这个问题对MMT研究的未来发展构成了严重的障碍。本文通过引入3AM数据集,为解决这个问题提供了一种新的解决方案。3AM数据集是一个包含26,000个并行句子对的英语和中文的多模态数据集,每个句子对都配有相应的图像。我们的数据集特意设计为包括更多的歧义和更多不同种类的图像,与其他MMT数据集相比具有更大的差异。我们利用语义距离模型从视觉和语言数据集中选择歧义数据,从而形成了一个更具挑战性的数据集。我们还对我们的数据集上的一些最先进的MMT模型进行了实验比较。实验结果表明,在我们提出的数据集上训练的MMT模型比那些在其他MMT数据集上训练的模型具有更强的利用视觉信息的能力。我们的工作为该领域的研究人员提供了一个宝贵的资源,并鼓励在这个领域进行进一步的探索。数据、代码和脚本可免费在https://这个网址上获取。
https://arxiv.org/abs/2404.18413
Vision Transformers are at the heart of the current surge of interest in foundation models for histopathology. They process images by breaking them into smaller patches following a regular grid, regardless of their content. Yet, not all parts of an image are equally relevant for its understanding. This is particularly true in computational pathology where background is completely non-informative and may introduce artefacts that could mislead predictions. To address this issue, we propose a novel method that explicitly masks background in Vision Transformers' attention mechanism. This ensures tokens corresponding to background patches do not contribute to the final image representation, thereby improving model robustness and interpretability. We validate our approach using prostate cancer grading from whole-slide images as a case study. Our results demonstrate that it achieves comparable performance with plain self-attention while providing more accurate and clinically meaningful attention heatmaps.
视觉 transformers 是当前病理学模型关注热潮的核心。它们通过在 regular 网格的指导下对图像进行分割来处理图像,无论其内容如何。然而,图像中并不是所有的部分都对其理解至关重要。在计算病理学中,背景是完全非信息性的,可能会引入伪影,导致误预测。为了解决这个问题,我们提出了一种新的方法,在视觉 transformers 的注意力机制中明确遮盖背景。这样可以确保与背景补丁对应的标记词不参与最终图像表示,从而提高模型的稳健性和可解释性。我们通过前列腺癌分级来自整体幻灯片作为病例研究来验证我们的方法。我们的结果表明,与仅使用自注意力机制相比,它实现了相当不错的性能,同时提供了更准确和具有临床意义的关注热图。
https://arxiv.org/abs/2404.18152
Text-based person search (TBPS) aims to retrieve images of a specific person from a large image gallery based on a natural language description. Existing methods rely on massive annotated image-text data to achieve satisfactory performance in fully-supervised learning. It poses a significant challenge in practice, as acquiring person images from surveillance videos is relatively easy, while obtaining annotated texts is challenging. The paper undertakes a pioneering initiative to explore TBPS under the semi-supervised setting, where only a limited number of person images are annotated with textual descriptions while the majority of images lack annotations. We present a two-stage basic solution based on generation-then-retrieval for semi-supervised TBPS. The generation stage enriches annotated data by applying an image captioning model to generate pseudo-texts for unannotated images. Later, the retrieval stage performs fully-supervised retrieval learning using the augmented data. Significantly, considering the noise interference of the pseudo-texts on retrieval learning, we propose a noise-robust retrieval framework that enhances the ability of the retrieval model to handle noisy data. The framework integrates two key strategies: Hybrid Patch-Channel Masking (PC-Mask) to refine the model architecture, and Noise-Guided Progressive Training (NP-Train) to enhance the training process. PC-Mask performs masking on the input data at both the patch-level and the channel-level to prevent overfitting noisy supervision. NP-Train introduces a progressive training schedule based on the noise level of pseudo-texts to facilitate noise-robust learning. Extensive experiments on multiple TBPS benchmarks show that the proposed framework achieves promising performance under the semi-supervised setting.
基于文本的人 search (TBPS) 旨在从大量图像库中根据自然语言描述检索特定的人 images。现有的方法依赖于大规模带有注释的图像-文本数据来实现令人满意的半监督学习性能。在实践中,获取来自监视视频的人图像相对容易,而获取注释文本相对困难。本文致力于在半监督设置中探索 TBPS,其中只有少数人图像带有文本描述,而大部分图像都没有注释。我们提出了一个基于生成-然后-检索的两阶段基本解决方案。生成阶段通过应用图像描述模型生成未注释图像的伪文本。后来,检索阶段使用增强数据执行半监督检索学习。考虑到伪文本在检索学习中的噪声干扰,我们提出了一个噪音抗性的检索框架,增强了检索模型的处理噪音数据的能力。该框架集成了两个关键策略:混合补丁级通道掩码(PC-Mask)来优化模型架构,以及噪音引导的逐步训练(NP-Train)来增强训练过程。PC-Mask 在输入数据的补丁级别和通道级别进行遮蔽,以防止过拟合噪音监督。NP-Train 根据伪文本的噪音水平引入了逐步训练计划,以促进噪音抗性学习。在多个 TBPS 基准测试上进行的大量实验证明,在半监督设置下,所提出的框架取得了良好的性能。
https://arxiv.org/abs/2404.18106
In today's world, image processing plays a crucial role across various fields, from scientific research to industrial applications. But one particularly exciting application is image captioning. The potential impact of effective image captioning is vast. It can significantly boost the accuracy of search engines, making it easier to find relevant information. Moreover, it can greatly enhance accessibility for visually impaired individuals, providing them with a more immersive experience of digital content. However, despite its promise, image captioning presents several challenges. One major hurdle is extracting meaningful visual information from images and transforming it into coherent language. This requires bridging the gap between the visual and linguistic domains, a task that demands sophisticated algorithms and models. Our project is focused on addressing these challenges by developing an automatic image captioning architecture that combines the strengths of convolutional neural networks (CNNs) and encoder-decoder models. The CNN model is used to extract the visual features from images, and later, with the help of the encoder-decoder framework, captions are generated. We also did a performance comparison where we delved into the realm of pre-trained CNN models, experimenting with multiple architectures to understand their performance variations. In our quest for optimization, we also explored the integration of frequency regularization techniques to compress the "AlexNet" and "EfficientNetB0" model. We aimed to see if this compressed model could maintain its effectiveness in generating image captions while being more resource-efficient.
在当今世界,图像处理在各个领域都扮演着关键角色,从科学研究到工业应用。但其中最令人兴奋的应用是图像标注。有效图像标注的机会潜力是巨大的。它可以大大提高搜索引擎的准确性,使其更容易找到相关信息。此外,它还可以大大提高盲人人士的可访问性,为他们提供更加沉浸式的数字内容体验。然而,尽管它具有很大的潜力,图像标注仍然面临着几个挑战。一个主要的障碍是从图像中提取有意义的视觉信息并将其转化为连贯的语言。这需要跨越视觉和语言域的鸿沟,这是一个需要复杂算法和模型的任务。我们的项目专注于通过开发结合卷积神经网络(CNN)和编码器-解码器模型的自动图像标注架构来解决这些挑战。CNN模型用于提取图像的视觉特征,然后,通过编码器-解码器框架,生成标注。我们还进行了一次性能比较,深入研究了预训练的 CNN 模型,尝试使用多种架构来了解它们的性能变化。在我们寻求优化的过程中,我们还探索了引入频率正则化技术来压缩“AlexNet”和“EfficientNetB0”模型的 integration。我们试图看看这个压缩模型是否能在生成图像标注的同时更加高效地使用资源。
https://arxiv.org/abs/2404.18062
Radiology Report Generation (R2Gen) demonstrates how Multi-modal Large Language Models (MLLMs) can automate the creation of accurate and coherent radiological reports. Existing methods often hallucinate details in text-based reports that don't accurately reflect the image content. To mitigate this, we introduce a novel strategy, SERPENT-VLM (SElf Refining Radiology RePort GENeraTion using Vision Language Models), which improves the R2Gen task by integrating a self-refining mechanism into the MLLM framework. We employ a unique self-supervised loss that leverages similarity between pooled image representations and the contextual representations of the generated radiological text, alongside the standard Causal Language Modeling objective, to refine image-text representations. This allows the model to scrutinize and align the generated text through dynamic interaction between a given image and the generated text, therefore reducing hallucination and continuously enhancing nuanced report generation. SERPENT-VLM outperforms existing baselines such as LLaVA-Med, BiomedGPT, etc., achieving SoTA performance on the IU X-ray and Radiology Objects in COntext (ROCO) datasets, and also proves to be robust against noisy images. A qualitative case study emphasizes the significant advancements towards more sophisticated MLLM frameworks for R2Gen, opening paths for further research into self-supervised refinement in the medical imaging domain.
放射学报告生成(R2Gen)展示了如何使用多模态大型语言模型(MLLMs)自动创建准确且连贯的放射学报告。现有方法通常在基于文本的报告中扭曲了文本报告中不准确反映图像内容的细节。为了减轻这种现象,我们引入了一种新颖的策略:SERPENT-VLM(自监督优化放射学报告生成),它通过将自监督机制集成到MLLM框架中来改善R2Gen任务。我们采用了一种独特的自监督损失,该损失利用了聚类图像表示和生成放射学文本的上下文表示之间的相似性,以及标准的因果语言建模目标,来优化图像-文本表示。这使得模型可以通过在给定图像和生成文本之间进行动态交互来审查和调整生成的文本,从而减少扭曲并持续提高细微报告生成。SERPENT-VLM在Context(ROCO)数据集上的SoTA性能优于现有基线,如LLaVA-Med和BiomedGPT等,同时在嘈杂图像上表现出色。一个定性案例研究强调了在医学图像领域更复杂MLLM框架的显著进步,为在医疗成像领域进一步研究自监督优化提供了途径。
https://arxiv.org/abs/2404.17912
Contrastive language-audio pretraining~(CLAP) has been developed to align the representations of audio and language, achieving remarkable performance in retrieval and classification tasks. However, current CLAP struggles to capture temporal information within audio and text features, presenting substantial limitations for tasks such as audio retrieval and generation. To address this gap, we introduce T-CLAP, a temporal-enhanced CLAP model. We use Large Language Models~(LLMs) and mixed-up strategies to generate temporal-contrastive captions for audio clips from extensive audio-text datasets. Subsequently, a new temporal-focused contrastive loss is designed to fine-tune the CLAP model by incorporating these synthetic data. We conduct comprehensive experiments and analysis in multiple downstream tasks. T-CLAP shows improved capability in capturing the temporal relationship of sound events and outperforms state-of-the-art models by a significant margin.
对比语言-音频预训练(CLAP)的目标是使音频和语言的表示达成一致,在检索和分类任务中取得显著的性能。然而,当前的CLAP在捕捉音频和文本特征中的时间信息方面遇到了很大的挑战,对音频检索和生成等任务造成了很大的限制。为了填补这一空白,我们引入了T-CLAP,一种时间增强的CLAP模型。我们使用大型语言模型(LLMs)和混合策略生成广泛的音频-文本数据集中的音频片段的时空对比性字幕。然后,为通过包含这些合成数据来微调CLAP模型,设计了一个新的时间关注度的对比损失。我们在多个下游任务上进行全面的实验和分析。T-CLAP在捕捉声音事件的时间关系方面表现优异,与最先进的模型相比,性能优势显著。
https://arxiv.org/abs/2404.17806
Vision-language models have become increasingly powerful for tasks that require an understanding of both visual and linguistic elements, bridging the gap between these modalities. In the context of multimodal clinical AI, there is a growing need for models that possess domain-specific knowledge, as existing models often lack the expertise required for medical applications. In this paper, we take brain abnormalities as an example to demonstrate how to automatically collect medical image-text aligned data for pretraining from public resources such as PubMed. In particular, we present a pipeline that streamlines the pre-training process by initially collecting a large brain image-text dataset from case reports and published journals and subsequently constructing a high-performance vision-language model tailored to specific medical tasks. We also investigate the unique challenge of mapping subfigures to subcaptions in the medical domain. We evaluated the resulting model with quantitative and qualitative intrinsic evaluations. The resulting dataset and our code can be found here this https URL
视觉语言模型在需要理解和掌握视觉和语言元素的任务中变得越来越强大,填平了这些模态之间的差距。在多模态临床人工智能背景下,需要具有特定领域知识的模型,因为现有的模型通常缺乏医疗应用所需的专业知识。在本文中,我们以脑部异常为例,展示了如何通过从公共资源如PubMed等处自动收集医学图像文本数据进行预训练。特别是,我们提出了一个通过首先收集大量的脑部图像文本数据病例报告和学术期刊,后续针对特定医疗任务构建高性能视觉语言模型的流程。我们还研究了在医学领域中映射子图到子句的独特挑战。我们用数量和定性内在评估评估了所得到的模型。最终的数据集和我们的代码可以在https:// this URL找到。
https://arxiv.org/abs/2404.17779