We investigate the internal representations of vision-language models (VLMs) to address hallucinations, a persistent challenge despite advances in model size and training. We project VLMs' internal image representations to their language vocabulary and observe more confident output probabilities on real objects than hallucinated objects. We additionally use these output probabilities to spatially localize real objects. Building on this approach, we introduce a knowledge erasure algorithm that removes hallucinations by linearly orthogonalizing image features with respect to hallucinated object features. We show that targeted edits to a model's latent representations can reduce hallucinations by up to 25.7% on the COCO2014 dataset while preserving performance. Our findings demonstrate how a deeper understanding of VLMs' latent representations can enhance reliability and enable novel capabilities, such as zero-shot segmentation.
我们研究了视觉语言模型(VLMs)的内部表示,以解决在模型大小和训练方面取得进展但仍然存在的一种普遍挑战:幻觉。我们将VLMs的内部图像表示投影到其语言词汇中,并观察到真实物体上的输出概率比幻觉物体上的输出概率更自信。此外,我们还使用这些输出概率将真实物体进行空间局部化。在此基础上,我们引入了一种知识消逝算法,通过将图像特征与幻觉物体特征之间进行线性正交操作来消除幻觉。我们在COCO2014数据集上展示了针对模型 latent 表示的定向修改可以减少幻觉,同时保持性能。我们的研究结果表明,对VLMs latent 表示的更深入了解可以提高可靠性并实现诸如零 shot分割等新颖功能。
https://arxiv.org/abs/2410.02762
Recent advancements in multimodal models highlight the value of rewritten captions for improving performance, yet key challenges remain. For example, while synthetic captions often provide superior quality and image-text alignment, it is not clear whether they can fully replace AltTexts: the role of synthetic captions and their interaction with original web-crawled AltTexts in pre-training is still not well understood. Moreover, different multimodal foundation models may have unique preferences for specific caption formats, but efforts to identify the optimal captions for each model remain limited. In this work, we propose a novel, controllable, and scalable captioning pipeline designed to generate diverse caption formats tailored to various multimodal models. By examining Short Synthetic Captions (SSC) towards Dense Synthetic Captions (DSC+) as case studies, we systematically explore their effects and interactions with AltTexts across models such as CLIP, multimodal LLMs, and diffusion models. Our findings reveal that a hybrid approach that keeps both synthetic captions and AltTexts can outperform the use of synthetic captions alone, improving both alignment and performance, with each model demonstrating preferences for particular caption formats. This comprehensive analysis provides valuable insights into optimizing captioning strategies, thereby advancing the pre-training of multimodal foundation models.
近年来在多模态模型的进步突出了重构式字幕的价值,然而关键挑战仍然存在。例如,虽然合成字幕通常提供更好的质量和图像文本对齐,但并不清楚它们是否可以完全取代AltTexts:合成字幕及其与原始爬取的AltText之间的相互作用仍然不太清楚。此外,不同的多模态基础模型可能对特定的字幕格式有独特的偏好,但努力确定每个模型的最佳重构式仍然有限。在这项工作中,我们提出了一个新颖、可控制和可扩展的 captioning 管道,旨在为各种多模态模型生成定制化的字幕格式。通过将 Short Synthetic Captions(SSC)与Dense Synthetic Captions(DSC+)作为案例研究,我们系统地探讨了它们对不同模型(如CLIP、多模态LLM和扩散模型)与AltText之间的影响。我们的研究结果表明,将人造字幕和原始文本相结合的半监督方法可以优于仅使用人造字幕,提高两者的对齐度和性能,每个模型都表现出对特定字幕格式的偏好。这种全面的分析为优化字幕策略提供了宝贵的洞见,从而推动了多模态基础模型的预训练。
https://arxiv.org/abs/2410.02740
There is a scarcity of multilingual vision-language models that properly account for the perceptual differences that are reflected in image captions across languages and cultures. In this work, through a multimodal, multilingual retrieval case study, we quantify the existing lack of model flexibility. We empirically show performance gaps between training on captions that come from native German perception and captions that have been either machine-translated or human-translated from English into German. To address these gaps, we further propose and evaluate caption augmentation strategies. While we achieve mean recall improvements (+1.3), gaps still remain, indicating an open area of future work for the community.
目前的跨语言视觉语言模型屈指可数,这些模型没有充分考虑到不同语言和文化下图像注释中反映的感知差异。在这项工作中,通过多模态、多语言的检索案例研究,我们定量了现有模型灵活性的缺乏。我们通过实证研究展示了来自德语母语的注释训练和从英语翻译成德语的注释训练之间的性能差距。为了解决这些差距,我们进一步提出了并评估了标题增强策略。虽然我们实现了平均召回率的提高(+1.3),但差距仍然存在,表明社区未来还需要在这个领域进行更多的工作。
https://arxiv.org/abs/2410.02027
Contrastive Language-Image Pre-training (CLIP) models have shown significant potential, particularly in zero-shot classification across diverse distribution shifts. Building on existing evaluations of overall classification robustness, this work aims to provide a more comprehensive assessment of CLIP by introducing several new perspectives. First, we investigate their robustness to variations in specific visual factors. Second, we assess two critical safety objectives--confidence uncertainty and out-of-distribution detection--beyond mere classification accuracy. Third, we evaluate the finesse with which CLIP models bridge the image and text modalities. Fourth, we extend our examination to 3D awareness in CLIP models, moving beyond traditional 2D image understanding. Finally, we explore the interaction between vision and language encoders within modern large multimodal models (LMMs) that utilize CLIP as the visual backbone, focusing on how this interaction impacts classification robustness. In each aspect, we consider the impact of six factors on CLIP models: model architecture, training distribution, training set size, fine-tuning, contrastive loss, and test-time prompts. Our study uncovers several previously unknown insights into CLIP. For instance, the architecture of the visual encoder in CLIP plays a significant role in their robustness against 3D corruption. CLIP models tend to exhibit a bias towards shape when making predictions. Moreover, this bias tends to diminish after fine-tuning on ImageNet. Vision-language models like LLaVA, leveraging the CLIP vision encoder, could exhibit benefits in classification performance for challenging categories over CLIP alone. Our findings are poised to offer valuable guidance for enhancing the robustness and reliability of CLIP models.
对比语言图像预训练(CLIP)模型在多样分布变化下的零散拍摄分类方面的显著潜力。在现有整体分类鲁棒性评估的基础上,本文旨在通过引入几个新的观点,对CLIP进行更全面的评估。首先,我们研究它们对特定视觉因素变化的鲁棒性。其次,我们评估两个关键的安全目标——信心不确定性和离群检测——超越了简单的分类准确性。第三,我们评估CLIP模型在图像和文本模态之间桥梁的精致程度。第四,我们将我们的研究扩展到CLIP模型在现代大型多模态模型(LMM)上的视觉先验,重点关注这种交互如何影响分类鲁棒性。在每一个方面,我们考虑六个因素对CLIP模型的影响:模型架构、训练分布、训练集大小、微调、对比损失和测试时间提示。我们的研究揭示了CLIP模型中视觉编码器架构对对抗3D损坏的鲁棒性的重要作用。CLIP模型在预测时倾向于展示形状偏见。此外,这种偏见在经过ImageNet的微调后通常会减小。利用CLIP视觉编码器的视觉-语言模型(如LLaVA)在具有挑战性类别的分类表现上可能优于仅使用CLIP的模型。我们的研究有望为提高CLIP模型的稳健性和可靠性提供宝贵的指导。
https://arxiv.org/abs/2410.01534
The emergence of Vision-Language Models (VLMs) represents a significant advancement in integrating computer vision with Large Language Models (LLMs) to generate detailed text descriptions from visual inputs. Despite their growing importance, the security of VLMs, particularly against backdoor attacks, is under explored. Moreover, prior works often assume attackers have access to the original training data, which is often unrealistic. In this paper, we address a more practical and challenging scenario where attackers must rely solely on Out-Of-Distribution (OOD) data. We introduce VLOOD (Backdooring Vision-Language Models with Out-of-Distribution Data), a novel approach with two key contributions: (1) demonstrating backdoor attacks on VLMs in complex image-to-text tasks while minimizing degradation of the original semantics under poisoned inputs, and (2) proposing innovative techniques for backdoor injection without requiring any access to the original training data. Our evaluation on image captioning and visual question answering (VQA) tasks confirms the effectiveness of VLOOD, revealing a critical security vulnerability in VLMs and laying the foundation for future research on securing multimodal models against sophisticated threats.
视觉语言模型(VLMs)的出现代表了对将计算机视觉与大型语言模型(LLMs)集成以生成详细文本描述视觉输入的显著进步。尽管它们的重要性不断增加,但VLMs的安全性,尤其是对后门攻击的防御,仍缺乏深入研究。此外,之前的 works 通常假设攻击者具有访问原始训练数据的能力,这在实际情况下并不现实。在本文中,我们讨论了一个更加实际和具有挑战性的场景,即攻击者只能依赖外部数据(OD)。我们引入了 VLOOD(在分布式数据上进行后门攻击的视觉语言模型),一种具有两个关键贡献的创新方法:(1)在复杂图像到文本任务中证明对 VLMs 的后门攻击,同时最小化在毒化输入下的原始语义贬损;(2)提出了一种不需要访问原始训练数据的创新后门注入技术。我们对图像标题和视觉问答(VQA)任务的评估证实了 VLOOD的有效性,揭示了 VLMs 中的关键安全漏洞,为未来研究在复杂威胁面前保护多模态模型奠定了基础。
https://arxiv.org/abs/2410.01264
We present MM1.5, a new family of multimodal large language models (MLLMs) designed to enhance capabilities in text-rich image understanding, visual referring and grounding, and multi-image reasoning. Building upon the MM1 architecture, MM1.5 adopts a data-centric approach to model training, systematically exploring the impact of diverse data mixtures across the entire model training lifecycle. This includes high-quality OCR data and synthetic captions for continual pre-training, as well as an optimized visual instruction-tuning data mixture for supervised fine-tuning. Our models range from 1B to 30B parameters, encompassing both dense and mixture-of-experts (MoE) variants, and demonstrate that careful data curation and training strategies can yield strong performance even at small scales (1B and 3B). Additionally, we introduce two specialized variants: MM1.5-Video, designed for video understanding, and MM1.5-UI, tailored for mobile UI understanding. Through extensive empirical studies and ablations, we provide detailed insights into the training processes and decisions that inform our final designs, offering valuable guidance for future research in MLLM development.
我们提出了MM1.5,一种新型的多模态大型语言模型(MLLM),旨在提高在文本丰富的图像理解、视觉参考和接地以及多图像推理方面的能力。在MM1架构的基础上,MM1.5采用了一种以数据为中心的训练方法,系统地探索了不同数据混合在整个模型训练生命周期中的影响。这包括高质量的OCR数据和合成字幕,用于持续预训练,以及用于监督微调的优化视觉指令数据混合。我们的模型从1B到30B个参数,涵盖密度和专家混合(MoE)变体,并证明了精细的数据选择和训练策略在小规模上也可以产生强大的性能(1B和3B)。此外,我们引入了两个专用变体:MM1.5-Video,用于视频理解,MM1.5-UI,专为移动UI理解而设计。通过广泛的实证研究和抽象,我们提供了对训练过程和决策的详细洞察,为未来MLLM发展的研究提供了宝贵的指导。
https://arxiv.org/abs/2409.20566
The human visual system is capable of processing continuous streams of visual information, but how the brain encodes and retrieves recent visual memories during continuous visual processing remains unexplored. This study investigates the capacity of working memory to retain past information under continuous visual stimuli. And then we propose a new task Memory Disentangling, which aims to extract and decode past information from fMRI signals. To address the issue of interference from past memory information, we design a disentangled contrastive learning method inspired by the phenomenon of proactive interference. This method separates the information between adjacent fMRI signals into current and past components and decodes them into image descriptions. Experimental results demonstrate that this method effectively disentangles the information within fMRI signals. This research could advance brain-computer interfaces and mitigate the problem of low temporal resolution in fMRI.
人类视觉系统能够处理连续的视觉信息,但大脑在连续视觉处理过程中如何编码和检索最近的视觉记忆仍然是一个未探索的问题。这项研究调查了在连续视觉刺激下工作记忆保留过去信息的能力。然后我们提出了一个新的任务 Memory Disentangling,旨在提取和解码来自fMRI信号的过去信息。为了应对过去记忆信息的影响,我们设计了一种基于前馈干扰现象的解离卷积学习方法。这种方法将相邻的fMRI信号之间的信息分离成当前和过去组件,并将其解码成图像描述。实验结果表明,这种方法有效地解离了fMRI信号中的信息。这项研究可能推动脑机接口的发展,减轻fMRI中时间分辨率低的问题。
https://arxiv.org/abs/2409.20428
Zero-shot inference, where pre-trained models perform tasks without specific training data, is an exciting emergent ability of large models like CLIP. Although there has been considerable exploration into enhancing zero-shot abilities in image captioning (IC) for popular datasets such as MSCOCO and Flickr8k, these approaches fall short with fine-grained datasets like CUB, FLO, UCM-Captions, and Sydney-Captions. These datasets require captions to discern between visually and semantically similar classes, focusing on detailed object parts and their attributes. To overcome this challenge, we introduce TRaining-Free Object-Part Enhancement (TROPE). TROPE enriches a base caption with additional object-part details using object detector proposals and Natural Language Processing techniques. It complements rather than alters the base caption, allowing seamless integration with other captioning methods and offering users enhanced flexibility. Our evaluations show that TROPE consistently boosts performance across all tested zero-shot IC approaches and achieves state-of-the-art results on fine-grained IC datasets.
零 shot推理,其中预训练模型在没有特定训练数据的情况下执行任务,是大型模型如CLIP令人兴奋的新兴能力。尽管已经对增强在图像标题(IC)流行数据集(如MSCOCO和Flickr8k)中的零 shot能力进行了大量探索,但这些问题在细粒度数据集(如CUB、FLO、UCM-Captions和Sydney-Captions)上仍然不足。这些数据集需要捕获视觉和语义上相似的类别的 caption,重点关注详细的对象部分及其属性。为了克服这一挑战,我们引入了TRaining-Free Object-Part Enhancement(TROPE)。TROPE通过使用目标检测提议和自然语言处理技术为基线图像标题增加强劲的物体部分细节。它补充了基线标题,而不是改变它,允许与其他 captioning 方法无缝集成,并为用户提供了更大的灵活性。我们的评估显示,TROPE在所有测试的零 shot IC方法中始终提高了性能,并在细粒度 IC数据集上实现了与最先进方法相当的结果。
https://arxiv.org/abs/2409.19960
Visual-language models have advanced the development of universal models, yet their application in medical imaging remains constrained by specific functional requirements and the limited data. Current general-purpose models are typically designed with task-specific branches and heads, which restricts the shared feature space and the flexibility of model. To address these challenges, we have developed a decomposed-composed universal medical imaging paradigm (UniMed) that supports tasks at all levels. To this end, we first propose a decomposed decoder that can predict two types of outputs -- pixel and semantic, based on a defined input queue. Additionally, we introduce a composed decoder that unifies the input and output spaces and standardizes task annotations across different levels into a discrete token format. The coupled design of these two components enables the model to flexibly combine tasks and mutual benefits. Moreover, our joint representation learning strategy skilfully leverages large amounts of unlabeled data and unsupervised loss, achieving efficient one-stage pretraining for more robust performance. Experimental results show that UniMed achieves state-of-the-art performance on eight datasets across all three tasks and exhibits strong zero-shot and 100-shot transferability. We will release the code and trained models upon the paper's acceptance.
视觉语言模型已经推动了通用模型的进步,但在医学影像应用中,它们的应用仍然受到具体功能要求和有限数据的限制。当前的通用模型通常设计为任务特定的分支和头,这限制了共享特征空间和模型的灵活性。为了应对这些挑战,我们开发了一种分层的组合统一医疗影像范式(UniMed),支持所有级别的任务。为此,我们首先提出了一个分层的解码器,可以根据定义的输入队列预测两种类型的输出——像素和语义。此外,我们还引入了一个组合解码器,统一输入和输出空间,并将不同级别的任务注释标准化为离散标记格式。这两组件的协同设计使模型能够灵活地组合任务并实现相互促进。此外,我们的联合表示学习策略巧妙地利用了大量的未标记数据和无监督损失,实现了更高效的单阶段预训练,从而实现稳健的性能。实验结果表明,UniMed在所有三个任务上的所有数据集都取得了最先进的性能,并表现出强大的零 shot 和 100-shot 转移能力。在论文接受后,我们将发布代码和训练好的模型。
https://arxiv.org/abs/2409.19890
In the digital era, the ability to understand visually rich documents that integrate text, complex layouts, and imagery is critical. Traditional Key Information Extraction (KIE) methods primarily rely on Optical Character Recognition (OCR), which often introduces significant latency, computational overhead, and errors. Current advanced image-to-text approaches, which bypass OCR, typically yield plain text outputs without corresponding vision grounding. In this paper, we introduce STNet (See then Tell Net), a novel end-to-end model designed to deliver precise answers with relevant vision grounding. Distinctively, STNet utilizes a unique <see> token to observe pertinent image areas, aided by a decoder that interprets physical coordinates linked to this token. Positioned at the outset of the answer text, the <see> token allows the model to first see--observing the regions of the image related to the input question--and then tell--providing articulated textual responses. To enhance the model's seeing capabilities, we collect extensive structured table recognition datasets. Leveraging the advanced text processing prowess of GPT-4, we develop the TVG (TableQA with Vision Grounding) dataset, which not only provides text-based Question Answering (QA) pairs but also incorporates precise vision grounding for these pairs. Our approach demonstrates substantial advancements in KIE performance, achieving state-of-the-art results on publicly available datasets such as CORD, SROIE, and DocVQA. The code will also be made publicly available.
在数字时代,理解和视觉丰富文档的能力至关重要。传统的关键词提取(KIE)方法主要依赖于光学字符识别(OCR),这通常引入显著的延迟、计算开销和错误。当前的先进图像到文本方法通常没有相应的视觉支撑产生纯文本输出。在本文中,我们介绍了STNet(See then Tell Net),一种新颖的端到端模型,旨在提供相关视觉支撑的准确答案。与传统方法不同,STNet利用一个独特的<see>标记来观察相关图像区域,并通过一个解码器解释与该标记相关的物理坐标。定位到答案文本的开头,<see>标记允许模型首先观察输入问题相关的图像区域,然后告诉--提供明确的文本回答。为了提高模型的观察能力,我们收集了广泛的结构化表格识别数据集。利用GPT-4的先进文本处理能力,我们开发了TVG(表QA with Vision Grounding)数据集,它不仅提供了基于文本的问答对,还包含了这些对的精确视觉支撑。我们的方法在关键词提取(KIE)性能上取得了显著的进步,在诸如CORD、SROIE和DocVQA等公开可用数据集上实现了最先进的结果。代码也将公开发布。
https://arxiv.org/abs/2409.19573
Multimodal image-text contrastive learning has shown that joint representations can be learned across modalities. Here, we show how leveraging multiple views of image data with contrastive learning can improve downstream fine-grained classification performance for species recognition, even when one view is absent. We propose ContRastive Image-remote Sensing Pre-training (CRISP)$\unicode{x2014}$a new pre-training task for ground-level and aerial image representation learning of the natural world$\unicode{x2014}$and introduce Nature Multi-View (NMV), a dataset of natural world imagery including $>3$ million ground-level and aerial image pairs for over 6,000 plant taxa across the ecologically diverse state of California. The NMV dataset and accompanying material are available at this http URL.
多模态图像-文本对比学习表明,可以在模态之间学习联合表示。在这里,我们证明了利用对比学习跨模态学习可以提高物种识别下游细粒度分类性能,即使其中一个视图是缺失的。我们提出了ContRastive Image-remote Sensing Pre-training (CRISP)$\unicode{x2014}$a新的预训练任务,用于自然世界地表和航空影像表示学习,并引入了Nature Multi-View(NMV)数据集,这是一个由加州生态多样状态下超过6000个植物类群的第1和第2张地面和航空图像对组成的自然世界图像数据集,可用于此链接。
https://arxiv.org/abs/2409.19439
Recent contrastive multimodal vision-language models like CLIP have demonstrated robust open-world semantic understanding, becoming the standard image backbones for vision-language applications due to their aligned latent space. However, this practice has left powerful unimodal encoders for both vision and language underutilized in multimodal applications which raises a key question: Is there a plausible way to connect unimodal backbones for zero-shot vision-language tasks? To this end, we propose a novel approach that aligns vision and language modalities using only projection layers on pretrained, frozen unimodal encoders. Our method exploits the high semantic similarity between embedding spaces of well-trained vision and language models. It involves selecting semantically similar encoders in the latent space, curating a concept-rich dataset of image-caption pairs, and training simple MLP projectors. We evaluated our approach on 12 zero-shot classification datasets and 2 image-text retrieval datasets. Our best model, utilizing DINOv2 and All-Roberta-Large text encoder, achieves 76\(\%\) accuracy on ImageNet with a 20-fold reduction in data and 65 fold reduction in compute requirements. The proposed framework enhances the accessibility of model development while enabling flexible adaptation across diverse scenarios, offering an efficient approach to building multimodal models by utilizing existing unimodal architectures. Code and datasets will be released soon.
近年来,像CLIP这样的对比式多模态视觉语言模型已经展示了在开放世界中具有稳健的语义理解,成为视觉语言应用的标准图像骨干。然而,这种做法使视觉和语言的双重编码器在多模态应用中都被充分利用,这引发了一个关键问题:在零散 shot 视觉语言任务中,是否有一种合理的方法来连接零散视觉和语言编码器?为此,我们提出了一个新颖的方法,它通过仅在预训练和冻存的无模态编码器的投影层上进行视觉和语言模式对齐来对齐嵌入空间。我们的方法利用了预训练视觉和语言模型之间高语义相似性。它涉及选择语义相似的编码器在嵌入空间中,策划概念丰富的图像-描述对数据集,并训练简单的MLP投影器。我们在12个零散 shot分类数据集和2个图像文本检索数据集上评估了我们的方法。我们的最佳模型,使用DINOv2和All-Roberta-Large文本编码器,在ImageNet上的准确率为76\(\%\),数据减少了20倍,计算需求减少了65倍。所提出的框架通过利用现有的无模态架构增强了模型开发的可访问性,同时允许在不同场景下进行灵活的适应,将一种有效的构建多模态模型的方法。代码和数据集将很快发布。
https://arxiv.org/abs/2409.19425
The potential for exploitation of AI models has increased due to the rapid advancement of Artificial Intelligence (AI) and the widespread use of platforms like Model Zoo for sharing AI models. Attackers can embed malware within AI models through steganographic techniques, taking advantage of the substantial size of these models to conceal malicious data and use it for nefarious purposes, e.g. Remote Code Execution. Ensuring the security of AI models is a burgeoning area of research essential for safeguarding the multitude of organizations and users relying on AI technologies. This study leverages well-studied image few-shot learning techniques by transferring the AI models to the image field using a novel image representation. Applying few-shot learning in this field enables us to create practical models, a feat that previous works lack. Our method addresses critical limitations in state-of-the-art detection techniques that hinder their practicality. This approach reduces the required training dataset size from 40000 models to just 6. Furthermore, our methods consistently detect delicate attacks of up to 25% embedding rate and even up to 6% in some cases, while previous works were only shown to be effective for a 100%-50% embedding rate. We employ a strict evaluation strategy to ensure the trained models are generic concerning various factors. In addition, we show that our trained models successfully detect novel spread-spectrum steganography attacks, demonstrating the models' impressive robustness just by learning one type of attack. We open-source our code to support reproducibility and enhance the research in this new field.
由于人工智能(AI)的快速发展以及像Model Zoo这样的平台广泛用于共享AI模型,利用 steganographic 技术在AI 模型中嵌入恶意软件的可能性已经增加。攻击者可以通过 steganographic 技术在 AI 模型中嵌入恶意软件,利用这些模型的庞大尺寸来隐藏恶意数据,并用于邪恶目的,例如远程代码执行。确保 AI 模型的安全性是一个正在迅速发展的研究领域,对于保护依赖 AI 技术的组织和个人至关重要。 本研究利用将 AI 模型转移到图像领域的新型图像表示来利用经过良好研究的小样本学习技术。应用小样本学习技术在这个领域使我们能够创建实用的模型,而以前的工作则缺乏这一功能。我们的方法解决了现有检测技术中关键的局限性,阻碍了其实用性的提高。 此外,我们的方法能够检测到高达 25% 的嵌入率的精细攻击,甚至有些情况下能够检测到 6% 的攻击。而以前的工作只是在 100% 至 50% 的嵌入率范围内证明有效的。我们采用严格的评估策略来确保训练好的模型在各种因素上都是通用的。 此外,我们还证明了我们的训练好的模型能够成功检测到新颖的扩散式 steganography 攻击,这表明仅通过学习一种类型的攻击,模型就能够展示出令人印象深刻的鲁棒性。 我们开源我们的代码,以支持可重复性并促进该领域的研究。
https://arxiv.org/abs/2409.19310
In this work, we address the challenge of developing automatic evaluation metrics for image captioning, with a particular focus on robustness against hallucinations. Existing metrics are often inadequate for handling hallucinations, primarily due to their limited ability to compare candidate captions with multifaceted reference captions. To address this shortcoming, we propose DENEB, a novel supervised automatic evaluation metric specifically robust against hallucinations. DENEB incorporates the Sim-Vec Transformer, a mechanism that processes multiple references simultaneously, thereby efficiently capturing the similarity between an image, a candidate caption, and reference captions. To train DENEB, we construct the diverse and balanced Nebula dataset comprising 32,978 images, paired with human judgments provided by 805 annotators. We demonstrated that DENEB achieves state-of-the-art performance among existing LLM-free metrics on the FOIL, Composite, Flickr8K-Expert, Flickr8K-CF, Nebula, and PASCAL-50S datasets, validating its effectiveness and robustness against hallucinations.
在这项工作中,我们关注图像摘要自动评估任务的挑战,特别是在对抗性方面的能力。现有的指标通常不足以处理幻觉,主要原因是它们在比较候选摘要与多维参考摘要方面的能力有限。为了应对这一缺陷,我们提出了DENEB,一种专门针对幻觉的鲁棒自动评估指标。DENEB包括Sim-Vec Transformer,这是一种同时处理多个参考的机制,从而有效地捕捉图像、候选摘要和参考摘要之间的相似性。为了训练DENEB,我们构建了包含32,978个图像的多样且平衡的Nebula数据集,这些图像由805个注释者提供的人类判断进行标注。我们证明了DENEB在现有LLM-免费指标在FOIL、Composite、Flicker8K-Expert、Flicker8K-CF、Nebula和PASCAL-50S数据集上的表现已经达到了最先进水平,验证了其对抗幻觉的有效性和鲁棒性。
https://arxiv.org/abs/2409.19255
The emergence of Vision Language Models (VLMs) is a significant advancement in integrating computer vision with Large Language Models (LLMs) to produce detailed text descriptions based on visual inputs, yet it introduces new security vulnerabilities. Unlike prior work that centered on single modalities or classification tasks, this study introduces TrojVLM, the first exploration of backdoor attacks aimed at VLMs engaged in complex image-to-text generation. Specifically, TrojVLM inserts predetermined target text into output text when encountering poisoned images. Moreover, a novel semantic preserving loss is proposed to ensure the semantic integrity of the original image content. Our evaluation on image captioning and visual question answering (VQA) tasks confirms the effectiveness of TrojVLM in maintaining original semantic content while triggering specific target text outputs. This study not only uncovers a critical security risk in VLMs and image-to-text generation but also sets a foundation for future research on securing multimodal models against such sophisticated threats.
Vision Language Models (VLMs) 的出现是一个显著的计算机视觉与大型语言模型(LLMs)整合的进步,以根据视觉输入产生详细的文本描述。然而,它也引入了新的安全漏洞。与之前关注单一模态或分类任务的工作不同,本研究引入了 TrojVLM,这是对从事复杂图像到文本生成的 VLMs 进行首次后门攻击的探索。具体来说,TrojVLM 在遇到有毒图像时,将预先确定的目标文本插入输出文本中。此外,还提出了一个新颖的语义保留损失,以确保原始图像内容的语义完整性。我们对图像标题和视觉问答(VQA)任务的评估证实了 TrojVLM 在保持原始语义内容的同时触发特定目标文本输出的有效性。这项研究不仅揭示了 VLMs 和图像到文本生成领域的一个关键安全风险,还为未来研究在对这种复杂威胁的防护上奠定了基础。
https://arxiv.org/abs/2409.19232
We present PhysGen, a novel image-to-video generation method that converts a single image and an input condition (e.g., force and torque applied to an object in the image) to produce a realistic, physically plausible, and temporally consistent video. Our key insight is to integrate model-based physical simulation with a data-driven video generation process, enabling plausible image-space dynamics. At the heart of our system are three core components: (i) an image understanding module that effectively captures the geometry, materials, and physical parameters of the image; (ii) an image-space dynamics simulation model that utilizes rigid-body physics and inferred parameters to simulate realistic behaviors; and (iii) an image-based rendering and refinement module that leverages generative video diffusion to produce realistic video footage featuring the simulated motion. The resulting videos are realistic in both physics and appearance and are even precisely controllable, showcasing superior results over existing data-driven image-to-video generation works through quantitative comparison and comprehensive user study. PhysGen's resulting videos can be used for various downstream applications, such as turning an image into a realistic animation or allowing users to interact with the image and create various dynamics. Project page: this https URL
我们提出了PhysGen,一种新颖的图像转视频生成方法,可以将单个图像和输入条件(例如作用于图像中物体的力和扭矩)转换为生产具有真实感、物理可信度和时间一致性的视频。我们的关键洞见是将基于模型的物理仿真与数据驱动的视频生成过程相结合,实现合理的图像空间动态。 我们系统的核心组件包括:(i)一个图像理解模块,有效捕捉图像的几何、材料和物理参数;(ii)一个图像空间动态仿真模型,利用刚体物理和推断参数模拟真实行为;(iii)一个基于生成视频扩散的图像基于渲染和精度的模块,利用生成的视频来制作具有模拟运动的视频 footage。 通过定量比较和全面用户研究,PhysGen生成的视频在物理和外观方面都是真实的,甚至可以精确控制,展示了通过定量比较和全面用户研究超过现有数据驱动图像到视频生成工作的优越性。PhysGen生成的视频可以用于各种下游应用,例如将图像转换为真实的动画,或者允许用户与图像交互并创建各种动态。项目页面:此链接
https://arxiv.org/abs/2409.18964
The integration of Large Language Models (LLMs) with visual encoders has recently shown promising performance in visual understanding tasks, leveraging their inherent capability to comprehend and generate human-like text for visual reasoning. Given the diverse nature of visual data, MultiModal Large Language Models (MM-LLMs) exhibit variations in model designing and training for understanding images, short videos, and long videos. Our paper focuses on the substantial differences and unique challenges posed by long video understanding compared to static image and short video understanding. Unlike static images, short videos encompass sequential frames with both spatial and within-event temporal information, while long videos consist of multiple events with between-event and long-term temporal information. In this survey, we aim to trace and summarize the advancements of MM-LLMs from image understanding to long video understanding. We review the differences among various visual understanding tasks and highlight the challenges in long video understanding, including more fine-grained spatiotemporal details, dynamic events, and long-term dependencies. We then provide a detailed summary of the advancements in MM-LLMs in terms of model design and training methodologies for understanding long videos. Finally, we compare the performance of existing MM-LLMs on video understanding benchmarks of various lengths and discuss potential future directions for MM-LLMs in long video understanding.
大规模语言模型(LLMs)与视觉编码器的集成在视觉理解任务中最近表现出良好的性能,利用其固有的理解并生成类似于人类文本的能力来理解视觉推理。考虑到视觉数据的多样性,多模态大规模语言模型(MM-LLMs)在理解图像、短视频和长视频时,模型设计和训练存在差异和独特挑战。我们的论文重点关注长视频理解与静态图像和短视频理解之间的重大差异和独特挑战。与静态图像不同,短视频包含具有空间和事件时间信息的连续帧,而长视频由多个事件组成,包括事件之间和长期时间信息。在这次调查中,我们旨在追踪和总结MM-LLMs在图像理解到长视频理解方面的进步。我们回顾了各种视觉理解任务之间的差异,并重点关注长视频理解中的挑战,包括更细粒度的空间和事件时间信息、动态事件和长期依赖关系。然后,我们详细总结了MM-LLMs在理解长视频方面的模型设计和训练方法。最后,我们比较了各种长度视频理解基准测试中现有MM-LLM的表现,并讨论了MM-LLM在长视频理解方面的潜在未来发展方向。
https://arxiv.org/abs/2409.18938
Recently, there has been a growing interest in Multimodal Large Language Models (MLLMs) due to their remarkable potential in various tasks integrating different modalities, such as image and text, as well as applications such as image captioning and visual question answering. However, such models still face challenges in accurately captioning and interpreting specific visual concepts and classes, particularly in domain-specific applications. We argue that integrating domain knowledge in the form of an ontology can significantly address these issues. In this work, as a proof of concept, we propose a new framework that combines ontology with MLLMs to classify images of plant diseases. Our method uses concepts about plant diseases from an existing disease ontology to query MLLMs and extract relevant visual concepts from images. Then, we use the reasoning capabilities of the ontology to classify the disease according to the identified concepts. Ensuring that the model accurately uses the concepts describing the disease is crucial in domain-specific applications. By employing an ontology, we can assist in verifying this alignment. Additionally, using the ontology's inference capabilities increases transparency, explainability, and trust in the decision-making process while serving as a judge by checking if the annotations of the concepts by MLLMs are aligned with those in the ontology and displaying the rationales behind their errors. Our framework offers a new direction for synergizing ontologies and MLLMs, supported by an empirical study using different well-known MLLMs.
近年来,随着Multimodal Large Language Models(MLLMs)在各种任务中整合不同模态(如图像和文本)以及应用于图像描述和视觉问答等领域的显著潜力,人们对MLLMs产生了浓厚的兴趣。然而,此类模型在准确捕捉和解释特定视觉概念和类方面仍然面临挑战,特别是在领域特定应用中。我们认为,在形式上整合领域知识,以建立一个本体,可以显著解决这些问题。在这项工作中,作为概念证明,我们提出了一种结合本体和MLLMs的新框架,用于分类植物病害的图像。我们的方法使用现有病本体中关于植物病的概念来查询MLLMs,并从图像中提取相关视觉概念。然后,利用本体的推理能力对疾病进行分类。确保模型准确使用描述疾病的概念至关重要,在领域特定应用中。通过采用本体,我们可以协助验证这一对齐关系。此外,利用本体的推理能力提高了透明度、可解释性和决策过程的可信度,同时作为评委检查MLLMs对概念的注释是否与本体中的相同进行对齐,并显示其错误背后的理由。我们的框架为将本体和MLLMs相结合提供了一个新的方向,得到了不同著名MLLMs的实证研究支持。
https://arxiv.org/abs/2409.18753
Research on food image understanding using recipe data has been a long-standing focus due to the diversity and complexity of the data. Moreover, food is inextricably linked to people's lives, making it a vital research area for practical applications such as dietary management. Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities, not only in their vast knowledge but also in their ability to handle languages naturally. While English is predominantly used, they can also support multiple languages including Japanese. This suggests that MLLMs are expected to significantly improve performance in food image understanding tasks. We fine-tuned open MLLMs LLaVA-1.5 and Phi-3 Vision on a Japanese recipe dataset and benchmarked their performance against the closed model GPT-4o. We then evaluated the content of generated recipes, including ingredients and cooking procedures, using 5,000 evaluation samples that comprehensively cover Japanese food culture. Our evaluation demonstrates that the open models trained on recipe data outperform GPT-4o, the current state-of-the-art model, in ingredient generation. Our model achieved F1 score of 0.531, surpassing GPT-4o's F1 score of 0.481, indicating a higher level of accuracy. Furthermore, our model exhibited comparable performance to GPT-4o in generating cooking procedure text.
使用食谱数据进行食物图像理解的 research,由于数据的多样性和复杂性,一直是一个长期关注点。此外,食物与人们的联系紧密,使得其在实际应用中(如饮食管理)具有重要的研究价值。近年来,在多模态大型语言模型(MLLMs)的进步中,已经展示了惊人的能力,不仅在其广泛的知识,而且在其处理自然语言的能力。虽然英语是主要的,但它们还可以支持其他语言,包括日语。这表明,MLLMs有望在食品图像理解任务中显著提高性能。我们对日本的食谱数据集进行了微调,并将其与关闭模型GPT-4o进行了比较。然后,我们使用包含5,000个评估样本的多样的评估样本评估了生成的食谱的内容,包括食材和烹饪程序。我们的评估结果表明,基于食谱数据的开放模型优于GPT-4o,这是当前最先进的模型。我们的模型的F1得分达到了0.531,超过了GPT-4o的F1得分0.481,表明准确度更高。此外,我们的模型在生成烹饪程序文本方面与GPT-4o的表现相当。
https://arxiv.org/abs/2409.18459
Recent advancements in Large Multimodal Models (LMMs) have greatly enhanced their proficiency in 2D visual understanding tasks, enabling them to effectively process and understand images and videos. However, the development of LMMs with 3D-awareness for 3D scene understanding has been hindered by the lack of large-scale 3D vision-language datasets and powerful 3D encoders. In this paper, we introduce a simple yet effective framework called LLaVA-3D. Leveraging the strong 2D understanding priors from LLaVA, our LLaVA-3D efficiently adapts LLaVA for 3D scene understanding without compromising 2D understanding capabilities. To achieve this, we employ a simple yet effective representation, 3D Patch, which connects 2D CLIP patch features with their corresponding positions in 3D space. By integrating the 3D Patches into 2D LMMs and employing joint 2D and 3D vision-language instruction tuning, we establish a unified architecture for both 2D image understanding and 3D scene understanding. Experimental results show that LLaVA-3D converges 3.5x faster than existing 3D LMMs when trained on 3D vision-language datasets. Moreover, LLaVA-3D not only achieves state-of-the-art performance across various 3D tasks but also maintains comparable 2D image understanding and vision-language conversation capabilities with LLaVA.
近年来,在大型多模态模型(LMMs)中取得了显著的进步,大大提高了其在2D视觉理解任务中的能力,使它们能够有效处理和理解图像和视频。然而,开发具有3D意识的大规模3D场景理解LMM遇到了缺乏大型3D视觉语言数据集和强大的3D编码器的困难。在本文中,我们介绍了一个简单而有效的框架LLaVA-3D。利用LLaVA的强大的2D理解 prior,我们的LLaVA-3D有效地将LLaVA应用于3D场景理解,同时不牺牲2D理解能力。为了实现这一目标,我们采用了一个简单的而有效的表示,3D Patch,它将2D CLIP补丁特征与它们在3D空间中的相应位置连接起来。通过将3D补丁集成到2D LMM中,并使用联合2D和3D视觉语言指令调整,我们建立了一个统一架构,用于2D图像理解和3D场景理解。实验结果表明,在用3D视觉语言数据集训练后,LLaVA-3D比现有3D LMM快3.5倍。此外,LLaVA-3D不仅在各种3D任务中实现了最先进的性能,而且还保持了与LLaVA相当的2D图像理解和视觉语言交流能力。
https://arxiv.org/abs/2409.18125