In the field of synthetic aperture radar (SAR) remote sensing image interpretation, although Vision language models (VLMs) have made remarkable progress in natural language processing and image understanding, their applications remain limited in professional domains due to insufficient domain expertise. This paper innovatively proposes the first large-scale multimodal dialogue dataset for SAR images, named SARChat-2M, which contains approximately 2 million high-quality image-text pairs, encompasses diverse scenarios with detailed target annotations. This dataset not only supports several key tasks such as visual understanding and object detection tasks, but also has unique innovative aspects: this study develop a visual-language dataset and benchmark for the SAR domain, enabling and evaluating VLMs' capabilities in SAR image interpretation, which provides a paradigmatic framework for constructing multimodal datasets across various remote sensing vertical domains. Through experiments on 16 mainstream VLMs, the effectiveness of the dataset has been fully verified, and the first multi-task dialogue benchmark in the SAR field has been successfully established. The project will be released at this https URL, aiming to promote the in-depth development and wide application of SAR visual language models.
在合成孔径雷达(SAR)遥感图像解释领域,尽管视觉语言模型(VLMs)在自然语言处理和图像理解方面取得了显著进展,但由于缺乏专业领域的知识经验,它们的应用仍受到限制。本文创新性地提出了首个大规模多模态对话数据集——SARChat-2M,该数据集包含约200万对高质量的图像文本配对,并涵盖了多种详细标注目标的情景。此数据集不仅支持包括视觉理解和物体检测在内的多项关键任务,还具有独特的创新点:本研究开发了一个针对SAR领域的视觉语言数据集和基准,这使得可以评估VLMs在解释SAR图像方面的能力,并为构建跨各种遥感垂直领域多模态数据集提供了一种典范框架。通过对16个主流VLM模型的实验验证了该数据集的有效性,并成功建立了首个SAR领域的多任务对话基准。该项目将在[此链接](https://this https URL)发布,旨在促进SAR视觉语言模型的深度发展和广泛应用。
https://arxiv.org/abs/2502.08168
Vision Large Language Models (VLMs) combine visual understanding with natural language processing, enabling tasks like image captioning, visual question answering, and video analysis. While VLMs show impressive capabilities across domains such as autonomous vehicles, smart surveillance, and healthcare, their deployment on resource-constrained edge devices remains challenging due to processing power, memory, and energy limitations. This survey explores recent advancements in optimizing VLMs for edge environments, focusing on model compression techniques, including pruning, quantization, knowledge distillation, and specialized hardware solutions that enhance efficiency. We provide a detailed discussion of efficient training and fine-tuning methods, edge deployment challenges, and privacy considerations. Additionally, we discuss the diverse applications of lightweight VLMs across healthcare, environmental monitoring, and autonomous systems, illustrating their growing impact. By highlighting key design strategies, current challenges, and offering recommendations for future directions, this survey aims to inspire further research into the practical deployment of VLMs, ultimately making advanced AI accessible in resource-limited settings.
视觉大型语言模型(VLM)结合了视觉理解和自然语言处理,支持诸如图像描述、视觉问答和视频分析等任务。尽管VLM在自动驾驶汽车、智能监控以及医疗保健等领域展示了令人印象深刻的能力,但由于计算能力、内存和能源限制,在资源受限的边缘设备上部署它们仍然面临挑战。本综述探讨了优化VLM以适应边缘环境的最近进展,重点介绍了模型压缩技术(如剪枝、量化、知识蒸馏)和专为提高效率而设计的硬件解决方案。我们详细讨论了高效的训练和微调方法、边缘部署挑战以及隐私考虑因素。此外,还探讨了轻量级VLM在医疗保健、环境监测及自主系统等领域的多样应用,展示了它们不断增长的影响。通过强调关键的设计策略、当前面临的挑战,并为未来的发展方向提供建议,本综述旨在激发进一步的研究,使先进的AI技术能够在资源受限的环境中实现实际部署,最终让更多人受益于高级人工智能的能力。
https://arxiv.org/abs/2502.07855
Vision-Language Models (VLMs), such as GPT-4V and Llama 3.2 vision, have garnered significant research attention for their ability to leverage Large Language Models (LLMs) in multimodal tasks. However, their potential is constrained by inherent challenges, including proprietary restrictions, substantial computational demands, and limited accessibility. Smaller models, such as GIT and BLIP, exhibit marked limitations, often failing to generate coherent and consistent text beyond a few tokens, even with extensive training. This underscores a pivotal inquiry: how small can a VLM be and still produce fluent and consistent text? Drawing inspiration from the exceptional learning process of 3-4 year old children, who rely heavily on visual cues for understanding and communication, we introduce two novel datasets: ShortDesc (featuring concise image descriptions) and LongDesc (containing more detailed image descriptions). These datasets consist of image-text pairs where the text is restricted to the simple vocabulary and syntax typically used by young children, generated with a scaled- down model, GPT-4o. Using these datasets, we demonstrate that it is possible to train VLMs that are significantly smaller, up to 10 times smaller than state of the art(SOTA) small VLMs while maintaining architectural simplicity. To evaluate the outputs, we leverage GPT-4o to grade the text, as if stories written by students, on creativity, meaningfulness, and consistency, assigning scores out of 10. This method addresses limitations of standard benchmarks by accommodating unstructured outputs and providing a multidimensional evaluation of the model capabilities. Our findings contribute to the development of lightweight, accessible multimodal models for resource constrained environments.
视觉-语言模型(VLMs),如GPT-4V和Llama 3.2 vision,因其能够利用大规模语言模型(LLMs)进行多模态任务的能力而备受研究关注。然而,这些模型的潜力受到一些固有挑战的限制,包括专有权限、巨大的计算需求以及有限的可访问性。较小规模的模型,例如GIT和BLIP,在生成连贯且一致的文本方面表现出明显的局限性,即使经过大量训练,也往往在超出几个token后无法继续生成有意义的内容。这引发了一个关键问题:视觉-语言模型可以做到多小,并仍然能够产出流畅而一致的文本? 受3至4岁儿童学习过程的启发,这些孩子依赖于视觉线索来进行理解和交流,我们推出了两个新的数据集:ShortDesc(包含简洁的图像描述)和LongDesc(包含更详细的图像描述)。这两个数据集中包含了图文对,其中的文字被限制为年幼的孩子通常使用的简单词汇和语法,并通过缩小版模型GPT-4o生成。利用这些数据集,我们展示了训练出比现有最先进的小型VLMs小十倍的视觉-语言模型是可能的,同时保持架构的简洁性。 为了评估输出结果,我们使用了GPT-4o为文本打分,就像对学生的文章进行评分一样,在创造力、有意义性和一致性等方面给出满分10分的评价。这种方法通过适应非结构化的输出并提供多维度的能力评估来弥补标准基准测试中的限制。我们的研究发现有助于开发适用于资源受限环境的轻量级且易于访问的多模态模型。
https://arxiv.org/abs/2502.07838
Accurate prediction with multimodal data-encompassing tabular, textual, and visual inputs or outputs-is fundamental to advancing analytics in diverse application domains. Traditional approaches often struggle to integrate heterogeneous data types while maintaining high predictive accuracy. We introduce Generative Distribution Prediction (GDP), a novel framework that leverages multimodal synthetic data generation-such as conditional diffusion models-to enhance predictive performance across structured and unstructured modalities. GDP is model-agnostic, compatible with any high-fidelity generative model, and supports transfer learning for domain adaptation. We establish a rigorous theoretical foundation for GDP, providing statistical guarantees on its predictive accuracy when using diffusion models as the generative backbone. By estimating the data-generating distribution and adapting to various loss functions for risk minimization, GDP enables accurate point predictions across multimodal settings. We empirically validate GDP on four supervised learning tasks-tabular data prediction, question answering, image captioning, and adaptive quantile regression-demonstrating its versatility and effectiveness across diverse domains.
使用包括表格数据、文本和视觉输入或输出的多模态数据进行准确预测,是推进各种应用领域分析的关键。传统方法往往难以整合不同类型的数据同时保持高预测准确性。我们提出了一种新的框架——生成分布预测(GDP),该框架利用多模态合成数据生成技术(例如条件扩散模型)来增强结构化和非结构化模式的预测性能。GDP 是模型无关的,可以与任何高质量的生成模型兼容,并支持迁移学习以适应不同的领域。我们为 GDP 建立了严格的理论基础,在使用扩散模型作为生成核心时提供了其预测准确性的统计保证。通过估计数据生成分布并针对各种损失函数进行风险最小化调整,GDP 可在多模态设置中实现精确的点预测。我们在四项监督学习任务上对 GDP 进行了实证验证:表格数据预测、问答、图像描述和自适应分位数回归,展示了其在不同领域的灵活性和有效性。
https://arxiv.org/abs/2502.07090
The evaluation of image captions, looking at both linguistic fluency and semantic correspondence to visual contents, has witnessed a significant effort. Still, despite advancements such as the CLIPScore metric, multilingual captioning evaluation has remained relatively unexplored. This work presents several strategies, and extensive experiments, related to evaluating CLIPScore variants in multilingual settings. To address the lack of multilingual test data, we consider two different strategies: (1) using quality aware machine-translated datasets with human judgements, and (2) re-purposing multilingual datasets that target semantic inference and reasoning. Our results highlight the potential of finetuned multilingual models to generalize across languages and to handle complex linguistic challenges. Tests with machine-translated data show that multilingual CLIPScore models can maintain a high correlation with human judgements across different languages, and additional tests with natively multilingual and multicultural data further attest to the high-quality assessments.
图像字幕的评估,考虑到语言流畅性和语义与视觉内容的一致性,已经得到了大量关注。尽管出现了如CLIPScore这类指标的进步,多语言字幕的评测仍然相对较少被探索。这项工作提出了一些策略和进行了广泛的实验,以评估在多语言环境下CLIPScore的不同变体。为了解决缺乏多语言测试数据的问题,我们考虑了两种不同的策略:(1) 使用具有人类评判的质量意识机器翻译数据集;(2) 利用针对语义推理而设计的多语言数据集并重新定位其用途。我们的研究结果突显了经过微调的多语言模型在不同语言间泛化的能力以及处理复杂语言挑战的潜力。使用机器翻译的数据进行测试表明,多语言CLIPScore模型能够在多种语言中保持与人类评判的高度相关性,并且利用原生多语种和跨文化数据集的额外测试进一步证明了这些评估的质量高。
https://arxiv.org/abs/2502.06600
Medical time series are often irregular and face significant missingness, posing challenges for data analysis and clinical decision-making. Existing methods typically adopt a single modeling perspective, either treating series data as sequences or transforming them into image representations for further classification. In this paper, we propose a joint learning framework that incorporates both sequence and image representations. We also design three self-supervised learning strategies to facilitate the fusion of sequence and image representations, capturing a more generalizable joint representation. The results indicate that our approach outperforms seven other state-of-the-art models in three representative real-world clinical datasets. We further validate our approach by simulating two major types of real-world missingness through leave-sensors-out and leave-samples-out techniques. The results demonstrate that our approach is more robust and significantly surpasses other baselines in terms of classification performance.
医学时间序列数据通常不规则且存在大量缺失值,这对数据分析和临床决策构成了挑战。现有方法一般采用单一建模视角,要么将时间序列视为序列处理,要么将其转换为图像表示进行进一步分类。在本文中,我们提出了一种联合学习框架,该框架同时整合了序列和图像表示。此外,我们设计了三种自监督学习策略,以促进序列和图像表示的融合,并捕捉更具有泛化的联合表示。实验结果表明,在三个代表性的真实世界临床数据集中,我们的方法优于其他七种最先进的模型。为进一步验证我们的方法的有效性,我们通过“移除传感器”(leave-sensors-out)和“移除样本”(leave-samples-out)技术模拟了两种主要类型的实际缺失情况。结果显示,我们的方法在分类性能方面更为稳健,并显著超越了所有基线模型。
https://arxiv.org/abs/2502.06134
There has been increasing research interest in building unified multimodal understanding and generation models, among which Show-o stands as a notable representative, demonstrating great promise for both text-to-image and image-to-text generation. The inference of Show-o involves progressively denoising image tokens and autoregressively decoding text tokens, and hence, unfortunately, suffers from inefficiency issues from both sides. This paper introduces Show-o Turbo to bridge the gap. We first identify a unified denoising perspective for the generation of images and text in Show-o based on the parallel decoding of text tokens. We then propose to extend consistency distillation (CD), a qualified approach for shortening the denoising process of diffusion models, to the multimodal denoising trajectories of Show-o. We introduce a trajectory segmentation strategy and a curriculum learning procedure to improve the training convergence. Empirically, in text-to-image generation, Show-o Turbo displays a GenEval score of 0.625 at 4 sampling steps without using classifier-free guidance (CFG), outperforming that of the original Show-o with 8 steps and CFG; in image-to-text generation, Show-o Turbo exhibits a 1.5x speedup without significantly sacrificing performance. The code is available at this https URL.
在构建统一的多模态理解和生成模型的研究兴趣日益增长的情况下,Show-o 模型作为一个显著的例子脱颖而出,在文本到图像和图像到文本的生成方面展现出巨大的潜力。然而,Show-o 的推理过程涉及逐步去除噪声以处理图像令牌,并自回归地解码文本令牌,因此在两个方向上都面临着效率问题。 本文介绍了 Show-o Turbo 旨在弥补这一差距。首先,基于并行解码文本令牌的方法,我们确定了用于 Show-o 中图像和文本生成的统一去噪视角。然后,我们将一致性蒸馏(CD)扩展到 Show-o 的多模态去噪轨迹中,这是一种可以缩短扩散模型去噪过程的有效方法。此外,我们还引入了一种轨迹分割策略和一种课程学习程序以改进训练收敛性。 实验表明,在文本到图像生成任务中,Show-o Turbo 在仅使用 4 步采样且不采用无分类器指导(CFG)的情况下达到了 0.625 的 GenEval 分数,超越了原版 Show-o 使用 8 步和 CFG 所达到的成绩;而在图像到文本生成方面,Show-o Turbo 实现了大约 1.5 倍的速度提升,同时并未显著牺牲性能。 代码可在以下网址获取:[this https URL]。
https://arxiv.org/abs/2502.05415
We introduce Quantized Language-Image Pretraining (QLIP), a visual tokenization method that combines state-of-the-art reconstruction quality with state-of-the-art zero-shot image understanding. QLIP trains a binary-spherical-quantization-based autoencoder with reconstruction and language-image alignment objectives. We are the first to show that the two objectives do not need to be at odds. We balance the two loss terms dynamically during training and show that a two-stage training pipeline effectively mixes the large-batch requirements of image-language pre-training with the memory bottleneck imposed by the reconstruction objective. We validate the effectiveness of QLIP for multimodal understanding and text-conditioned image generation with a single model. Specifically, QLIP serves as a drop-in replacement for the visual encoder for LLaVA and the image tokenizer for LlamaGen with comparable or even better performance. Finally, we demonstrate that QLIP enables a unified mixed-modality auto-regressive model for understanding and generation.
我们介绍了量化语言图像预训练(QLIP),这是一种视觉标记化方法,它将最先进的重建质量与最先进的零样本图像理解能力相结合。QLIP通过使用基于二进制球面量化的自动编码器,并同时优化重构和语言-图像对齐的目标来实现这一点。我们首次展示了这两个目标并不必然冲突,可以在训练过程中动态平衡两个损失项。 我们在训练期间采用了一种两阶段的管道,有效地将大规模批次的需求与重构目标施加的记忆瓶颈结合起来。QLIP在多模态理解和文本条件下的图像生成方面都得到了验证,并且可以用单一模型完成这些任务。具体而言,QLIP可以作为视觉编码器用于LLaVA和图像标记化器用于LlamaGen的直接替代品,其性能可与原系统相媲美甚至更优。 最后,我们展示了QLIP使统一的混合模式自回归模型成为可能,该模型能够进行理解和生成。
https://arxiv.org/abs/2502.05178
Establishing the long-context capability of large vision-language models is crucial for video understanding, high-resolution image understanding, multi-modal agents and reasoning. We introduce Long-VITA, a simple yet effective large multi-modal model for long-context visual-language understanding tasks. It is adept at concurrently processing and analyzing modalities of image, video, and text over 4K frames or 1M tokens while delivering advanced performances on short-context multi-modal tasks. We propose an effective multi-modal training schema that starts with large language models and proceeds through vision-language alignment, general knowledge learning, and two sequential stages of long-sequence fine-tuning. We further implement context-parallelism distributed inference and logits-masked language modeling head to scale Long-VITA to infinitely long inputs of images and texts during model inference. Regarding training data, Long-VITA is built on a mix of $17$M samples from public datasets only and demonstrates the state-of-the-art performance on various multi-modal benchmarks, compared against recent cutting-edge models with internal data. Long-VITA is fully reproducible and supports both NPU and GPU platforms for training and testing. We hope Long-VITA can serve as a competitive baseline and offer valuable insights for the open-source community in advancing long-context multi-modal understanding.
建立大型视觉-语言模型的长上下文能力对于视频理解、高分辨率图像理解、多模态代理和推理至关重要。我们介绍了Long-VITA,这是一种简单但有效的大型多模态模型,用于处理长期视觉-语言理解和分析任务。它可以同时处理并分析图像、视频和文本(长达4K帧或100万令牌),并且在短上下文的多模态任务中表现出色。我们提出了一种有效的多模态训练方案,该方案从大型语言模型开始,经过视觉-语言对齐、通用知识学习,并通过两个连续阶段的长序列微调来实现。此外,我们在模型推理过程中实施了上下文并行分布式推理和掩码语言建模头,以将Long-VITA扩展到无限长的图像和文本输入。 关于训练数据,Long-VITA基于1700万个来自公共数据集的样本构建,并在各种多模态基准测试中表现出色,与最近使用内部数据的尖端模型相比。Long-VITA完全可重复,并支持NPU和GPU平台进行培训和测试。我们希望Long-VITA可以作为具有竞争力的基线并为开源社区提供有价值的见解,以推进长上下文多模式理解的发展。
https://arxiv.org/abs/2502.05177
Street view imagery is extensively utilized in representation learning for urban visual environments, supporting various sustainable development tasks such as environmental perception and socio-economic assessment. However, it is challenging for existing image representations to specifically encode the dynamic urban environment (such as pedestrians, vehicles, and vegetation), the built environment (including buildings, roads, and urban infrastructure), and the environmental ambiance (such as the cultural and socioeconomic atmosphere) depicted in street view imagery to address downstream tasks related to the city. In this work, we propose an innovative self-supervised learning framework that leverages temporal and spatial attributes of street view imagery to learn image representations of the dynamic urban environment for diverse downstream tasks. By employing street view images captured at the same location over time and spatially nearby views at the same time, we construct contrastive learning tasks designed to learn the temporal-invariant characteristics of the built environment and the spatial-invariant neighborhood ambiance. Our approach significantly outperforms traditional supervised and unsupervised methods in tasks such as visual place recognition, socioeconomic estimation, and human-environment perception. Moreover, we demonstrate the varying behaviors of image representations learned through different contrastive learning objectives across various downstream tasks. This study systematically discusses representation learning strategies for urban studies based on street view images, providing a benchmark that enhances the applicability of visual data in urban science. The code is available at this https URL.
街景图像在城市视觉环境的表征学习中被广泛应用于支持可持续发展任务,如环境感知和社会经济评估。然而,现有的图像表示方法很难专门编码街景图像中的动态城市环境(例如行人、车辆和植被)、建成环境(包括建筑、道路和城市基础设施)以及环境氛围(如文化和社会经济氛围),以应对与城市发展相关的下游任务。 为解决这一问题,我们提出了一种创新的自监督学习框架,利用街景图像的时间和空间属性来学习动态城市环境的图像表示,适用于多种下游任务。通过使用同一地点随时间捕捉到的街景图像以及同一时刻附近不同视角拍摄的照片,我们可以构建对比学习任务,旨在学习建成环境中不变的时间特征及空间上不变的邻里氛围特征。我们的方法在视觉位置识别、社会经济估计和人与环境感知等任务中,显著优于传统的监督和无监督方法。 此外,我们展示了通过不同的对比学习目标所学得的不同图像表示如何在各种下游任务中的表现有所差异。这项研究系统地讨论了基于街景图像的城市研究中的表征学习策略,并为视觉数据在城市科学研究中的应用提供了一个基准。代码可在[此链接](https://example.com)获取(实际使用时请替换为正确的URL)。
https://arxiv.org/abs/2502.04638
Optical Character Recognition (OCR) technology is widely used to extract text from images of documents, facilitating efficient digitization and data retrieval. However, merely extracting text is insufficient when dealing with complex documents. Fully comprehending such documents requires an understanding of their structure -- including formatting, formulas, tables, and the reading order of multiple blocks and columns across multiple pages -- as well as semantic information for detecting elements like footnotes and image captions. This comprehensive understanding is crucial for downstream tasks such as retrieval, document question answering, and data curation for training Large Language Models (LLMs) and Vision Language Models (VLMs). To address this, we introduce Ãclair, a general-purpose text-extraction tool specifically designed to process a wide range of document types. Given an image, Ãclair is able to extract formatted text in reading order, along with bounding boxes and their corresponding semantic classes. To thoroughly evaluate these novel capabilities, we introduce our diverse human-annotated benchmark for document-level OCR and semantic classification. Ãclair achieves state-of-the-art accuracy on this benchmark, outperforming other methods across key metrics. Additionally, we evaluate Ãclair on established benchmarks, demonstrating its versatility and strength across several evaluation standards.
光学字符识别(OCR)技术被广泛用于从文档图像中提取文本,这有助于高效地进行数字化和数据检索。然而,在处理复杂文档时,仅提取文本是不够的。全面理解这些文档不仅需要了解其结构——包括格式、公式、表格以及跨多页多个区块和列的阅读顺序——还需要掌握语义信息以识别诸如脚注和图片说明等元素。这种综合理解对于诸如检索、文档问答及用于训练大型语言模型(LLM)和视觉-语言模型(VLM)的数据整理等下游任务至关重要。 为解决这一问题,我们引入了Éclair,这是一个通用的文本提取工具,专门设计用来处理各种类型的文档。给定一张图片后,Éclair能够按照阅读顺序提取格式化的文本,并附带相应的边界框及其对应的语义类别。为了全面评估这些新功能,我们推出了一个多样的人工注释基准测试,用于文档级别的OCR和语义分类。在这一基准上,Éclair实现了最先进的准确率,在关键指标中超越了其他方法。此外,我们在现有的基准测试中评估了Éclair的表现,展示了它在多个评价标准中的灵活性与强大性能。
https://arxiv.org/abs/2502.04223
Vision-language models (VLMs) excel in tasks such as visual question answering and image captioning. However, VLMs are often limited by their use of pretrained image encoders, like CLIP, leading to image understanding errors that hinder overall performance. On top of that, real-world applications often require the model to be continuously adapted as new and often limited data continuously arrive. To address this, we propose LoRSU (Low-Rank Adaptation with Structured Updates), a robust and computationally efficient method for selectively updating image encoders within VLMs. LoRSU introduces structured and localized parameter updates, effectively correcting performance on previously error-prone data while preserving the model's general robustness. Our approach leverages theoretical insights to identify and update only the most critical parameters, achieving significant resource efficiency. Specifically, we demonstrate that LoRSU reduces computational overhead by over 25x compared to full VLM updates, without sacrificing performance. Experimental results on VQA tasks in the few-shot continual learning setting, validate LoRSU's scalability, efficiency, and effectiveness, making it a compelling solution for image encoder adaptation in resource-constrained environments.
视觉语言模型(VLMs)在诸如视觉问答和图像描述等任务中表现出色。然而,这些模型通常受限于预训练的图像编码器(如CLIP),这导致了图像理解中的错误并影响整体性能。此外,在实际应用中,由于不断有新的且往往有限的数据到达,需要对模型进行持续适应调整。 为了解决这个问题,我们提出了LoRSU(低秩适应结构化更新)这一方法,这是一种在VLM中选择性地更新图像编码器的稳健而计算效率高的方式。LoRSU引入了结构化的、局部化的参数更新,有效纠正了之前容易出错的数据上的表现,并保持模型的整体鲁棒性。我们的方法利用理论洞察来识别并仅更新最关键的部分参数,在确保性能的同时显著提升了资源利用率。 具体而言,我们展示了在与完全的VLM更新相比,LoRSU可以将计算开销降低超过25倍,而不会牺牲性能。实验结果表明,LoRSU在少量样本持续学习设置下的视觉问答任务中验证了其可扩展性、效率和有效性,在资源受限环境中为图像编码器适应提供了一个有吸引力的解决方案。
https://arxiv.org/abs/2502.04098
Groundbreaking advancements in text-to-image generation have recently been achieved with the emergence of diffusion models. These models exhibit a remarkable ability to generate highly artistic and intricately detailed images based on textual prompts. However, obtaining desired generation outcomes often necessitates repetitive trials of manipulating text prompts just like casting spells on a magic mirror, and the reason behind that is the limited capability of semantic understanding inherent in current image generation models. Specifically, existing diffusion models encode the text prompt input with a pre-trained encoder structure, which is usually trained on a limited number of image-caption pairs. The state-of-the-art large language models (LLMs) based on the decoder-only structure have shown a powerful semantic understanding capability as their architectures are more suitable for training on very large-scale unlabeled data. In this work, we propose to enhance text-to-image diffusion models by borrowing the strength of semantic understanding from large language models, and devise a simple yet effective adapter to allow the diffusion models to be compatible with the decoder-only structure. Meanwhile, we also provide a supporting theoretical analysis with various architectures (e.g., encoder-only, encoder-decoder, and decoder-only), and conduct extensive empirical evaluations to verify its effectiveness. The experimental results show that the enhanced models with our adapter module are superior to the stat-of-the-art models in terms of text-to-image generation quality and reliability.
最近,随着扩散模型的出现,文本到图像生成领域取得了突破性进展。这些模型展示了根据文本提示生成高度艺术性和细节丰富的图像的卓越能力。然而,为了获得所需的生成结果,通常需要反复试验以调整文本提示,就像在魔镜前施展魔法一样,其原因在于当前图像生成模型在语义理解方面的能力有限。具体而言,现有的扩散模型使用预训练编码器结构来处理文本提示输入,而该编码器通常是基于有限数量的图文对进行训练的。目前最先进的基于解码器架构的大规模语言模型(LLMs)由于其体系结构更适合于大规模无标签数据的训练,因此展示了强大的语义理解能力。 在此工作中,我们提出了一种方法来增强文本到图像生成的扩散模型,通过借鉴大型语言模型在语义理解方面的优势,并设计了一个简单有效的适配器模块,使得扩散模型能够与解码器架构兼容。同时,我们也提供了基于不同架构(例如仅编码器、编码器-解码器和仅解码器)的支持性理论分析,并进行了广泛的实证评估来验证其有效性。实验结果显示,我们的增强型模型在文本到图像生成的质量和可靠性方面优于当前最先进水平的模型。
https://arxiv.org/abs/2502.04412
Today's open vocabulary scene graph generation (OVSGG) extends traditional SGG by recognizing novel objects and relationships beyond predefined categories, leveraging the knowledge from pre-trained large-scale models. Most existing methods adopt a two-stage pipeline: weakly supervised pre-training with image captions and supervised fine-tuning (SFT) on fully annotated scene graphs. Nonetheless, they omit explicit modeling of interacting objects and treat all objects equally, resulting in mismatched relation pairs. To this end, we propose an interaction-aware OVSGG framework INOVA. During pre-training, INOVA employs an interaction-aware target generation strategy to distinguish interacting objects from non-interacting ones. In SFT, INOVA devises an interaction-guided query selection tactic to prioritize interacting objects during bipartite graph matching. Besides, INOVA is equipped with an interaction-consistent knowledge distillation to enhance the robustness by pushing interacting object pairs away from the background. Extensive experiments on two benchmarks (VG and GQA) show that INOVA achieves state-of-the-art performance, demonstrating the potential of interaction-aware mechanisms for real-world applications.
今天的开放词汇场景图生成(OVSGG)通过识别超出预定义类别范畴的新颖对象和关系,扩展了传统的场景图生成(SGG),并利用大规模预训练模型的知识。大多数现有方法采用两阶段管道:使用图像字幕进行弱监督预训练,并在完全标注的场景图上进行有监督微调(SFT)。然而,它们忽略了交互物体的显式建模,并且对待所有物体一视同仁,导致关系配对不匹配的问题。为此,我们提出了一种感知互动的OVSGG框架INOVA。 在预训练阶段,INOVA采用一种感知互动的目标生成策略来区分相互作用的物体和非相互作用的物体。在SFT过程中,INOVA设计了基于交互引导的查询选择策略,在二分图匹配中优先考虑相互作用的物体。此外,INOVA配备了具有交互一致性的知识蒸馏方法,通过将相互作用的物体会对推开背景来增强模型的鲁棒性。 在两个基准数据集(VG和GQA)上的广泛实验表明,INOVA达到了最先进的性能水平,展示了感知互动机制在现实世界应用中的潜力。
https://arxiv.org/abs/2502.03856
Diagrams play a crucial role in visually conveying complex relationships and processes within business documentation. Despite recent advances in Vision-Language Models (VLMs) for various image understanding tasks, accurately identifying and extracting the structures and relationships depicted in diagrams continues to pose significant challenges. This study addresses these challenges by proposing a text-driven approach that bypasses reliance on VLMs' visual recognition capabilities. Instead, it utilizes the editable source files--such as xlsx, pptx or docx--where diagram elements (e.g., shapes, lines, annotations) are preserved as textual metadata. In our proof-of-concept, we extracted diagram information from xlsx-based system design documents and transformed the extracted shape data into textual input for Large Language Models (LLMs). This approach allowed the LLM to analyze relationships and generate responses to business-oriented questions without the bottleneck of image-based processing. Experimental comparisons with a VLM-based method demonstrated that the proposed text-driven framework yielded more accurate answers for questions requiring detailed comprehension of diagram this http URL results obtained in this study are not limited to the tested .xlsx files but can also be extended to diagrams in other documents with source files, such as Office pptx and docx formats. These findings highlight the feasibility of circumventing VLM constraints through direct textual extraction from original source files. By enabling robust diagram understanding through LLMs, our method offers a promising path toward enhanced workflow efficiency and information analysis in real-world business scenarios.
图表在业务文档中扮演着重要角色,用于直观传达复杂的关系和流程。尽管近年来视觉-语言模型(VLM)在各种图像理解任务上取得了显著进展,但准确识别并提取图表中的结构和关系仍然面临重大挑战。本研究通过提出一种以文本驱动的方法来应对这些挑战,这种方法不依赖于视觉语言模型的视觉识别能力。相反,该方法利用可编辑源文件(例如.xlsx、.pptx或.docx),其中图表元素(如形状、线条、注释)被保存为文本元数据。在我们的概念验证中,我们从基于.xlsx的系统设计文档中提取了图表信息,并将提取出的形状数据转化为大型语言模型(LLM)的文本输入。这种方法使LLM能够分析关系并生成针对业务问题的回答,而无需图像处理瓶颈。与基于VLM的方法进行实验比较表明,所提出的以文本驱动框架为需要详细了解图结构的问题提供了更准确的答案。 本研究获得的结果不仅限于测试中的.xlsx文件,还可扩展至其他具有源文件的文档图表(如Office pptx和docx格式)。这些发现凸显了通过直接从原始源文件中提取文本信息来规避VLM限制的可行性。通过利用LLM实现强大的图表理解能力,我们的方法为在现实世界业务场景中提高工作流程效率和信息分析提供了有前景的道路。
https://arxiv.org/abs/2502.04389
Efforts to connect LiDAR data with text, such as LidarCLIP, have primarily focused on embedding 3D point clouds into CLIP text-image space. However, these approaches rely on 3D point clouds, which present challenges in encoding efficiency and neural network processing. With the advent of advanced LiDAR sensors like Ouster OS1, which, in addition to 3D point clouds, produce fixed resolution depth, signal, and ambient panoramic 2D images, new opportunities emerge for LiDAR based tasks. In this work, we propose an alternative approach to connect LiDAR data with text by leveraging 2D imagery generated by the OS1 sensor instead of 3D point clouds. Using the Florence 2 large model in a zero-shot setting, we perform image captioning and object detection. Our experiments demonstrate that Florence 2 generates more informative captions and achieves superior performance in object detection tasks compared to existing methods like CLIP. By combining advanced LiDAR sensor data with a large pre-trained model, our approach provides a robust and accurate solution for challenging detection scenarios, including real-time applications requiring high accuracy and robustness.
将LiDAR数据与文本连接的努力,如LidarCLIP,主要集中在将3D点云嵌入到CLIP的图文空间中。然而,这些方法依赖于3D点云,这在编码效率和神经网络处理方面存在挑战。随着像Ouster OS1这样的先进LiDAR传感器的出现,除了生成3D点云外,还提供了固定分辨率的深度、信号以及环境全景2D图像,为基于LiDAR的任务带来了新的机会。在这项工作中,我们提出了一种通过利用OS1传感器生成的2D图像而非3D点云来连接LiDAR数据与文本的替代方法。在零样本设置中使用Florence 2大型模型进行图像描述和目标检测实验表明,Florence 2能够生成更具信息量的描述,并且在目标检测任务中的表现优于现有的CLIP等方法。通过结合先进LiDAR传感器的数据和大规模预训练模型,我们的方法为包括需要高精度和鲁棒性的实时应用在内的挑战性检测场景提供了一种稳健而准确的解决方案。
https://arxiv.org/abs/2502.04385
Spatial Reasoning is an important component of human cognition and is an area in which the latest Vision-language models (VLMs) show signs of difficulty. The current analysis works use image captioning tasks and visual question answering. In this work, we propose using the Referring Expression Comprehension task instead as a platform for the evaluation of spatial reasoning by VLMs. This platform provides the opportunity for a deeper analysis of spatial comprehension and grounding abilities when there is 1) ambiguity in object detection, 2) complex spatial expressions with a longer sentence structure and multiple spatial relations, and 3) expressions with negation ('not'). In our analysis, we use task-specific architectures as well as large VLMs and highlight their strengths and weaknesses in dealing with these specific situations. While all these models face challenges with the task at hand, the relative behaviors depend on the underlying models and the specific categories of spatial semantics (topological, directional, proximal, etc.). Our results highlight these challenges and behaviors and provide insight into research gaps and future directions.
空间推理是人类认知的重要组成部分,而最新的视觉-语言模型(VLMs)在这一领域表现出明显的困难。目前的分析工作主要通过图像描述和视觉问答任务来进行评估。在这项工作中,我们建议使用指称表达理解任务来作为评估VLM的空间推理能力的一个平台。这个平台提供了深入分析空间理解和定位能力的机会,在以下情况下:1)目标检测中的模糊性;2)复杂的包含长句结构及多种空间关系的表述;3)包含否定词(如“不”)的表述。在我们的研究中,我们使用了特定任务架构以及大规模视觉-语言模型,并突出显示它们在处理这些具体情况时的优点和缺点。 尽管所有这些模型都面临着当前任务中的挑战,但具体行为取决于底层模型及其特定的空间语义类别(拓扑、方向、邻近等)。我们的结果强调了这些挑战与行为模式,并为研究空白及未来发展方向提供了见解。
https://arxiv.org/abs/2502.04359
This paper introduces the COCONut-PanCap dataset, created to enhance panoptic segmentation and grounded image captioning. Building upon the COCO dataset with advanced COCONut panoptic masks, this dataset aims to overcome limitations in existing image-text datasets that often lack detailed, scene-comprehensive descriptions. The COCONut-PanCap dataset incorporates fine-grained, region-level captions grounded in panoptic segmentation masks, ensuring consistency and improving the detail of generated captions. Through human-edited, densely annotated descriptions, COCONut-PanCap supports improved training of vision-language models (VLMs) for image understanding and generative models for text-to-image tasks. Experimental results demonstrate that COCONut-PanCap significantly boosts performance across understanding and generation tasks, offering complementary benefits to large-scale datasets. This dataset sets a new benchmark for evaluating models on joint panoptic segmentation and grounded captioning tasks, addressing the need for high-quality, detailed image-text annotations in multi-modal learning.
本文介绍了COCONut-PanCap数据集,该数据集旨在增强全景分割和基于图像的描述生成。在COCO数据集的基础上使用先进的COCONut全景掩码,该数据集旨在克服现有图像-文本数据集中通常缺乏详细、场景全面描述的局限性。COCONut-PanCap数据集整合了精细粒度、区域级别的描述,并基于全景分割掩码进行定位,确保了一致性和生成描述细节的提升。通过人工编辑和密集标注的描述,COCONut-PanCap支持对视觉-语言模型(VLMs)在图像理解和文本到图像任务中的训练进行改进。实验结果显示,与大规模数据集相比,COCONut-PanCap显著提升了理解与生成任务的表现,并提供了互补的好处。此数据集为评估联合全景分割和基于图像的描述生成的任务建立了新的基准,满足了多模态学习中高质量、详细图像-文本注释的需求。
https://arxiv.org/abs/2502.02589
Multi-modal Large Language Models (MLLMs) excel in vision-language tasks but remain vulnerable to visual adversarial perturbations that can induce hallucinations, manipulate responses, or bypass safety mechanisms. Existing methods seek to mitigate these risks by applying constrained adversarial fine-tuning to CLIP vision encoders on ImageNet-scale data, ensuring their generalization ability is preserved. However, this limited adversarial training restricts robustness and broader generalization. In this work, we explore an alternative approach of leveraging existing vision classification models that have been adversarially pre-trained on large-scale data. Our analysis reveals two principal contributions: (1) the extensive scale and diversity of adversarial pre-training enables these models to demonstrate superior robustness against diverse adversarial threats, ranging from imperceptible perturbations to advanced jailbreaking attempts, without requiring additional adversarial training, and (2) end-to-end MLLM integration with these robust models facilitates enhanced adaptation of language components to robust visual features, outperforming existing plug-and-play methodologies on complex reasoning tasks. Through systematic evaluation across visual question-answering, image captioning, and jail-break attacks, we demonstrate that MLLMs trained with these robust models achieve superior adversarial robustness while maintaining favorable clean performance. Our framework achieves 2x and 1.5x average robustness gains in captioning and VQA tasks, respectively, and delivers over 10% improvement against jailbreak attacks. Code and pretrained models will be available at this https URL.
多模态大型语言模型(MLLMs)在视觉-语言任务中表现出色,但仍然容易受到诱导幻觉、操纵响应或绕过安全机制的视觉对抗性干扰的影响。现有的方法通过将受限制的对抗性微调应用于CLIP视觉编码器来缓解这些风险,并确保其泛化能力得到保留,这种方法使用的是ImageNet规模的数据集。然而,这种有限的对抗训练方式限制了模型的鲁棒性和更广泛的泛化能力。 在本研究中,我们探索了一种替代方法:利用已经在大规模数据上进行过对抗性预训练的现有视觉分类模型。我们的分析揭示了两个主要贡献: 1. 广阔规模和多样性的对抗性预训练使这些模型能够展示出对各种对抗威胁(从不易察觉的变化到高级破解尝试)的卓越鲁棒性,无需额外的对抗性训练。 2. 将MLLMs与这些强大的视觉分类模型进行端到端集成可以促进语言组件更好地适应稳健的视觉特征,在复杂的推理任务中优于现有的即插即用方法。 通过在图像问答、图像描述和破解攻击等视觉问题上的系统评估,我们证明了使用这些强大模型训练的MLLMs能够实现优越的对抗鲁棒性,并且保持良好的清洁性能。我们的框架分别使图像描述和VQA(视觉问答)任务中的平均稳健性提高了2倍和1.5倍,并在破解攻击中提供了超过10%的改进。 代码和预训练模型将在以下URL提供:[此链接](请在此处插入实际链接)。
https://arxiv.org/abs/2502.01576
Detailed image captioning is essential for tasks like data generation and aiding visually impaired individuals. High-quality captions require a balance between precision and recall, which remains challenging for current multimodal large language models (MLLMs). In this work, we hypothesize that this limitation stems from weakening and increasingly noisy visual attention as responses lengthen. To address this issue, we propose SPARC (Selective Progressive Attention ReCalibration), a training-free method that enhances the contribution of visual tokens during decoding. SPARC is founded on three key observations: (1) increasing the influence of all visual tokens reduces recall; thus, SPARC selectively amplifies visual tokens; (2) as captions lengthen, visual attention becomes noisier, so SPARC identifies critical visual tokens by leveraging attention differences across time steps; (3) as visual attention gradually weakens, SPARC reinforces it to preserve its influence. Our experiments, incorporating both automated and human evaluations, demonstrate that existing methods improve the precision of MLLMs at the cost of recall. In contrast, our proposed method enhances both precision and recall with minimal computational overhead.
详细图像描述对于数据生成和帮助视障人士等任务至关重要。高质量的图像描述需要在准确性和召回率之间取得平衡,这对当前多模态大型语言模型(MLLM)来说仍是一个挑战。在这项工作中,我们假设这一限制源于随着响应长度增加,视觉注意力逐渐减弱并变得更加嘈杂。为了解决这个问题,我们提出了SPARC(Selective Progressive Attention ReCalibration),这是一种无需训练的方法,旨在增强解码过程中视觉标记的贡献。SPARC基于三个关键观察结果:(1) 增加所有视觉标记的影响会降低召回率;因此,SPARC选择性地放大了视觉标记;(2) 随着描述变长,视觉注意力变得更加嘈杂,所以SPARC通过利用时间步长间的注意差异来识别关键的视觉标记;(3) 当视觉注意力逐渐减弱时,SPARC增强了它以保持其影响力。我们的实验,包括自动和人工评估,表明现有的方法提高了MLLM的精度,但代价是召回率降低。相比之下,我们提出的方法在几乎不增加计算开销的情况下同时提升了准确性和召回率。
https://arxiv.org/abs/2502.01419