Remote Sensing Image-Text Retrieval (RSITR) plays a critical role in geographic information interpretation, disaster monitoring, and urban planning by establishing semantic associations between image and textual descriptions. Existing Parameter-Efficient Fine-Tuning (PEFT) methods for Vision-and-Language Pre-training (VLP) models typically adopt symmetric adapter structures for exploring cross-modal correlations. However, the strong discriminative nature of text modality may dominate the optimization process and inhibits image representation learning. The nonnegligible imbalanced cross-modal optimization remains a bottleneck to enhancing the model performance. To address this issue, this study proposes a Representation Discrepancy Bridging (RDB) method for the RSITR task. On the one hand, a Cross-Modal Asymmetric Adapter (CMAA) is designed to enable modality-specific optimization and improve feature alignment. The CMAA comprises a Visual Enhancement Adapter (VEA) and a Text Semantic Adapter (TSA). VEA mines fine-grained image features by Differential Attention (DA) mechanism, while TSA identifies key textual semantics through Hierarchical Attention (HA) mechanism. On the other hand, this study extends the traditional single-task retrieval framework to a dual-task optimization framework and develops a Dual-Task Consistency Loss (DTCL). The DTCL improves cross-modal alignment robustness through an adaptive weighted combination of cross-modal, classification, and exponential moving average consistency constraints. Experiments on RSICD and RSITMD datasets show that the proposed RDB method achieves a 6%-11% improvement in mR metrics compared to state-of-the-art PEFT methods and a 1.15%-2% improvement over the full fine-tuned GeoRSCLIP model.
远程遥感图像-文本检索(Remote Sensing Image-Text Retrieval, RSITR)在地理信息解释、灾害监测和城市规划中扮演着关键角色,通过建立图像与文字描述之间的语义关联来实现这些目标。现有的视觉语言预训练模型(Vision-and-Language Pre-training, VLP)的参数高效微调(Parameter-Efficient Fine-Tuning, PEFT)方法通常采用对称适配器结构来探索跨模态相关性。然而,文本模态强烈的判别性质可能会在优化过程中占主导地位,并抑制图像表示学习。因此,显著且不可忽视的跨模态不平衡优化仍然是提高模型性能的一个瓶颈。 为解决这一问题,本研究提出了一种用于RSITR任务的表征差异桥接(Representation Discrepancy Bridging, RDB)方法。一方面,设计了跨模态非对称适配器(Cross-Modal Asymmetric Adapter, CMAA),以实现模态特异化优化,并提高特征对齐能力。CMAA 包括视觉增强适配器 (Visual Enhancement Adapter, VEA) 和文本语义适配器 (Text Semantic Adapter, TSA)。VEA 通过差分注意(Differential Attention, DA)机制挖掘精细的图像特征,而TSA 则通过层次化注意力(Hierarchical Attention, HA)机制识别关键的文字语义。 另一方面,本研究将传统的单一任务检索框架扩展为双任务优化框架,并开发了双任务一致性损失 (Dual-Task Consistency Loss, DTCL)。DTCL 通过跨模态、分类和指数移动平均一致性的自适应加权组合来提高跨模态对齐的鲁棒性。 在RSICD和RSITMD数据集上的实验表明,所提出的RDB方法相比现有的最先进PEFT方法,在mR指标上提升了6%-11%,并比完全微调后的GeoRSCLIP模型提高了1.15%-2%。
https://arxiv.org/abs/2505.16756
We investigate fine-tuning Vision-Language Models (VLMs) for multi-task medical image understanding, focusing on detection, localization, and counting of findings in medical images. Our objective is to evaluate whether instruction-tuned VLMs can simultaneously improve these tasks, with the goal of enhancing diagnostic accuracy and efficiency. Using MedMultiPoints, a multimodal dataset with annotations from endoscopy (polyps and instruments) and microscopy (sperm cells), we reformulate each task into instruction-based prompts suitable for vision-language reasoning. We fine-tune Qwen2.5-VL-7B-Instruct using Low-Rank Adaptation (LoRA) across multiple task combinations. Results show that multi-task training improves robustness and accuracy. For example, it reduces the Count Mean Absolute Error (MAE) and increases Matching Accuracy in the Counting + Pointing task. However, trade-offs emerge, such as more zero-case point predictions, indicating reduced reliability in edge cases despite overall performance gains. Our study highlights the potential of adapting general-purpose VLMs to specialized medical tasks via prompt-driven fine-tuning. This approach mirrors clinical workflows, where radiologists simultaneously localize, count, and describe findings - demonstrating how VLMs can learn composite diagnostic reasoning patterns. The model produces interpretable, structured outputs, offering a promising step toward explainable and versatile medical AI. Code, model weights, and scripts will be released for reproducibility at this https URL.
我们研究了针对多任务医学图像理解的视觉语言模型(VLMs)微调方法,重点关注在医学图像中检测、定位和计数病变的任务。我们的目标是评估通过指令调整后的VLM是否能够同时改善这些任务,并以此提高诊断准确性和效率。使用MedMultiPoints这一包含内窥镜(息肉和仪器)及显微镜检查(精子细胞)注释的多模态数据集,我们将每个任务重新表述为基于视觉语言推理的指令提示。我们采用低秩适应(LoRA)方法对Qwen2.5-VL-7B-Instruct模型进行多任务组合下的微调训练。实验结果显示,多任务训练提升了模型的鲁棒性和准确性,例如,在计数+定位任务中减少了计数平均绝对误差(MAE),提高了匹配准确度。然而,也存在一些权衡,比如更多的零点预测情况出现,这表明在边缘案例中的可靠性有所下降,尽管整体性能有所提升。 我们的研究强调了通过提示驱动的微调方法将通用视觉语言模型应用于专门医学任务的潜力。这种方法模仿了临床工作流程,其中放射科医生同时定位、计数并描述病变——展示了VLM如何学习复合诊断推理模式。该模型生成可解释和结构化的输出,为具有透明度与多样性的医疗人工智能提供了有前景的发展方向。代码、模型权重及脚本将在 [提供的URL] 上发布,以确保研究的可重复性。
https://arxiv.org/abs/2505.16647
Document Visual Question Answering (DocVQA) faces dual challenges in processing lengthy multimodal documents (text, images, tables) and performing cross-modal reasoning. Current document retrieval-augmented generation (DocRAG) methods remain limited by their text-centric approaches, frequently missing critical visual information. The field also lacks robust benchmarks for assessing multimodal evidence selection and integration. We introduce MMDocRAG, a comprehensive benchmark featuring 4,055 expert-annotated QA pairs with multi-page, cross-modal evidence chains. Our framework introduces innovative metrics for evaluating multimodal quote selection and enables answers that interleave text with relevant visual elements. Through large-scale experiments with 60 VLM/LLM models and 14 retrieval systems, we identify persistent challenges in multimodal evidence retrieval, selection, and this http URL findings reveal advanced proprietary LVMs show superior performance than open-sourced alternatives. Also, they show moderate advantages using multimodal inputs over text-only inputs, while open-source alternatives show significant performance degradation. Notably, fine-tuned LLMs achieve substantial improvements when using detailed image descriptions. MMDocRAG establishes a rigorous testing ground and provides actionable insights for developing more robust multimodal DocVQA systems. Our benchmark and code are available at this https URL.
文档视觉问答(DocVQA)面临双重挑战,即处理长篇多模态文件(文本、图像、表格)以及执行跨模式推理。当前的文档检索增强型生成(DocRAG)方法仍然受限于其以文本为中心的方法,经常忽略关键的视觉信息。该领域还缺乏评估多模态证据选择和整合的稳健基准。我们引入了MMDocRAG,这是一个全面的基准测试平台,包含4,055个专家标注的问题-答案对,并且涉及跨模式、多页的证据链。我们的框架提出了用于评估多模态引用选择的新颖指标,并支持在回答中插入文本与相关视觉元素。 通过大规模实验(使用60种视觉语言模型/大型语言模型和14种检索系统),我们识别出多模态证据检索、选择中的持久性挑战以及生成高质量答案的困难。研究结果表明,先进的专有LVM表现优于开源替代品。此外,它们在使用多模态输入时显示了相对于仅文本输入的适度优势,而开源替代品则表现出显著的性能下降。值得注意的是,经过微调的语言模型在使用详细图像描述时取得了显著改进。 MMDocRAG为开发更强大的多模态DocVQA系统建立了一个严格的测试平台,并提供了可操作性的见解。我们的基准和代码可在[这里](https://github.com/MMDocRAG)访问。
https://arxiv.org/abs/2505.16470
Despite their remarkable progress in multimodal understanding tasks, large vision language models (LVLMs) often suffer from "hallucinations", generating texts misaligned with the visual context. Existing methods aimed at reducing hallucinations through inference time intervention incur a significant increase in latency. To mitigate this, we present SPIN, a task-agnostic attention-guided head suppression strategy that can be seamlessly integrated during inference, without incurring any significant compute or latency overhead. We investigate whether hallucination in LVLMs can be linked to specific model components. Our analysis suggests that hallucinations can be attributed to a dynamic subset of attention heads in each layer. Leveraging this insight, for each text query token, we selectively suppress attention heads that exhibit low attention to image tokens, keeping the top-K attention heads intact. Extensive evaluations on visual question answering and image description tasks demonstrate the efficacy of SPIN in reducing hallucination scores up to 2.7x while maintaining F1, and improving throughput by 1.8x compared to existing alternatives. Code is available at this https URL.
尽管大型视觉语言模型(LVLM)在多模态理解任务中取得了显著进展,但它们常常会生成与视觉上下文不一致的文本,即所谓的“幻觉”。现有的通过推理时间干预减少幻觉的方法会导致延迟大幅增加。为了解决这个问题,我们提出了SPIN策略,这是一种基于注意力引导的任务无关头部抑制策略,可以在推理过程中无缝集成,而不会带来显著的计算或延迟开销。 我们研究了LVLM中的幻觉是否可以与特定模型组件相关联。分析表明,这些幻觉可归因于每一层中动态变化的一部分注意力头。利用这一见解,对于每个文本查询令牌,我们会选择性地抑制对图像令牌关注较低的注意力头,并保持前K个最高的注意力头不变。 在视觉问答和图像描述任务上的广泛评估显示,SPIN能够将幻觉评分降低高达2.7倍,同时维持F1分数不变,并且与现有方法相比,吞吐量提高了1.8倍。代码可在[此链接](https://this-url.com)获取。
https://arxiv.org/abs/2505.16411
Evaluating image captions requires cohesive assessment of both visual semantics and language pragmatics, which is often not entirely captured by most metrics. We introduce Redemption Score, a novel hybrid framework that ranks image captions by triangulating three complementary signals: (1) Mutual Information Divergence (MID) for global image-text distributional alignment, (2) DINO-based perceptual similarity of cycle-generated images for visual grounding, and (3) BERTScore for contextual text similarity against human references. A calibrated fusion of these signals allows Redemption Score to offer a more holistic assessment. On the Flickr8k benchmark, Redemption Score achieves a Kendall-$\tau$ of 56.43, outperforming twelve prior methods and demonstrating superior correlation with human judgments without requiring task-specific training. Our framework provides a more robust and nuanced evaluation by effectively redeeming image semantics and linguistic interpretability indicated by strong transfer of knowledge in the Conceptual Captions and MS COCO datasets.
评估图像字幕需要综合考虑视觉语义和语言实用性的连贯性评价,而大多数现有指标往往无法完全捕捉到这一点。我们引入了Redemption Score,这是一种新颖的混合框架,通过整合三种互补信号对图像字幕进行排名:(1) 全局图像-文本分布对齐的互信息散度(MID);(2) 基于DINO的循环生成图像的感知相似性以实现视觉定位;以及 (3) 对比人工参考的文字上下文相似性的BERTScore。这些信号经过校准融合后,Redemption Score能够提供更为全面的评估。在Flickr8k基准测试中,Redemption Score实现了56.43的Kendall-$\tau$得分,在与十二种先前方法对比时表现更优,并且无需特定任务训练即可展现出更强的人类评价相关性。我们的框架通过有效提升Conceptual Captions和MS COCO数据集中知识迁移的效果,提供了一种更为稳健和细致的评估方式,体现了图像语义和语言可解释性的改善。
https://arxiv.org/abs/2505.16180
Large vision-language models (LVLMs) have achieved remarkable performance on multimodal tasks such as visual question answering (VQA) and image captioning. However, they still suffer from hallucinations, generating text inconsistent with visual input, posing significant risks in real-world applications. Existing approaches to address this issue focus on incorporating external knowledge bases, alignment training, or decoding strategies, all of which require substantial computational cost and time. Recent works try to explore more efficient alternatives by adjusting LVLMs' internal representations. Although promising, these methods may cause hallucinations to be insufficiently suppressed or lead to excessive interventions that negatively affect normal semantics. In this work, we leverage sparse autoencoders (SAEs) to identify semantic directions closely associated with either hallucinations or actuality, realizing more precise and direct hallucination-related representations. Our analysis demonstrates that interventions along the faithful direction we identified can mitigate hallucinations, while those along the hallucinatory direction can exacerbate them. Building on these insights, we propose Steering LVLMs via SAE Latent Directions (SSL), a training-free method based on SAE-derived latent directions to mitigate hallucinations in LVLMs. Extensive experiments demonstrate that SSL significantly outperforms existing decoding approaches in mitigating hallucinations, while maintaining transferability across different model architectures with negligible additional time overhead.
大型视觉-语言模型(LVLM)在多模态任务如视觉问答(VQA)和图像描述生成方面取得了显著的性能。然而,它们仍然存在幻觉问题,即生成与视觉输入不一致的文字内容,在实际应用中带来了重大风险。现有的解决这一问题的方法主要集中在整合外部知识库、对齐训练或解码策略上,这些方法都需要大量的计算成本和时间。最近的研究试图通过调整LVLM的内部表示来探索更高效的替代方案。尽管有前景,但这种方法可能会导致幻觉抑制不足或者过度干预正常的语义。 在本工作中,我们利用稀疏自编码器(SAE)识别与幻觉或现实紧密相关的语义方向,从而实现更为精确和直接地处理与幻觉相关表示的目标。我们的分析表明,在我们识别出的忠实方向上进行干预可以减轻幻觉的发生,而在幻觉方向上的干预则会加剧这一问题。 基于这些洞察,我们提出了通过SAE潜在方向引导LVLM的方法(SSL),这是一种无需训练的方法,利用从SAE中得出的潜在方向来减少LVLM中的幻觉。广泛的实验表明,SSL在减少幻觉方面显著优于现有的解码方法,并且能够保持跨不同模型架构的可迁移性,在几乎不增加额外时间开销的情况下实现这一点。
https://arxiv.org/abs/2505.16146
Multimodal pathological image understanding has garnered widespread interest due to its potential to improve diagnostic accuracy and enable personalized treatment through integrated visual and textual data. However, existing methods exhibit limited reasoning capabilities, which hamper their ability to handle complex diagnostic scenarios. Additionally, the enormous size of pathological images leads to severe computational burdens, further restricting their practical deployment. To address these limitations, we introduce a novel bilateral reinforcement learning framework comprising two synergistic branches. One reinforcement branch enhances the reasoning capability by enabling the model to learn task-specific decision processes, i.e., pathology rationales, directly from labels without explicit reasoning supervision. While the other branch dynamically allocates a tailored number of tokens to different images based on both their visual content and task context, thereby optimizing computational efficiency. We apply our method to various pathological tasks such as visual question answering, cancer subtyping, and lesion detection. Extensive experiments show an average +41.7 absolute performance improvement with 70.3% lower inference costs over the base models, achieving both reasoning accuracy and computational efficiency.
多模态病理图像理解由于其潜在的提高诊断准确性并通过对集成视觉和文本数据的应用来实现个性化治疗的能力,已经引起了广泛的兴趣。然而,现有的方法表现出有限的推理能力,这阻碍了它们处理复杂诊断场景的能力。此外,病理图像的巨大尺寸导致严重的计算负担,进一步限制了其实用部署。为了解决这些局限性,我们引入了一个新颖的双边强化学习框架,该框架包含两个相互促进的分支。一个强化分支通过使模型能够直接从标签中学习特定任务的决策过程(即病理性理由)来增强推理能力,而无需明确的推理监督。另一个分支根据图像的视觉内容和任务背景动态分配不同数量的标记,从而优化计算效率。我们将该方法应用于各种病理学任务,如视觉问答、癌症亚型分类和病变检测。大量的实验表明,在基础模型上实现了平均+41.7的绝对性能提升,并且推理成本降低了70.3%,同时在推理准确性和计算效率方面都取得了成果。
https://arxiv.org/abs/2505.15687
While eXplainable AI (XAI) has advanced significantly, few methods address interpretability in embedded vector spaces where dimensions represent complex abstractions. We introduce Distance Explainer, a novel method for generating local, post-hoc explanations of embedded spaces in machine learning models. Our approach adapts saliency-based techniques from RISE to explain the distance between two embedded data points by assigning attribution values through selective masking and distance-ranked mask filtering. We evaluate Distance Explainer on cross-modal embeddings (image-image and image-caption pairs) using established XAI metrics including Faithfulness, Sensitivity/Robustness, and Randomization. Experiments with ImageNet and CLIP models demonstrate that our method effectively identifies features contributing to similarity or dissimilarity between embedded data points while maintaining high robustness and consistency. We also explore how parameter tuning, particularly mask quantity and selection strategy, affects explanation quality. This work addresses a critical gap in XAI research and enhances transparency and trustworthiness in deep learning applications utilizing embedded spaces.
尽管可解释的人工智能(XAI)已经取得了显著进展,但很少有方法能解决嵌入向量空间中的可解释性问题,其中维度代表复杂的抽象概念。我们提出了Distance Explainer,这是一种新颖的方法,用于生成机器学习模型中嵌入空间的局部事后解释。我们的方法借鉴RISE中的基于敏感度的技术来解释两个嵌入数据点之间的距离,并通过选择性掩码和根据距离排名过滤掩码来分配归因值。 我们在使用ImageNet和CLIP模型进行实验时,利用了已建立的XAI指标(包括忠实度、灵敏度/鲁棒性和随机化)对跨模态嵌入(图像-图像和图像-描述符配对)上的Distance Explainer进行了评估。结果表明,我们的方法能够有效地识别导致嵌入数据点之间相似性或不相似性的特征,并保持高鲁棒性和一致性。 此外,我们还探讨了参数调整,特别是掩码数量和选择策略如何影响解释质量。这项工作填补了XAI研究中的一个关键空白,并增强了使用嵌入空间的深度学习应用在透明度和信任方面的表现。
https://arxiv.org/abs/2505.15516
The real-world impact of misinformation stems from the underlying misleading narratives that creators seek to convey. As such, interpreting misleading creator intent is essential for multimodal misinformation detection (MMD) systems aimed at effective information governance. In this paper, we introduce an automated framework that simulates real-world multimodal news creation by explicitly modeling creator intent through two components: the desired influence and the execution plan. Using this framework, we construct DeceptionDecoded, a large-scale benchmark comprising 12,000 image-caption pairs aligned with trustworthy reference articles. The dataset captures both misleading and non-misleading intents and spans manipulations across visual and textual modalities. We conduct a comprehensive evaluation of 14 state-of-the-art vision-language models (VLMs) on three intent-centric tasks: (1) misleading intent detection, (2) misleading source attribution, and (3) creator desire inference. Despite recent advances, we observe that current VLMs fall short in recognizing misleading intent, often relying on spurious cues such as superficial cross-modal consistency, stylistic signals, and heuristic authenticity hints. Our findings highlight the pressing need for intent-aware modeling in MMD and open new directions for developing systems capable of deeper reasoning about multimodal misinformation.
误信息在现实世界中的影响源于创作者试图传达的误导性叙述。因此,对于旨在有效治理信息的多模态误信息检测(MMD)系统来说,解读误导性的创作意图至关重要。在这篇论文中,我们介绍了一个自动化的框架,该框架通过模拟真实世界的多模态新闻制作过程,并明确地通过两个组件——期望的影响和执行计划——来建模创作者的意图。利用这个框架,我们构建了DeceptionDecoded,这是一个包含12,000对图片-描述文本的数据集,这些数据与值得信赖的参考文章相匹配。该数据集涵盖了误导性和非误导性意图,并且跨越视觉和文本模态的操纵行为。我们在三项以意图为中心的任务上全面评估了14种最先进的视觉语言模型(VLM):(1) 误导意图检测;(2) 误导来源归因;以及 (3) 创作者需求推断。尽管最近取得了进展,但我们观察到目前的VLM在识别误导性意图方面仍然存在不足,它们常常依赖于诸如表层跨模态一致性、风格信号和启发式真实性线索等虚假提示。我们的发现强调了多模态误信息检测中进行意图感知建模的迫切需要,并为开发能够深入推理多模态误信息系统的新型方向提供了新的启示。
https://arxiv.org/abs/2505.15489
Despite the dominance of convolutional and transformer-based architectures in image-to-image retrieval, these models are prone to biases arising from low-level visual features, such as color. Recognizing the lack of semantic understanding as a key limitation, we propose a novel scene graph-based retrieval framework that emphasizes semantic content over superficial image characteristics. Prior approaches to scene graph retrieval predominantly rely on supervised Graph Neural Networks (GNNs), which require ground truth graph pairs driven from image captions. However, the inconsistency of caption-based supervision stemming from variable text encodings undermine retrieval reliability. To address these, we present SCENIR, a Graph Autoencoder-based unsupervised retrieval framework, which eliminates the dependence on labeled training data. Our model demonstrates superior performance across metrics and runtime efficiency, outperforming existing vision-based, multimodal, and supervised GNN approaches. We further advocate for Graph Edit Distance (GED) as a deterministic and robust ground truth measure for scene graph similarity, replacing the inconsistent caption-based alternatives for the first time in image-to-image retrieval evaluation. Finally, we validate the generalizability of our method by applying it to unannotated datasets via automated scene graph generation, while substantially contributing in advancing state-of-the-art in counterfactual image retrieval.
尽管基于卷积和转换器的架构在图像到图像检索中占据主导地位,但这些模型容易受到由低级视觉特征(如颜色)引起的偏差影响。鉴于缺乏语义理解是关键限制,我们提出了一种新颖的场景图基检索框架,该框架强调语义内容而非浅层图像特性。此前,针对场景图检索的方法主要依赖于监督式的图神经网络(GNNs),这些方法需要基于图像说明的地面实况图形对作为训练数据。然而,由于文本编码变化导致的基于标题的监督不稳定会损害检索的可靠性。 为了应对这些问题,我们提出了SCENIR,这是一种基于图自编码器的无监督检索框架,它消除了对标注训练数据的依赖性。我们的模型在各种指标和运行时效率上表现出卓越性能,并且优于现有的视觉基础、多模式以及监督GNN方法。此外,我们首次将图编辑距离(GED)作为确定性和稳健性的地面实况度量引入图像到图像检索评估中,以替代此前不一致的基于标题的方法来衡量场景图相似性。 最后,通过在未标注的数据集上应用自动化场景图生成技术,我们验证了该方法的一般化能力,并在此过程中显著推进了反事实图像检索领域的最先进水平。
https://arxiv.org/abs/2505.15867
Adversarial attacks aim to generate malicious inputs that mislead deep models, but beyond causing model failure, they cannot provide certain interpretable information such as ``\textit{What content in inputs make models more likely to fail?}'' However, this information is crucial for researchers to specifically improve model robustness. Recent research suggests that models may be particularly sensitive to certain semantics in visual inputs (such as ``wet,'' ``foggy''), making them prone to errors. Inspired by this, in this paper we conducted the first exploration on large vision-language models (LVLMs) and found that LVLMs indeed are susceptible to hallucinations and various errors when facing specific semantic concepts in images. To efficiently search for these sensitive concepts, we integrated large language models (LLMs) and text-to-image (T2I) models to propose a novel semantic evolution framework. Randomly initialized semantic concepts undergo LLM-based crossover and mutation operations to form image descriptions, which are then converted by T2I models into visual inputs for LVLMs. The task-specific performance of LVLMs on each input is quantified as fitness scores for the involved semantics and serves as reward signals to further guide LLMs in exploring concepts that induce LVLMs. Extensive experiments on seven mainstream LVLMs and two multimodal tasks demonstrate the effectiveness of our method. Additionally, we provide interesting findings about the sensitive semantics of LVLMs, aiming to inspire further in-depth research.
对抗攻击旨在生成误导深度模型的恶意输入,但除了导致模型失效外,它们不能提供可解释的信息,例如“输入中的哪些内容更容易使模型出错?”然而,这种信息对于研究人员针对性地改进模型鲁棒性至关重要。最近的研究表明,模型可能对视觉输入中某些语义(如“潮湿”、“雾气”)特别敏感,从而容易犯错误。受此启发,在本文中,我们首次探索了大型视觉-语言模型(LVLMs),发现这些模型在面对图像中的特定语义概念时确实会产生幻觉和各种错误。为了高效地搜索这些敏感概念,我们将大规模语言模型(LLMs)与文本到图像(T2I)模型结合提出了一种新颖的语义进化框架。随机初始化的语义概念经过基于LLM的交叉和变异操作形成图像描述,然后通过T2I模型转换为LVLMs的视觉输入。每个输入在特定任务上的性能被量化为涉及语义的适应度得分,并作为奖励信号进一步指导LLMs探索导致LVLM错误的概念。在七个主流LVLMs和两个多模态任务上进行的大量实验展示了我们方法的有效性。此外,我们还提供了有关LVLM敏感语义的一些有趣的发现,旨在激发更深入的研究。
https://arxiv.org/abs/2505.15265
We introduce UniGen, a unified multimodal large language model (MLLM) capable of image understanding and generation. We study the full training pipeline of UniGen from a data-centric perspective, including multi-stage pre-training, supervised fine-tuning, and direct preference optimization. More importantly, we propose a new Chain-of-Thought Verification (CoT-V) strategy for test-time scaling, which significantly boosts UniGen's image generation quality using a simple Best-of-N test-time strategy. Specifically, CoT-V enables UniGen to act as both image generator and verifier at test time, assessing the semantic alignment between a text prompt and its generated image in a step-by-step CoT manner. Trained entirely on open-source datasets across all stages, UniGen achieves state-of-the-art performance on a range of image understanding and generation benchmarks, with a final score of 0.78 on GenEval and 85.19 on DPG-Bench. Through extensive ablation studies, our work provides actionable insights and addresses key challenges in the full life cycle of building unified MLLMs, contributing meaningful directions to the future research.
我们介绍了UniGen,这是一个统一的多模态大型语言模型(MLLM),能够理解并生成图像。从数据为中心的角度研究了UniGen整个训练流程,包括多阶段预训练、有监督微调和直接偏好优化。更重要的是,我们提出了一种新的测试时间缩放策略——链式思维验证(CoT-V)策略,通过简单的Best-of-N测试时间策略显著提升了UniGen的图像生成质量。具体来说,CoT-V使UniGen在测试时既能作为图像生成器又能作为验证器,在逐步推理方式下评估文本提示和其生成图像之间的语义一致性。在整个阶段完全使用开源数据集进行训练后,UniGen在一系列图像理解和生成基准上达到了最先进的性能,GenEval得分为0.78,DPG-Bench得分则为85.19。通过广泛的消融研究,我们的工作提供了具有操作性的见解,并解决了构建统一MLLM整个生命周期中的关键挑战,为未来的研究指出了有意义的方向。
https://arxiv.org/abs/2505.14682
Brain-to-image decoding has been recently propelled by the progress in generative AI models and the availability of large ultra-high field functional Magnetic Resonance Imaging (fMRI). However, current approaches depend on complicated multi-stage pipelines and preprocessing steps that typically collapse the temporal dimension of brain recordings, thereby limiting time-resolved brain decoders. Here, we introduce Dynadiff (Dynamic Neural Activity Diffusion for Image Reconstruction), a new single-stage diffusion model designed for reconstructing images from dynamically evolving fMRI recordings. Our approach offers three main contributions. First, Dynadiff simplifies training as compared to existing approaches. Second, our model outperforms state-of-the-art models on time-resolved fMRI signals, especially on high-level semantic image reconstruction metrics, while remaining competitive on preprocessed fMRI data that collapse time. Third, this approach allows a precise characterization of the evolution of image representations in brain activity. Overall, this work lays the foundation for time-resolved brain-to-image decoding.
最近,脑到图像的解码技术由于生成式人工智能模型的进步以及大型超高场功能性磁共振成像(fMRI)数据的可用性得到了推动。然而,当前的方法依赖于复杂的多阶段管道和预处理步骤,这些步骤通常会压缩大脑记录的时间维度,从而限制了时间分辨的大脑解码器的发展。为此,我们引入了一种新的单阶段扩散模型——Dynadiff(动态神经活动扩散用于图像重建),专门针对从动态变化的fMRI记录中重构图像而设计。 我们的方法主要提供了三项贡献: 首先,与现有方法相比,Dynadiff简化了训练过程。 第二,我们的模型在时间分辨的fMRI信号上超过了当前最先进的模型,特别是在高级语义图像重建指标方面表现出色,并且在预处理过的、已经压缩时间维度的fMRI数据上仍保持竞争力。 第三,这种方法能够精确表征大脑活动中的图像表示演变。 总体而言,这项工作为时间分辨的脑到图像解码奠定了基础。
https://arxiv.org/abs/2505.14556
As vision-language models (VLMs) become increasingly integrated into daily life, the need for accurate visual culture understanding is becoming critical. Yet, these models frequently fall short in interpreting cultural nuances effectively. Prior work has demonstrated the effectiveness of retrieval-augmented generation (RAG) in enhancing cultural understanding in text-only settings, while its application in multimodal scenarios remains underexplored. To bridge this gap, we introduce RAVENEA (Retrieval-Augmented Visual culturE uNdErstAnding), a new benchmark designed to advance visual culture understanding through retrieval, focusing on two tasks: culture-focused visual question answering (cVQA) and culture-informed image captioning (cIC). RAVENEA extends existing datasets by integrating over 10,000 Wikipedia documents curated and ranked by human annotators. With RAVENEA, we train and evaluate seven multimodal retrievers for each image query, and measure the downstream impact of retrieval-augmented inputs across fourteen state-of-the-art VLMs. Our results show that lightweight VLMs, when augmented with culture-aware retrieval, outperform their non-augmented counterparts (by at least 3.2% absolute on cVQA and 6.2% absolute on cIC). This highlights the value of retrieval-augmented methods and culturally inclusive benchmarks for multimodal understanding.
随着视觉-语言模型(VLMs)在日常生活中的应用越来越广泛,准确理解视觉文化的需求变得至关重要。然而,这些模型往往难以有效解读文化细微差别。先前的研究表明,在纯文本环境中,检索增强生成(RAG)能够提高对文化的理解,但在多模态场景中这一方法的应用仍处于探索阶段。为弥补这一不足,我们推出了RAVENEA(Retrieval-Augmented Visual culturE uNdErstAnding),这是一个新的基准测试系统,旨在通过检索来提升视觉文化理解能力,并专注于两个任务:以文化为重点的视觉问答(cVQA)和文化导向型图像描述生成(cIC)。RAVENEA在现有的数据集基础上进行了扩展,整合了超过10,000份由人工标注员整理并排名的维基百科文档。利用RAVENEA,我们对每张图片查询训练了七种多模态检索器,并评估了十四种最先进的VLM中检索增强输入的影响。我们的结果显示,在文化意识检索增强下,轻量级VLM在cVQA上比未增强版本至少高出3.2%绝对值,在cIC上高出6.2%绝对值。这突显了检索增强方法和多模态理解的文化包容性基准的价值。
https://arxiv.org/abs/2505.14462
Due to the unidirectional masking mechanism, Decoder-Only models propagate information from left to right. LVLMs (Large Vision-Language Models) follow the same architecture, with visual information gradually integrated into semantic representations during forward propagation. Through systematic analysis, we observe that over 80\% of the visual information is absorbed into the semantic representations. However, the model's attention still predominantly focuses on the visual representations. This misalignment between the attention distribution and the actual information flow undermines the model's visual understanding ability and contributes to hallucinations. To address this issue, we enhance the model's visual understanding by leveraging the core information embedded in semantic representations. Specifically, we identify attention heads that focus on core semantic representations based on their attention distributions. Then, through a two-stage optimization paradigm, we propagate the advantages of these attention heads across the entire model, aligning the attention distribution with the actual information flow. We evaluate our method on three image captioning benchmarks using five different LVLMs, demonstrating its effectiveness in significantly reducing hallucinations. Further experiments reveal a trade-off between reduced hallucinations and richer details. Notably, our method allows for manual adjustment of the model's conservativeness, enabling flexible control to meet diverse real-world requirements. Code will be released once accepted.
由于单向掩码机制,Decoder-Only模型的信息传递是从左到右的。LVLM(大型视觉语言模型)遵循相同的架构,在前向传播过程中逐渐将视觉信息整合进语义表示中。通过系统分析,我们观察到超过80%的视觉信息被吸收到了语义表示中。然而,模型的注意力仍然主要集中在视觉表示上。这种注意力分布和实际信息流之间的不匹配削弱了模型对视觉的理解能力,并导致幻觉(即生成与输入图像不符的信息)。为了解决这个问题,我们通过利用嵌入在语义表示中的核心信息来增强模型的视觉理解能力。具体来说,我们根据它们的注意力分布识别那些专注于关键语义表示的注意力头,然后通过两阶段优化范式将这些注意力头的优点传播到整个模型中,使注意力分布与实际的信息流对齐。 我们在三个图像描述基准数据集上使用五种不同的LVLM评估了我们的方法,证明其在显著减少幻觉方面是有效的。进一步的实验揭示了减少幻觉和丰富细节之间的权衡。值得注意的是,我们的方法允许手动调整模型的保守程度,从而灵活地控制以满足各种现实世界的需求。一旦被接受,代码将公开发布。
https://arxiv.org/abs/2505.14257
Medical image captioning is a challenging task that requires generating clinically accurate and semantically meaningful descriptions of radiology images. While recent vision-language models (VLMs) such as BLIP, BLIP2, Gemini and ViT-GPT2 show strong performance on natural image datasets, they often produce generic or imprecise captions when applied to specialized medical domains. In this project, we explore the effectiveness of fine-tuning the BLIP model on the ROCO dataset for improved radiology captioning. We compare the fine-tuned BLIP against its zero-shot version, BLIP-2 base, BLIP-2 Instruct and a ViT-GPT2 transformer baseline. Our results demonstrate that domain-specific fine-tuning on BLIP significantly improves performance across both quantitative and qualitative evaluation metrics. We also visualize decoder cross-attention maps to assess interpretability and conduct an ablation study to evaluate the contributions of encoder-only and decoder-only fine-tuning. Our findings highlight the importance of targeted adaptation for medical applications and suggest that decoder-only fine-tuning (encoder-frozen) offers a strong performance baseline with 5% lower training time than full fine-tuning, while full model fine-tuning still yields the best results overall.
医学图像描述生成是一个具有挑战性的任务,需要为放射学图像生成临床准确且语义相关的描述。尽管最近的视觉-语言模型(VLM)如BLIP、BLIP2、Gemini和ViT-GPT2在自然图像数据集上表现出色,但它们在应用于专门的医学领域时往往会产生通用或不精确的说明。在这个项目中,我们探索了对BLIP模型在ROCO数据集上的微调效果,以改进放射学描述生成。我们将微调后的BLIP与零样本版本(zero-shot version)、BLIP-2基础版、BLIP-2指令版和一个ViT-GPT2转换器基线进行了比较。我们的结果表明,在定量和定性评估指标上,针对特定领域的BLIP微调显著提高了性能。我们还通过可视化解码器交叉注意力图来评估解释能力,并进行消融研究以评估编码器仅微调和解码器仅微调的贡献。 我们的发现强调了在医学应用中进行有针对性适应的重要性,并表明了解码器单独微调(冻结编码器)可以提供一个训练时间比完整微调少5%的强大性能基准,而完整的模型微调仍然提供了最佳的整体结果。
https://arxiv.org/abs/2505.14726
We present Sat2Sound, a multimodal representation learning framework for soundscape mapping, designed to predict the distribution of sounds at any location on Earth. Existing methods for this task rely on satellite image and paired geotagged audio samples, which often fail to capture the diversity of sound sources at a given location. To address this limitation, we enhance existing datasets by leveraging a Vision-Language Model (VLM) to generate semantically rich soundscape descriptions for locations depicted in satellite images. Our approach incorporates contrastive learning across audio, audio captions, satellite images, and satellite image captions. We hypothesize that there is a fixed set of soundscape concepts shared across modalities. To this end, we learn a shared codebook of soundscape concepts and represent each sample as a weighted average of these concepts. Sat2Sound achieves state-of-the-art performance in cross-modal retrieval between satellite image and audio on two datasets: GeoSound and SoundingEarth. Additionally, building on Sat2Sound's ability to retrieve detailed soundscape captions, we introduce a novel application: location-based soundscape synthesis, which enables immersive acoustic experiences. Our code and models will be publicly available.
我们介绍了Sat2Sound,这是一个用于声音景观映射的多模态表示学习框架,旨在预测地球上任何位置的声音分布。现有的方法依赖于卫星图像和配对的地标签音频样本进行这项任务,但常常无法捕捉到给定地点声音来源的多样性。为了解决这一局限性,我们通过利用视觉-语言模型(VLM)来增强现有数据集,生成富含语义的声音景观描述,这些描述基于卫星图像中描绘的位置。我们的方法包括在音频、音频字幕、卫星图像和卫星图像字幕之间进行对比学习。我们认为,在不同的模态间存在一组固定的声音景观概念。为此,我们学习了一组共享的声音景观概念代码本,并将每个样本表示为这些概念的加权平均值。 Sat2Sound在两个数据集(GeoSound 和 SoundingEarth)上的跨模式检索任务中(卫星图像和音频之间),达到了最先进的性能水平。此外,基于Sat2Sound能够检索详细的声音景观描述的能力,我们引入了一个新颖的应用程序:基于位置的声音景观合成,这可以实现沉浸式的声学体验。 我们的代码和模型将公开提供。
https://arxiv.org/abs/2505.13777
Multilingual alignment of sentence representations has mostly required bitexts to bridge the gap between languages. We investigate whether visual information can bridge this gap instead. Image caption datasets are very easy to create without requiring multilingual expertise, so this offers a more efficient alternative for low-resource languages. We find that multilingual image-caption alignment can implicitly align the text representations between languages, languages unseen by the encoder in pretraining can be incorporated into this alignment post-hoc, and these aligned representations are usable for cross-lingual Natural Language Understanding (NLU) and bitext retrieval.
跨语言句子表示的一致性研究大多需要双语文本来弥合不同语言之间的差距。我们探讨了视觉信息是否可以作为替代方案来填补这一空白。图像字幕数据集很容易创建,无需多语言专业知识,这为资源匮乏的语言提供了更高效的替代方法。我们的发现表明,通过多语言图像-字幕对齐可以在隐含中使文本表示在不同语言间达到一致,并且即使编码器在预训练阶段没有见过某些语言,在后期也可以将这些语言纳入到这种对齐之中,而这些对齐后的表示形式可以用于跨语言自然语言理解(NLU)和双语文本检索。
https://arxiv.org/abs/2505.13628
Evaluating the open-ended outputs of large language models (LLMs) has become a bottleneck as model capabilities, task diversity, and modality coverage rapidly expand. Existing "LLM-as-a-Judge" evaluators are typically narrow in a few tasks, aspects, or modalities, and easily suffer from low consistency. In this paper, we argue that explicit, fine-grained aspect specification is the key to both generalizability and objectivity in automated evaluation. To do so, we introduce a hierarchical aspect taxonomy spanning 112 aspects that unifies evaluation across four representative settings - Natural Language Generation, Image Understanding, Image Generation, and Interleaved Text-and-Image Generation. Building on this taxonomy, we create FRAbench, a benchmark comprising 60.4k pairwise samples with 325k aspect-level labels obtained from a combination of human and LLM annotations. FRAbench provides the first large-scale, multi-modal resource for training and meta-evaluating fine-grained LMM judges. Leveraging FRAbench, we develop GenEval, a fine-grained evaluator generalizable across tasks and modalities. Experiments show that GenEval (i) attains high agreement with GPT-4o and expert annotators, (ii) transfers robustly to unseen tasks and modalities, and (iii) reveals systematic weaknesses of current LMMs on evaluation.
评估大型语言模型(LLMs)的开放性输出已成为瓶颈,因为随着模型能力、任务多样性和模态覆盖范围的迅速扩展,现有的“LLM作为裁判”评价方法通常只在少数几个任务、方面或模态上狭窄,并且容易出现一致性低的问题。在这篇论文中,我们论证了显式、细粒度方面的规定是实现自动评估中的通用性和客观性的关键。为此,我们引入了一个包含112个方面的层级化方面分类法,这一分类法统一了自然语言生成、图像理解、图像生成和交错的文本与图像生成这四个代表性设置的评估标准。基于这个分类法,我们创建了FRAbench,这是一个由组合的人类和LLM标注组成的60.4k对样本以及325k个细粒度方面级别标签构成的基准测试集。FRAbench提供了第一个大规模、多模态资源用于训练和元评估细粒度LMM裁判。通过利用FRAbench,我们开发了GenEval,这是一个能够在任务和模态之间通用化的细粒度评估器。实验结果表明,GenEval(i)与GPT-4o和专家标注者的同意度高,(ii)在未见过的任务和模态上稳健迁移,并且(iii)揭示了当前LMMs在评估中的系统性弱点。
https://arxiv.org/abs/2505.12795
Despite achieving significant progress in 2D image understanding, large multimodal models (LMMs) struggle in the physical world due to the lack of spatial representation. Typically, existing 3D LMMs mainly embed 3D positions as fixed spatial prompts within visual features to represent the scene. However, these methods are limited to understanding the static background and fail to capture temporally varying dynamic objects. In this paper, we propose LLaVA-4D, a general LMM framework with a novel spatiotemporal prompt for visual representation in 4D scene understanding. The spatiotemporal prompt is generated by encoding 3D position and 1D time into a dynamic-aware 4D coordinate embedding. Moreover, we demonstrate that spatial and temporal components disentangled from visual features are more effective in distinguishing the background from objects. This motivates embedding the 4D spatiotemporal prompt into these features to enhance the dynamic scene representation. By aligning visual spatiotemporal embeddings with language embeddings, LMMs gain the ability to understand both spatial and temporal characteristics of static background and dynamic objects in the physical world. Additionally, we construct a 4D vision-language dataset with spatiotemporal coordinate annotations for instruction fine-tuning LMMs. Extensive experiments have been conducted to demonstrate the effectiveness of our method across different tasks in 4D scene understanding.
尽管在二维图像理解方面取得了重大进展,大型多模态模型(LMM)在物理世界中仍因缺乏空间表示而面临挑战。通常,现有的3D LMM主要将3D位置嵌入到视觉特征中的固定空间提示中来表示场景。然而,这些方法仅限于理解静态背景,并且无法捕捉随时间变化的动态物体。为此,在本文中我们提出了LLaVA-4D,这是一个具有新颖时空提示的一般LMM框架,用于在4D场景理解中的视觉表示。该时空提示通过将3D位置和1D时间编码为感知动态的4D坐标嵌入来生成。 此外,我们证明了从视觉特征中解耦的空间和时间组件在区分背景与物体方面更为有效。这促使我们将4D时空提示嵌入这些特征中以增强动态场景表示。通过使视觉时空嵌入与语言嵌入对齐,LMM获得了理解物理世界中静态背景和动态对象的时空特性的能力。 此外,我们构建了一个带有时空坐标注释的4D视觉-语言数据集,用于指令微调LMMs。已经进行了广泛的实验来证明我们的方法在不同任务中的有效性,这些任务涵盖了4D场景理解。
https://arxiv.org/abs/2505.12253