Verifying the truthfulness of claims usually requires joint multi-modal reasoning over both textual and visual evidence, such as analyzing both textual caption and chart image for claim verification. In addition, to make the reasoning process transparent, a textual explanation is necessary to justify the verification result. However, most claim verification works mainly focus on the reasoning over textual evidence only or ignore the explainability, resulting in inaccurate and unconvincing verification. To address this problem, we propose a novel model that jointly achieves evidence retrieval, multi-modal claim verification, and explanation generation. For evidence retrieval, we construct a two-layer multi-modal graph for claims and evidence, where we design image-to-text and text-to-image reasoning for multi-modal retrieval. For claim verification, we propose token- and evidence-level fusion to integrate claim and evidence embeddings for multi-modal verification. For explanation generation, we introduce multi-modal Fusion-in-Decoder for explainability. Finally, since almost all the datasets are in general domain, we create a scientific dataset, AIChartClaim, in AI domain to complement claim verification community. Experiments show the strength of our model.
验证声明的真实性通常需要对文本和视觉证据进行多模态推理,例如分析文字描述和图表图像以验证声明。此外,为了使推理过程透明化,还需要通过文本解释来证明验证结果的合理性。然而,大多数声明验证工作主要集中在仅基于文本证据的推理上,或者忽略了可解释性,这导致了验证结果的不准确和缺乏说服力。为了解决这一问题,我们提出了一种新型模型,该模型可以同时实现证据检索、多模态声明验证以及生成解释。 在证据检索方面,我们构建了一个双层的多模态图来关联声明与证据,其中设计了图像到文本和文本到图像的推理机制来进行多模态检索。对于声明验证,我们提出了令牌级和证据级融合方法,以整合声明和证据的嵌入特征进行多模态验证。在解释生成方面,我们引入了Decoder中的多模态Fusion-in-Decoder来增强可解释性。 最后,鉴于几乎所有数据集都属于通用领域,我们在人工智能领域创建了一个新的科学数据集AIChartClaim,以此来补充和完善声明验证社区的需求。实验结果展示了我们的模型的优势和有效性。
https://arxiv.org/abs/2602.10023
Recent studies show that text-to-image models often fail to generate geographically representative images, raising concerns about the representativeness of their training data and motivating the question: which parts of the world do these training examples come from? We geographically profile large-scale multimodal datasets by mapping image-caption pairs to countries based on location information extracted from captions using LLMs. Studying English captions from three widely used datasets (Re-LAION, DataComp1B, and Conceptual Captions) across $20$ common entities (e.g., house, flag), we find that the United States, the United Kingdom, and Canada account for $48.0\%$ of samples, while South American and African countries are severely under-represented with only $1.8\%$ and $3.8\%$ of images, respectively. We observe a strong correlation between a country's GDP and its representation in the data ($\rho = 0.82$). Examining non-English subsets for $4$ languages from the Re-LAION dataset, we find that representation skews heavily toward countries where these languages are predominantly spoken. Additionally, we find that higher representation does not necessarily translate to greater visual or semantic diversity. Finally, analyzing country-specific images generated by Stable Diffusion v1.3 trained on Re-LAION, we show that while generations appear realistic, they are severely limited in their coverage compared to real-world images.
最近的研究表明,文本到图像的模型在生成具有地理代表性的图片时常常失败,这引发了对其训练数据代表性问题的关注,并提出了一个问题:这些训练样本来自世界上的哪些地区?我们通过使用大规模多模态数据集(如Re-LAION、DataComp1B和Conceptual Captions)中的英语描述文本,基于从描述中提取的位置信息将图像-描述对映射到国家的方式来进行地理分析。通过对20个常见实体(例如房屋、国旗)进行研究,我们发现美国、英国和加拿大占了48.0%的样本,而南美和非洲国家的代表性严重不足,分别仅占1.8%和3.8%的图片。我们观察到一个国家的国内生产总值与其在数据中的表示之间存在强烈的正相关关系($\rho = 0.82$)。另外,我们还检查了Re-LAION数据集四种语言非英语子集中图像的代表性情况,发现这些图像主要集中在那些语言的主要使用国中。此外,我们发现较高的代表性并不一定意味着视觉或语义多样性更高。最后,在分析Stable Diffusion v1.3在Re-LAION上训练生成的国家特定图片时,虽然生成的图片看起来很逼真,但与现实世界的图片相比,它们的覆盖范围受到了严重限制。
https://arxiv.org/abs/2602.09775
We introduce MILE-RefHumEval, a reference-free framework for evaluating Large Language Models (LLMs) without ground-truth annotations or evaluator coordination. It leverages an ensemble of independently prompted evaluators guided by a human-aligned schema, supporting both discrete and continuous scoring judgement. With task-specific prompts from best candidate selection, summarization and image captioning to dialogue, MILE-RefHumEval provides flexible, interpretable, and scalable assessments. Experiments show it aligns closely with human judgments, outperforms prior methods, and reduces computational overhead, offering an efficient, robust, and human-aligned solution for real-world LLM evaluation.
我们介绍了MILE-RefHumEval,这是一个无需真实标注或评估者协调的大型语言模型(LLMs)无参考评估框架。该框架利用一组由人类一致性的模式指导的独立提示评估器,支持离散和连续评分判断。通过从最佳候选选择、摘要生成到图像描述和对话等特定任务的提示,MILE-RefHumEval 提供了灵活、可解释且可扩展的评估方式。实验表明,该框架与人类判断高度一致,优于先前的方法,并减少了计算开销,为现实世界中LLMs 的评估提供了一个高效、稳健且符合人类标准的解决方案。
https://arxiv.org/abs/2602.09624
The scarcity of high-quality data remains a primary bottleneck in adapting multimodal generative models for medical image editing. Existing medical image editing datasets often suffer from limited diversity, neglect of medical image understanding and inability to balance quality with scalability. To address these gaps, we propose MieDB-100k, a large-scale, high-quality and diverse dataset for text-guided medical image editing. It categorizes editing tasks into perspectives of Perception, Modification and Transformation, considering both understanding and generation abilities. We construct MieDB-100k via a data curation pipeline leveraging both modality-specific expert models and rule-based data synthetic methods, followed by rigorous manual inspection to ensure clinical fidelity. Extensive experiments demonstrate that model trained with MieDB-100k consistently outperform both open-source and proprietary models while exhibiting strong generalization ability. We anticipate that this dataset will serve as a cornerstone for future advancements in specialized medical image editing.
高质量数据的稀缺仍然是将多模态生成模型应用于医学图像编辑的主要瓶颈。现有的医学图像编辑数据集通常存在多样性不足、忽视了对医学图像的理解以及无法平衡质量和可扩展性的问题。为了解决这些问题,我们提出了MieDB-100k,这是一个大规模、高质量且多样化的文本引导医学图像编辑的数据集。该数据集将编辑任务分为感知、修改和转换三个视角,并考虑了理解和生成的能力。我们通过利用特定模态的专家模型以及基于规则的数据合成方法构建了MieDB-100k,并进行了严格的手动检查以确保临床准确性。广泛的实验表明,使用MieDB-100k训练的模型在性能上不仅超越了开源和专有的模型,还展示了强大的泛化能力。我们预计该数据集将成为未来医学图像编辑领域进展的重要基石。
https://arxiv.org/abs/2602.09587
Retrieving wrist radiographs with analogous fracture patterns is challenging because clinically important cues are subtle, highly localized and often obscured by overlapping anatomy or variable imaging views. Progress is further limited by the scarcity of large, well-annotated datasets for case-based medical image retrieval. We introduce WristMIR, a region-aware pediatric wrist radiograph retrieval framework that leverages dense radiology reports and bone-specific localization to learn fine-grained, clinically meaningful image representations without any manual image-level annotations. Using MedGemma-based structured report mining to generate both global and region-level captions, together with pre-processed wrist images and bone-specific crops of the distal radius, distal ulna, and ulnar styloid, WristMIR jointly trains global and local contrastive encoders and performs a two-stage retrieval process: (1) coarse global matching to identify candidate exams, followed by (2) region-conditioned reranking aligned to a predefined anatomical bone region. WristMIR improves retrieval performance over strong vision-language baselines, raising image-to-text Recall@5 from 0.82% to 9.35%. Its embeddings also yield stronger fracture classification (AUROC 0.949, AUPRC 0.953). In region-aware evaluation, the two-stage design markedly improves retrieval-based fracture diagnosis, increasing mean $F_1$ from 0.568 to 0.753, and radiologists rate its retrieved cases as more clinically relevant, with mean scores rising from 3.36 to 4.35. These findings highlight the potential of anatomically guided retrieval to enhance diagnostic reasoning and support clinical decision-making in pediatric musculoskeletal imaging. The source code is publicly available at this https URL.
检索具有类似骨折模式的手腕放射影像颇具挑战性,因为临床上重要的线索往往十分细微、位置高度集中,并且常常被重叠的解剖结构或变化多端的成像视角所遮蔽。此外,由于缺乏大规模、标注良好的基于病例的医学图像检索数据集,研究进展受到了限制。为此,我们引入了WristMIR(手腕医学图像检索),这是一种区域感知型儿科手腕放射影像检索框架,它利用密集放射报告和特定骨骼定位来学习细粒度的、临床意义重大的图像表示,并且无需手动进行图形单元标注。 通过使用基于MedGemma结构化报告挖掘的方法生成全局及区域级别的描述文本,同时结合预处理的手腕影像以及针对远端桡骨、远端尺骨和小头骨特异性切割后的骨骼图像,WristMIR联合训练了全局对比编码器和局部对比编码器,并且执行了一个两阶段的检索过程:(1)粗略匹配以确定候选检查;随后是(2)基于特定解剖区域的重新排序。在这一过程中,我们提高了图像到文本的Recall@5指标从0.82%上升至9.35%,并且其嵌入式表示也提升了骨折分类的效果(AUROC为0.949,AUPRC为0.953)。 在区域感知评估中,两阶段设计显著改善了基于检索的骨折诊断,使平均F1分数从0.568提升到0.753,并且放射科医生认为其检索出的案例更具临床相关性,评分由3.36增加到了4.35。这些发现凸显了解剖引导型检索在增强诊断推理和支持儿科骨骼肌肉影像学中的临床决策方面具有巨大潜力。 该项目源代码已公开发布于以下网址:[此URL](请将方括号内的文本替换为实际的URL)。
https://arxiv.org/abs/2602.07872
Recent progress in large-scale CLIP-like vision-language models(VLMs) has greatly advanced medical image analysis. However, most existing medical VLMs still rely on coarse image-text contrastive objectives and fail to capture the systematic visual knowledge encoded in well-defined medical phenotype ontologies. To address this gap, we construct PhenoKG, the first large-scale, phenotype-centric multimodal knowledge graph that encompasses over 520K high-quality image-text pairs linked to more than 3,000 phenotypes. Building upon PhenoKG, we propose PhenoLIP, a novel pretraining framework that explicitly incorporates structured phenotype knowledge into medical VLMs through a two-stage process. We first learn a knowledge-enhanced phenotype embedding space from textual ontology data and then distill this structured knowledge into multimodal pretraining via a teacher-guided knowledge distillation objective. To support evaluation, we further introduce PhenoBench, an expert-verified benchmark designed for phenotype recognition, comprising over 7,800 image--caption pairs covering more than 1,000 phenotypes. Extensive experiments demonstrate that PhenoLIP outperforms previous state-of-the-art baselines, improving upon BiomedCLIP in phenotype classification accuracy by 8.85\% and BIOMEDICA in cross-modal retrieval by 15.03%, underscoring the value of integrating phenotype-centric priors into medical VLMs for structured and interpretable medical image understanding.
近期在大规模类似CLIP的视觉-语言模型(VLMs)方面取得的进步极大地推动了医学图像分析的发展。然而,大多数现有的医疗VLM仍然依赖于粗略的图像文本对比目标,并且无法捕捉到在定义明确的医学表型本体论中编码的系统性视觉知识。为了填补这一空白,我们构建了PhenoKG,这是首个大规模、以表型为中心的多模态知识图谱,涵盖了超过520,000张高质量的图像-文本对,并且这些数据链接到了3,000多个不同的表型上。 基于PhenoKG,我们提出了一个新的预训练框架——PhenoLIP。这个框架通过一个两阶段过程明确地将结构化的表型知识整合到医疗VLMs中。首先,我们从文本本体数据中学得增强的表型嵌入空间,并且然后通过一种教师指导的知识蒸馏目标将其结构化知识浓缩进多模态预训练。 为了支持评估,我们进一步引入了PhenoBench,这是一个经过专家验证的基准测试集,专门用于表型识别,包含超过7,800张图像-描述对,涵盖了1,000多个不同的表型。大量的实验表明,PhenoLIP超越了之前最先进的基线模型,在表型分类准确率上比BiomedCLIP高出8.85%,在跨模态检索任务上比BIOMEDICA高出15.03%。这突显出将基于中心的先验知识整合到医疗VLMs中的重要性,以实现结构化和可解释的医学图像理解。
https://arxiv.org/abs/2602.06184
Contrastive Language-Image Pre-training (CLIP) has achieved widely applications in various computer vision tasks, e.g., text-to-image generation, Image-Text retrieval and Image captioning. However, CLIP suffers from high memory and computation cost, which prohibits its usage to the resource-limited application scenarios. Existing CLIP compression methods typically reduce the size of pre-trained CLIP weights by selecting their subset as weight inheritance for further retraining via mask optimization or important weight measurement. However, these select-based weight inheritance often compromises the feature presentation ability, especially on the extreme compression. In this paper, we propose a novel mapping-based CLIP compression framework, CLIP-Map. It leverages learnable matrices to map and combine pretrained weights by Full-Mapping with Kronecker Factorization, aiming to preserve as much information from the original weights as possible. To mitigate the optimization challenges introduced by the learnable mapping, we propose Diagonal Inheritance Initialization to reduce the distribution shifting problem for efficient and effective mapping learning. Extensive experimental results demonstrate that the proposed CLIP-Map outperforms select-based frameworks across various compression ratios, with particularly significant gains observed under high compression settings.
对比语言图像预训练(CLIP)在多种计算机视觉任务中得到了广泛应用,例如文本到图像生成、图像文本检索和图像描述。然而,CLIP面临着高内存和计算成本的问题,这限制了其在资源受限的应用场景中的使用。现有的CLIP压缩方法通常通过选择预训练权重的子集并通过掩码优化或重要权值测量进行进一步微调来减小CLIP预训练权重的大小。但是,这些基于选择的方法往往会在极端压缩情况下牺牲特征表示能力。 为此,在本文中我们提出了一种新的映射基(CLIP-Map)的CLIP压缩框架。该框架利用可学习矩阵通过全映射和克罗内克因子化来映射和组合预训练权重,旨在尽可能保留原始权重中的信息。为了缓解由可学习映射引入的优化挑战,我们提出了对角继承初始化(Diagonal Inheritance Initialization)以减少分布偏移问题,并实现高效有效的映射学习。 广泛的实验结果表明,在各种压缩比下,所提出的CLIP-Map优于基于选择的方法框架,在高压缩设置下尤其表现出显著的优势。
https://arxiv.org/abs/2602.05909
Prompt-guided generative AI models have rapidly expanded across vision and language domains, producing realistic and diverse outputs from textual inputs. The growing variety of such models, trained with different data and architectures, calls for principled methods to identify which types of prompts lead to distinct model behaviors. In this work, we propose PromptSplit, a kernel-based framework for detecting and analyzing prompt-dependent disagreement between generative models. For each compared model pair, PromptSplit constructs a joint prompt--output representation by forming tensor-product embeddings of the prompt and image (or text) features, and then computes the corresponding kernel covariance matrix. We utilize the eigenspace of the weighted difference between these matrices to identify the main directions of behavioral difference across prompts. To ensure scalability, we employ a random-projection approximation that reduces computational complexity to $O(nr^2 + r^3)$ for projection dimension $r$. We further provide a theoretical analysis showing that this approximation yields an eigenstructure estimate whose expected deviation from the full-dimensional result is bounded by $O(1/r^2)$. Experiments across text-to-image, text-to-text, and image-captioning settings demonstrate that PromptSplit accurately detects ground-truth behavioral differences and isolates the prompts responsible, offering an interpretable tool for detecting where generative models disagree.
基于提示的生成式人工智能模型在视觉和语言领域迅速扩展,能够从文本输入中产生逼真且多样的输出。随着这类使用不同数据集和架构训练而成的模型种类日益增多,有必要采用规范的方法来识别哪些类型的提示会导致不同的模型行为。在此项研究中,我们提出了PromptSplit,这是一种基于核方法的框架,用于检测和分析生成式模型之间因提示而产生的差异性分歧。 对于每一对比较的模型,PromptSplit通过形成提示与图像(或文本)特征的张量积嵌入来构建联合提示-输出表示,并计算相应的核协方差矩阵。我们利用这些矩阵加权差值的特征空间来识别各提示下行为差异的主要方向。为了确保可扩展性,我们采用了一种随机投影近似方法,将计算复杂度降至$O(nr^2 + r^3)$(其中$r$为投影维度)。此外,还提供了理论分析证明该近似法产生的特征结构估计的期望偏差与全维结果相比有界于$O(1/r^2)$。 跨文本生成图像、文本生成文本以及图文配对等场景中的实验表明,PromptSplit能够准确检测到实际的行为差异,并隔离出导致分歧的关键提示,从而提供了一个用于检测生成式模型间分歧的可解释性工具。
https://arxiv.org/abs/2602.04009
Despite recent progress in vision-language models (VLMs), existing approaches often fail to generate personalized responses based on the user's specific experiences, as they lack the ability to associate visual inputs with a user's accumulated visual-textual context. We newly formalize this challenge as contextualized visual personalization, which requires the visual recognition and textual retrieval of personalized visual experiences by VLMs when interpreting new images. To address this issue, we propose CoViP, a unified framework that treats personalized image captioning as a core task for contextualized visual personalization and improves this capability through reinforcement-learning-based post-training and caption-augmented generation. We further introduce diagnostic evaluations that explicitly rule out textual shortcut solutions and verify whether VLMs truly leverage visual context. Extensive experiments demonstrate that existing open-source and proprietary VLMs exhibit substantial limitations, while CoViP not only improves personalized image captioning but also yields holistic gains across downstream personalization tasks. These results highlight CoViP as a crucial stage for enabling robust and generalizable contextualized visual personalization.
尽管视觉-语言模型(VLM)近期取得了进展,但现有的方法通常无法基于用户的具体经验生成个性化响应,因为它们缺乏将视觉输入与用户的累积视听上下文关联起来的能力。我们首次正式将这一挑战定义为情境化视觉个性化,即要求VLM在解读新图像时识别并检索个性化的视觉体验。 为了应对这一问题,我们提出了CoViP框架,该框架将个性化的图像描述作为情境化视觉个性化的核心任务,并通过基于强化学习的后期训练和增强生成来提升这项能力。此外,我们还引入了诊断评估方法,以明确排除文本捷径解决方案,并验证VLM是否真正利用了视觉上下文。 广泛的实验表明,现有的开源和专有VLM在这一领域存在重大限制,而CoViP不仅改善了个性化的图像描述能力,还在下游的个性化任务中实现了整体收益。这些结果突显了CoViP作为实现稳健且通用的情境化视觉个性化的重要阶段。
https://arxiv.org/abs/2602.03454
Accurate decision making in medical imaging requires reasoning over subtle visual differences between confusable conditions, yet most existing approaches rely on nearest neighbor retrieval that returns redundant evidence and reinforces a single hypothesis. We introduce a contrastive, document-aware reference selection framework that constructs compact evidence sets optimized for discrimination rather than similarity by explicitly balancing visual relevance, embedding diversity, and source-level provenance using ROCO embeddings and metadata. While ROCO provides large-scale image-caption pairs, it does not specify how references should be selected for contrastive reasoning, and naive retrieval frequently yields near-duplicate figures from the same document. To address this gap, we release a reproducible reference selection protocol and curated reference bank that enable a systematic study of contrastive retrieval in medical image reasoning. Building on these contrastive evidence sets, we propose Counterfactual-Contrastive Inference, a confidence-aware reasoning framework that performs structured pairwise visual comparisons and aggregates evidence using margin-based decision rules with faithful abstention. On the MediConfusion benchmark, our approach achieves state-of-the-art performance, improving set-level accuracy by nearly 15% relative to prior methods while reducing confusion and improving individual accuracy.
在医学影像诊断中,准确的决策需要基于细微的视觉差异进行推理,而大多数现有的方法依赖于最近邻检索,这种方法返回冗余证据并强化单一假设。我们提出了一种对比性的、文档感知型参考选择框架,该框架通过构建优化区分而非相似度的紧凑证据集来改进医学影像诊断中的推理过程。这一框架通过显式地平衡视觉相关性、嵌入多样性以及来源级别的出处使用ROCO嵌入和元数据实现。 虽然ROCO提供了大规模的图像-标题对,但它没有规定如何选择对比推理中的参考文献,并且简单的检索方法经常导致同一文档中的近似重复图例。为了填补这一空白,我们发布了一个可复现的参考选择协议以及一个精心策划的参考银行,这使得系统地研究医学影像推理中的对比检索成为可能。 基于这些对比证据集,我们提出了反事实-对比推理框架,这是一个具有信心感知的推理框架,它执行结构化的成对视觉比较,并使用带有忠实拒绝规则(faithful abstention)的边际决策规则来聚合证据。在MediConfusion基准测试中,我们的方法达到了最先进的性能,在集合级准确度上相对于先前的方法提高了近15%,同时减少了混淆并提高了个别精度。 这一研究和框架旨在改善医学影像中的诊断准确性,并通过改进对比推理技术来减少误诊的可能性。
https://arxiv.org/abs/2602.02894
The integration of visual information into Large Language Models (LLMs) has enabled Multimodal LLMs (MLLMs), but the quadratic memory and computational costs of Transformer architectures remain a bottleneck. Existing KV cache eviction strategies fail to address the heterogeneous attention distributions between visual and text tokens, leading to suboptimal efficiency or degraded performance. In this paper, we propose Hierarchical Adaptive Eviction (HAE), a KV cache eviction framework that optimizes text-visual token interaction in MLLMs by implementing Dual-Attention Pruning during pre-filling (leveraging visual token sparsity and attention variance) and a Dynamic Decoding Eviction Strategy (inspired by OS Recycle Bins) during decoding. HAE minimizes KV cache usage across layers, reduces computational overhead via index broadcasting, and theoretically ensures superior information integrity and lower error bounds compared to greedy strategies, enhancing efficiency in both comprehension and generation tasks. Empirically, HAE reduces KV-Cache memory by 41\% with minimal accuracy loss (0.3\% drop) in image understanding tasks and accelerates story generation inference by 1.5x while maintaining output quality on Phi3.5-Vision-Instruct model.
https://arxiv.org/abs/2602.02197
Despite the success of multimodal contrastive learning in aligning visual and linguistic representations, a persistent geometric anomaly, the Modality Gap, remains: embeddings of distinct modalities expressing identical semantics occupy systematically offset regions. Prior approaches to bridge this gap are largely limited by oversimplified isotropic assumptions, hindering their application in large-scale scenarios. In this paper, we address these limitations by precisely characterizing the geometric shape of the modality gap and leveraging it for efficient model scaling. First, we propose the Fixed-frame Modality Gap Theory, which decomposes the modality gap within a frozen reference frame into stable biases and anisotropic residuals. Guided by this precise modeling, we introduce ReAlign, a training-free modality alignment strategy. Utilizing statistics from massive unpaired data, ReAlign aligns text representation into the image representation distribution via a three-step process comprising Anchor, Trace, and Centroid Alignment, thereby explicitly rectifying geometric misalignment. Building on ReAlign, we propose ReVision, a scalable training paradigm for Multimodal Large Language Models (MLLMs). ReVision integrates ReAlign into the pretraining stage, enabling the model to learn the distribution of visual representations from unpaired text before visual instruction tuning, without the need for large-scale, high-quality image-text pairs. Our framework demonstrates that statistically aligned unpaired data can effectively substitute for expensive image-text pairs, offering a robust path for the efficient scaling of MLLMs.
尽管多模态对比学习在对齐视觉和语言表示方面取得了成功,但仍存在一种持久的几何异常——模态差距:不同模态中表达相同语义的嵌入系统地占据偏移区域。先前解决这一差距的方法主要受到过度简化的各向同性假设限制,在大规模场景中的应用受到了阻碍。在本文中,我们通过精确描述模态差距的几何形状并利用它进行高效的模型扩展来应对这些局限。 首先,我们提出了固定框架模态间隙理论,该理论将冻结参考框架内的模态间隙分解为稳定的偏差和各向异性残差。在此精准建模引导下,我们引入了ReAlign,这是一种无需训练的模态对齐策略。通过利用大量未配对数据中的统计信息,ReAlign 采用锚点、追踪和质心对齐三个步骤过程将文本表示与图像表示分布对齐,从而明确纠正了几何偏差。 基于 ReAlign,我们提出了 ReVision,这是一种用于多模态大规模语言模型(MLLMs)的可扩展训练范式。ReVision 将 ReAlign 整合到预训练阶段,在视觉指令微调前,无需大量高质量图像-文本对的情况下使模型能够从未配对的文本中学习视觉表示分布。我们的框架证明了统计上对齐的未配对数据可以有效地替代昂贵的图像-文本对,并为多模态大规模语言模型(MLLMs)的有效扩展提供了一条稳健的道路。
https://arxiv.org/abs/2602.07026
Large Vision-Language Models (LVLMs) achieve strong performance on single-image tasks, but their performance declines when multiple images are provided as input. One major reason is the cross-image information leakage, where the model struggles to distinguish information across different images. Existing LVLMs already employ delimiter tokens to mark the start and end of each image, yet our analysis reveals that these tokens fail to effectively block cross-image information leakage. To enhance their effectiveness, we propose a method that scales the hidden states of delimiter tokens. This enhances the model's ability to preserve image-specific information by reinforcing intra-image interaction and limiting undesired cross-image interactions. Consequently, the model is better able to distinguish between images and reason over them more accurately. Experiments show performance gains on multi-image benchmarks such as Mantis, MuirBench, MIRB, and QBench2. We further evaluate our method on text-only tasks that require clear distinction. The method improves performance on multi-document and multi-table understanding benchmarks, including TQABench, MultiNews, and WCEP-10. Notably, our method requires no additional training or inference cost.
https://arxiv.org/abs/2602.01984
Prevailing image representation methods, including explicit representations such as raster images and Gaussian primitives, as well as implicit representations such as latent images, either suffer from representation redundancy that leads to heavy manual editing effort, or lack a direct mapping from latent variables to semantic instances or parts, making fine-grained manipulation difficult. These limitations hinder efficient and controllable image and video editing. To address these issues, we propose a hierarchical proxy-based parametric image representation that disentangles semantic, geometric, and textural attributes into independent and manipulable parameter spaces. Based on a semantic-aware decomposition of the input image, our representation constructs hierarchical proxy geometries through adaptive Bezier fitting and iterative internal region subdivision and meshing. Multi-scale implicit texture parameters are embedded into the resulting geometry-aware distributed proxy nodes, enabling continuous high-fidelity reconstruction in the pixel domain and instance- or part-independent semantic editing. In addition, we introduce a locality-adaptive feature indexing mechanism to ensure spatial texture coherence, which further supports high-quality background completion without relying on generative models. Extensive experiments on image reconstruction and editing benchmarks, including ImageNet, OIR-Bench, and HumanEdit, demonstrate that our method achieves state-of-the-art rendering fidelity with significantly fewer parameters, while enabling intuitive, interactive, and physically plausible manipulation. Moreover, by integrating proxy nodes with Position-Based Dynamics, our framework supports real-time physics-driven animation using lightweight implicit rendering, achieving superior temporal consistency and visual realism compared with generative approaches.
https://arxiv.org/abs/2602.01881
Recent advances in large vision-language models (VLMs) have demonstrated generalizable open-vocabulary perception and reasoning, yet their real-robot manipulation capability remains unclear for long-horizon, closed-loop execution in unstructured, in-the-wild environments. Prior VLM-based manipulation pipelines are difficult to compare across different research groups' setups, and many evaluations rely on simulation, privileged state, or specially designed setups. We present AgenticLab, a model-agnostic robot agent platform and benchmark for open-world manipulation. AgenticLab provides a closed-loop agent pipeline for perception, task decomposition, online verification, and replanning. Using AgenticLab, we benchmark state-of-the-art VLM-based agents on real-robot tasks in unstructured environments. Our benchmark reveals several failure modes that offline vision-language tests (e.g., VQA and static image understanding) fail to capture, including breakdowns in multi-step grounding consistency, object grounding under occlusion and scene changes, and insufficient spatial reasoning for reliable manipulation. We will release the full hardware and software stack to support reproducible evaluation and accelerate research on general-purpose robot agents.
https://arxiv.org/abs/2602.01662
Unified multimodal large language models (MLLMs) integrate image understanding and generation in a single framework, with the visual tokenizer acting as the sole interface that maps visual inputs into tokens for downstream tasks. However, existing shared-token designs are mostly architecture-driven and lack an explicit criterion for what information tokens should preserve to support both understanding and generation. Therefore, we introduce a capacity-constrained perspective, highlighting that in shared-token unified MLLMs the visual tokenizer behaves as a compute-bounded learner, so the token budget should prioritize reusable structure over hard-to-exploit high-entropy variations and redundancy. Motivated by this perspective, we propose InfoTok, an information-regularized visual tokenization mechanism grounded in the Information Bottleneck (IB) principle. InfoTok formulates tokenization as controlling information flow from images to shared tokens to multimodal outputs, yielding a principled trade-off between compression and task relevance via mutual-information regularization. We integrate InfoTok into three representative unified MLLMs without introducing any additional training data. Experiments show consistent improvements on both understanding and generation, supporting information-regularized tokenization as a principled foundation for learning a shared token space in unified MLLMs.
https://arxiv.org/abs/2602.01554
Leveraging pre-trained 2D image representations in behavior cloning policies has achieved great success and has become a standard approach for robotic manipulation. However, such representations fail to capture the 3D spatial information about objects and scenes that is essential for precise manipulation. In this work, we introduce Contrastive Learning for 3D Multi-View Action-Conditioned Robotic Manipulation Pretraining (CLAMP), a novel 3D pre-training framework that utilizes point clouds and robot actions. From the merged point cloud computed from RGB-D images and camera extrinsics, we re-render multi-view four-channel image observations with depth and 3D coordinates, including dynamic wrist views, to provide clearer views of target objects for high-precision manipulation tasks. The pre-trained encoders learn to associate the 3D geometric and positional information of objects with robot action patterns via contrastive learning on large-scale simulated robot trajectories. During encoder pre-training, we pre-train a Diffusion Policy to initialize the policy weights for fine-tuning, which is essential for improving fine-tuning sample efficiency and performance. After pre-training, we fine-tune the policy on a limited amount of task demonstrations using the learned image and action representations. We demonstrate that this pre-training and fine-tuning design substantially improves learning efficiency and policy performance on unseen tasks. Furthermore, we show that CLAMP outperforms state-of-the-art baselines across six simulated tasks and five real-world tasks.
https://arxiv.org/abs/2602.00937
Audio-to-image retrieval offers an interpretable alternative to audio-only classification for bioacoustic species recognition, but learning aligned audio-image representations is challenging due to the scarcity of paired audio-image data. We propose a simple and data-efficient approach that enables audio-to-image retrieval without any audio-image supervision. Our proposed method uses text as a semantic intermediary: we distill the text embedding space of a pretrained image-text model (BioCLIP-2), which encodes rich visual and taxonomic structure, into a pretrained audio-text model (BioLingual) by fine-tuning its audio encoder with a contrastive objective. This distillation transfers visually grounded semantics into the audio representation, inducing emergent alignment between audio and image embeddings without using images during training. We evaluate the resulting model on multiple bioacoustic benchmarks. The distilled audio encoder preserves audio discriminative power while substantially improving audio-text alignment on focal recordings and soundscape datasets. Most importantly, on the SSW60 benchmark, the proposed approach achieves strong audio-to-image retrieval performance exceeding baselines based on zero-shot model combinations or learned mappings between text embeddings, despite not training on paired audio-image data. These results demonstrate that indirect semantic transfer through text is sufficient to induce meaningful audio-image alignment, providing a practical solution for visually grounded species recognition in data-scarce bioacoustic settings.
https://arxiv.org/abs/2602.00681
Multi-agent debate can improve reasoning quality and reduce hallucinations, but it incurs rapidly growing context as debate rounds and agent count increase. Retaining full textual histories leads to token usage that can exceed context limits and often requires repeated summarization, adding overhead and compounding information loss. We introduce DebateOCR, a cross-modal compression framework that replaces long textual debate traces with compact image representations, which are then consumed through a dedicated vision encoder to condition subsequent rounds. This design compresses histories that commonly span tens to hundreds of thousands of tokens, cutting input tokens by more than 92% and yielding substantially lower compute cost and faster inference across multiple benchmarks. We further provide a theoretical perspective showing that diversity across agents supports recovery of omitted information: although any single compressed history may discard details, aggregating multiple agents' compressed views allows the collective representation to approach the information bottleneck with exponentially high probability.
https://arxiv.org/abs/2602.00454
Image captioning (IC) refers to the automatic generation of natural language descriptions for images, with applications ranging from social media content generation to assisting individuals with visual impairments. While most research has been focused on English-based models, low-resource languages such as Brazilian Portuguese face significant challenges due to the lack of specialized datasets and models. Several studies create datasets by automatically translating existing ones to mitigate resource scarcity. This work addresses this gap by proposing a cross-native-translated evaluation of Transformer-based vision and language models for Brazilian Portuguese IC. We use a version of Flickr30K comprised of captions manually created by native Brazilian Portuguese speakers and compare it to a version with captions automatically translated from English to Portuguese. The experiments include a cross-context approach, where models trained on one dataset are tested on the other to assess the translation impact. Additionally, we incorporate attention maps for model inference interpretation and use the CLIP-Score metric to evaluate the image-description alignment. Our findings show that Swin-DistilBERTimbau consistently outperforms other models, demonstrating strong generalization across datasets. ViTucano, a Brazilian Portuguese pre-trained VLM, surpasses larger multilingual models (GPT-4o, LLaMa 3.2 Vision) in traditional text-based evaluation metrics, while GPT-4 models achieve the highest CLIP-Score, highlighting improved image-text alignment. Attention analysis reveals systematic biases, including gender misclassification, object enumeration errors, and spatial inconsistencies. The datasets and the models generated and analyzed during the current study are available in: this https URL.
https://arxiv.org/abs/2602.00393