CLIP and large multimodal models (LMMs) have better accuracy on examples involving concepts that are highly represented in the training data. However, the role of concept combinations in the training data on compositional generalization is largely unclear -- for instance, how does accuracy vary when a common object appears in an uncommon pairing with another object? In this paper, we investigate how word co-occurrence statistics in the pretraining dataset (a proxy for co-occurrence of visual concepts) impacts CLIP/LMM performance. To disentangle the effects of word co-occurrence frequencies from single-word frequencies, we measure co-occurrence with pointwise mutual information (PMI), which normalizes the joint probability of two words co-occurring by the probability of co-occurring independently. Using synthetically generated images with a variety of concept pairs, we show a strong correlation between PMI in the CLIP pretraining data and zero-shot accuracy in CLIP models trained on LAION-400M (r=0.97 and 14% accuracy gap between images in the top and bottom 5% of PMI values), demonstrating that even accuracy on common concepts is affected by the combination of concepts in the image. Leveraging this finding, we reproduce this effect in natural images by editing them to contain pairs with varying PMI, resulting in a correlation of r=0.75. Finally, we demonstrate that this behavior in CLIP transfers to LMMs built on top of CLIP (r=0.70 for TextVQA, r=0.62 for VQAv2). Our findings highlight the need for algorithms and architectures that improve compositional generalization in multimodal models without scaling the training data combinatorially. Our code is available at this https URL.
这段文字描述了一篇关于CLIP(对比语言-图像预训练模型)和大型多模态模型(LMMs)在处理概念组合时表现的研究论文。主要内容如下: CLIP 和大型多模态模型(LMMs)在那些训练数据中高度表示的概念上表现出更高的准确性,然而,这些模型如何处理未见过的或罕见的概念搭配以实现组成性泛化尚不清楚——例如,当一个常见的对象与另一个对象以不常见的方式组合时,其准确度会发生怎样的变化? 在这篇论文中,作者研究了预训练数据集中的词共现统计数据(作为视觉概念共现的代理)如何影响CLIP/LMM的表现。为了解除词汇共现频率和单一词汇频率的影响,他们使用点互信息(PMI)来衡量共现情况,这种度量方法将两个词同时出现的概率归一化为其独立出现概率。 通过使用具有各种概念对组合的人工合成图像,作者展示了CLIP预训练数据中的PMI与在LAION-400M上微调的CLIP模型的零样本准确率之间存在很强的相关性(r=0.97),并且PMI值处于最高和最低5%区间的图像之间的准确性差距达到了14%,这表明即使是对于常见概念,图像中的概念组合也会影响其准确性。 利用这一发现,作者通过编辑自然图像以包含不同PMI值的概念对,从而在实际场景中再现了这种效果(r=0.75的相关性)。最后,他们展示了CLIP的这种行为可以转移到基于CLIP构建的LMMs上(对于TextVQA和VQAv2,相关系数分别为 r=0.70 和 r=0.62)。 这些发现强调了改进多模态模型组成泛化能力的重要性,而无需通过组合性地扩大训练数据来实现。相关的代码可以在提供的链接处获取。
https://arxiv.org/abs/2507.08000
Models like OpenAI-o3 pioneer visual grounded reasoning by dynamically referencing visual regions, just like human "thinking with images". However, no benchmark exists to evaluate these capabilities holistically. To bridge this gap, we propose TreeBench (Traceable Evidence Evaluation Benchmark), a diagnostic benchmark built on three principles: (1) focused visual perception of subtle targets in complex scenes, (2) traceable evidence via bounding box evaluation, and (3) second-order reasoning to test object interactions and spatial hierarchies beyond simple object localization. Prioritizing images with dense objects, we initially sample 1K high-quality images from SA-1B, and incorporate eight LMM experts to manually annotate questions, candidate options, and answers for each image. After three stages of quality control, TreeBench consists of 405 challenging visual question-answering pairs, even the most advanced models struggle with this benchmark, where none of them reach 60% accuracy, e.g., OpenAI-o3 scores only 54.87. Furthermore, we introduce TreeVGR (Traceable Evidence Enhanced Visual Grounded Reasoning), a training paradigm to supervise localization and reasoning jointly with reinforcement learning, enabling accurate localizations and explainable reasoning pathways. Initialized from Qwen2.5-VL-7B, it improves V* Bench (+16.8), MME-RealWorld (+12.6), and TreeBench (+13.4), proving traceability is key to advancing vision-grounded reasoning. The code is available at this https URL.
像OpenAI-o3这样的模型通过动态引用视觉区域开创了基于图像的推理,类似于人类用“图像思考”。然而,目前没有一个基准能够全面评估这些能力。为了填补这一空白,我们提出了TreeBench(可追溯证据评价基准),这是一个根据三个原则构建的诊断性测试集:(1) 在复杂场景中聚焦于细微目标的视觉感知;(2) 通过边界框评估实现可追溯的证据;以及 (3) 第二阶段推理来测试对象间的交互和空间层级,超越简单的对象定位。我们优先选择包含密集物体的图像,从SA-1B数据集中选取了1000张高质量图片,并邀请八位LMM专家为每一张图手动标注问题、选项及答案。经过三轮质量控制后,TreeBench最终包括405个具有挑战性的视觉问答对,即使是最先进的模型也难以应对这些测试,没有任何一个模型能够达到60%以上的准确率(例如OpenAI-o3仅得54.87分)。此外,我们还推出了TreeVGR(可追溯证据增强的基于图像推理训练范式),通过强化学习监督定位和推理过程,以实现精确的定位并解释推理路径。该模型从Qwen2.5-VL-7B初始化,在V* Bench、MME-RealWorld 和 TreeBench 上分别提高了16.8%、12.6%和13.4%,证明了可追溯性是推进视觉基础推理的关键因素。相关代码可在提供的链接中获取。
https://arxiv.org/abs/2507.07999
LLMs are increasingly deployed as agents, systems capable of planning, reasoning, and dynamically calling external tools. However, in visual reasoning, prior approaches largely remain limited by predefined workflows and static toolsets. In this report, we present PyVision, an interactive, multi-turn framework that enables MLLMs to autonomously generate, execute, and refine Python-based tools tailored to the task at hand, unlocking flexible and interpretable problem-solving. We develop a taxonomy of the tools created by PyVision and analyze their usage across a diverse set of benchmarks. Quantitatively, PyVision achieves consistent performance gains, boosting GPT-4.1 by +7.8% on V* and Claude-4.0-Sonnet by +31.1% on VLMsAreBlind-mini. These results point to a broader shift: dynamic tooling allows models not just to use tools, but to invent them, advancing toward more agentic visual reasoning.
大型语言模型(LLMs)越来越多地被部署为能够规划、推理并动态调用外部工具的代理系统。然而,在视觉推理领域,先前的方法仍然受限于预定义的工作流程和静态工具集。在本报告中,我们介绍了PyVision,这是一个交互式的多回合框架,它使多语言大模型(MLLMs)能够自主生成、执行并优化针对特定任务量身定制的基于Python的工具,从而解锁灵活且可解释的问题解决能力。我们开发了一个关于PyVision创建的工具分类,并分析了这些工具在多样化的基准测试集中的使用情况。从定量角度来看,PyVision实现了持续的性能提升,在V*上将GPT-4.1的表现提高了+7.8%,在VLMsAreBlind-mini上将Claude-4.0-Sonnet的表现提升了+31.1%。这些结果表明了一个更广泛的转变:动态工具使模型不仅能够使用现有工具,还能发明新工具,从而向着更具代理性的视觉推理迈进。
https://arxiv.org/abs/2507.07998
Vector Quantized Variational Autoencoders (VQ-VAEs) are fundamental models that compress continuous visual data into discrete tokens. Existing methods have tried to improve the quantization strategy for better reconstruction quality, however, there still exists a large gap between VQ-VAEs and VAEs. To narrow this gap, we propose \NickName, a novel method to augment the representation capability of discrete codebooks, facilitating easier optimization for codebooks and minimizing information loss, thereby enhancing reconstruction quality. Specifically, we propose to retain the latent dimension to preserve encoded features and incorporate a set of sub-codebooks for quantization. Furthermore, we construct comprehensive zero-shot benchmarks featuring resolutions of 512p and 2k to evaluate the reconstruction performance of existing methods rigorously. \NickName~achieves the \textbf{state-of-the-art performance on both ImageNet and $8$ zero-shot benchmarks} across all VQ-VAEs. Notably, compared with SD-VAE, we outperform them on ImageNet significantly, with rFID $\textbf{0.49}$ v.s. $\textbf{0.91}$, and achieve superior PSNR on all zero-shot benchmarks. These results highlight the superiority of \NickName~in reconstruction and pave the way for preserving fidelity in HD image processing tasks. Code will be publicly available at this https URL.
向量量化变分自编码器(VQ-VAEs)是一种基础模型,用于将连续的视觉数据压缩为离散令牌。现有的方法试图改进量化策略以提高重建质量,然而,VQ-VAE与VAE之间仍然存在较大的差距。为了缩小这一差距,我们提出了\NickName(这里用“NM”代替),这是一种新的方法,旨在增强离散代码本的表示能力,使代码本更容易优化,并最小化信息损失,从而提升重建质量。具体来说,我们提出保留潜在维度以保持编码特征,并引入一组子代码本进行量化。此外,我们构建了全面的零样本基准测试,包括512p和2k分辨率,用于严格评估现有方法的重建性能。 NM在ImageNet以及8个零样本基准测试上实现了所有VQ-VAE中**最先进的性能**。值得注意的是,与SD-VAE相比,在ImageNet上的表现显著优于后者,rFID为0.49对比0.91,并且在所有零样本基准测试上均取得了更优的PSNR值。这些结果突显了NM在重建方面的优越性,并为进一步在高清图像处理任务中保持保真度铺平了道路。 代码将在以下网址公开发布:this https URL(这里需要提供实际链接)。
https://arxiv.org/abs/2507.07997
According to Algorithmic Information Theory (AIT) -- Intelligent representations compress data into the shortest possible program that can reconstruct its content, exhibiting low Kolmogorov Complexity (KC). In contrast, most visual representation learning systems use fixed-length representations for all inputs, ignoring variations in complexity or familiarity. Recent adaptive tokenization methods address this by allocating variable-length representations but typically require test-time search over multiple encodings to find the most predictive one. Inspired by Kolmogorov Complexity principles, we propose a single-pass adaptive tokenizer, KARL, which predicts the appropriate number of tokens for an image in a single forward pass, halting once its approximate KC is reached. The token count serves as a proxy for the minimum description length. KARL's training procedure closely resembles the Upside-Down Reinforcement Learning paradigm, as it learns to conditionally predict token halting based on a desired reconstruction quality. KARL matches the performance of recent adaptive tokenizers while operating in a single pass. We present scaling laws for KARL, analyzing the role of encoder/decoder size, continuous vs. discrete tokenization and more. Additionally, we offer a conceptual study drawing an analogy between Adaptive Image Tokenization and Algorithmic Information Theory, examining the predicted image complexity (KC) across axes such as structure vs. noise and in- vs. out-of-distribution familiarity -- revealing alignment with human intuition.
根据算法信息理论(AIT),智能表示将数据压缩为可以重建其内容的最短程序,表现出低柯尔莫哥洛夫复杂性(KC)。相比之下,大多数视觉表征学习系统使用固定长度的表示来处理所有输入,忽略了复杂度或熟悉度的变化。最近的自适应标记方法通过分配可变长度的表示解决了这个问题,但通常在测试时需要对多个编码进行搜索以找到最具有预测性的那个。受柯尔莫哥洛夫复杂性原理启发,我们提出了一个单步自适应标记器KARL,它可以在一次前向传递中为图像预测合适的标记数量,并且一旦达到了近似KC就停止。标记的数量作为最小描述长度的代理。 KARL的训练过程类似于倒置强化学习范式,因为它学会了根据所需的重建质量条件性地预测标记终止。 KARL在单步操作下即可达到最近自适应标记器的性能水平。我们为KARL提供了扩展定律分析,探讨了编码器/解码器大小、连续与离散标记化等因素的作用。此外,我们还提供了一个概念研究,将自适应图像标记和算法信息理论进行类比,考察在结构与噪声以及分布内与分布外熟悉度等轴上的预测图像复杂性(KC),揭示了与人类直觉的一致性。
https://arxiv.org/abs/2507.07995
Keypoint detection, integral to modern machine perception, faces challenges in few-shot learning, particularly when source data from the same distribution as the query is unavailable. This gap is addressed by leveraging sketches, a popular form of human expression, providing a source-free alternative. However, challenges arise in mastering cross-modal embeddings and handling user-specific sketch styles. Our proposed framework overcomes these hurdles with a prototypical setup, combined with a grid-based locator and prototypical domain adaptation. We also demonstrate success in few-shot convergence across novel keypoints and classes through extensive experiments.
关键点检测是现代机器感知的核心组成部分,但在少量样本学习中面临挑战,特别是在缺乏与查询数据同分布的源数据时。为解决这一问题,我们利用了素描这种流行的人类表达形式,提供了一种无需来源数据的替代方案。然而,在掌握跨模态嵌入和处理用户特定的素描风格方面仍然存在挑战。我们的框架通过结合原型设置、基于网格的位置检测器以及原型领域自适应技术来克服这些障碍。我们还通过广泛的实验展示了在新颖关键点和类别上实现少量样本收敛的成功。
https://arxiv.org/abs/2507.07994
Existing evaluation protocols for brain visual decoding predominantly rely on coarse metrics that obscure inter-model differences, lack neuroscientific foundation, and fail to capture fine-grained visual distinctions. To address these limitations, we introduce BASIC, a unified, multigranular evaluation framework that jointly quantifies structural fidelity, inferential alignment, and contextual coherence between decoded and ground truth images. For the structural level, we introduce a hierarchical suite of segmentation-based metrics, including foreground, semantic, instance, and component masks, anchored in granularity-aware correspondence across mask structures. For the semantic level, we extract structured scene representations encompassing objects, attributes, and relationships using multimodal large language models, enabling detailed, scalable, and context-rich comparisons with ground-truth stimuli. We benchmark a diverse set of visual decoding methods across multiple stimulus-neuroimaging datasets within this unified evaluation framework. Together, these criteria provide a more discriminative, interpretable, and comprehensive foundation for measuring brain visual decoding methods.
现有的大脑视觉解码评估协议主要依赖于粗略的度量标准,这些标准掩盖了模型间的细微差异,缺乏神经科学基础,并且无法捕捉到精细的视觉区别。为了解决这些问题,我们引入了一个统一、多层次的评价框架BASIC,该框架能够同时量化解码图像与真实图像在结构保真度、推理一致性以及上下文连贯性方面的相似程度。 对于结构层面,我们提出了一套分层的基于分割的指标体系,包括前景掩膜、语义掩膜、实例掩膜和组件掩膜,并根据各层级掩膜间的细粒度对应关系进行评估。这有助于更细致地分析解码图像与真实图像在各个层次上的差异。 对于语义层面,我们利用多模态大型语言模型提取结构化的场景表示,涵盖对象、属性以及它们之间的关系,从而使详细的、可扩展的、具有上下文信息的对比成为可能,这些都可以与真实的刺激物进行比较。我们在多个刺激-神经成像数据集中对多种视觉解码方法进行了基准测试,并在一个统一的评价框架内评估了这些方法。 综上所述,这些标准为衡量大脑视觉解码方法提供了一个更具鉴别力、可解释性以及全面性的基础。
https://arxiv.org/abs/2507.07993
Video large language models (LLMs) achieve strong video understanding by leveraging a large number of spatio-temporal tokens, but suffer from quadratic computational scaling with token count. To address this, we propose a training-free spatio-temporal token merging method, named STTM. Our key insight is to exploit local spatial and temporal redundancy in video data which has been overlooked in prior work. STTM first transforms each frame into multi-granular spatial tokens using a coarse-to-fine search over a quadtree structure, then performs directed pairwise merging across the temporal dimension. This decomposed merging approach outperforms existing token reduction methods across six video QA benchmarks. Notably, STTM achieves a 2$\times$ speed-up with only a 0.5% accuracy drop under a 50% token budget, and a 3$\times$ speed-up with just a 2% drop under a 30% budget. Moreover, STTM is query-agnostic, allowing KV cache reuse across different questions for the same video. The project page is available at this https URL.
视频大型语言模型(LLMs)通过利用大量时空令牌实现强大的视频理解,但随着令牌数量的增加,计算成本呈二次增长。为了解决这个问题,我们提出了一种无需训练的空间时间令牌合并方法,命名为STTM。我们的关键洞察在于利用了之前工作中被忽视的视频数据中的局部空间和时间冗余。 STTM首先通过四叉树结构进行从粗到细搜索的方式将每一帧转换成多尺度的空间令牌,然后在时间维度上执行有方向的一对一合并操作。这种分解式的合并方法在六个视频问答基准测试中均超越了现有的令牌减少方法。值得注意的是,在50%的令牌预算下,STTM实现了两倍的速度提升且仅损失了0.5%的准确率;而在30%的预算下,则达到了三倍速度增长且仅有2%的准确率下降。 此外,STTM是查询无关的,这意味着在处理同一视频的不同问题时可以重复使用KV缓存。项目的网页可在此网址访问:[此URL](请根据实际情况提供实际链接)。
https://arxiv.org/abs/2507.07990
As large language models (LLMs) become increasingly integrated into clinical decision-making, ensuring transparent and trustworthy reasoning is essential. However, existing evaluation strategies of LLMs' medical reasoning capability either suffer from unsatisfactory assessment or poor scalability, and a rigorous benchmark remains lacking. To address this, we introduce MedThink-Bench, a benchmark designed for rigorous, explainable, and scalable assessment of LLMs' medical reasoning. MedThink-Bench comprises 500 challenging questions across ten medical domains, each annotated with expert-crafted step-by-step rationales. Building on this, we propose LLM-w-Ref, a novel evaluation framework that leverages fine-grained rationales and LLM-as-a-Judge mechanisms to assess intermediate reasoning with expert-level fidelity while maintaining scalability. Experiments show that LLM-w-Ref exhibits a strong positive correlation with expert judgments. Benchmarking twelve state-of-the-art LLMs, we find that smaller models (e.g., MedGemma-27B) can surpass larger proprietary counterparts (e.g., OpenAI-o3). Overall, MedThink-Bench offers a foundational tool for evaluating LLMs' medical reasoning, advancing their safe and responsible deployment in clinical practice.
随着大型语言模型(LLMs)在临床决策中的应用越来越广泛,确保其透明和可信的推理能力变得至关重要。然而,现有的评估LLMs医疗推理能力的方法要么效果不理想,要么扩展性较差,且缺乏严格的基准测试。为了解决这些问题,我们引入了MedThink-Bench这一基准测试系统,旨在对LLMs的医学推理进行严格、可解释性和可扩展性的评估。MedThink-Bench包含了来自十个医学领域的500个具有挑战性的问题,并附有由专家精心编写的逐步推导过程。 在此基础上,我们提出了LLM-w-Ref这一新颖的评价框架,该框架利用了细粒度的推理依据和将LLMs作为裁判的机制来评估中间推理步骤,同时保持高度扩展性的特点。实验结果表明,LLM-w-Ref与专家判断之间存在显著正相关性。 通过使用MedThink-Bench对十二种最先进的LLMs进行了基准测试后,我们发现较小规模的模型(例如MedGemma-27B)可以超越较大规模且专有的同类模型(如OpenAI-o3)。总的来说,MedThink-Bench为评估LLMs在医学推理方面的表现提供了一个基础工具,并推动了它们在临床实践中的安全和负责任部署。
https://arxiv.org/abs/2507.07988
We study the problem of training and fine-tuning expressive policies with online reinforcement learning (RL) given an offline dataset. Training expressive policy classes with online RL present a unique challenge of stable value maximization. Unlike simpler Gaussian policies commonly used in online RL, expressive policies like diffusion and flow-matching policies are parameterized by a long denoising chain, which hinders stable gradient propagation from actions to policy parameters when optimizing against some value function. Our key insight is that we can address stable value maximization by avoiding direct optimization over value with the expressive policy and instead construct an on-the-fly RL policy to maximize Q-value. We propose Expressive Policy Optimization (EXPO), a sample-efficient online RL algorithm that utilizes an on-the-fly policy to maximize value with two parameterized policies -- a larger expressive base policy trained with a stable imitation learning objective and a light-weight Gaussian edit policy that edits the actions sampled from the base policy toward a higher value distribution. The on-the-fly policy optimizes the actions from the base policy with the learned edit policy and chooses the value maximizing action from the base and edited actions for both sampling and temporal-difference (TD) backup. Our approach yields up to 2-3x improvement in sample efficiency on average over prior methods both in the setting of fine-tuning a pretrained policy given offline data and in leveraging offline data to train online.
我们研究了在给定离线数据集的情况下,使用在线强化学习(RL)训练和微调具有表达性的策略的问题。用在线RL训练表达性强的策略类面临着稳定价值最大化的独特挑战。与常见的高斯策略相比,扩散和流匹配等表达性策略通过一个长去噪链进行参数化,这在针对某些价值函数优化时阻碍了从动作到政策参数的稳定梯度传播。 我们的关键见解是可以通过避免直接利用表达性策略进行价值优化,并构建一种实时RL策略来最大化Q值的方式来解决稳定的值最大化问题。我们提出了Expressive Policy Optimization(EXPO),这是一种样本高效的在线RL算法,它使用一种实时策略通过两个参数化的策略——一个较大的、用稳定模仿学习目标训练的表达性强的基础策略和一个轻量级高斯编辑策略——来实现价值的最大化。这个编辑策略将从基础策略中采样的动作调整到更高的价值分布。 该实时策略利用学到的编辑策略优化基础策略中的动作,并在采样以及时间差分(TD)备份时,从基础策略的动作和编辑后的动作中选择具有最大价值的动作。 我们的方法相对于先前的方法,在给定离线数据微调预训练政策的情况下,以及使用离线数据来训练在线学习方面,平均样本效率提高了2到3倍。
https://arxiv.org/abs/2507.07986
Contrastive vision-language models like CLIP are used for a large variety of applications, such as zero-shot classification or as vision encoder for multi-modal models. Despite their popularity, their representations show major limitations. For instance, CLIP models learn bag-of-words representations and, as a consequence, fail to distinguish whether an image is of "a yellow submarine and a blue bus" or "a blue submarine and a yellow bus". Previous attempts to fix this issue added hard negatives during training or modified the architecture, but failed to resolve the problem in its entirety. We suspect that the missing insights to solve the binding problem for CLIP are hidden in the arguably most important part of learning algorithms: the data. In this work, we fill this gap by rigorously identifying the influence of data properties on CLIP's ability to learn binding using a synthetic dataset. We find that common properties of natural data such as low attribute density, incomplete captions, and the saliency bias, a tendency of human captioners to describe the object that is "most salient" to them have a detrimental effect on binding performance. In contrast to common belief, we find that neither scaling the batch size, i.e., implicitly adding more hard negatives, nor explicitly creating hard negatives enables CLIP to learn reliable binding. Only when the data expresses our identified data properties CLIP learns almost perfect binding.
对比视觉语言模型如CLIP被广泛应用于各种应用场景,包括零样本分类或多模态模型中的视觉编码器。尽管这些模型很受欢迎,但它们的表示形式存在重大局限性。例如,CLIP模型学习的是基于词袋的表示方法,因此无法区分“一艘黄色潜水艇和一辆蓝色巴士”与“一艘蓝色潜水艇和一辆黄色巴士”。以往尝试通过在训练中加入硬负样本或修改架构来解决这一问题的努力并未彻底解决问题。我们怀疑解决CLIP绑定问题的关键见解隐藏于学习算法中最重要的一部分:数据本身。 在这项工作中,我们填补了这一空白,严谨地识别出数据属性对CLIP学习绑定能力的影响,并使用合成数据集进行了验证。我们发现,自然数据的常见特性(如低属性密度、不完整的描述和显著性偏差——即人类描述者倾向于描述他们认为“最显眼”的物体)都对绑定性能产生了负面影响。 与普遍看法不同的是,我们发现在训练中既不是通过增加批量大小来隐式添加更多硬负样本,也不是通过明确创建硬负样本使CLIP能够学习可靠的绑定。只有当数据体现出我们识别出的数据属性时,CLIP才能几乎完美地实现绑定。
https://arxiv.org/abs/2507.07985
Recent advances in multimodal large language models (MLLMs) have shown remarkable capabilities in integrating vision and language for complex reasoning. While most existing benchmarks evaluate models under offline settings with a fixed set of pre-recorded inputs, we introduce OST-Bench, a benchmark designed to evaluate Online Spatio-Temporal understanding from the perspective of an agent actively exploring a scene. The Online aspect emphasizes the need to process and reason over incrementally acquired observations, while the Spatio-Temporal component requires integrating current visual inputs with historical memory to support dynamic spatial reasoning. OST-Bench better reflects the challenges of real-world embodied perception. Built on an efficient data collection pipeline, OST-Bench consists of 1.4k scenes and 10k question-answer pairs collected from ScanNet, Matterport3D, and ARKitScenes. We evaluate several leading MLLMs on OST-Bench and observe that they fall short on tasks requiring complex spatio-temporal reasoning. Under the online setting, their accuracy declines as the exploration horizon extends and the memory grows. Through further experimental analysis, we identify common error patterns across models and find that both complex clue-based spatial reasoning demands and long-term memory retrieval requirements significantly drop model performance along two separate axes, highlighting the core challenges that must be addressed to improve online embodied reasoning. To foster further research and development in the field, our codes, dataset, and benchmark are available. Our project page is: this https URL
最近在多模态大型语言模型(MLLM)领域的进展显示出在视觉和语言整合以进行复杂推理方面具有显著的能力。然而,大多数现有的基准测试是在离线设置下进行的,这些设置使用一组固定的预录制输入来评估模型性能。我们引入了OST-Bench这一新基准,旨在从一个主动探索场景的代理角度评价在线时空理解能力。"Online"(在线)这个特性强调了处理和推理增量获取观察信息的需求,而"Spatio-Temporal"(时空)部分要求将当前视觉输入与历史记忆结合起来,以支持动态的空间推理。OST-Bench更好地反映了现实世界中具身感知的挑战。 基于一个高效的收集流程,OST-Bench包含来自ScanNet、Matterport3D和ARKitScenes的数据集中的1400个场景以及超过1万个问题-答案配对。我们在OST-Bench上评估了几种领先的MLLM,并发现它们在需要复杂时空推理的任务中表现不佳。在线设置下,随着探索范围的扩大和记忆的增长,模型的准确性下降。 通过进一步实验分析,我们发现了跨模型普遍存在的错误模式,发现在复杂的基于线索的空间推理需求和长期记忆检索要求方面的性能显著降低,这表明了两个独立轴线上的核心挑战:一个是空间推理能力的需求,另一个是长时间内的记忆保存与调用。这些发现凸显出改善在线具身感知的必要性。 为了促进该领域的进一步研究与发展,我们提供了项目代码、数据集和基准测试工具,我们的项目页面位于此链接: [此处应为具体URL,请参阅原文]。
https://arxiv.org/abs/2507.07984
Large language models (LLMs) show promise for supporting clinical decision-making in complex fields such as rheumatology. Our evaluation shows that smaller language models (SLMs), combined with retrieval-augmented generation (RAG), achieve higher diagnostic and therapeutic performance than larger models, while requiring substantially less energy and enabling cost-efficient, local deployment. These features are attractive for resource-limited healthcare. However, expert oversight remains essential, as no model consistently reached specialist-level accuracy in rheumatology.
大型语言模型(LLMs)在如风湿病学等复杂领域中支持临床决策方面展现出潜力。我们的评估表明,较小的语言模型(SLMs),结合检索增强生成(RAG)技术,在诊断和治疗性能上超过了大型模型,并且所需能耗更少,能够实现成本效益高的本地部署。这些特性对于资源有限的医疗环境非常有吸引力。然而,专家监督仍然至关重要,因为没有一个模型能在风湿病学领域始终达到专科医生级别的准确性。
https://arxiv.org/abs/2507.07983
Videos inherently represent 2D projections of a dynamic 3D world. However, our analysis suggests that video diffusion models trained solely on raw video data often fail to capture meaningful geometric-aware structure in their learned representations. To bridge this gap between video diffusion models and the underlying 3D nature of the physical world, we propose Geometry Forcing, a simple yet effective method that encourages video diffusion models to internalize latent 3D representations. Our key insight is to guide the model's intermediate representations toward geometry-aware structure by aligning them with features from a pretrained geometric foundation model. To this end, we introduce two complementary alignment objectives: Angular Alignment, which enforces directional consistency via cosine similarity, and Scale Alignment, which preserves scale-related information by regressing unnormalized geometric features from normalized diffusion representation. We evaluate Geometry Forcing on both camera view-conditioned and action-conditioned video generation tasks. Experimental results demonstrate that our method substantially improves visual quality and 3D consistency over the baseline methods. Project page: this https URL.
视频本质上是动态三维世界的二维投影。然而,我们的分析表明,仅基于原始视频数据训练的视频扩散模型往往无法捕捉到有意义的几何感知结构。为了弥合视频扩散模型与物理世界内在三维性质之间的差距,我们提出了Geometry Forcing(几何强制),这是一种简单而有效的方法,鼓励视频扩散模型内化潜在的三维表示。我们的关键见解是通过将中间表示与其从预训练的几何基础模型获得的功能对齐来引导模型朝向具有几何感知结构的方向发展。 为此,我们引入了两个互补的目标对齐方法:Angular Alignment(角度对齐),它通过余弦相似性强制执行方向一致性;以及Scale Alignment(尺度对齐),该方法通过回归未归一化的几何特征以保持与扩散表示的规范化形式相关的尺度信息。我们在基于摄像机视图和基于动作条件的视频生成任务上评估了Geometry Forcing的效果。 实验结果表明,相较于基线方法,我们的方法在视觉质量和三维一致性方面有显著提升。项目页面:[此链接](https://this-url.com)(请将占位符替换为实际URL)。
https://arxiv.org/abs/2507.07982
Robots can better interact with humans and unstructured environments through touch sensing. However, most commercial robots are not equipped with tactile skins, making it challenging to achieve even basic touch-sensing functions, such as contact localization. We present UniTac, a data-driven whole-body touch-sensing approach that uses only proprioceptive joint sensors and does not require the installation of additional sensors. Our approach enables a robot equipped solely with joint sensors to localize contacts. Our goal is to democratize touch sensing and provide an off-the-shelf tool for HRI researchers to provide their robots with touch-sensing capabilities. We validate our approach on two platforms: the Franka robot arm and the Spot quadruped. On Franka, we can localize contact to within 8.0 centimeters, and on Spot, we can localize to within 7.2 centimeters at around 2,000 Hz on an RTX 3090 GPU without adding any additional sensors to the robot. Project website: this https URL.
机器人可以通过触觉感应更好地与人类和非结构化环境互动。然而,大多数商用机器人并未配备触觉皮肤,这使得实现诸如接触定位等基本的触感功能变得困难。我们提出了UniTac,这是一种基于数据驱动的整体触觉感知方法,仅使用本体感受关节传感器,并不需要安装额外的传感器。我们的方法使只配备了关节传感器的机器人都能够进行接触定位。我们的目标是让触觉感应更加普及,并为人机交互(HRI)研究者提供现成的工具,以赋予他们的机器人触觉感知能力。我们在两个平台上验证了该方法的有效性:Franka机械臂和Spot四足机器人。在Franka上,我们能够将接触定位到8.0厘米范围内;而在Spot上,在配备RTX 3090 GPU的情况下,无需添加任何额外的传感器即可实现在约2,000 Hz频率下的7.2厘米范围内的接触定位。 项目网站:[此处插入实际链接]
https://arxiv.org/abs/2507.07980
Reward models are key to language model post-training and inference pipelines. Conveniently, recent work showed that every language model defines an implicit reward model (IM-RM), without requiring any architectural changes. However, such IM-RMs tend to generalize worse, especially out-of-distribution, compared to explicit reward models (EX-RMs) that apply a dedicated linear head over the hidden representations of a language model. The existence of a generalization gap is puzzling, as EX-RMs and IM-RMs are nearly identical. They can be trained using the same data, loss function, and language model, and differ only in how the reward is computed. Towards a fundamental understanding of the implicit biases underlying different reward model types, we investigate the root cause of this gap. Our main finding, backed by theory and experiments, is that IM-RMs rely more heavily on superficial token-level cues. Consequently, they often generalize worse than EX-RMs under token-level distribution shifts, as well as in-distribution. Furthermore, we provide evidence against alternative hypotheses for the generalization gap. Most notably, we challenge the intuitive claim that IM-RMs struggle in tasks where generation is harder than verification because they can operate both as a verifier and a generator. Taken together, our results highlight that seemingly minor design choices can substantially impact the generalization behavior of reward models.
奖励模型是语言模型后期训练和推理管道的关键组成部分。幸运的是,最近的研究表明,每个语言模型都隐含定义了一个奖励模型(IM-RM),而无需对架构进行任何更改。然而,与应用专门线性头部于语言模型隐藏表示上的显式奖励模型(EX-RMs)相比,这种隐式的奖励模型在泛化性能方面表现较差,尤其是在分布外的数据上。令人困惑的是,尽管 EX-RMs 和 IM-RMs 几乎是相同的,并且可以使用相同的数据、损失函数和语言模型进行训练,它们仅在计算奖励的方式上有所不同。 为了对不同类型奖励模型背后的隐性偏见获得基本理解,我们调查了造成这种差距的根本原因。我们的主要发现,由理论和实验支持的是,IM-RMs 更依赖于浅层的标记级线索。因此,它们通常在发生标记级别分布变化时泛化能力不如 EX-RMs,在分布内也如此。 此外,我们还提供了反对导致泛化差距的替代假设的证据。最值得注意的是,我们挑战了这样一个直观的说法:IM-RMs 在生成比验证更困难的任务中表现不佳,因为它们可以同时作为验证器和生成器工作。 综合来看,我们的研究结果突显了看似微小的设计选择如何对奖励模型的泛化行为产生重大影响。
https://arxiv.org/abs/2507.07981
Synthesizing realistic Martian landscape videos is crucial for mission rehearsal and robotic simulation. However, this task poses unique challenges due to the scarcity of high-quality Martian data and the significant domain gap between Martian and terrestrial imagery. To address these challenges, we propose a holistic solution composed of two key components: 1) A data curation pipeline Multimodal Mars Synthesis (M3arsSynth), which reconstructs 3D Martian environments from real stereo navigation images, sourced from NASA's Planetary Data System (PDS), and renders high-fidelity multiview 3D video sequences. 2) A Martian terrain video generator, MarsGen, which synthesizes novel videos visually realistic and geometrically consistent with the 3D structure encoded in the data. Our M3arsSynth engine spans a wide range of Martian terrains and acquisition dates, enabling the generation of physically accurate 3D surface models at metric-scale resolution. MarsGen, fine-tuned on M3arsSynth data, synthesizes videos conditioned on an initial image frame and, optionally, camera trajectories or textual prompts, allowing for video generation in novel environments. Experimental results show that our approach outperforms video synthesis models trained on terrestrial datasets, achieving superior visual fidelity and 3D structural consistency.
合成逼真的火星景观视频对于任务排练和机器人模拟至关重要。然而,由于高质量的火星数据稀缺以及火星图像与地球图像之间的显著领域差异,这一任务面临着独特的挑战。为了解决这些挑战,我们提出了一种全面的解决方案,包括两个关键组成部分: 1. **多模态火星合成(M3arsSynth)**:这是一个数据整理流水线,从NASA行星数据系统(PDS)获取的真实立体导航图像中重建三维火星环境,并渲染高质量的多视角三维视频序列。 2. **火星地形视频生成器(MarsGen)**:该组件能够根据编码在数据中的3D结构来合成视觉上逼真且几何一致的新视频。 我们的M3arsSynth引擎涵盖了广泛的火星地貌和采集日期,从而能够在米级分辨率下生成物理准确的三维表面模型。通过微调M3arsSynth数据集,MarsGen可以根据初始图像帧(以及可选的摄像机轨迹或文本提示)合成视频,使得在新环境中进行视频生成成为可能。 实验结果表明,我们的方法优于基于地球数据集训练的视频合成模型,在视觉保真度和三维结构一致性方面表现出色。
https://arxiv.org/abs/2507.07978
We present Q-chunking, a simple yet effective recipe for improving reinforcement learning (RL) algorithms for long-horizon, sparse-reward tasks. Our recipe is designed for the offline-to-online RL setting, where the goal is to leverage an offline prior dataset to maximize the sample-efficiency of online learning. Effective exploration and sample-efficient learning remain central challenges in this setting, as it is not obvious how the offline data should be utilized to acquire a good exploratory policy. Our key insight is that action chunking, a technique popularized in imitation learning where sequences of future actions are predicted rather than a single action at each timestep, can be applied to temporal difference (TD)-based RL methods to mitigate the exploration challenge. Q-chunking adopts action chunking by directly running RL in a 'chunked' action space, enabling the agent to (1) leverage temporally consistent behaviors from offline data for more effective online exploration and (2) use unbiased $n$-step backups for more stable and efficient TD learning. Our experimental results demonstrate that Q-chunking exhibits strong offline performance and online sample efficiency, outperforming prior best offline-to-online methods on a range of long-horizon, sparse-reward manipulation tasks.
我们提出了Q-chunking,这是一种简单而有效的方案,旨在改进针对长时序和稀疏奖励任务的强化学习(RL)算法。我们的方法是为离线到在线的RL场景设计的,在这种场景下,目标是利用一个离线的数据集来最大化在线学习中的样本效率。在这个设置中,有效探索和样本高效的学习仍然是核心挑战之一,因为如何使用离线数据以获得一个好的探索策略并不明显。我们的重要洞察在于:行动分块(在模仿学习中流行的一种技术,其中预测的是未来动作序列而非每个时间步的一次性动作)可以应用于基于时序差分(TD)的RL方法,以此来缓解探索难题。 Q-chunking通过直接在“分块”的动作空间中运行强化学习任务来采用行动分块。这样可以让智能体(1)利用来自离线数据集中的时间一致性行为进行更有效的在线探索;以及(2)使用无偏n步备份以实现更加稳定和高效的TD学习。 我们的实验结果表明,Q-chunking在离线性能和在线样本效率方面表现出色,在一系列长时序稀疏奖励操作任务上超越了先前最好的从离线到在线的方法。
https://arxiv.org/abs/2507.07969
We introduce a full-stack framework that scales up reasoning in vision-language models (VLMs) to long videos, leveraging reinforcement learning. We address the unique challenges of long video reasoning by integrating three critical components: (1) a large-scale dataset, LongVideo-Reason, comprising 52K long video QA pairs with high-quality reasoning annotations across diverse domains such as sports, games, and vlogs; (2) a two-stage training pipeline that extends VLMs with chain-of-thought supervised fine-tuning (CoT-SFT) and reinforcement learning (RL); and (3) a training infrastructure for long video RL, named Multi-modal Reinforcement Sequence Parallelism (MR-SP), which incorporates sequence parallelism and a vLLM-based engine tailored for long video, using cached video embeddings for efficient rollout and prefilling. In experiments, LongVILA-R1-7B achieves strong performance on long video QA benchmarks such as VideoMME. It also outperforms Video-R1-7B and even matches Gemini-1.5-Pro across temporal reasoning, goal and purpose reasoning, spatial reasoning, and plot reasoning on our LongVideo-Reason-eval benchmark. Notably, our MR-SP system achieves up to 2.1x speedup on long video RL training. LongVILA-R1 demonstrates consistent performance gains as the number of input video frames scales. LongVILA-R1 marks a firm step towards long video reasoning in VLMs. In addition, we release our training system for public availability that supports RL training on various modalities (video, text, and audio), various models (VILA and Qwen series), and even image and video generation models. On a single A100 node (8 GPUs), it supports RL training on hour-long videos (e.g., 3,600 frames / around 256k tokens).
我们介绍了一个全栈框架,该框架通过强化学习扩展了视觉语言模型(VLM)在长视频中的推理能力。为了解决长视频推理的特定挑战,我们将三个关键组件整合到了这个框架中:(1) 大规模数据集LongVideo-Reason,包含52K高质量推理标注的长视频问答对,涵盖了体育、游戏和vlog等多样化的领域;(2) 一个两阶段训练流程,该流程通过链式思考监督微调(CoT-SFT)和强化学习(RL)扩展了VLMs;以及 (3) 长视频RL的训练基础设施,命名为多模态强化序列并行化(MR-SP),它结合了序列并行性,并采用了一种基于vLLM引擎且针对长视频进行了优化的设计,使用缓存视频嵌入以实现高效的rollout和prefill。在实验中,LongVILA-R1-7B 在如VideoMME这样的长视频问答基准测试上表现出色,并且在我们的LongVideo-Reason-eval基准的时序推理、目标与目的推理、空间推理以及情节推理方面优于Video-R1-7B,甚至能够匹敌Gemini-1.5-Pro。值得注意的是,我们的MR-SP系统在长视频RL训练中实现了高达2.1倍的速度提升。随着输入视频帧数的增加,LongVILA-R1持续表现出性能改进的趋势。LongVILA-R1标志着视觉语言模型(VLMs)向处理长视频推理迈出了坚实一步。此外,我们还发布了支持多种模态(视频、文本和音频)、各种模型(包括VILA系列及Qwen系列),甚至是图像和视频生成模型的RL训练系统的公共版本,在单个A100节点(8 GPU)上可以对长达一个小时的视频进行强化学习训练(例如3600帧/约256k令牌)。
https://arxiv.org/abs/2507.07966
Although memory capabilities of AI agents are gaining increasing attention, existing solutions remain fundamentally limited. Most rely on flat, narrowly scoped memory components, constraining their ability to personalize, abstract, and reliably recall user-specific information over time. To this end, we introduce MIRIX, a modular, multi-agent memory system that redefines the future of AI memory by solving the field's most critical challenge: enabling language models to truly remember. Unlike prior approaches, MIRIX transcends text to embrace rich visual and multimodal experiences, making memory genuinely useful in real-world scenarios. MIRIX consists of six distinct, carefully structured memory types: Core, Episodic, Semantic, Procedural, Resource Memory, and Knowledge Vault, coupled with a multi-agent framework that dynamically controls and coordinates updates and retrieval. This design enables agents to persist, reason over, and accurately retrieve diverse, long-term user data at scale. We validate MIRIX in two demanding settings. First, on ScreenshotVQA, a challenging multimodal benchmark comprising nearly 20,000 high-resolution computer screenshots per sequence, requiring deep contextual understanding and where no existing memory systems can be applied, MIRIX achieves 35% higher accuracy than the RAG baseline while reducing storage requirements by 99.9%. Second, on LOCOMO, a long-form conversation benchmark with single-modal textual input, MIRIX attains state-of-the-art performance of 85.4%, far surpassing existing baselines. These results show that MIRIX sets a new performance standard for memory-augmented LLM agents. To allow users to experience our memory system, we provide a packaged application powered by MIRIX. It monitors the screen in real time, builds a personalized memory base, and offers intuitive visualization and secure local storage to ensure privacy.
尽管人工智能代理的记忆能力越来越受到关注,但现有的解决方案在根本上仍然存在局限。大多数现有方案依赖于扁平且范围狭窄的记忆组件,这限制了它们个性化、抽象化以及长期可靠地回溯用户特定信息的能力。为此,我们引入了MIRIX,这是一种模块化的多代理记忆系统,它重新定义了AI记忆的未来,解决了该领域最紧迫的挑战:使语言模型真正具备记忆能力。与以往的方法不同,MIRIX超越了文本,拥抱丰富的视觉和跨模态体验,使其在现实场景中的记忆功能变得切实有用。 MIRIX由六种精心设计的记忆类型组成:核心记忆、情景记忆、语义记忆、程序性记忆、资源记忆和知识库,同时配合一个多代理框架,动态控制并协调更新与检索。这种设计使代理能够在大规模范围内持久存储、推理以及准确地检索各种长期用户数据。 我们通过两个具有挑战性的应用场景验证了MIRIX的有效性: 首先,在ScreenshotVQA这一包含近20,000张高分辨率电脑屏幕截图的跨模态基准测试中,要求进行深度上下文理解且现有记忆系统无法应用的情况下,MIRIX相较于RAG基线模型准确率提高了35%,同时减少了99.9%的存储需求。 其次,在LOCOMO这一长篇对话基准测试中(基于单模态文本输入),MIRIX实现了85.4%的最佳性能,远超现有基线。 这些结果表明,MIRIX为增强记忆功能的大语言模型代理设定了新的性能标准。为了让用户体验我们的记忆系统,我们提供了一个由MIRIX驱动的应用程序包。该应用实时监控屏幕、构建个性化记忆库,并提供直观的可视化和安全本地存储以确保隐私保护。
https://arxiv.org/abs/2507.07957