Metaphorical comprehension in images remains a critical challenge for AI systems, as existing models struggle to grasp the nuanced cultural, emotional, and contextual implications embedded in visual content. While multimodal large language models (MLLMs) excel in basic Visual Question Answer (VQA) tasks, they struggle with a fundamental limitation on image implication tasks: contextual gaps that obscure the relationships between different visual elements and their abstract meanings. Inspired by the human cognitive process, we propose Let Androids Dream (LAD), a novel framework for image implication understanding and reasoning. LAD addresses contextual missing through the three-stage framework: (1) Perception: converting visual information into rich and multi-level textual representations, (2) Search: iteratively searching and integrating cross-domain knowledge to resolve ambiguity, and (3) Reasoning: generating context-alignment image implication via explicit reasoning. Our framework with the lightweight GPT-4o-mini model achieves SOTA performance compared to 15+ MLLMs on English image implication benchmark and a huge improvement on Chinese benchmark, performing comparable with the GPT-4o model on Multiple-Choice Question (MCQ) and outperforms 36.7% on Open-Style Question (OSQ). Additionally, our work provides new insights into how AI can more effectively interpret image implications, advancing the field of vision-language reasoning and human-AI interaction. Our project is publicly available at this https URL.
图像中的比喻理解仍然是AI系统的重大挑战,因为现有的模型难以把握视觉内容中嵌入的细腻的文化、情感和上下文含义。尽管多模态大型语言模型(MLLMs)在基本的视觉问答(VQA)任务上表现出色,但在涉及图像内涵的任务方面仍面临一个根本性的限制:即不同视觉元素之间关系及其抽象意义所造成的上下文差距。 受人类认知过程启发,我们提出了一种新的框架——让机器人产生梦境(Let Androids Dream, LAD),旨在理解和推理图像的隐含含义。LAD通过三阶段框架解决上下文缺失的问题:(1)感知:将视觉信息转换为丰富且多层次的文本表示;(2)搜索:迭代地搜索和整合跨域知识以消除歧义;以及(3)推理:通过明确推理生成与背景相符的图像隐含含义。使用轻量级GPT-4o-mini模型,我们的框架在英语图像隐含基准测试中相较于15个以上的MLLMs达到了最先进的性能,并在中国语料库的测试中取得了巨大进步,在多项选择题(MCQ)和开放式风格问题(OSQ)上分别与GPT-4o模型表现相当并超越了后者36.7%。此外,我们的工作为AI如何更有效地解释图像隐含含义提供了新的见解,推动了视觉语言推理及人机交互领域的发展。 本项目已在公开网址上发布:[此链接](https://thishttpsURL.com/)(请将“this https URL”替换为您实际的项目地址)。
https://arxiv.org/abs/2505.17019
We introduce RIPT-VLA, a simple and scalable reinforcement-learning-based interactive post-training paradigm that fine-tunes pretrained Vision-Language-Action (VLA) models using only sparse binary success rewards. Existing VLA training pipelines rely heavily on offline expert demonstration data and supervised imitation, limiting their ability to adapt to new tasks and environments under low-data regimes. RIPT-VLA addresses this by enabling interactive post-training with a stable policy optimization algorithm based on dynamic rollout sampling and leave-one-out advantage estimation. RIPT-VLA has the following characteristics. First, it applies to various VLA models, resulting in an improvement on the lightweight QueST model by 21.2%, and the 7B OpenVLA-OFT model to an unprecedented 97.5% success rate. Second, it is computationally efficient and data-efficient: with only one demonstration, RIPT-VLA enables an unworkable SFT model (4%) to succeed with a 97% success rate within 15 iterations. Furthermore, we demonstrate that the policy learned by RIPT-VLA generalizes across different tasks and scenarios and is robust to the initial state context. These results highlight RIPT-VLA as a practical and effective paradigm for post-training VLA models through minimal supervision.
我们介绍了一种基于强化学习的简单且可扩展的交互式后期训练范例——RIPT-VLA,该方法仅使用稀疏二元成功奖励对预训练的视觉-语言-动作(VLA)模型进行微调。现有的VLA训练流水线依赖于大量的离线专家演示数据和监督模仿学习,这限制了它们在低数据环境下的适应能力。RIPT-VLA通过启用基于动态采样和留一法优势估计的稳定策略优化算法的交互式后期训练来解决这个问题。 RIPT-VLA具有以下特点:首先,它适用于各种VLA模型,在轻量级QueST模型上提高了21.2%,并且在7B OpenVLA-OFT模型上达到了前所未有的97.5%的成功率。其次,它在计算和数据使用方面都十分高效:仅用一次演示,RIPT-VLA就使原本几乎无法工作的SFT模型(成功率仅为4%)在经过15次迭代后将成功率提高到97%。此外,我们还展示了由RIPT-VLA学习的策略能够跨不同任务和场景进行泛化,并且对初始状态背景具有鲁棒性。 这些结果凸显了RIPT-VLA作为一种通过最小监督有效提升VLA模型后期训练性能的方法的实际价值与效果。
https://arxiv.org/abs/2505.17016
Prompting is one of the main ways to adapt a pretrained model to target tasks. Besides manually constructing prompts, many prompt optimization methods have been proposed in the literature. Method development is mainly empirically driven, with less emphasis on a conceptual understanding of prompting. In this paper we discuss how optimal prompting can be understood through a Bayesian view, which also implies some fundamental limitations of prompting that can only be overcome by tuning weights. The paper explains in detail how meta-trained neural networks behave as Bayesian predictors over the pretraining distribution, whose hallmark feature is rapid in-context adaptation. Optimal prompting can be studied formally as conditioning these Bayesian predictors, yielding criteria for target tasks where optimal prompting is and is not possible. We support the theory with educational experiments on LSTMs and Transformers, where we compare different versions of prefix-tuning and different weight-tuning methods. We also confirm that soft prefixes, which are sequences of real-valued vectors outside the token alphabet, can lead to very effective prompts for trained and even untrained networks by manipulating activations in ways that are not achievable by hard tokens. This adds an important mechanistic aspect beyond the conceptual Bayesian theory.
指令调优是将预训练模型适应特定任务的主要方法之一。除了手动构造提示词外,文献中还提出了许多优化提示的方法。目前,方法的发展主要依赖于经验驱动,较少关注对提示概念上的理解。在本文中,我们通过贝叶斯视角探讨了如何理解和实现最优的指令调优,同时也指出了仅靠调整提示无法克服的一些基本限制,这些问题需要通过对模型权重进行微调来解决。论文详细解释了元训练神经网络如何作为基于预训练数据分布的贝叶斯预测器工作,其显著特征是能够迅速适应上下文变化。通过将这些贝叶斯预测器视为条件概率,可以正式研究最优指令调优,并确定哪些任务可以通过提示实现最优表现以及哪些不能。 为了支持这一理论,我们在LSTM和Transformer模型上进行了教育性实验,比较了不同版本的前缀调优方法和不同的权重微调策略。我们还证实了“软前缀”(即超出标记字母表的一系列实值向量)能够为训练过的甚至是未经过训练的网络产生非常有效的提示词,通过这种方式可以操纵激活函数以实现硬令牌无法达到的效果。这在概念上的贝叶斯理论之外提供了重要的机制性见解。 总的来说,本文探讨了指令调优背后的理论基础,并展示了如何利用这种理解来设计更有效的方法和策略。
https://arxiv.org/abs/2505.17010
Uniform downsampling remains the de facto standard for reducing spatial resolution in vision backbones. In this work, we propose an alternative design built around a content-aware spatial grouping layer, that dynamically assigns tokens to a reduced set based on image boundaries and their semantic content. Stacking our grouping layer across consecutive backbone stages results in hierarchical segmentation that arises natively in the feature extraction process, resulting in our coined Native Segmentation Vision Transformer. We show that a careful design of our architecture enables the emergence of strong segmentation masks solely from grouping layers, that is, without additional segmentation-specific heads. This sets the foundation for a new paradigm of native, backbone-level segmentation, which enables strong zero-shot results without mask supervision, as well as a minimal and efficient standalone model design for downstream segmentation tasks. Our project page is this https URL.
均匀下采样一直是降低视觉骨干网络空间分辨率的事实标准。在本工作中,我们提出了一种基于内容感知的空间分组层的替代设计方案,该设计可以根据图像边界及其语义内容动态地将标记分配给一个缩小的集合中。在整个连续骨干阶段堆叠我们的分组层会产生一种层次化分割,这种分割自然出现在特征提取过程中,从而形成了我们提出的原生分割视觉变换器(Native Segmentation Vision Transformer)。我们展示了对架构进行精心设计可以使仅通过分组层就能产生强大的分割掩码,而无需额外的特定于分割的头部。这为新的原生骨干级分割范式奠定了基础,该范式可以在没有掩码监督的情况下实现强大的零样本结果,并且对于下游分割任务具有最小和高效的独立模型设计。我们的项目页面在此 [URL]。 注:原文中的项目页面链接(https URL)未给出具体网址,在实际引用时需要提供完整的网址信息。
https://arxiv.org/abs/2505.16993
Single-agent LLMs hit hard limits--finite context, role overload, and brittle domain transfer. Conventional multi-agent fixes soften those edges yet expose fresh pains: ill-posed decompositions, fuzzy contracts, and verification overhead that blunts the gains. We therefore present Know-The-Ropes (KtR), a framework that converts domain priors into an algorithmic blueprint hierarchy, in which tasks are recursively split into typed, controller-mediated subtasks, each solved zero-shot or with the lightest viable boost (e.g., chain-of-thought, micro-tune, self-check). Grounded in the No-Free-Lunch theorem, KtR trades the chase for a universal prompt for disciplined decomposition. On the Knapsack problem (3-8 items), three GPT-4o-mini agents raise accuracy from 3% zero-shot to 95% on size-5 instances after patching a single bottleneck agent. On the tougher Task-Assignment problem (6-15 jobs), a six-agent o3-mini blueprint hits 100% up to size 10 and 84% on sizes 13-15, versus 11% zero-shot. Algorithm-aware decomposition plus targeted augmentation thus turns modest models into reliable collaborators--no ever-larger monoliths required.
单一代理LLM面临硬性限制——有限的上下文、角色过载和脆弱的知识领域转移。传统多代理解决方案虽然减轻了这些问题,但也暴露出新的问题:不恰当的任务分解、模糊不清的合作协议以及验证成本高昂,削弱了改进效果。因此,我们提出了一种名为“掌握诀窍”(Know-The-Ropes, KtR)的框架,该框架将领域的先验知识转化为算法蓝图层级结构,在这种结构中,任务被递归地拆分为有类型的、由控制器中介的子任务,每个子任务要么直接解决,要么通过最轻量级的方法进行增强(例如:思维链推理、微调或自我检查)。基于“没有免费午餐”的定理,KtR放弃了寻找通用提示符的努力,转而强调有条不紊的任务分解。 在背包问题(3-8个物品)上,使用三个GPT-4o-mini代理,在补全单一瓶颈代理后,从零样本的3%准确率提高到大小为5的情况下的95%。对于更具挑战性的任务分配问题(6-15项工作),一个由六个o3-mini蓝图组成的系统在规模达到10时能够实现100%的正确率,并且在规模13-15时也能保持84%的准确度,相比之下零样本情况下的准确率为11%。 通过算法意识的任务分解加上有针对性的增强,这种框架使中等大小的模型成为可靠的合作伙伴——无需构建越来越大、越来越复杂的单一代理系统。
https://arxiv.org/abs/2505.16979
Large Language Models (LLMs) have shown strong capability in diverse software engineering tasks, e.g. code completion, bug fixing, and document generation. However, feature-driven development (FDD), a highly prevalent real-world task that involves developing new functionalities for large, existing codebases, remains underexplored. We therefore introduce SWE-Dev, the first large-scale dataset (with 14,000 training and 500 test samples) designed to evaluate and train autonomous coding systems on real-world feature development tasks. To ensure verifiable and diverse training, SWE-Dev uniquely provides all instances with a runnable environment and its developer-authored executable unit tests. This collection not only provides high-quality data for Supervised Fine-Tuning (SFT), but also enables Reinforcement Learning (RL) by delivering accurate reward signals from executable unit tests. Our extensive evaluations on SWE-Dev, covering 17 chatbot LLMs, 10 reasoning models, and 10 Multi-Agent Systems (MAS), reveal that FDD is a profoundly challenging frontier for current AI (e.g., Claude-3.7-Sonnet achieves only 22.45\% Pass@3 on the hard test split). Crucially, we demonstrate that SWE-Dev serves as an effective platform for model improvement: fine-tuning on training set enabled a 7B model comparable to GPT-4o on \textit{hard} split, underscoring the value of its high-quality training data. Code is available here \href{this https URL}{this https URL}.
大型语言模型(LLMs)在各种软件工程任务中展现了强大的能力,例如代码补全、错误修复和文档生成。然而,特性驱动开发(FDD),这是一种高度流行的真实世界任务,涉及到为庞大的现有代码库添加新功能,这一领域仍然被较少探索。为此,我们引入了SWE-Dev,这是首个大规模数据集(包含14,000个训练样本和500个测试样本),旨在评估和训练自动编码系统在真实世界的特性开发任务上的表现。为了确保可验证且多样化的训练过程,SWE-Dev独特地为所有实例提供了一个运行环境及其由开发者编写的执行单元测试。 这个数据集不仅提供了高质量的数据用于监督微调(SFT),而且还通过提供来自可执行单元测试的准确奖励信号支持强化学习(RL)。我们在SWE-Dev上进行了广泛评估,涵盖了17个聊天机器人LLM、10个推理模型和10个多智能体系统(MAS),发现FDD是当前AI面临的深刻挑战前沿(例如,Claude-3.7-Sonnet在困难测试分割上的Pass@3仅达到22.45%)。至关重要的是,我们展示了SWE-Dev作为一个有效的模型改进平台的作用:在训练集上进行微调使一个70亿参数的模型在“困难”分段的表现可媲美GPT-4o,强调了其高质量训练数据的价值。 代码可以在[\href{this https URL}{此处}]获取。
https://arxiv.org/abs/2505.16975
Recent advances in Automatic Speech Recognition (ASR) have been largely fueled by massive speech corpora. However, extending coverage to diverse languages with limited resources remains a formidable challenge. This paper introduces Speech Back-Translation, a scalable pipeline that improves multilingual ASR models by converting large-scale text corpora into synthetic speech via off-the-shelf text-to-speech (TTS) models. We demonstrate that just tens of hours of real transcribed speech can effectively train TTS models to generate synthetic speech at hundreds of times the original volume while maintaining high quality. To evaluate synthetic speech quality, we develop an intelligibility-based assessment framework and establish clear thresholds for when synthetic data benefits ASR training. Using Speech Back-Translation, we generate more than 500,000 hours of synthetic speech in ten languages and continue pre-training Whisper-large-v3, achieving average transcription error reductions of over 30\%. These results highlight the scalability and effectiveness of Speech Back-Translation for enhancing multilingual ASR systems.
最近的自动语音识别(ASR)进展主要得益于大规模语音语料库。然而,将覆盖范围扩展到资源有限的多种语言仍然是一个巨大的挑战。本文介绍了Speech Back-Translation,这是一种可扩展的工作流程,通过使用现成的文本转语音(TTS)模型将大型文本语料库转换为合成语音来改进多语言ASR模型。我们证明了仅用数十小时的实际录音可以有效训练TTS模型以生成比原始数据量大几百倍的高质量合成语音。为了评估合成语音的质量,我们开发了一个基于可理解性的评价框架,并确定了合成数据对ASR培训有益的确切阈值。通过使用Speech Back-Translation,我们在十种语言中生成了超过50万小时的合成语音并继续预训练Whisper-large-v3模型,在平均转录错误率上降低了超过30%。这些结果强调了Speech Back-Translation在增强多语言ASR系统中的可扩展性和有效性。
https://arxiv.org/abs/2505.16972
We introduce \texttt{CASS}, the first large-scale dataset and model suite for cross-architecture GPU code transpilation, targeting both source-level (CUDA~$\leftrightarrow$~HIP) and assembly-level (Nvidia SASS~$\leftrightarrow$~AMD RDNA3) translation. The dataset comprises 70k verified code pairs across host and device, addressing a critical gap in low-level GPU code portability. Leveraging this resource, we train the \texttt{CASS} family of domain-specific language models, achieving 95\% source translation accuracy and 37.5\% assembly translation accuracy, substantially outperforming commercial baselines such as GPT-4o, Claude, and Hipify. Our generated code matches native performance in over 85\% of test cases, preserving runtime and memory behavior. To support rigorous evaluation, we introduce \texttt{CASS-Bench}, a curated benchmark spanning 16 GPU domains with ground-truth execution. All data, models, and evaluation tools are released as open source to foster progress in GPU compiler tooling, binary compatibility, and LLM-guided hardware translation. Dataset and benchmark are on \href{this https URL}{\textcolor{blue}{HuggingFace}}, with code at \href{this https URL}{\textcolor{blue}{GitHub}}.
我们介绍了一种名为\texttt{CASS}的大型数据集和模型套件,用于跨架构GPU代码转译,旨在支持源码级(CUDA~$\leftrightarrow$~HIP)和汇编级(Nvidia SASS~$\leftrightarrow$~AMD RDNA3)翻译。该数据集包含了70,000个经过验证的代码对,涵盖了主机端和设备端,填补了低级别GPU代码可移植性中的关键空白。利用这一资源,我们训练了\texttt{CASS}系列特定领域的语言模型,在源码转译准确率上达到了95%,汇编级转译准确率达到37.5%。这些性能远超商业基准如GPT-4o、Claude和Hipify的水平。在超过85%的测试案例中,我们生成的代码能够匹配原生性能,并保持了运行时间和内存行为的一致性。 为了支持严格的评估,我们引入了\texttt{CASS-Bench},这是一个经过精心挑选的基准集,涵盖了16个GPU领域并且拥有真实的执行结果。所有的数据、模型和评估工具都作为开源项目发布,以促进在GPU编译器工具开发、二进制兼容性以及LLM(大型语言模型)指导硬件翻译方面的进步。 该数据集与基准可以在\href{this https URL}{HuggingFace}上找到,代码托管于\href{this https URL}{GitHub}。
https://arxiv.org/abs/2505.16968
Training robust retrieval and reranker models typically relies on large-scale retrieval datasets; for example, the BGE collection contains 1.6 million query-passage pairs sourced from various data sources. However, we find that certain datasets can negatively impact model effectiveness -- pruning 8 out of 15 datasets from the BGE collection reduces the training set size by 2.35$\times$ and increases nDCG@10 on BEIR by 1.0 point. This motivates a deeper examination of training data quality, with a particular focus on "false negatives", where relevant passages are incorrectly labeled as irrelevant. We propose a simple, cost-effective approach using cascading LLM prompts to identify and relabel hard negatives. Experimental results show that relabeling false negatives with true positives improves both E5 (base) and Qwen2.5-7B retrieval models by 0.7-1.4 nDCG@10 on BEIR and by 1.7-1.8 nDCG@10 on zero-shot AIR-Bench evaluation. Similar gains are observed for rerankers fine-tuned on the relabeled data, such as Qwen2.5-3B on BEIR. The reliability of the cascading design is further supported by human annotation results, where we find judgment by GPT-4o shows much higher agreement with humans than GPT-4o-mini.
训练鲁棒的检索和重排序模型通常依赖大规模的检索数据集;例如,BGE集合包含了从各种来源获取的160万条查询-段落对。然而,我们发现某些数据集会对模型效果产生负面影响——移除BGE集合中的8个数据集后,训练集规模减小了2.35倍,但在BEIR上的nDCG@10评分提高了1分。这促使我们更深入地考察训练数据的质量,并特别关注“假阴性”,即相关段落被错误地标记为不相关的案例。为此,我们提出了一种简单且成本效益高的方法:使用级联大语言模型提示来识别和重新标注难例(hard negatives)。实验结果显示,在BEIR上对E5(基础)和Qwen2.5-7B检索模型进行假阴性到真阳性的重标后,nDCG@10评分提高了0.7至1.4分;而在零样本的AIR-Bench评估中,此操作使得分提升了1.7至1.8分。对于在重新标注数据上微调的重排序器模型(如Qwen2.5-3B),也观察到了类似的性能提升效果。此外,级联设计的有效性还通过人工注释结果得到了进一步的支持:我们发现GPT-4o在判断方面的准确性与人类评价高度一致,而其简化版GPT-4o-mini则不具备这一特性。
https://arxiv.org/abs/2505.16967
Despite their impressive capabilities, Large Language Models struggle with generalisation beyond their training distribution, often exhibiting sophisticated pattern interpolation rather than true abstract reasoning (extrapolation). In this work, we approach this limitation through the lens of Information Bottleneck (IB) theory, which posits that model generalisation emerges from an optimal balance between input compression and retention of predictive information in latent representations. We prove using IB theory that decoder-only Transformers are inherently constrained in their ability to form task-optimal sequence representations. We then use this result to demonstrate that periodic global transformation of the internal sequence-level representations (KV cache) is a necessary computational step for improving Transformer generalisation in reasoning tasks. Based on these theoretical insights, we propose a modification to the Transformer architecture, in the form of an additional module that globally rewrites the KV cache at periodic intervals, shifting its capacity away from memorising input prefixes and toward encoding features most useful for predicting future tokens. Our model delivers substantial gains on mathematical reasoning benchmarks, outperforming both vanilla Transformers with up to 3.5x more parameters, as well as heuristic-driven pruning mechanisms for cache compression. Our approach can be seen as a principled generalisation of existing KV-cache compression methods; whereas such methods focus solely on compressing input representations, they often do so at the expense of retaining predictive information, and thus their capabilities are inherently bounded by those of an unconstrained model. This establishes a principled framework to manipulate Transformer memory using information theory, addressing fundamental reasoning limitations that scaling alone cannot overcome.
尽管大型语言模型具有令人印象深刻的性能,它们在训练数据分布之外的泛化能力仍然有限,常常表现出复杂的模式插值而非真正的抽象推理(外推)。在这项工作中,我们通过信息瓶颈(IB)理论来探讨这一限制。IB 理论认为,模型的泛化能力源自输入压缩与潜在表示中保留预测信息之间的最优平衡。使用 IB 理论,我们证明解码器单独的 Transformer 在形成任务优化序列表示时存在固有局限性。基于此结果,我们进一步展示周期性的全局转换内部序列级表示(KV 缓存)是提升 Transformer 在推理任务泛化能力的关键计算步骤。 根据这些理论见解,我们提出了一种对Transformer架构进行修改的方法,即添加一个额外的模块,在周期间隔内全局重写 KV 缓存,使其容量从记忆输入前缀向编码预测未来标记最相关的特征转变。我们的模型在数学推理基准测试中取得了显著的优势,优于参数量最多高达3.5倍的标准 Transformer 模型,以及用于缓存压缩的启发式驱动剪枝机制。 这种方法可以被视为现有 KV 缓存压缩方法的原理化扩展;虽然这些方法仅专注于压缩输入表示,但往往以牺牲保留预测信息为代价,因此其能力本质上受到无约束模型能力的限制。这建立了一个基于信息理论操作Transformer内存的原则性框架,解决了单纯通过扩大规模无法克服的根本推理局限性。
https://arxiv.org/abs/2505.16950
Foundation models hold significant promise in healthcare, given their capacity to extract meaningful representations independent of downstream tasks. This property has enabled state-of-the-art performance across several clinical applications trained on structured electronic health record (EHR) data, even in settings with limited labeled data, a prevalent challenge in healthcare. However, there is little consensus on these models' potential for clinical utility due to the lack of desiderata of comprehensive and meaningful tasks and sufficiently diverse evaluations to characterize the benefit over conventional supervised learning. To address this gap, we propose a suite of clinically meaningful tasks spanning patient outcomes, early prediction of acute and chronic conditions, including desiderata for robust evaluations. We evaluate state-of-the-art foundation models on EHR data consisting of 5 million patients from Columbia University Irving Medical Center (CUMC), a large urban academic medical center in New York City, across 14 clinically relevant tasks. We measure overall accuracy, calibration, and subpopulation performance to surface tradeoffs based on the choice of pre-training, tokenization, and data representation strategies. Our study aims to advance the empirical evaluation of structured EHR foundation models and guide the development of future healthcare foundation models.
基础模型在医疗保健领域展现出巨大的潜力,这是因为它们能够提取出与具体下游任务无关的有意义表示。这种特性使得这些模型即使在标注数据有限的情况下(这是医疗保健领域的常见挑战),也能在基于结构化电子健康记录(EHR)数据训练的多个临床应用中实现最先进的性能表现。然而,由于缺乏全面且有意义的任务标准以及足够多样化的评估方法来表征其相对于传统监督学习的优势,这些模型在临床实用性方面的潜力仍然存在争议。 为了弥补这一差距,我们提出了一套涵盖患者结果、急性病和慢性疾病的早期预测等具有临床意义的任务,并制定了稳健评价的标准。我们在哥伦比亚大学欧文医学中心(CUMC)提供的包含500万患者的EHR数据集上对最先进基础模型进行了评估,该数据来自纽约市的一个大型城市学术医疗中心。我们针对14个相关的临床任务进行了测试,测量了整体准确性、校准性和不同亚群体的表现,以揭示基于预训练策略、标记化和数据表示方法选择的权衡。 我们的研究旨在推进结构化EHR基础模型的经验评估,并为未来健康保健领域基础模型的发展提供指导。
https://arxiv.org/abs/2505.16941
Computing the polar decomposition and the related matrix sign function, has been a well-studied problem in numerical analysis for decades. More recently, it has emerged as an important subroutine in deep learning, particularly within the Muon optimization framework. However, the requirements in this setting differ significantly from those of traditional numerical analysis. In deep learning, methods must be highly efficient and GPU-compatible, but high accuracy is often unnecessary. As a result, classical algorithms like Newton-Schulz (which suffers from slow initial convergence) and methods based on rational functions (which rely on QR decompositions or matrix inverses) are poorly suited to this context. In this work, we introduce Polar Express, a GPU-friendly algorithm for computing the polar decomposition. Like classical polynomial methods such as Newton-Schulz, our approach uses only matrix-matrix multiplications, making it GPU-compatible. Motivated by earlier work of Chen & Chow and Nakatsukasa & Freund, Polar Express adapts the polynomial update rule at each iteration by solving a minimax optimization problem, and we prove that it enjoys a strong worst-case optimality guarantee. This property ensures both rapid early convergence and fast asymptotic convergence. We also address finite-precision issues, making it stable in bfloat16 in practice. We apply Polar Express within the Muon optimization framework and show consistent improvements in validation loss on large-scale models such as GPT-2, outperforming recent alternatives across a range of learning rates.
计算极分解和相关的矩阵符号函数是数值分析领域中长期研究的问题。近年来,这些问题在深度学习领域变得尤为重要,特别是在Muon优化框架中的应用。然而,在这种环境中需求与传统数值分析的需求有显著不同。在深度学习中,方法必须高效且兼容GPU,并且对精度的要求往往不高。因此,传统的算法如牛顿-施瓦茨(其初期收敛速度慢)和基于有理函数的方法(依赖于QR分解或矩阵求逆)在此环境中并不适用。 在这项工作中,我们引入了一种名为Polar Express的新算法,用于在GPU环境下高效计算极分解。与经典的多项式方法(如牛顿-施瓦茨法)类似,我们的方法仅使用矩阵乘法运算,从而使其兼容于GPU环境。受到陈和周以及中村祐介和弗雷德之前工作的启发,Polar Express通过在每次迭代中解决一个最小最大优化问题来调整多项式更新规则,并证明了该算法具有强大的最坏情况下的最优性保证。这一特性确保了快速的早期收敛以及较快的渐近收敛速度。 我们还解决了有限精度的问题,使其在实际应用中能够在bfloat16格式下保持稳定。我们将Polar Express应用于Muon优化框架,在大规模模型(如GPT-2)上验证损失,并显示相对于各种学习率下的近期替代方法而言,其性能得到了一致的改进。
https://arxiv.org/abs/2505.16932
We introduce a new paradigm for active sound modification: Active Speech Enhancement (ASE). While Active Noise Cancellation (ANC) algorithms focus on suppressing external interference, ASE goes further by actively shaping the speech signal -- both attenuating unwanted noise components and amplifying speech-relevant frequencies -- to improve intelligibility and perceptual quality. To enable this, we propose a novel Transformer-Mamba-based architecture, along with a task-specific loss function designed to jointly optimize interference suppression and signal enrichment. Our method outperforms existing baselines across multiple speech processing tasks -- including denoising, dereverberation, and declipping -- demonstrating the effectiveness of active, targeted modulation in challenging acoustic environments.
我们提出了一种新的主动声音修改范式:主动语音增强(ASE)。与专注于抑制外部干扰的主动噪声消除(ANC)算法不同,ASE 进一步通过积极塑造语音信号——既减弱不需要的噪音成分又放大对语音相关的频率——来提高可懂度和感知质量。为了实现这一目标,我们提出了一种基于Transformer-Mamba架构的新颖方法,并设计了一个特定任务的损失函数,以同时优化干扰抑制和信号增强。我们的方法在多个语音处理任务中超越了现有的基线模型,包括降噪、去混响和去削波等,展示了在具有挑战性的声学环境中进行主动、有针对性调制的有效性。
https://arxiv.org/abs/2505.16911
Continual post-training adapts a single text-to-image diffusion model to learn new tasks without incurring the cost of separate models, but naive post-training causes forgetting of pretrained knowledge and undermines zero-shot compositionality. We observe that the absence of a standardized evaluation protocol hampers related research for continual post-training. To address this, we introduce T2I-ConBench, a unified benchmark for continual post-training of text-to-image models. T2I-ConBench focuses on two practical scenarios, item customization and domain enhancement, and analyzes four dimensions: (1) retention of generality, (2) target-task performance, (3) catastrophic forgetting, and (4) cross-task generalization. It combines automated metrics, human-preference modeling, and vision-language QA for comprehensive assessment. We benchmark ten representative methods across three realistic task sequences and find that no approach excels on all fronts. Even joint "oracle" training does not succeed for every task, and cross-task generalization remains unsolved. We release all datasets, code, and evaluation tools to accelerate research in continual post-training for text-to-image models.
持续后期训练使单一的文本到图像扩散模型能够在不增加单独模型成本的情况下学习新任务,但简单的后期训练会导致遗忘预训练知识,并损害零样本组合性。我们观察到缺乏标准化的评估协议阻碍了相关研究的发展,特别是对于持续后期训练的研究。为此,我们引入了T2I-ConBench,这是一个用于文本到图像模型持续后期训练的统一基准测试平台。T2I-ConBench专注于两个实际场景:项目定制和领域增强,并从四个维度进行分析:(1)通用性保留;(2)目标任务性能;(3)灾难性遗忘;以及(4)跨任务泛化能力。它结合了自动化指标、人类偏好建模及视觉-语言问答,以进行全面评估。我们对十种代表性方法进行了三个实际的任务序列基准测试,并发现没有一种方法在所有方面都表现出色。即使联合“oracle”训练也不适用于每个任务,而跨任务的泛化问题仍未解决。我们发布了所有数据集、代码和评估工具,以加速文本到图像模型持续后期训练的研究进展。
https://arxiv.org/abs/2505.16875
Despite the remarkable generation quality of video Diffusion Transformer (DiT) models, their practical deployment is severely hindered by extensive computational requirements. This inefficiency stems from two key challenges: the quadratic complexity of self-attention with respect to token length and the multi-step nature of diffusion models. To address these limitations, we present Jenga, a novel inference pipeline that combines dynamic attention carving with progressive resolution generation. Our approach leverages two key insights: (1) early denoising steps do not require high-resolution latents, and (2) later steps do not require dense attention. Jenga introduces a block-wise attention mechanism that dynamically selects relevant token interactions using 3D space-filling curves, alongside a progressive resolution strategy that gradually increases latent resolution during generation. Experimental results demonstrate that Jenga achieves substantial speedups across multiple state-of-the-art video diffusion models while maintaining comparable generation quality (8.83$\times$ speedup with 0.01\% performance drop on VBench). As a plug-and-play solution, Jenga enables practical, high-quality video generation on modern hardware by reducing inference time from minutes to seconds -- without requiring model retraining. Code: this https URL
尽管视频扩散变压器(DiT)模型的生成质量十分出色,但由于其计算需求过大,实际部署受到了严重阻碍。这种低效率主要源自两个关键挑战:自注意力机制相对于标记长度呈二次复杂度增长,以及扩散模型的多步骤特性。为解决这些局限性,我们提出了Jenga,这是一种结合动态注意力裁剪与渐进式分辨率生成的新颖推理管道。我们的方法利用了以下两大洞见:(1)早期去噪阶段不需要高分辨率潜在特征,(2)后期阶段则无需密集型注意力机制。Jenga引入了一种基于块的注意力机制,在此机制中使用三维空间填充曲线动态选择相关的标记交互,并配合一种渐进式分辨率策略,在生成过程中逐步增加潜在特征的分辨率。 实验结果显示,与多种最先进的视频扩散模型相比,Jenga实现了显著的速度提升(在VBench数据集上速度提升了8.83倍且性能仅下降了0.01%),同时保持了相当的生成质量。作为即插即用解决方案,Jenga通过将推理时间从数分钟缩短至数秒,使得在现代硬件上进行高质量视频生成成为可能——而无需重新训练模型。 代码链接:[请在此处插入具体链接]
https://arxiv.org/abs/2505.16864
Improving the performance of pre-trained policies through online reinforcement learning (RL) is a critical yet challenging topic. Existing online RL fine-tuning methods require continued training with offline pretrained Q-functions for stability and performance. However, these offline pretrained Q-functions commonly underestimate state-action pairs beyond the offline dataset due to the conservatism in most offline RL methods, which hinders further exploration when transitioning from the offline to the online setting. Additionally, this requirement limits their applicability in scenarios where only pre-trained policies are available but pre-trained Q-functions are absent, such as in imitation learning (IL) pre-training. To address these challenges, we propose a method for efficient online RL fine-tuning using solely the offline pre-trained policy, eliminating reliance on pre-trained Q-functions. We introduce PORL (Policy-Only Reinforcement Learning Fine-Tuning), which rapidly initializes the Q-function from scratch during the online phase to avoid detrimental pessimism. Our method not only achieves competitive performance with advanced offline-to-online RL algorithms and online RL approaches that leverage data or policies prior, but also pioneers a new path for directly fine-tuning behavior cloning (BC) policies.
通过在线强化学习(RL)提升预训练策略的性能是一个重要但具有挑战性的课题。现有的在线RL微调方法需要继续使用离线预训练的价值函数Q进行稳定性和性能优化。然而,这些离线预训练的价值函数通常会低估超出离线数据集的状态-动作对,因为大多数离线RL方法都偏向保守,这在从离线转向在线设置时阻碍了进一步的探索。此外,这一需求限制了它们在只有预训练策略而没有预训练价值函数可用的情景下的适用性,比如模仿学习(IL)预训练中就是如此。 为了解决这些挑战,我们提出了一种仅使用离线预训练策略进行高效在线RL微调的方法,从而消除对预训练价值函数的依赖。我们引入了PORL(Policy-Only Reinforcement Learning Fine-Tuning),该方法在在线阶段从头开始快速初始化Q函数,以避免有害的悲观态度。我们的方法不仅实现了与先进的离线到在线RL算法和利用数据或策略先验的在线RL方法相当的性能,而且还为直接微调行为克隆(BC)策略开辟了一条新的路径。
https://arxiv.org/abs/2505.16856
Reinforcement Learning (RL) has proven to be an effective post-training strategy for enhancing reasoning in vision-language models (VLMs). Group Relative Policy Optimization (GRPO) is a recent prominent method that encourages models to generate complete reasoning traces before answering, leading to increased token usage and computational cost. Inspired by the human-like thinking process-where people skip reasoning for easy questions but think carefully when needed-we explore how to enable VLMs to first decide when reasoning is necessary. To realize this, we propose TON, a two-stage training strategy: (i) a supervised fine-tuning (SFT) stage with a simple yet effective 'thought dropout' operation, where reasoning traces are randomly replaced with empty thoughts. This introduces a think-or-not format that serves as a cold start for selective reasoning; (ii) a GRPO stage that enables the model to freely explore when to think or not, while maximizing task-aware outcome rewards. Experimental results show that TON can reduce the completion length by up to 90% compared to vanilla GRPO, without sacrificing performance or even improving it. Further evaluations across diverse vision-language tasks-covering a range of reasoning difficulties under both 3B and 7B models-consistently reveal that the model progressively learns to bypass unnecessary reasoning steps as training advances. These findings shed light on the path toward human-like reasoning patterns in reinforcement learning approaches. Our code is available at this https URL.
强化学习(RL)已被证明是增强视觉-语言模型(VLMs)推理能力的有效训练后策略。最近,群组相对政策优化(GRPO)方法被提出,该方法鼓励模型在回答之前生成完整的推理轨迹,从而导致更多的令牌使用和计算成本增加。受人类思维方式的启发——即人们在遇到简单问题时会跳过推理过程而在需要思考时仔细考虑——我们探索了如何让VLMs能够首先判断何时需要进行推理。 为实现这一目标,我们提出了TON,这是一种两阶段训练策略:(i) 监督微调(SFT)阶段,该阶段引入了一种简单而有效的“思维丢弃”操作,在此操作中随机将推理轨迹替换为空白思考。这引入了一个考虑是否需要推理的格式,作为选择性推理的冷启动;(ii) GRPO阶段使模型能够自由探索何时进行或不进行思考,同时最大化任务感知结果奖励。 实验结果显示,与原始GRPO相比,TON可以减少完成长度高达90%,而不会牺牲性能甚至提高性能。在涵盖3B和7B模型的各种视觉-语言任务上的进一步评估表明,随着训练的进展,该模型逐渐学会了跳过不必要的推理步骤。这些发现为强化学习方法向人类类似的推理模式迈进指明了道路。 我们的代码可在上述提供的链接中获取。
https://arxiv.org/abs/2505.16854
This paper introduces a method for detecting inappropriately targeting language in online conversations by integrating crowd and expert annotations with ChatGPT. We focus on English conversation threads from Reddit, examining comments that target individuals or groups. Our approach involves a comprehensive annotation framework that labels a diverse data set for various target categories and specific target words within the conversational context. We perform a comparative analysis of annotations from human experts, crowd annotators, and ChatGPT, revealing strengths and limitations of each method in recognizing both explicit hate speech and subtler discriminatory language. Our findings highlight the significant role of contextual factors in identifying hate speech and uncover new categories of targeting, such as social belief and body image. We also address the challenges and subjective judgments involved in annotation and the limitations of ChatGPT in grasping nuanced language. This study provides insights for improving automated content moderation strategies to enhance online safety and inclusivity.
本文介绍了一种通过结合群众和专家注释与ChatGPT来检测在线对话中不当目标语言的方法。我们重点关注来自Reddit的英文对话线程,分析针对个人或群体的评论。我们的方法涉及一个全面的标注框架,该框架为各种目标类别以及对话上下文中的特定目标词汇标记了一个多样化的数据集。我们对人类专家、群众注释者和ChatGPT提供的注解进行了比较分析,揭示了每种方法在识别明确的仇恨言论和较为微妙的歧视性语言方面的优缺点。 我们的研究结果强调了识别仇恨言论中情境因素的重要性,并发现了一些新的目标类别,如社会信仰和身体形象。我们还讨论了标注过程中的挑战和主观判断以及ChatGPT理解细微语言的局限性。这项研究表明如何改进自动内容审核策略以增强在线安全性和包容性。
https://arxiv.org/abs/2505.16847
Modern Vision-Language Models (VLMs) can solve a wide range of tasks requiring visual reasoning. In real-world scenarios, desirable properties for VLMs include fast inference and controllable generation (e.g., constraining outputs to adhere to a desired format). However, existing autoregressive (AR) VLMs like LLaVA struggle in these aspects. Discrete diffusion models (DMs) offer a promising alternative, enabling parallel decoding for faster inference and bidirectional context for controllable generation through text-infilling. While effective in language-only settings, DMs' potential for multimodal tasks is underexplored. We introduce LaViDa, a family of VLMs built on DMs. We build LaViDa by equipping DMs with a vision encoder and jointly fine-tune the combined parts for multimodal instruction following. To address challenges encountered, LaViDa incorporates novel techniques such as complementary masking for effective training, prefix KV cache for efficient inference, and timestep shifting for high-quality sampling. Experiments show that LaViDa achieves competitive or superior performance to AR VLMs on multi-modal benchmarks such as MMMU, while offering unique advantages of DMs, including flexible speed-quality tradeoff, controllability, and bidirectional reasoning. On COCO captioning, LaViDa surpasses Open-LLaVa-Next-8B by +4.1 CIDEr with 1.92x speedup. On bidirectional tasks, it achieves +59% improvement on Constrained Poem Completion. These results demonstrate LaViDa as a strong alternative to AR VLMs. Code and models will be released in the camera-ready version.
现代视觉-语言模型(VLMs)能够解决需要视觉推理的广泛任务。在实际应用场景中,理想中的VLM应具备快速推断和可控生成能力(例如,确保输出符合预期格式)。然而,现有的自回归(AR)VLM如LLaVA在这两方面表现不佳。离散扩散模型(DMs)提供了有前景的替代方案,通过并行解码实现更快的推理,并通过文本填充实现双向上下文从而支持可控生成。尽管在纯语言场景中效果显著,但DMs在多模态任务中的潜力尚未被充分探索。 我们介绍了LaViDa,这是一个基于DM构建的一系列VLM家族。通过为DM配备视觉编码器并联合微调以适应多模态指令跟随,我们建立了LaViDa。为了克服开发过程中遇到的挑战,LaViDa引入了互补掩码、前缀KV缓存和时间步移位等新颖技术,这些技术分别有助于有效的训练、高效的推理以及高质量采样。 实验结果表明,在包括MMMU在内的多模态基准测试中,LaViDa能够达到与AR VLM相当或更优的性能,并且具备DM的独特优势,如灵活的速度质量权衡、可控性及双向推理能力。在COCO描述任务上,相较于Open-LLaVa-Next-8B,LaViDa以+4.1 CIDEr分的优势和1.92倍的速度提升取得领先地位。而在受控诗歌完成的双向任务中,它也实现了59%的性能改善。这些结果表明,LaViDa是AR VLM的一个强有力的替代选择。代码与模型将在最终版本中发布。
https://arxiv.org/abs/2505.16839
While foundation models (FMs), such as diffusion models and large vision-language models (LVLMs), have been widely applied in educational contexts, their ability to generate pedagogically effective visual explanations remains limited. Most existing approaches focus primarily on textual reasoning, overlooking the critical role of structured and interpretable visualizations in supporting conceptual understanding. To better assess the visual reasoning capabilities of FMs in educational settings, we introduce EduVisBench, a multi-domain, multi-level benchmark. EduVisBench features diverse STEM problem sets requiring visually grounded solutions, along with a fine-grained evaluation rubric informed by pedagogical theory. Our empirical analysis reveals that existing models frequently struggle with the inherent challenge of decomposing complex reasoning and translating it into visual representations aligned with human cognitive processes. To address these limitations, we propose EduVisAgent, a multi-agent collaborative framework that coordinates specialized agents for instructional planning, reasoning decomposition, metacognitive prompting, and visualization design. Experimental results show that EduVisAgent substantially outperforms all baselines, achieving a 40.2% improvement and delivering more educationally aligned visualizations. EduVisBench and EduVisAgent are available at this https URL and this https URL.
虽然基础模型(FMs),如扩散模型和大型视觉-语言模型(LVLMs),已在教育领域得到广泛应用,但它们生成有效的教学性可视化解释的能力仍然有限。大多数现有方法主要关注文本推理,忽视了结构化且可解释的可视化在支持概念理解中的关键作用。为了更好地评估基础模型在教育环境中进行视觉推理的能力,我们介绍了EduVisBench,这是一个跨领域的多层次基准测试。EduVisBench涵盖了一系列要求基于视觉解决方案的STEM问题集,并提供了一套由教学理论指导的细粒度评价标准。 我们的实证分析表明,现有模型经常难以应对将复杂推理分解并转化为与人类认知过程相一致的视觉表示这一固有挑战。为了解决这些局限性,我们提出了EduVisAgent,这是一个多代理协作框架,协调专门用于教育规划、推理拆解、元认知提示和可视化设计的代理。实验结果显示,EduVisAgent显著优于所有基线模型,在表现上提高了40.2%,并提供了更符合教学目标的视觉表示。 有关EduVisBench和EduVisAgent的更多信息,请访问以下链接: - EduVisBench: [URL] - EduVisAgent: [URL] 请将[URL]替换为实际网址。
https://arxiv.org/abs/2505.16832