We present a fundamental discovery that challenges our understanding of how complex reasoning emerges in large language models. While conventional wisdom suggests that sophisticated reasoning tasks demand extensive training data (>100,000 examples), we demonstrate that complex mathematical reasoning abilities can be effectively elicited with surprisingly few examples. Through comprehensive experiments, our proposed model LIMO demonstrates unprecedented performance in mathematical reasoning. With merely 817 curated training samples, LIMO achieves 57.1% accuracy on AIME and 94.8% on MATH, improving from previous SFT-based models' 6.5% and 59.2% respectively, while only using 1% of the training data required by previous approaches. LIMO demonstrates exceptional out-of-distribution generalization, achieving 40.5% absolute improvement across 10 diverse benchmarks, outperforming models trained on 100x more data, challenging the notion that SFT leads to memorization rather than generalization. Based on these results, we propose the Less-Is-More Reasoning Hypothesis (LIMO Hypothesis): In foundation models where domain knowledge has been comprehensively encoded during pre-training, sophisticated reasoning capabilities can emerge through minimal but precisely orchestrated demonstrations of cognitive processes. This hypothesis posits that the elicitation threshold for complex reasoning is determined by two key factors: (1) the completeness of the model's encoded knowledge foundation during pre-training, and (2) the effectiveness of post-training examples as "cognitive templates" that show the model how to utilize its knowledge base to solve complex reasoning tasks. To facilitate reproducibility and future research in data-efficient reasoning, we release LIMO as a comprehensive open-source suite at this https URL.
我们提出了一项基本发现,该发现挑战了人们对大型语言模型中复杂推理如何产生的理解。虽然传统的观点认为,复杂的推理任务需要大量的训练数据(>100,000个样本),但我们证明,仅用极少数的示例就可以有效地激发复杂的数学推理能力。通过全面的实验,我们提出的模型LIMO在数学推理方面表现出前所未有的性能。使用仅仅817个精心挑选的训练样本,LIMO在AIME测试中实现了57.1%的准确率,在MATH测试中的准确率为94.8%,分别比之前基于SFT(Supervised Fine-Tuning)模型的6.5%和59.2%有显著提升,而使用的训练数据仅为先前方法所需的1%。LIMO展示了卓越的域外泛化能力,在涵盖十个不同基准的数据集上实现了绝对改进达40.5%,超越了那些使用100倍更多数据进行训练的模型,挑战了SFT导致记忆而非泛化的观点。 基于这些结果,我们提出了“少即是多推理假设”(LIMO假设):在经过预训练阶段已全面编码领域知识的基础模型中,复杂的推理能力可以通过最少但精心策划的认知过程演示来激发。这一假设认为复杂推理的激发门槛取决于两个关键因素: 1. 在预训练期间所编录的知识基础的完整性; 2. 后期训练示例作为“认知模板”的有效性,这些示例向模型展示了如何利用其知识库解决复杂的推理任务。 为了促进高效数据推理的可重复性和未来研究,我们已经将LIMO作为一个全面开放源代码套件发布在此URL:[此链接]。
https://arxiv.org/abs/2502.03387
We introduce Hibiki, a decoder-only model for simultaneous speech translation. Hibiki leverages a multistream language model to synchronously process source and target speech, and jointly produces text and audio tokens to perform speech-to-text and speech-to-speech translation. We furthermore address the fundamental challenge of simultaneous interpretation, which unlike its consecutive counterpart, where one waits for the end of the source utterance to start translating, adapts its flow to accumulate just enough context to produce a correct translation in real-time, chunk by chunk. To do so, we introduce a weakly-supervised method that leverages the perplexity of an off-the-shelf text translation system to identify optimal delays on a per-word basis and create aligned synthetic data. After supervised training, Hibiki performs adaptive, simultaneous speech translation with vanilla temperature sampling. On a French-English simultaneous speech translation task, Hibiki demonstrates state-of-the-art performance in translation quality, speaker fidelity and naturalness. Moreover, the simplicity of its inference process makes it compatible with batched translation and even real-time on-device deployment. We provide examples as well as models and inference code.
我们介绍了一种名为Hibiki的解码器模型,用于实时语音翻译。Hibiki利用一个多流语言模型同步处理源语音和目标语音,并同时生成文本和音频令牌以执行语音到文本以及语音到语音的翻译。此外,我们还解决了实时口译的基本挑战,不同于连续口译(后者需要等待源语句结束才开始翻译),实时口译必须根据积累足够的上下文来逐块地产生正确的实时翻译进行调整。 为了实现这一目标,我们引入了一种弱监督方法,利用现成的文本翻译系统的困惑度(perplexity)来识别每个单词的最佳延迟,并创建对齐的人工合成数据。经过有监督训练后,Hibiki能够使用简单的温度抽样进行自适应、实时语音翻译。在法语-英语的实时语音翻译任务中,Hibiki展示了业界领先的翻译质量、说话人真实性和自然度。 此外,其推理过程的简洁性使其与批处理翻译甚至设备上的实时部署兼容。我们提供了示例模型以及推理代码。
https://arxiv.org/abs/2502.03382
We give a comprehensive analysis of transformers as time series foundation models, focusing on their approximation and generalization capabilities. First, we demonstrate that there exist transformers that fit an autoregressive model on input univariate time series via gradient descent. We then analyze MOIRAI, a multivariate time series foundation model capable of handling an arbitrary number of covariates. We prove that it is capable of automatically fitting autoregressive models with an arbitrary number of covariates, offering insights into its design and empirical success. For generalization, we establish bounds for pretraining when the data satisfies Dobrushin's condition. Experiments support our theoretical findings, highlighting the efficacy of transformers as time series foundation models.
我们对变压器作为时间序列基础模型进行了全面分析,重点关注它们的近似和泛化能力。首先,我们证明存在可以通过梯度下降在输入的一维时间序列上拟合自回归模型的变压器。然后,我们研究了MOIRAI,这是一种可以处理任意数量协变量的多变量时间序列基础模型,并证明它可以自动适应具有任意数量协变量的自回归模型,从而提供对其设计和实证成功的见解。对于泛化能力,我们在数据满足Dobrushin条件的情况下建立了预训练的界限。实验结果支持了我们的理论发现,突显了变压器作为时间序列基础模型的有效性。
https://arxiv.org/abs/2502.03383
This paper reports on the results from a pilot study investigating the impact of automatic speech recognition (ASR) technology on interpreting quality in remote healthcare interpreting settings. Employing a within-subjects experiment design with four randomised conditions, this study utilises scripted medical consultations to simulate dialogue interpreting tasks. It involves four trainee interpreters with a language combination of Chinese and English. It also gathers participants' experience and perceptions of ASR support through cued retrospective reports and semi-structured interviews. Preliminary data suggest that the availability of ASR, specifically the access to full ASR transcripts and to ChatGPT-generated summaries based on ASR, effectively improved interpreting quality. Varying types of ASR output had different impacts on the distribution of interpreting error types. Participants reported similar interactive experiences with the technology, expressing their preference for full ASR transcripts. This pilot study shows encouraging results of applying ASR to dialogue-based healthcare interpreting and offers insights into the optimal ways to present ASR output to enhance interpreter experience and performance. However, it should be emphasised that the main purpose of this study was to validate the methodology and that further research with a larger sample size is necessary to confirm these findings.
这篇论文报道了一项初步研究的结果,该研究探讨了自动语音识别(ASR)技术对远程医疗口译服务质量的影响。采用一种包含四种随机条件的被试内实验设计,本研究利用编写的医学咨询对话来模拟口译任务,并涉及四位中英语言组合的实习口译员。此外,还通过提示式回顾报告和半结构化访谈收集参与者对ASR支持的经验与看法。 初步数据表明,ASR(特别是完整ASR转录文本和基于ASR由ChatGPT生成的摘要)的可用性显著提升了口译质量。不同类型ASR输出对口译错误类型的分布产生了不同的影响。参与者报告了相似的技术互动体验,并表示他们更倾向于使用完整的ASR转录文本。这项初步研究展示了将ASR应用于对话式医疗口译中的有希望的结果,为优化呈现ASR输出以增强口译员经验和表现提供了见解。 然而,需要强调的是,本研究的主要目的是验证方法学的有效性,因此还需要进行具有更大样本量的研究来确认这些发现。
https://arxiv.org/abs/2502.03381
This white paper underscores the critical importance of responsibly deploying Artificial Intelligence (AI) in military contexts, emphasizing a commitment to ethical and legal standards. The evolving role of AI in the military goes beyond mere technical applications, necessitating a framework grounded in ethical principles. The discussion within the paper delves into ethical AI principles, particularly focusing on the Fairness, Accountability, Transparency, and Ethics (FATE) guidelines. Noteworthy considerations encompass transparency, justice, non-maleficence, and responsibility. Importantly, the paper extends its examination to military-specific ethical considerations, drawing insights from the Just War theory and principles established by prominent entities. In addition to the identified principles, the paper introduces further ethical considerations specifically tailored for military AI applications. These include traceability, proportionality, governability, responsibility, and reliability. The application of these ethical principles is discussed on the basis of three use cases in the domains of sea, air, and land. Methods of automated sensor data analysis, eXplainable AI (XAI), and intuitive user experience are utilized to specify the use cases close to real-world scenarios. This comprehensive approach to ethical considerations in military AI reflects a commitment to aligning technological advancements with established ethical frameworks. It recognizes the need for a balance between leveraging AI's potential benefits in military operations while upholding moral and legal standards. The inclusion of these ethical principles serves as a foundation for responsible and accountable use of AI in the complex and dynamic landscape of military scenarios.
这份白皮书强调了在军事环境中负责任地部署人工智能(AI)的至关重要性,尤其重视遵守伦理和法律标准。随着人工智能在军事领域的作用不断演变,它已超越简单的技术应用层面,需要建立一个以道德原则为基础的框架。白皮书中讨论的内容深入探讨了与伦理相关的AI原则,并特别关注公平性、责任、透明度和伦理(FATE)准则。值得注意的是,这些考虑因素包括透明度、公正、不造成伤害以及责任感。此外,该文件还扩展到军事领域特有的道德考量,从正义战争理论及其原则中汲取洞见。 除已确立的原则外,白皮书进一步提出了针对军事AI应用的特定伦理考量。这其中包括可追溯性、比例原则、可治理性、责任和可靠性等。这些伦理原则的应用依据海陆空三大领域的三个实际案例进行了讨论,并采用了自动化传感器数据分析、解释型人工智能(XAI)以及直观用户体验的方法来具体说明这些案例,使其更加贴近现实世界的情景。 这种对军事AI伦理考量的全面方法体现了将技术进步与既定道德框架相结合的承诺。它承认需要在利用AI为军事行动带来的潜在利益和坚持道德及法律标准之间取得平衡。引入这些伦理原则是为了确保在复杂多变的军事场景中负责任且有问责地使用人工智能的基础。
https://arxiv.org/abs/2502.03376
Scaling inference compute enhances reasoning in large language models (LLMs), with long chains-of-thought (CoTs) enabling strategies like backtracking and error correction. Reinforcement learning (RL) has emerged as a crucial method for developing these capabilities, yet the conditions under which long CoTs emerge remain unclear, and RL training requires careful design choices. In this study, we systematically investigate the mechanics of long CoT reasoning, identifying the key factors that enable models to generate long CoT trajectories. Through extensive supervised fine-tuning (SFT) and RL experiments, we present four main findings: (1) While SFT is not strictly necessary, it simplifies training and improves efficiency; (2) Reasoning capabilities tend to emerge with increased training compute, but their development is not guaranteed, making reward shaping crucial for stabilizing CoT length growth; (3) Scaling verifiable reward signals is critical for RL. We find that leveraging noisy, web-extracted solutions with filtering mechanisms shows strong potential, particularly for out-of-distribution (OOD) tasks such as STEM reasoning; and (4) Core abilities like error correction are inherently present in base models, but incentivizing these skills effectively for complex tasks via RL demands significant compute, and measuring their emergence requires a nuanced approach. These insights provide practical guidance for optimizing training strategies to enhance long CoT reasoning in LLMs. Our code is available at: this https URL.
扩大推理计算能力可以增强大型语言模型(LLMs)的推理性能,尤其是长链条思维(CoTs),这使得回溯和错误校正等策略得以实现。强化学习(RL)已成为开发这些能力的关键方法之一,但关于何种条件下会出现长链条思维以及如何进行有效的RL训练仍然不明确。在这项研究中,我们系统地调查了长链条思维推理的机制,并确定了模型能够生成长链条轨迹的关键因素。通过广泛的监督微调(SFT)和强化学习实验,我们提出了四个主要发现: 1. **虽然严格的监督微调并非绝对必要**,但这种做法可以简化训练过程并提高效率。 2. **随着训练计算量的增加,推理能力倾向于出现**,但是这一发展并不是确定无疑的。因此,为了稳定链条长度的增长,奖励塑形(即设计合理的奖励函数)变得至关重要。 3. **扩大可验证的回报信号对于强化学习至关重要**。我们发现利用带有过滤机制的噪声网络提取解决方案显示出强大潜力,特别是在如STEM推理这样的分布外任务上效果尤为显著。 4. **基础模型中固有具备纠错等核心能力**,但通过RL有效激励这些技能以应对复杂任务需要大量的计算资源,并且衡量它们的发展需要采用细致的方法。 这些见解为优化训练策略、增强LLMs中的长链条思维推理提供了实用指导。我们的代码可在提供的链接处获取。
https://arxiv.org/abs/2502.03373
The potato is a widely grown crop in many regions of the world. In recent decades, potato farming has gained incredible traction in the world. Potatoes are susceptible to several illnesses that stunt their development. This plant seems to have significant leaf disease. Early Blight and Late Blight are two prevalent leaf diseases that affect potato plants. The early detection of these diseases would be beneficial for enhancing the yield of this crop. The ideal solution is to use image processing to identify and analyze these disorders. Here, we present an autonomous method based on image processing and machine learning to detect late blight disease affecting potato leaves. The proposed method comprises four different phases: (1) Histogram Equalization is used to improve the quality of the input image; (2) feature extraction is performed using a Deep CNN model, then these extracted features are concatenated; (3) feature selection is performed using wrapper-based feature selection; (4) classification is performed using an SVM classifier and its variants. This proposed method achieves the highest accuracy of 99% using SVM by selecting 550 features.
土豆是世界上许多地区广泛种植的一种作物。近年来,土豆的栽培在全球范围内得到了极大的推广。然而,土豆容易受到多种病害的影响,这些病害会抑制其生长发育。这里有一种植物似乎患上了严重的叶部疾病。早疫病和晚疫病是影响土豆植株的两种常见叶部疾病。及早发现这些病症将有助于提高作物产量。 理想的解决方案是利用图像处理技术来识别和分析这些问题。在这里,我们提出了一种基于图像处理和机器学习的自主方法,用于检测影响土豆叶片的晚疫病。所提出的这种方法包含了四个不同的阶段: 1. **直方图均衡化**:用于改善输入图像的质量。 2. **特征提取**:利用深度卷积神经网络(Deep CNN)模型进行特征提取;然后将这些提取出的特征进行串联。 3. **特征选择**:采用基于包装器的方法来进行特征选择。 4. **分类**:使用支持向量机(SVM)分类器及其变体进行分类。 该方法通过选择550个特征,利用SVM实现了高达99%的最高准确率。
https://arxiv.org/abs/2502.03370
Learning from active human involvement enables the human subject to actively intervene and demonstrate to the AI agent during training. The interaction and corrective feedback from human brings safety and AI alignment to the learning process. In this work, we propose a new reward-free active human involvement method called Proxy Value Propagation for policy optimization. Our key insight is that a proxy value function can be designed to express human intents, wherein state-action pairs in the human demonstration are labeled with high values, while those agents' actions that are intervened receive low values. Through the TD-learning framework, labeled values of demonstrated state-action pairs are further propagated to other unlabeled data generated from agents' exploration. The proxy value function thus induces a policy that faithfully emulates human behaviors. Human-in-the-loop experiments show the generality and efficiency of our method. With minimal modification to existing reinforcement learning algorithms, our method can learn to solve continuous and discrete control tasks with various human control devices, including the challenging task of driving in Grand Theft Auto V. Demo video and code are available at: this https URL
通过主动的人类参与进行学习使人类主体能够在训练过程中积极干预并展示给AI代理。这种互动和来自人的纠正反馈为学习过程带来了安全性和与人工智能的对齐。在本文中,我们提出了一种新的无奖励的主动人类介入方法,称为代理价值传播(Proxy Value Propagation),用于策略优化。我们的关键见解是,可以设计一个代理价值函数来表达人类意图,在此情况下,人展示的状态-动作对被赋予高值,而那些被干预的操作则得到低值。通过TD学习框架,标记的示范状态-动作对的价值进一步传播到由代理探索生成的未标记数据中。因此,该代理价值函数诱导出一种策略,可以忠实地模仿人类行为。人机交互实验表明了我们方法的普适性和效率性。在不修改现有强化学习算法的情况下,我们的方法能够通过各种人类控制设备(包括《侠盗猎车手V》中的驾驶等挑战性任务)来学会解决连续和离散控制任务。演示视频和代码可以在以下网址找到:[此URL]
https://arxiv.org/abs/2502.03369
Thanks to the advances in generative architectures and large language models, data scientists can now code pipelines of machine-learning operations to process large collections of unstructured data. Recent progress has seen the rise of declarative AI frameworks (e.g., Palimpzest, Lotus, and DocETL) to build optimized and increasingly complex pipelines, but these systems often remain accessible only to expert programmers. In this demonstration, we present PalimpChat, a chat-based interface to Palimpzest that bridges this gap by letting users create and run sophisticated AI pipelines through natural language alone. By integrating Archytas, a ReAct-based reasoning agent, and Palimpzest's suite of relational and LLM-based operators, PalimpChat provides a practical illustration of how a chat interface can make declarative AI frameworks truly accessible to non-experts. Our demo system is publicly available online. At SIGMOD'25, participants can explore three real-world scenarios--scientific discovery, legal discovery, and real estate search--or apply PalimpChat to their own datasets. In this paper, we focus on how PalimpChat, supported by the Palimpzest optimizer, simplifies complex AI workflows such as extracting and analyzing biomedical data.
感谢生成架构和大型语言模型的进步,数据科学家现在可以编写用于处理大量非结构化数据的机器学习操作流水线。近期的发展见证了声明式人工智能框架(例如Palimpzest、Lotus和DocETL)的兴起,这些框架能够构建优化且日趋复杂的流水线,但这类系统通常仅对专家程序员开放。在此演示中,我们介绍了一种名为PalimpChat的新聊天界面,该界面基于Palimpzest并提供了一个桥梁,使用户通过自然语言就能创建和运行复杂的人工智能管道,从而弥合了这一差距。通过结合基于ReAct的推理代理Archytas与Palimpzest的关系运算符及大型语言模型(LLM)运算符套件,PalimpChat实际展示了聊天界面如何使得声明式AI框架对非专业人士变得真正可访问。 我们的演示系统已在网上公开可用。在SIGMOD'25会议上,参与者可以探索三个现实世界场景——科学发现、法律发现和房地产搜索——或者将PalimpChat应用到他们自己的数据集上。本文着重介绍了PalimpChat如何通过帕尔姆佩斯特优化器的支持简化复杂的人工智能工作流程,例如提取和分析生物医学数据。
https://arxiv.org/abs/2502.03368
Volumetric Modulated Arc Therapy (VMAT) revolutionizes cancer treatment by precisely delivering radiation while sparing healthy tissues. Fluence maps generation, crucial in VMAT planning, traditionally involves complex and iterative, and thus time consuming processes. These fluence maps are subsequently leveraged for leaf-sequence. The deep-learning approach presented in this article aims to expedite this by directly predicting fluence maps from patient data. We developed a 3D network which we trained in a supervised way using a combination of L1 and L2 losses, and RT plans generated by Eclipse and from the REQUITE dataset, taking the RT dose map as input and the fluence maps computed from the corresponding RT plans as target. Our network predicts jointly the 180 fluence maps corresponding to the 180 control points (CP) of single arc VMAT plans. In order to help the network, we pre-process the input dose by computing the projections of the 3D dose map to the beam's eye view (BEV) of the 180 CPs, in the same coordinate system as the fluence maps. We generated over 2000 VMAT plans using Eclipse to scale up the dataset size. Additionally, we evaluated various network architectures and analyzed the impact of increasing the dataset size. We are measuring the performance in the 2D fluence maps domain using image metrics (PSNR, SSIM), as well as in the 3D dose domain using the dose-volume histogram (DVH) on a validation dataset. The network inference, which does not include the data loading and processing, is less than 20ms. Using our proposed 3D network architecture as well as increasing the dataset size using Eclipse improved the fluence map reconstruction performance by approximately 8 dB in PSNR compared to a U-Net architecture trained on the original REQUITE dataset. The resulting DVHs are very close to the one of the input target dose.
体积调制旋转治疗(VMAT)通过精确地向肿瘤输送放射线同时保护健康组织,革新了癌症治疗方法。在VMAT规划中至关重要的射束强度分布图的生成过程传统上是复杂且迭代性的,因此耗时较长。这些射束强度分布图随后被用于叶片序列中。本文介绍的一种深度学习方法旨在通过直接从患者数据预测射束强度分布图来加快这一过程。 我们开发了一种3D网络,并使用Eclipse系统产生的和REQUITE数据集中RT计划生成的组合,采用监督式训练方式,在输入为RT剂量图、目标为从相应RT计划中计算出的射束强度分布图的情况下进行训练。我们的网络能够同时预测单弧VMAT计划中的180个控制点(CP)对应的180张射束强度分布图。为了帮助网络,我们对输入剂量进行了预处理,通过计算3D剂量图在各个控制点视图下的投影,使其与射束强度分布图使用相同的坐标系。 我们利用Eclipse系统生成了超过2000个VMAT计划以扩大数据集规模,并评估了各种网络架构以及增加数据集大小的影响。我们在二维射束强度分布图领域使用图像质量指标(PSNR和SSIM),在三维剂量域中则采用剂量体积直方图(DVH)来一个验证数据集中衡量性能。 我们的网络推理过程(不包括数据加载和处理步骤)耗时不到20毫秒。相较于在原始REQUITE数据集上训练的U-Net架构,使用我们提出的3D网络结构以及通过Eclipse系统增加数据集规模的方法使射束强度分布图重建性能提升了约8dB PSNR。生成的剂量体积直方图(DVH)与输入的目标剂量非常接近。 这种技术的进步为癌症治疗提供了更加高效且准确的方式,有助于减少患者接受放射治疗的时间和潜在风险。
https://arxiv.org/abs/2502.03360
Evaluations of large-scale recognition methods typically focus on overall performance. While this approach is common, it often fails to provide insights into performance across individual classes, which can lead to fairness issues and misrepresentation. Addressing these gaps is crucial for accurately assessing how well methods handle novel or unseen classes and ensuring a fair evaluation. To address fairness in Open-Set Recognition (OSR), we demonstrate that per-class performance can vary dramatically. We introduce Gaussian Hypothesis Open Set Technique (GHOST), a novel hyperparameter-free algorithm that models deep features using class-wise multivariate Gaussian distributions with diagonal covariance matrices. We apply Z-score normalization to logits to mitigate the impact of feature magnitudes that deviate from the model's expectations, thereby reducing the likelihood of the network assigning a high score to an unknown sample. We evaluate GHOST across multiple ImageNet-1K pre-trained deep networks and test it with four different unknown datasets. Using standard metrics such as AUOSCR, AUROC and FPR95, we achieve statistically significant improvements, advancing the state-of-the-art in large-scale OSR. Source code is provided online.
大规模识别方法的评估通常侧重于整体性能。虽然这种做法很常见,但它往往无法提供关于各个类别表现情况的深入见解,这可能导致公平性问题和信息失真。解决这些差距对于准确评估方法处理新型或未见过类别的能力以及确保公正评价至关重要。 为了在开放集识别(OSR)中实现这一目标并解决公平性问题,我们展示了按类别性能的巨大差异。为此,我们引入了高斯假设开放集技术(GHOST),这是一种无需调整超参数的算法,该算法使用带有对角协方差矩阵的类特异性多元高斯分布来建模深层特征。我们将Z分数标准化应用于预测得分,以减轻偏离模型预期的特征幅度的影响,从而降低网络将高分分配给未知样本的概率。 我们在多个ImageNet-1K预训练深度神经网络上评估了GHOST,并使用四个不同的未知数据集进行了测试。通过标准指标如AUOSCR、AUROC和FPR95,我们实现了统计显著性的改进,推动了大规模OSR的前沿发展。源代码在线提供。
https://arxiv.org/abs/2502.03359
How effectively can LLM-based AI assistants utilize their memory (context) to perform various tasks? Traditional data benchmarks, which are often manually crafted, suffer from several limitations: they are static, susceptible to overfitting, difficult to interpret, and lack actionable insights--failing to pinpoint the specific capabilities a model lacks when it does not pass a test. In this paper, we present a framework for automatically generating a comprehensive set of tests to evaluate models' abilities to use their memory effectively. Our framework extends the range of capability tests beyond the commonly explored (passkey, key-value, needle in the haystack) search, a dominant focus in the literature. Specifically, we evaluate models on atomic tasks such as searching, recalling, editing, matching, comparing information in context memory, and performing basic operations when inputs are structured into distinct blocks, simulating real-world data. Additionally, we design composite tests to investigate the models' ability to maintain state while operating on memory. Our benchmark enables an interpretable, detailed assessment of memory capabilities of LLMs.
基于大型语言模型(LLM)的AI助手如何有效地利用其记忆(上下文)来执行各种任务?传统的数据基准测试通常由人工创建,存在若干局限性:它们是静态的、容易过拟合、难以解释且缺乏可操作见解——无法在模型未能通过测试时确定具体缺失的能力。在这篇论文中,我们提出了一种框架,用于自动生成一套全面的测试来评估模型利用其记忆的有效能力。我们的框架扩展了常见探索范围之外的能力测试(如密码查找、键值对搜索和大海捞针),这些是文献中的主要关注点。特别地,我们在结构化成不同块的输入上评估模型,执行原子任务,例如搜索、回忆、编辑、匹配和比较上下文记忆中的信息,并在模拟现实世界数据的情况下进行基本操作。此外,我们设计了合成测试来调查模型在处理内存时保持状态的能力。我们的基准测试使LLM的记忆能力能够得到可解释且详细的评估。
https://arxiv.org/abs/2502.03358
Game-theoretic models are effective tools for modeling multi-agent interactions, especially when robots need to coordinate with humans. However, applying these models requires inferring their specifications from observed behaviors -- a challenging task known as the inverse game problem. Existing inverse game approaches often struggle to account for behavioral uncertainty and measurement noise, and leverage both offline and online data. To address these limitations, we propose an inverse game method that integrates a generative trajectory model into a differentiable mixed-strategy game framework. By representing the mixed strategy with a conditional variational autoencoder (CVAE), our method can infer high-dimensional, multi-modal behavior distributions from noisy measurements while adapting in real-time to new observations. We extensively evaluate our method in a simulated navigation benchmark, where the observations are generated by an unknown game model. Despite the model mismatch, our method can infer Nash-optimal actions comparable to those of the ground-truth model and the oracle inverse game baseline, even in the presence of uncertain agent objectives and noisy measurements.
游戏理论模型是模拟多智能体交互的有效工具,尤其是在机器人需要与人类进行协调时。然而,应用这些模型需要从观察到的行为中推断其规范——这是一个被称为逆向游戏问题的艰巨任务。现有的逆向游戏方法通常难以应对行为不确定性及测量噪声,并且依赖于离线和在线数据。为了克服这些限制,我们提出了一种结合生成轨迹模型与可微混合策略博弈框架的逆向游戏方法。通过用条件变分自动编码器(CVAE)表示混合策略,我们的方法可以从嘈杂的测量中推断出高维、多模态的行为分布,并实时适应新的观察结果。 我们在一个模拟导航基准测试中全面评估了该方法,在这个基准测试中,观测数据是由未知游戏模型生成的。即使存在模型不匹配的问题,当面对不确定的目标和噪声测量时,我们的方法仍然能够推断出与真实模型及Oracle逆向博弈基线相当的纳什最优行动。 这种方法在处理复杂、不确定性高的多智能体系统中展现出强大的潜力,尤其是涉及人类机器人交互的应用场景中。
https://arxiv.org/abs/2502.03356
Self-play has powered breakthroughs in two-player and multi-player games. Here we show that self-play is a surprisingly effective strategy in another domain. We show that robust and naturalistic driving emerges entirely from self-play in simulation at unprecedented scale -- 1.6~billion~km of driving. This is enabled by Gigaflow, a batched simulator that can synthesize and train on 42 years of subjective driving experience per hour on a single 8-GPU node. The resulting policy achieves state-of-the-art performance on three independent autonomous driving benchmarks. The policy outperforms the prior state of the art when tested on recorded real-world scenarios, amidst human drivers, without ever seeing human data during training. The policy is realistic when assessed against human references and achieves unprecedented robustness, averaging 17.5 years of continuous driving between incidents in simulation.
自博弈在双人游戏和多人游戏中推动了突破性进展。在这里,我们展示了自博弈在一个完全不同的领域中同样是一种极其有效的策略。我们在模拟环境中通过大规模的自博弈(达到了前所未有的规模——16亿公里驾驶里程)生成了稳健且自然主义风格的驾驶行为。这一成就得益于Gigaflow系统的支持,这是一种能够在一个单个配备8张GPU的节点上每小时合成并训练42年人类主观驾驶经验的大批量模拟器。所获得的策略在三个独立的自动驾驶基准测试中达到了最先进的性能水平。 该策略在未经任何人类数据训练的情况下,在真实世界的记录场景中和人类驾驶员混行时,依然超越了之前的最佳表现。当根据人类参考标准评估时,该政策显得非常现实,并实现了前所未有的稳健性,在模拟环境中平均每17.5年连续驾驶才会出现一次事故。
https://arxiv.org/abs/2502.03349
We focus on human-robot collaborative transport, in which a robot and a user collaboratively move an object to a goal pose. In the absence of explicit communication, this problem is challenging because it demands tight implicit coordination between two heterogeneous agents, who have very different sensing, actuation, and reasoning capabilities. Our key insight is that the two agents can coordinate fluently by encoding subtle, communicative signals into actions that affect the state of the transported object. To this end, we design an inference mechanism that probabilistically maps observations of joint actions executed by the two agents to a set of joint strategies of workspace traversal. Based on this mechanism, we define a cost representing the human's uncertainty over the unfolding traversal strategy and introduce it into a model predictive controller that balances between uncertainty minimization and efficiency maximization. We deploy our framework on a mobile manipulator (Hello Robot Stretch) and evaluate it in a within-subjects lab study (N=24). We show that our framework enables greater team performance and empowers the robot to be perceived as a significantly more fluent and competent partner compared to baselines lacking a communicative mechanism.
我们专注于人机协作搬运,其中机器人和用户共同将一个物体移动到目标位置。在缺乏明确沟通的情况下,这一问题具有挑战性,因为它要求两个异构代理之间紧密的隐式协调,而这两个代理感知、执行和推理的能力差异很大。我们的关键见解是,通过将微妙且具有交流性质的信号编码为影响搬运对象状态的动作,两个代理可以流畅地进行协调。为此,我们设计了一种推断机制,该机制基于两代理执行的联合动作观测结果,概率性地映射到一组工作空间穿越策略。在此基础上,我们定义了一个成本函数,表示人在不断展开的穿越策略中的不确定性,并将其引入一个模型预测控制器中,以平衡不确定性和效率之间的关系。 我们在移动操作机器人(Hello Robot Stretch)上部署了该框架,并在一个有24名参与者的研究中进行了评估。研究结果表明,我们的框架能够显著提高团队性能,并使机器人被感知为比缺乏交流机制的基线系统更流畅、更有能力的合作伙伴。
https://arxiv.org/abs/2502.03346
Variational inference in probabilistic graphical models aims to approximate fundamental quantities such as marginal distributions and the partition function. Popular approaches are the Bethe approximation, tree-reweighted, and other types of convex free energies. These approximations are efficient but can fail if the model is complex and highly interactive. In this work, we analyze two classes of approximations that include the above methods as special cases: first, if the model parameters are changed; and second, if the entropy approximation is changed. We discuss benefits and drawbacks of either approach, and deduce from this analysis how a free energy approximation should ideally be constructed. Based on our observations, we propose approximations that automatically adapt to a given model and demonstrate their effectiveness for a range of difficult problems.
在概率图模型中的变分推理旨在近似基本量,如边缘分布和配分函数。流行的方法包括Bethe近似、树重新加权以及其它类型的凸自由能量。这些方法虽然高效,但在面对复杂且高度交互的模型时可能会失效。在这项工作中,我们分析了两类包含上述方法作为特例的近似:第一类是在改变模型参数的情况下;第二类是在更改熵近似的情况下。我们讨论了每种方法的优点和缺点,并由此推断出理想的自由能近似应如何构建。基于我们的观察结果,我们提出了一组能够自动适应给定模型的近似方法,并展示了这些方法在一系列难题中的有效性。
https://arxiv.org/abs/2502.03341
The widespread use of chest X-rays (CXRs), coupled with a shortage of radiologists, has driven growing interest in automated CXR analysis and AI-assisted reporting. While existing vision-language models (VLMs) show promise in specific tasks such as report generation or abnormality detection, they often lack support for interactive diagnostic capabilities. In this work we present RadVLM, a compact, multitask conversational foundation model designed for CXR interpretation. To this end, we curate a large-scale instruction dataset comprising over 1 million image-instruction pairs containing both single-turn tasks -- such as report generation, abnormality classification, and visual grounding -- and multi-turn, multi-task conversational interactions. After fine-tuning RadVLM on this instruction dataset, we evaluate it across different tasks along with re-implemented baseline VLMs. Our results show that RadVLM achieves state-of-the-art performance in conversational capabilities and visual grounding while remaining competitive in other radiology tasks. Ablation studies further highlight the benefit of joint training across multiple tasks, particularly for scenarios with limited annotated data. Together, these findings highlight the potential of RadVLM as a clinically relevant AI assistant, providing structured CXR interpretation and conversational capabilities to support more effective and accessible diagnostic workflows.
胸部X光片(CXRs)的广泛应用,加上放射科医生短缺的问题,推动了自动CXRs分析和人工智能辅助报告的兴趣日益增长。尽管现有的视觉-语言模型(VLMs)在诸如生成报告或检测异常等特定任务中显示出潜力,但它们通常缺乏支持交互式诊断功能的支持。在这项工作中,我们介绍了RadVLM,这是一种紧凑型、多任务对话基础模型,专为CXRs解释设计。为此,我们整理了一个大规模指令数据集,该数据集中包含超过100万张图像-指令对,这些对涵盖了单轮次任务(如生成报告、异常分类和视觉定位)以及多轮次、多任务的对话交互。 在RadVLM模型上进行这项指令数据集微调后,我们在不同任务中对其进行评估,并与重新实现的基线VLMs进行了比较。我们的结果显示,在对话能力和视觉定位方面,RadVLM达到了最先进的性能,同时在其他放射学任务中也保持了竞争力。消融研究进一步强调了跨多个任务联合训练的好处,特别是在标注数据有限的情况下。 这些发现共同突显了RadVLM作为临床相关AI助手的潜力,它能够提供结构化的CXRs解释和对话能力,以支持更有效且可访问的诊断工作流程。
https://arxiv.org/abs/2502.03333
During the early stages of interface design, designers need to produce multiple sketches to explore a design space. Design tools often fail to support this critical stage, because they insist on specifying more details than necessary. Although recent advances in generative AI have raised hopes of solving this issue, in practice they fail because expressing loose ideas in a prompt is impractical. In this paper, we propose a diffusion-based approach to the low-effort generation of interface sketches. It breaks new ground by allowing flexible control of the generation process via three types of inputs: A) prompts, B) wireframes, and C) visual flows. The designer can provide any combination of these as input at any level of detail, and will get a diverse gallery of low-fidelity solutions in response. The unique benefit is that large design spaces can be explored rapidly with very little effort in input-specification. We present qualitative results for various combinations of input specifications. Additionally, we demonstrate that our model aligns more accurately with these specifications than other models.
在界面设计的早期阶段,设计师需要制作多张草图以探索设计方案。然而,现有的设计工具往往无法有效支持这一关键步骤,因为它们要求指定过多不必要的细节。尽管最近生成式人工智能技术的进步带来了解决这一问题的希望,但实际上这些方法由于难以通过提示表达松散的想法而未能成功。在本文中,我们提出了一种基于扩散模型的方法来快速生成界面草图,该方法通过允许设计师以三种类型输入的任意组合进行灵活控制来创新性地解决了这个问题:A)描述性的文字提示、B)线框图和C)视觉流程图。设计师可以在任意详细程度上提供这些输入,并将获得一系列多样化的低保真度解决方案作为回应。 这种方法的独特优势在于,它能够以极小的输入指定量快速探索大规模的设计空间。我们展示了各种不同的输入组合所得到的结果,并且证明我们的模型在与特定设计规范的一致性方面优于其他现有模型。
https://arxiv.org/abs/2502.03330
Recent advancements in large language models (LLMs) have led to significant successes across various applications, where the most noticeable is to a series of emerging capabilities, particularly in the areas of In-Context Learning (ICL) and Chain-of-Thought (CoT). To better understand and control model performance, many studies have begun investigating the underlying causes of these phenomena and their impact on task outcomes. However, existing explanatory frameworks predominantly focus on isolating and explaining ICL and CoT independently, leading to an incomplete understanding of their combined influence on model performance. To address this gap, we propose the Electronic Circuit Model (ECM), which provides a foundation for developing scalable, learnable policies and improving the management of AI-generated content. Specifically, ECM conceptualizes model behavior as an electronic circuit: ICL is represented as semantic magnetic field to providing an additional voltage following Faraday's Law, while CoT is modeled as series resistors to constrain the model output performance following Ohm's Law. Experimental results demonstrate that the ECM effectively predicts and explains LLM performance across a variety of prompting strategies. Furthermore, we apply ECM to advanced reasoning strategy optimization on a series of tasks, such as the International Olympiad in Informatics (IOI) and the International Mathematical Olympiad (IMO), achieving competitive performance that surpasses nearly 80% of top human competitors.
最近在大型语言模型(LLMs)方面取得的进展已经在各种应用中取得了显著的成功,其中最引人注目的是新兴能力的发展,特别是在上下文学习(ICL)和链式思维(CoT)领域。为了更好地理解和控制模型性能,许多研究开始调查这些现象的根本原因及其对任务结果的影响。然而,现有的解释框架主要集中在单独隔离并解释ICL和CoT上,这导致了对其结合影响模型表现的理解不完整。为了解决这一缺口,我们提出了电子电路模型(ECM),它提供了一个开发可扩展、可学习的政策以及改进AI生成内容管理的基础。具体来说,ECM将模型行为比喻成一个电子电路:ICL被表示为通过法拉第定律提供的语义磁场来增加额外电压,而CoT则被建模为串联电阻以限制模型输出性能遵循欧姆定律。 实验结果表明,ECM能够有效预测和解释各种提示策略下的LLM表现。此外,我们还将ECM应用于一系列任务(如国际信息学奥林匹克竞赛(IOI) 和 国际数学奥林匹克(IMO))的高级推理策略优化上,在这些任务中实现了超越近80%顶级人类竞争对手的竞争力性能。
https://arxiv.org/abs/2502.03325
Distinguishing in- and out-of-distribution (OOD) inputs is crucial for reliable deployment of classification systems. However, OOD data is typically unavailable or difficult to collect, posing a significant challenge for accurate OOD detection. In this work, we present a method that harnesses the generative capabilities of Large Language Models (LLMs) to create high-quality synthetic OOD proxies, eliminating the dependency on any external OOD data source. We study the efficacy of our method on classical text classification tasks such as toxicity detection and sentiment classification as well as classification tasks arising in LLM development and deployment, such as training a reward model for RLHF and detecting misaligned generations. Extensive experiments on nine InD-OOD dataset pairs and various model sizes show that our approach dramatically lowers false positive rates (achieving a perfect zero in some cases) while maintaining high accuracy on in-distribution tasks, outperforming baseline methods by a significant margin.
区分内分布(In-Distribution,ID)和外分布(Out-of-Distribution,OOD)输入对于分类系统的可靠部署至关重要。然而,OOD数据通常难以获取或收集,这给准确的OOD检测带来了重大挑战。在这项工作中,我们提出了一种方法,该方法利用大规模语言模型(LLMs)的生成能力来创建高质量的合成OOD代理,从而消除了对任何外部OOD数据源的依赖。我们在经典文本分类任务(如毒性检测和情感分类)以及在LLM开发和部署中出现的分类任务(例如为RLHF训练奖励模型和检测不一致生成)上研究了我们方法的有效性。在九组InD-OOD数据集对及各种规模模型上的大量实验表明,我们的方法显著降低了假阳性率(在某些情况下达到了完美的零),同时保持了高准确度的内分布任务表现,大幅超越了基线方法。
https://arxiv.org/abs/2502.03323