Verification is crucial for effective mathematical reasoning. We present a new temporal consistency method where verifiers iteratively refine their judgments based on the previous assessment. Unlike one-round verification or multi-model debate approaches, our method leverages consistency in a sequence of self-reflection actions to improve verification accuracy. Empirical evaluations across diverse mathematical process error identification benchmarks (Mathcheck, ProcessBench, and PRM800K) show consistent performance improvements over baseline methods. When applied to the recent DeepSeek R1 distilled models, our method demonstrates strong performance, enabling 7B/8B distilled models to outperform all 70B/72B models and GPT-4o on ProcessBench. Notably, the distilled 14B model with our method achieves performance comparable to Deepseek-R1. Our codes are available at this https URL
验证对于有效的数学推理至关重要。我们提出了一种新的时间一致性方法,其中验证者基于之前的评估结果迭代地改进他们的判断。与一轮验证或多模型辩论的方法不同,我们的方法利用一系列自我反思行为中的连贯性来提高验证的准确性。在广泛的数学过程错误识别基准测试(包括Mathcheck、ProcessBench和PRM800K)上的经验评估显示,在基线方法上持续有性能改进。当应用于最近的DeepSeek R1蒸馏模型时,我们的方法表现出强大的性能,使得7B/8B大小的蒸馏模型在ProcessBench基准测试中超越了所有70B/72B规模的模型和GPT-4o。值得注意的是,使用我们方法处理过的14B模型,在性能上达到了与DeepSeek-R1相当的水平。我们的代码可在以下网址获取:[此处应提供URL]。
https://arxiv.org/abs/2503.14495
To be helpful assistants, AI agents must be aware of their own capabilities and limitations. This includes knowing when to answer from parametric knowledge versus using tools, when to trust tool outputs, and when to abstain or hedge. Such capabilities are hard to teach through supervised fine-tuning because they require constructing examples that reflect the agent's specific capabilities. We therefore propose a radically new approach to teaching agents what they know: \emph{collaborative self-play}. We construct multi-agent collaborations in which the group is rewarded for collectively arriving at correct answers. The desired meta-knowledge emerges from the incentives built into the structure of the interaction. We focus on small societies of agents that have access to heterogeneous tools (corpus-specific retrieval), and therefore must collaborate to maximize their success while minimizing their effort. Experiments show that group-level rewards for multi-agent communities can induce policies that \emph{transfer} to improve tool use and selective prediction in settings where individual agents are deployed in isolation.
为了成为有用的助手,AI代理必须了解自己的能力和局限性。这包括知道何时从参数知识中作答与使用工具之间的区别、何时信任工具输出以及何时保持谨慎或选择回避。由于这些能力难以通过监督微调来传授(因为需要构建能够反映特定代理能力的例子),我们提出了一种全新的教学方法:\emph{协作自我游戏}。我们构造了多代理合作,其中团队因集体正确地得出答案而获得奖励。这种元知识从互动结构中内置的激励机制中涌现出来。我们的重点在于拥有异构工具(针对特定语料库检索)的小规模代理社会,并且这些代理必须通过最小化自身努力来最大化成功所需的合作。 实验表明,多代理社区中的群体奖励可以诱导出在单个代理独立部署时能够\emph{转移}的策略,从而改进工具使用和选择性预测。
https://arxiv.org/abs/2503.14481
LLMs often adopt an assertive language style also when making false claims. Such ``overconfident hallucinations'' mislead users and erode trust. Achieving the ability to express in language the actual degree of uncertainty around a claim is therefore of great importance. We find that ``verbal uncertainty'' is governed by a single linear feature in the representation space of LLMs, and show that this has only moderate correlation with the actual ``semantic uncertainty'' of the model. We apply this insight and show that (1) the mismatch between semantic and verbal uncertainty is a better predictor of hallucinations than semantic uncertainty alone and (2) we can intervene on verbal uncertainty at inference time and reduce hallucinations on short-form answers, achieving an average relative reduction of 32%.
大型语言模型(LLMs)在作出错误陈述时也常常采用自信的语言风格。这种“过度自信的幻觉”会误导用户并侵蚀信任。因此,能够用语言表达围绕某个声明的实际不确定性程度变得非常重要。我们发现,“口头不确定性”在LLM的表示空间中由单一线性特征控制,并且这一特征与模型实际存在的“语义不确定性”的相关度仅为适度。 基于这一洞察,我们展示了以下两点: 1. 语义不确定性和口头不确定性之间的不匹配比单独的语义不确定性更准确地预测幻觉; 2. 我们可以在推理时间干预口头不确定性,减少短形式答案中的幻觉现象,并实现平均相对减少32%的效果。
https://arxiv.org/abs/2503.14477
Inference scaling empowers LLMs with unprecedented reasoning ability, with reinforcement learning as the core technique to elicit complex reasoning. However, key technical details of state-of-the-art reasoning LLMs are concealed (such as in OpenAI o1 blog and DeepSeek R1 technical report), thus the community still struggles to reproduce their RL training results. We propose the $\textbf{D}$ecoupled Clip and $\textbf{D}$ynamic s$\textbf{A}$mpling $\textbf{P}$olicy $\textbf{O}$ptimization ($\textbf{DAPO}$) algorithm, and fully open-source a state-of-the-art large-scale RL system that achieves 50 points on AIME 2024 using Qwen2.5-32B base model. Unlike previous works that withhold training details, we introduce four key techniques of our algorithm that make large-scale LLM RL a success. In addition, we open-source our training code, which is built on the verl framework, along with a carefully curated and processed dataset. These components of our open-source system enhance reproducibility and support future research in large-scale LLM RL.
推理缩放赋予大型语言模型(LLM)前所未有的推理能力,强化学习是激发复杂推理的核心技术。然而,最先进的推理LLM的关键技术细节被隐藏(如OpenAI的o1博客和DeepSeek R1的技术报告),因此社区仍然难以再现它们的RL训练结果。我们提出了“解耦裁剪与动态采样策略优化”(DAPO)算法,并完全开源了一个使用Qwen2.5-32B基础模型在AIME 2024上取得50分的大型强化学习系统。不同于之前保留训练细节的做法,我们介绍了四项使大规模LLM RL成为可能的关键技术。此外,我们还开源了我们的训练代码,该代码基于verl框架,并附带了一个精心整理和处理过的数据集。这些开放源码系统的组成部分增强了可重复性,并支持未来在大规模LLM RL领域的研究。
https://arxiv.org/abs/2503.14476
Common subword tokenization algorithms like BPE and UnigramLM assume that text can be split into meaningful units by concatenative measures alone. This is not true for languages such as Hebrew and Arabic, where morphology is encoded in root-template patterns, or Malay and Georgian, where split affixes are common. We present SPLINTER, a pre-processing step which rearranges text into a linear form that better represents such nonconcatenative morphologies, enabling meaningful contiguous segments to be found by the tokenizer. We demonstrate SPLINTER's merit using both intrinsic measures evaluating token vocabularies in Hebrew, Arabic, and Malay; as well as on downstream tasks using BERT-architecture models trained for Hebrew.
常见的子词标记化算法,如BPE(Byte Pair Encoding)和UnigramLM,假设可以通过简单的拼接操作将文本分割成有意义的单位。然而,对于希伯来语、阿拉伯语等语言来说,这种假设并不成立,因为这些语言中的形态学信息是通过根形模式编码的;而对于马来语和格鲁吉亚语,则普遍存在分裂词缀的现象。我们提出了一种名为SPLINTER的预处理步骤,该步骤可以将文本重新排列成一种线性形式,以更好地表示此类非拼接(nonconcatenative)形态学结构,从而使标记器能够找到有意义且连续的片段。 我们通过内在度量评估希伯来语、阿拉伯语和马来语中的词典词汇,并使用基于BERT架构模型在希伯来语下游任务上的表现,展示了SPLINTER的优点。
https://arxiv.org/abs/2503.14433
Co-speech gestures convey a wide variety of meanings and play an important role in face-to-face human interactions. These gestures significantly influence the addressee's engagement, recall, comprehension, and attitudes toward the speaker. Similarly, they impact interactions between humans and embodied virtual agents. The process of selecting and animating meaningful gestures has thus become a key focus in the design of these agents. However, automating this gesture selection process poses a significant challenge. Prior gesture generation techniques have varied from fully automated, data-driven methods, which often struggle to produce contextually meaningful gestures, to more manual approaches that require crafting specific gesture expertise and are time-consuming and lack generalizability. In this paper, we leverage the semantic capabilities of Large Language Models to develop a gesture selection approach that suggests meaningful, appropriate co-speech gestures. We first describe how information on gestures is encoded into GPT-4. Then, we conduct a study to evaluate alternative prompting approaches for their ability to select meaningful, contextually relevant gestures and to align them appropriately with the co-speech utterance. Finally, we detail and demonstrate how this approach has been implemented within a virtual agent system, automating the selection and subsequent animation of the selected gestures for enhanced human-agent interactions.
共言语手势传达了多种意义,并在面对面的人际互动中扮演着重要角色。这些手势显著影响接收者的参与度、记忆力、理解力以及对发言人的态度。同样,它们也会影响人类与具身虚拟代理之间的交互。因此,在设计这些虚拟代理时,选择和动画化有意义的手势已成为一个关键焦点。然而,自动化这一手势选择过程带来了重大挑战。先前的手势生成技术从完全自动化的数据驱动方法(往往难以产生上下文相关的手势)到需要专门手势专业知识的更手动的方法(耗时且缺乏通用性)不等。 在本文中,我们利用大型语言模型的语言语义能力来开发一种手势选择方法,该方法建议有意义且合适的共言语手势。首先,我们将描述如何将关于手势的信息编码进GPT-4。然后,我们将进行一项研究,评估不同的提示方法能否有效地选择有意义的、上下文相关的手势,并使其与共言语句适当对应。最后,我们将详细说明并展示这一方法在虚拟代理系统中的实现情况:自动化地选择了相应的手势并在之后将其动画化以增强人类和代理之间的交互。
https://arxiv.org/abs/2503.14408
This paper explores hallucination phenomena in large language models (LLMs) through the lens of language philosophy and psychoanalysis. By incorporating Lacan's concepts of the "chain of signifiers" and "suture points," we propose the Anchor-RAG framework as a novel approach to mitigate hallucinations. In contrast to the predominant reliance on trial-and-error experiments, constant adjustments of mathematical formulas, or resource-intensive methods that emphasize quantity over quality, our approach returns to the fundamental principles of linguistics to analyze the root causes of hallucinations in LLMs. Drawing from robust theoretical foundations, we derive algorithms and models that are not only effective in reducing hallucinations but also enhance LLM performance and improve output quality. This paper seeks to establish a comprehensive theoretical framework for understanding hallucinations in LLMs and aims to challenge the prevalent "guess-and-test" approach and rat race mentality in the field. We aspire to pave the way for a new era of interpretable LLMs, offering deeper insights into the inner workings of language-based AI systems.
本文通过语言哲学和精神分析的视角探讨大型语言模型(LLM)中的幻觉现象。结合拉康的“能指链”和“缝合点”概念,我们提出了Anchor-RAG框架作为缓解幻觉的新方法。与当前主要依赖试错实验、不断调整数学公式或强调数量而非质量的资源密集型方法不同,我们的方法回归语言学的基本原则,以分析LLM中幻觉的根本原因。基于坚实的理论基础,我们推导出不仅有效减少幻觉,还能提升LLM性能和改善输出质量的算法和模型。 本文旨在建立一个全面的理论框架来理解LLM中的幻觉现象,并挑战该领域盛行的“猜测与测试”方法以及资源竞赛心态。我们的目标是为可解释性的大型语言模型铺平道路,提供对基于语言的人工智能系统的内部运作更深刻的洞察。
https://arxiv.org/abs/2503.14392
Large language models (LLMs) undergo a three-phase training process: unsupervised pre-training, supervised fine-tuning (SFT), and learning from human feedback (RLHF/DPO). Notably, it is during the final phase that these models are exposed to negative examples -- incorrect, rejected, or suboptimal responses to queries. This paper delves into the role of negative examples in the training of LLMs, using a likelihood-ratio (Likra) model on multiple-choice question answering benchmarks to precisely manage the influence and the volume of negative examples. Our findings reveal three key insights: (1) During a critical phase in training, Likra with negative examples demonstrates a significantly larger improvement per training example compared to SFT using only positive examples. This leads to a sharp jump in the learning curve for Likra unlike the smooth and gradual improvement of SFT; (2) negative examples that are plausible but incorrect (near-misses) exert a greater influence; and (3) while training with positive examples fails to significantly decrease the likelihood of plausible but incorrect answers, training with negative examples more accurately identifies them. These results indicate a potentially significant role for negative examples in improving accuracy and reducing hallucinations for LLMs.
大型语言模型(LLM)的训练过程分为三个阶段:无监督预训练、有监督微调(SFT)和从人类反馈中学习(RLHF/DPO)。值得注意的是,在最后一个阶段,这些模型才接触到负面示例——即查询的不正确、被拒绝或次优响应。本文探讨了在大型语言模型训练过程中负面示例的作用,并使用似然比(Likra)模型在多项选择题回答基准测试上精确管理负面样本的影响和数量。我们的研究发现揭示了三个关键见解: 1. 在训练过程中的一个关键阶段,包含负面示例的Likra显示出每个训练样例相比仅使用正面示例进行SFT有显著更大的改进效果。这导致了Likra学习曲线出现明显的跳跃式提升,与SFT中平滑渐进式的改善形成鲜明对比; 2. 可信但错误的答案(接近正确答案但实际是错的)作为负面示例对模型的影响更大; 3. 单独使用正面样本训练无法显著降低可信但错误答案的可能性,而采用负面样例进行训练则更准确地识别出这些选项。 这些结果表明,在提高大型语言模型的准确性并减少幻觉方面,负面示例可能扮演着重要角色。
https://arxiv.org/abs/2503.14391
The purpose of this paper is to examine whether large language models (LLMs) can understand what is good and evil with respect to judging good/evil reputation of celebrities. Specifically, we first apply a large language model (namely, ChatGPT) to the task of collecting sentences that mention the target celebrity from articles about celebrities on Web pages. Next, the collected sentences are categorized based on their contents by ChatGPT, where ChatGPT assigns a category name to each of those categories. Those assigned category names are referred to as "aspects" of each celebrity. Then, by applying the framework of retrieval augmented generation (RAG), we show that the large language model is quite effective in the task of judging good/evil reputation of aspects and descriptions of each celebrity. Finally, also in terms of proving the advantages of the proposed method over existing services incorporating RAG functions, we show that the proposed method of judging good/evil of aspects/descriptions of each celebrity significantly outperform an existing service incorporating RAG functions.
本文的目的是探讨大型语言模型(LLMs)是否能够理解善与恶,特别是在判断名人的善恶声誉方面。具体而言,首先我们应用了一个大型语言模型(如ChatGPT),从网页上的名人文章中收集提及目标名人的句子。接下来,使用ChatGPT对这些收集到的句子进行分类,根据内容为每一类分配一个类别名称。这些被分配的类别名称被称为每个名人的“方面”。然后,在检索增强生成(RAG)框架的应用下,我们展示了大型语言模型在判断名人各方面的善恶声誉任务中非常有效。最后,为了证明所提出方法相对于现有集成RAG功能的服务的优势,本文显示了对于判断每位名人的各方面和描述的善恶,所提出的方法显著优于现有的集成了RAG功能的服务。
https://arxiv.org/abs/2503.14382
While recent works (e.g. o1, DeepSeek R1) have demonstrated great promise of using long Chain-of-Thought (CoT) to improve reasoning capabilities of language models, scaling it up during test-time is challenging due to inefficient memory usage -- intermediate computations accumulate indefinitely in context even no longer needed for future thoughts. We propose PENCIL, which incorporates a reduction mechanism into the autoregressive generation process, allowing the model to recursively clean up intermediate thoughts based on patterns learned from training. With this reduction mechanism, PENCIL significantly reduces the maximal context length required during generation, and thus can generate longer thoughts with limited memory, solving larger-scale problems given more thinking time. For example, we demonstrate PENCIL achieves 97\% accuracy on the challenging Einstein's puzzle -- a task even large models like GPT-4 struggle with -- using only a small 25M-parameter transformer with 2048 context length. Theoretically, we prove PENCIL can perform universal space-efficient computation by simulating Turing machines with optimal time and space complexity, and thus can solve arbitrary computational tasks that would otherwise be intractable given context window constraints.
最近的一些研究(例如o1,DeepSeek R1)展示了使用长链式思维(Chain-of-Thought, CoT)来提升语言模型的推理能力的巨大潜力。然而,在测试时扩大这一方法的应用范围是具有挑战性的,因为其内存利用效率低下——即使不再需要,中间计算也会无限积累在上下文中。为了解决这个问题,我们提出了PENCIL,这是一种将缩减机制融入自回归生成过程中的方法,使模型能够根据从训练中学习到的模式递归地清理中间思维内容。借助这一缩减机制,PENCIL显著减少了生成过程中所需的最大上下文长度,因此能够在有限内存的情况下生成更长的思考序列,并且能通过更多的时间进行思考来解决更大规模的问题。例如,我们展示了使用仅25M参数的小型Transformer模型(上下文长度为2048),PENCIL在颇具挑战性的爱因斯坦难题上实现了97%的准确率——这是即使是像GPT-4这样的大型模型也难以完成的任务。从理论上讲,我们证明了PENCIL能够通过模拟具有最优时间与空间复杂度的图灵机来进行通用的空间效率计算,因此可以解决由于上下文窗口限制而通常无法处理的任意计算任务。
https://arxiv.org/abs/2503.14337
Large Language Models (LLMs) have revolutionized various domains, including natural language processing, data analysis, and software development, by enabling automation. In software engineering, LLM-powered coding agents have garnered significant attention due to their potential to automate complex development tasks, assist in debugging, and enhance productivity. However, existing approaches often struggle with sub-optimal decision-making, requiring either extensive manual intervention or inefficient compute scaling strategies. To improve coding agent performance, we present Dynamic Action Re-Sampling (DARS), a novel inference time compute scaling approach for coding agents, that is faster and more effective at recovering from sub-optimal decisions compared to baselines. While traditional agents either follow linear trajectories or rely on random sampling for scaling compute, our approach DARS works by branching out a trajectory at certain key decision points by taking an alternative action given the history of the trajectory and execution feedback of the previous attempt from that point. We evaluate our approach on SWE-Bench Lite benchmark, demonstrating that this scaling strategy achieves a pass@k score of 55% with Claude 3.5 Sonnet V2. Our framework achieves a pass@1 rate of 47%, outperforming state-of-the-art (SOTA) open-source frameworks.
大型语言模型(LLMs)已经通过实现自动化革新了包括自然语言处理、数据分析和软件开发在内的多个领域。在软件工程中,由于其潜在的复杂开发任务自动化的可能、调试辅助以及提高生产力的能力,由LLM驱动的编码代理受到了广泛关注。然而,现有的方法常常在决策质量方面表现不佳,需要大量的手动干预或低效的计算扩展策略。为了提升编码代理的表现,我们提出了一种名为动态动作重采样(DARS)的新颖推理时间计算扩展方法,这种方法比基线更快速且更有效地从次优决策中恢复。 传统的代理要么遵循线性轨迹,要么依赖于随机采样进行计算扩展,而我们的方法DARS则通过在某些关键决策点分支出一条新轨迹来工作。它根据之前的尝试历史和执行反馈采取不同的行动。我们在SWE-Bench Lite基准上评估了这种方法,并证明该扩展策略使Claude 3.5 Sonnet V2达到了55%的pass@k分数(注:这里的pass@k通常是指模型在前k次预测中至少有一次正确的概率)。我们的框架实现了47%的pass@1比率,超过了最先进的开源框架的表现。
https://arxiv.org/abs/2503.14269
The integration of tools has extended the capabilities of language models (LMs) beyond vanilla text generation to versatile scenarios. However, tool-augmented language models (TaLMs) often assume 'perfect' information access and tool availability, which may not hold in the real world. To systematically study TaLMs' imperfections, we introduce the FAIL-TALMS benchmark, featuring two major failures: under-specified user queries and non-available tools. FAIL-TALMS contains 1,749 examples using 906 tools across 21 categories, including single- and multi-tool usage. We evaluate top-performing proprietary and open-source models, and find all current models except for Claude struggle to recognize missing tools or information. Further, to study possible mitigation of the failures, we enable real-time human interaction, named the Ask-and-Help (AAH) method, to provide missing information or replace non-functional tools. While AAH can help models solve tasks more correctly when queries are under-specified, it brings minimal benefit when complex tools are broken.
工具的整合已经将语言模型(LM)的能力从单纯的文本生成扩展到了多种多样的应用场景。然而,增强型语言模型(TaLMs)通常假设具备“完美”的信息访问和工具可用性条件,在现实世界中这种状况可能并不存在。为了系统地研究TaLMs的不足之处,我们引入了FAIL-TALMS基准测试,该测试主要针对两个重大失败场景:用户查询描述不充分以及工具不可用。FAIL-TALMS包含了1,749个实例,涉及906种不同类别中的21类工具使用案例,包括单一和多种工具的使用情况。我们评估了顶级性能的专有模型和开源模型,并发现除了Claude之外的所有现有模型在识别缺失工具或信息方面都存在困难。此外,为了研究可能减轻这些失败的方法,我们启用了实时的人工交互,称为“提问与协助”(Ask-and-Help, AAH)方法,以便提供缺失的信息或者替代无法使用的工具。虽然AAH能够在用户查询描述不充分的情况下帮助模型更准确地完成任务,但在复杂工具不可用时带来的好处非常有限。
https://arxiv.org/abs/2503.14227
In end-to-end speech translation, acoustic representations learned by the encoder are usually fixed and static, from the perspective of the decoder, which is not desirable for dealing with the cross-modal and cross-lingual challenge in speech translation. In this paper, we show the benefits of varying acoustic states according to decoder hidden states and propose an adaptive speech-to-text translation model that is able to dynamically adapt acoustic states in the decoder. We concatenate the acoustic state and target word embedding sequence and feed the concatenated sequence into subsequent blocks in the decoder. In order to model the deep interaction between acoustic states and target hidden states, a speech-text mixed attention sublayer is introduced to replace the conventional cross-attention network. Experiment results on two widely-used datasets show that the proposed method significantly outperforms state-of-the-art neural speech translation models.
在端到端语音翻译中,编码器学习的声学表示通常是固定且静态的,从解码器的角度来看,这对处理语音翻译中的跨模态和跨语言挑战是不利的。在这篇论文中,我们展示了根据解码器隐藏状态变化声学状态的好处,并提出了一种适应性语音到文本翻译模型,该模型能够在解码过程中动态调整声学状态。我们将声学状态与目标词嵌入序列进行拼接,并将拼接后的序列输入到解码器的后续块中。为了建模声学状态和目标隐藏状态之间的深层交互,我们引入了一个语音-文本混合注意力子层来替换传统的交叉注意网络。实验结果表明,在两个广泛使用的数据集上,所提出的方法显著优于现有的神经语音翻译模型。
https://arxiv.org/abs/2503.14185
Named Entity Recognition (NER) is a critical component of Natural Language Processing (NLP) for extracting structured information from unstructured text. However, for low-resource languages like Catalan, the performance of NER systems often suffers due to the lack of high-quality annotated datasets. This paper introduces NERCat, a fine-tuned version of the GLiNER[1] model, designed to improve NER performance specifically for Catalan text. We used a dataset of manually annotated Catalan television transcriptions to train and fine-tune the model, focusing on domains such as politics, sports, and culture. The evaluation results show significant improvements in precision, recall, and F1-score, particularly for underrepresented named entity categories such as Law, Product, and Facility. This study demonstrates the effectiveness of domain-specific fine-tuning in low-resource languages and highlights the potential for enhancing Catalan NLP applications through manual annotation and high-quality datasets.
命名实体识别(NER)是自然语言处理(NLP)中提取无结构文本中的结构化信息的关键组成部分。然而,对于像加泰罗尼亚语这样的低资源语言,由于高质量标注数据的缺乏,NER系统的性能往往较差。本文介绍了 NERCat,这是 GLiNER[1] 模型经过微调的一个版本,旨在提高加泰罗尼亚语文本中命名实体识别的表现。我们使用了一套手动注释的加泰罗尼亚语电视转录数据集来训练和微调模型,并侧重于政治、体育和文化等特定领域。评估结果显示,在法律、产品和设施这类代表性不足的命名实体类别上,精度、召回率和F1分数均显著提高。这项研究证明了在低资源语言中进行领域特异性微调的有效性,并突显了通过手动注释和高质量数据集来增强加泰罗尼亚语NLP应用潜力的可能性。
https://arxiv.org/abs/2503.14173
Real dialogues with AI assistants for solving data-centric tasks often follow dynamic, unpredictable paths due to imperfect information provided by the user or in the data, which must be caught and handled. Developing datasets which capture such user-AI interactions is difficult and time-consuming. In this work, we develop a novel framework for synthetically generating controlled, multi-turn conversations between a user and AI assistant for the task of table-based question answering, which can be generated from an existing dataset with fully specified table QA examples for any target domain. Each conversation aims to solve a table-based reasoning question through collaborative effort, modeling one of two real-world scenarios: (1) an AI-initiated clarification, or (2) a user-initiated correction. Critically, we employ a strong teacher LLM to verify the correctness of our synthetic conversations, ensuring high quality. We demonstrate synthetic datasets generated from TAT-QA and WikiTableQuestions as benchmarks of frontier LLMs. We find that even larger models struggle to effectively issuing clarification questions and accurately integrate user feedback for corrections.
与AI助手解决数据相关任务的真实对话通常会因为用户提供的信息或数据中的不完善之处而呈现动态且不可预测的路径,这些需要被捕捉和处理。开发能够捕捉这种用户-人工智能互动的数据集既困难又耗时。在这项工作中,我们提出了一种新颖的方法来合成性地生成控制型多轮对话框架,以解决基于表格的问题回答任务,该方法可以从现有的包含完整规范化的表格问答示例的现有数据集中为任何目标领域生成这些对话。每个会话旨在通过协作努力解答一个基于表格的推理问题,并模拟两种真实世界场景之一:(1)AI主动发起澄清;或(2)用户主动发起修正。 关键的是,我们使用了一个强大的教师级语言模型来验证我们合成对话的正确性,确保高质量的标准。我们展示了从TAT-QA和WikiTableQuestions生成的人工数据集作为前沿大语言模型性能基准。研究发现,即使是更大的模型也难以有效地提出澄清问题并准确地整合用户反馈来进行修正。
https://arxiv.org/abs/2503.14167
The rapid advancement of large language models (LLMs) has revolutionized code generation tasks across various programming languages. However, the unique characteristics of programming languages, particularly those like Verilog with specific syntax and lower representation in training datasets, pose significant challenges for conventional tokenization and decoding approaches. In this paper, we introduce a novel application of speculative decoding for Verilog code generation, showing that it can improve both inference speed and output quality, effectively achieving speed and quality all in one. Unlike standard LLM tokenization schemes, which often fragment meaningful code structures, our approach aligns decoding stops with syntactically significant tokens, making it easier for models to learn the token distribution. This refinement addresses inherent tokenization issues and enhances the model's ability to capture Verilog's logical constructs more effectively. Our experimental results show that our method achieves up to a 5.05x speedup in Verilog code generation and increases pass@10 functional accuracy on RTLLM by up to 17.19% compared to conventional training strategies. These findings highlight speculative decoding as a promising approach to bridge the quality gap in code generation for specialized programming languages.
大型语言模型(LLMs)的迅速发展已经彻底改变了各种编程语言中的代码生成任务。然而,特定编程语言的独特特性,例如具有特定语法且在训练数据集中表示较少的Verilog语言,给传统的分词和解码方法带来了显著挑战。本文中,我们引入了一种针对Verilog代码生成的新颖投机解码应用,证明它可以同时提高推理速度和输出质量。 与标准LLM分词方案通常会将有意义的代码结构片段化不同,我们的方法使解码停止位置与语义上重要的标记对齐,从而使模型更容易学习标记分布。这一改进解决了内在的分词问题,并增强了模型捕捉Verilog逻辑构造的能力。 实验结果表明,我们提出的方法在Verilog代码生成方面比传统训练策略快高达5.05倍,并且在RTLLM上的pass@10功能准确性提高了最多17.19%。这些发现突显了投机解码作为弥合特定编程语言代码生成质量差距的有前景方法的重要性。
https://arxiv.org/abs/2503.14153
To cope with the large number of publications, more and more researchers are automatically extracting data of interest using natural language processing methods based on supervised learning. Much data, especially in the natural and engineering sciences, is quantitative, but there is a lack of datasets for identifying quantities and their context in text. To address this issue, we present two large datasets based on Wikipedia and Wikidata: Wiki-Quantities is a dataset consisting of over 1.2 million annotated quantities in the English-language Wikipedia. Wiki-Measurements is a dataset of 38,738 annotated quantities in the English-language Wikipedia along with their respective measured entity, property, and optional qualifiers. Manual validation of 100 samples each of Wiki-Quantities and Wiki-Measurements found 100% and 84-94% correct, respectively. The datasets can be used in pipeline approaches to measurement extraction, where quantities are first identified and then their measurement context. To allow reproduction of this work using newer or different versions of Wikipedia and Wikidata, we publish the code used to create the datasets along with the data.
为了应对大量出版物的挑战,越来越多的研究人员开始使用基于监督学习的自然语言处理方法自动提取感兴趣的文本数据。许多数据,尤其是在自然科学和工程科学领域,是定量化的,但识别文本中的数量及其上下文的数据集却相对匮乏。为了解决这一问题,我们提出了两个基于维基百科(Wikipedia)和维基数据(Wikidata)的大规模数据集:Wiki-Quantities 数据集中包含超过 120 万条在英文维基百科中标注的数量;而 Wiki-Measurements 数据集则包含了 38,738 条数量以及它们所测量的对象、属性及可选的条件限定。对每份数据集各手动验证了其中 100 个样本,分别发现准确率为 100% 和 84-94%。 这些数据集可以用于构建量值提取流水线方法,在此过程中首先识别数量,然后是它们的数量上下文。为了使这项研究能够在使用更新或不同版本的维基百科和维基数据时能够重复进行,我们发布了创建这些数据集所使用的代码及数据本身。
https://arxiv.org/abs/2503.14090
Large language models (LLMs) have unlocked new possibilities for generating synthetic training data in both natural language and code. By producing artificial but task-relevant examples, these models can significantly augment or even replace real-world datasets, especially when labeled data is scarce or sensitive. This paper surveys recent advances in using LLMs to create synthetic text and code, emphasizing prompt-based generation, retrieval-augmented pipelines, and iterative self-refinement. We show how these methods enrich low-resource tasks such as classification and question answering, as well as code-centric applications such as instruction tuning, code translation, and bug repair, by enabling automated verification of functional correctness. Alongside potential benefits like cost-effectiveness, broad coverage, and controllable diversity, we address challenges such as factual inaccuracies in generated text, lack of stylistic realism, and the risk of bias amplification. Proposed mitigations include filtering and weighting outputs and reinforcement learning with execution feedback for code. We conclude with open research directions like automated prompt engineering, cross-modal data synthesis, and robust evaluation frameworks, highlighting the importance of LLM-generated synthetic data in advancing AI while emphasizing ethical and quality safeguards.
大型语言模型(LLMs)解锁了生成自然语言和代码合成训练数据的新可能性。通过产生人工但任务相关的示例,这些模型可以显著增强或替代现实世界的数据集,尤其是在标注数据稀缺或敏感的情况下。本文综述了最近利用LLM创建合成文本和代码的进展,重点介绍了基于提示的生成、检索增强流水线以及迭代自我改进方法。我们展示了这些方法如何丰富低资源任务(如分类和问答)以及以代码为中心的应用(如指令微调、代码翻译和错误修复),通过实现功能正确性的自动化验证来推动进步。除了成本效益、广泛覆盖范围和可控多样性等潜在好处之外,本文还讨论了生成文本中的事实不准确、缺乏风格现实性和偏见放大风险等挑战。提出的缓解措施包括过滤和加权输出以及基于执行反馈的代码强化学习。我们以自动提示工程、跨模态数据合成以及稳健评估框架等开放研究方向作为结论,并强调LLM生成的合成数据在推动AI发展方面的重要性,同时重视伦理和质量保障。
https://arxiv.org/abs/2503.14023
Compression is at the heart of intelligence. A theoretically optimal way to compress any sequence of data is to find the shortest program that outputs that sequence and then halts. However, such 'Kolmogorov compression' is uncomputable, and code generating LLMs struggle to approximate this theoretical ideal, as it requires reasoning, planning and search capabilities beyond those of current models. In this work, we introduce the KoLMogorov-Test (KT), a compression-as-intelligence test for code generating LLMs. In KT a model is presented with a sequence of data at inference time, and asked to generate the shortest program that produces the sequence. We identify several benefits of KT for both evaluation and training: an essentially infinite number of problem instances of varying difficulty is readily available, strong baselines already exist, the evaluation metric (compression) cannot be gamed, and pretraining data contamination is highly unlikely. To evaluate current models, we use audio, text, and DNA data, as well as sequences produced by random synthetic programs. Current flagship models perform poorly - both GPT4-o and Llama-3.1-405B struggle on our natural and synthetic sequences. On our synthetic distribution, we are able to train code generation models with lower compression rates than previous approaches. Moreover, we show that gains on synthetic data generalize poorly to real data, suggesting that new innovations are necessary for additional gains on KT.
压缩是智能的核心。理论上,最有效地压缩任何数据序列的方法是在找到能够输出该序列并随后停止的最短程序后进行压缩。然而,“柯尔莫哥洛夫压缩”(Kolmogorov compression)是不可计算的,代码生成的大规模语言模型(LLMs)难以接近这一理论理想,因为这需要超越当前模型推理、规划和搜索的能力。为此,在这项工作中我们引入了“科里莫哥罗夫-测试”(KT),这是一种针对代码生成的大规模语言模型进行压缩作为智能能力评估的测试方法。 在KT中,一个模型在推断时会接收一段数据序列,并被要求生成能产生该序列最短程序。KT为评估和训练提供了几大优势:问题实例的数量几乎是无限的,且难度各不相同;已有强大的基准存在;评估指标(压缩)无法作弊;以及预训练数据污染的可能性极低。 为了评估当前模型的表现,我们使用了音频、文本和DNA数据,还有由随机合成程序产生的序列。目前最先进的模型表现不佳——无论是GPT4-o还是Llama-3.1-405B,在我们的自然及合成序列上都挣扎着。在我们的人工分布中,我们可以训练出压缩率低于先前方法的代码生成模型。此外,我们展示了人工数据上的改进很难推广到真实数据中,这表明为了进一步提高KT性能,需要新的创新。 这个工作揭示了当前大规模语言模型在处理复杂、随机生成的数据序列时遇到的挑战,并为未来研究提供了一种评估和提升智能压缩能力的方法。
https://arxiv.org/abs/2503.13992
Language models excel at following instructions but often struggle with the collaborative aspects of conversation that humans naturally employ. This limitation in grounding -- the process by which conversation participants establish mutual understanding -- can lead to outcomes ranging from frustrated users to serious consequences in high-stakes scenarios. To systematically study grounding challenges in human-LLM interactions, we analyze logs from three human-assistant datasets: WildChat, MultiWOZ, and Bing Chat. We develop a taxonomy of grounding acts and build models to annotate and forecast grounding behavior. Our findings reveal significant differences in human-human and human-LLM grounding: LLMs were three times less likely to initiate clarification and sixteen times less likely to provide follow-up requests than humans. Additionally, early grounding failures predicted later interaction breakdowns. Building on these insights, we introduce RIFTS: a benchmark derived from publicly available LLM interaction data containing situations where LLMs fail to initiate grounding. We note that current frontier models perform poorly on RIFTS, highlighting the need to reconsider how we train and prompt LLMs for human interaction. To this end, we develop a preliminary intervention that mitigates grounding failures.
语言模型在遵循指令方面表现出色,但在人类自然使用的对话合作方面往往存在困难。这种在建立共同理解(即“grounding”过程)上的局限性可能会导致从用户沮丧到高风险场景中严重后果的各种问题。为了系统地研究人类与大型语言模型(LLM)互动中的grounding挑战,我们分析了三个数据集的记录:WildChat、MultiWOZ和Bing Chat。我们开发了一个关于grounding行为的分类体系,并建立模型来标注和预测这些行为。我们的发现揭示了人与人之间以及人与LLM之间在grounding方面的显著差异:LLMs发起澄清的可能性比人类低三倍,提供后续请求的可能性低十六倍。此外,早期的grounding失败预示着后来互动中的崩溃。基于这些见解,我们引入了一个基准测试RIFTS,该基准由公开可用的LLM互动数据中模型未能启动grounding的情况构成。值得注意的是,当前前沿模型在RIFTS上表现不佳,这表明我们需要重新考虑训练和提示LLMs以适应人类互动的方式。为此,我们开发了一种初步干预措施来缓解grounding失败的问题。
https://arxiv.org/abs/2503.13975