Generative reward models (also known as LLMs-as-judges), which use large language models (LLMs) to evaluate answer quality, are increasingly adopted in reinforcement learning with verifiable rewards (RLVR). They are often preferred over rigid rule-based metrics, especially for complex reasoning tasks involving free-form outputs. In this paradigm, an LLM is typically prompted to compare a candidate answer against a ground-truth reference and assign a binary reward indicating correctness. Despite the seeming simplicity of this comparison task, we find that generative reward models exhibit surprising vulnerabilities to superficial manipulations: non-word symbols (e.g., ":" or ".") or reasoning openers like "Thought process:" and "Let's solve this problem step by step." can often lead to false positive rewards. We demonstrate that this weakness is widespread across LLMs, datasets, and prompt formats, posing a serious threat for core algorithmic paradigms that rely on generative reward models, such as rejection sampling, preference optimization, and RLVR. To mitigate this issue, we introduce a simple yet effective data augmentation strategy and train a new generative reward model with substantially improved robustness. Our findings highlight the urgent need for more reliable LLM-based evaluation methods. We release our robust, general-domain reward model and its synthetic training data at this https URL and this https URL.
生成奖励模型(也被称为LLM裁判),使用大型语言模型(LLMs)来评估答案质量,在具有可验证奖励的强化学习(RLVR)中越来越受欢迎。它们通常比刚性的基于规则的度量标准更受青睐,尤其是在涉及自由形式输出的复杂推理任务方面。在这种范式下,一个LLM通常被提示将候选答案与真实参考进行比较,并分配一个二元奖励来指示正确性。尽管这种对比任务看似简单,我们发现生成型奖励模型对表面操纵表现出令人惊讶的脆弱性:非词符号(例如,“:”或“.”)或推理开头语句如“思考过程:”和“让我们一步一步解决这个问题。”常常会导致错误的正向奖励。我们证明了这一弱点广泛存在于各种LLM、数据集和提示格式中,这对依赖生成型奖励模型的核心算法范式构成严重威胁,例如拒绝采样法、偏好优化以及RLVR。 为了缓解这一问题,我们引入了一种简单而有效的数据增强策略,并训练了一个新的具有显著改进的稳健性的生成型奖励模型。我们的发现突显了更可靠的基于LLM评估方法的迫切需求。我们在[此处](https://example.com/reward_model)和[此处](https://example.com/train_data)发布了我们强大的、通用领域的奖励模型及其合成训练数据。
https://arxiv.org/abs/2507.08794
To alleviate the computational burden of large language models (LLMs), architectures with activation sparsity, represented by mixture-of-experts (MoE), have attracted increasing attention. However, the non-differentiable and inflexible routing of vanilla MoE hurts model performance. Moreover, while each token activates only a few parameters, these sparsely-activated architectures exhibit low chunk-level sparsity, indicating that the union of multiple consecutive tokens activates a large ratio of parameters. Such a sparsity pattern is unfriendly for acceleration under low-resource conditions (e.g., end-side devices) and incompatible with mainstream acceleration techniques (e.g., speculative decoding). To address these challenges, we introduce a novel MoE architecture, BlockFFN, as well as its efficient training and deployment techniques. Specifically, we use a router integrating ReLU activation and RMSNorm for differentiable and flexible routing. Next, to promote both token-level sparsity (TLS) and chunk-level sparsity (CLS), CLS-aware training objectives are designed, making BlockFFN more acceleration-friendly. Finally, we implement efficient acceleration kernels, combining activation sparsity and speculative decoding for the first time. The experimental results demonstrate the superior performance of BlockFFN over other MoE baselines, achieving over 80% TLS and 70% 8-token CLS. Our kernels achieve up to 3.67$\times$ speedup on real end-side devices than dense models. All codes and checkpoints are available publicly (this https URL).
为了减轻大型语言模型(LLMs)的计算负担,采用激活稀疏架构(如专家混合MoE)吸引了越来越多的关注。然而,传统MoE中非可微和刚性的路由机制损害了模型性能。此外,尽管每个标记仅激活少数参数,但这些稀疏激活架构在块级表现出低稀疏性,即多个连续标记的组合会激活大量参数的比例。这种稀疏模式不利于资源受限条件(例如终端设备)下的加速,并且与主流加速技术(如投机解码)不兼容。为了解决这些问题,我们引入了一种新的MoE架构BlockFFN及其高效的训练和部署技术。具体而言,我们使用集成了ReLU激活和RMSNorm的路由器来实现可微分和灵活的路由机制。接下来,为了同时促进标记级稀疏性(TLS)和块级稀疏性(CLS),设计了CLS感知的训练目标,使得BlockFFN更加易于加速。最后,我们实现了高效的加速内核,并首次结合了激活稀疏性和投机解码技术。实验结果显示,BlockFFN在其他MoE基准模型上的性能优越,达到了超过80%的TLS和70%八标记CLS(Chunk-Level Sparsity)。我们的内核实现在真实终端设备上比密集模型快达3.67倍。所有代码和检查点均可公开访问(此 https URL 链接)。
https://arxiv.org/abs/2507.08771
Speaker attribution from speech transcripts is the task of identifying a speaker from the transcript of their speech based on patterns in their language use. This task is especially useful when the audio is unavailable (e.g. deleted) or unreliable (e.g. anonymized speech). Prior work in this area has primarily focused on the feasibility of attributing speakers using transcripts produced by human annotators. However, in real-world settings, one often only has more errorful transcripts produced by automatic speech recognition (ASR) systems. In this paper, we conduct what is, to our knowledge, the first comprehensive study of the impact of automatic transcription on speaker attribution performance. In particular, we study the extent to which speaker attribution performance degrades in the face of transcription errors, as well as how properties of the ASR system impact attribution. We find that attribution is surprisingly resilient to word-level transcription errors and that the objective of recovering the true transcript is minimally correlated with attribution performance. Overall, our findings suggest that speaker attribution on more errorful transcripts produced by ASR is as good, if not better, than attribution based on human-transcribed data, possibly because ASR transcription errors can capture speaker-specific features revealing of speaker identity.
基于演讲录音的文字转录中的说话人归属任务是指根据语言使用模式从其讲话的转录中识别出说话者。这项任务在音频不可用(例如被删除)或不可靠(例如匿名发言)时特别有用。此前的工作主要集中在利用人工标注生成的转录文本进行说话人归属的可能性研究上。然而,在现实场景下,人们往往只有由自动语音识别(ASR)系统产生的、更加错误频出的文字转录可用。 在本文中,我们开展了迄今为止据我们所知的第一项关于自动转录对说话人归属性能影响的全面研究。特别地,我们研究了面对文字转录中的错误时说话人归属性能下降的程度,以及语音识别系统的特性如何影响归属性能。我们的发现表明,尽管存在单词级别的转录错误,归属任务依然表现出惊人的鲁棒性,并且恢复真实转录的目标与归属表现的关联度极低。 总体而言,我们的研究结果表明,在ASR系统生成的文字转录(即使更加混乱)上进行说话人归属的效果至少不逊于基于人工转写的资料,甚至可能更好。这或许是因为ASR系统的转录错误能够捕捉到体现说话人身份的独特特征。
https://arxiv.org/abs/2507.08660
We introduce DocPolarBERT, a layout-aware BERT model for document understanding that eliminates the need for absolute 2D positional embeddings. We extend self-attention to take into account text block positions in relative polar coordinate system rather than the Cartesian one. Despite being pre-trained on a dataset more than six times smaller than the widely used IIT-CDIP corpus, DocPolarBERT achieves state-of-the-art results. These results demonstrate that a carefully designed attention mechanism can compensate for reduced pre-training data, offering an efficient and effective alternative for document understanding.
我们介绍了DocPolarBERT,这是一种针对文档理解的布局感知型BERT模型,它消除了对绝对二维位置嵌入的需求。我们将自注意力机制扩展为考虑基于相对极坐标系统而非笛卡尔坐标的文本块位置。尽管是在比广泛使用的IIT-CDIP语料库小六倍多的数据集上进行预训练的,DocPolarBERT仍取得了最先进的成果。这些结果表明,精心设计的注意机制可以补偿较小规模的预训练数据量,并为文档理解提供了高效且有效的替代方案。
https://arxiv.org/abs/2507.08606
To ensure equitable access to the benefits of large language models (LLMs), it is essential to evaluate their capabilities across the world's languages. We introduce the AI Language Proficiency Monitor, a comprehensive multilingual benchmark that systematically assesses LLM performance across up to 200 languages, with a particular focus on low-resource languages. Our benchmark aggregates diverse tasks including translation, question answering, math, and reasoning, using datasets such as FLORES+, MMLU, GSM8K, TruthfulQA, and ARC. We provide an open-source, auto-updating leaderboard and dashboard that supports researchers, developers, and policymakers in identifying strengths and gaps in model performance. In addition to ranking models, the platform offers descriptive insights such as a global proficiency map and trends over time. By complementing and extending prior multilingual benchmarks, our work aims to foster transparency, inclusivity, and progress in multilingual AI. The system is available at this https URL.
为了确保全世界公平地获取大型语言模型(LLMs)的好处,有必要评估这些模型在世界各种语言中的能力。我们引入了AI语言熟练度监测器,这是一个全面的多语言基准测试工具,系统性地评估多达200种语言上的LLM性能,并特别关注低资源语言。我们的基准测试汇集了包括翻译、问答、数学和推理在内的多种任务,使用FLORES+、MMLU、GSM8K、TruthfulQA 和 ARC等数据集。我们提供了一个开源的自动更新排行榜和仪表板,支持研究人员、开发者和政策制定者识别模型性能的优势和不足之处。除了对模型进行排名外,该平台还提供了描述性见解,例如全球熟练度地图和随时间变化的趋势。通过补充和扩展现有的多语言基准测试工作,我们旨在促进多语言AI领域的透明度、包容性和进步。系统可在此网址访问:[此URL](请将此处的"this https URL"替换为实际链接)。
https://arxiv.org/abs/2507.08538
Latent Dirichlet Allocation (LDA) is a prominent generative probabilistic model used for uncovering abstract topics within document collections. In this paper, we explore the effectiveness of augmenting topic models with Large Language Models (LLMs) through integration into two key phases: Initialization and Post-Correction. Since the LDA is highly dependent on the quality of its initialization, we conduct extensive experiments on the LLM-guided topic clustering for initializing the Gibbs sampling algorithm. Interestingly, the experimental results reveal that while the proposed initialization strategy improves the early iterations of LDA, it has no effect on the convergence and yields the worst performance compared to the baselines. The LLM-enabled post-correction, on the other hand, achieved a promising improvement of 5.86% in the coherence evaluation. These results highlight the practical benefits of the LLM-in-the-loop approach and challenge the belief that LLMs are always the superior text mining alternative.
潜狄利克雷分配(LDA)是一种广泛使用的生成概率模型,用于从文档集合中揭示抽象主题。在本文中,我们探讨了通过将大型语言模型(LLM)整合到两个关键阶段:初始化和后期校正来增强主题模型的有效性。 由于LDA高度依赖于其初始设置的质量,我们在实验中进行了广泛的试验,以探索由LLM引导的主题聚类方法对吉布斯采样算法的初始化过程的影响。有趣的是,实验结果表明,尽管所提出的初始化策略在LDA早期迭代中表现出改进,但它并未影响到模型的收敛性,并且其性能与基准相比最差。 另一方面,启用LLM的后期校正实现了5.86%的显著提高,在一致性评估方面尤为突出。这些结果显示了“闭环中的大型语言模型”方法的实际益处,并挑战了人们认为大型语言模型总是在文本挖掘中占优的观点。
https://arxiv.org/abs/2507.08498
While large language models (LLMs) have advanced procedural planning for embodied AI systems through strong reasoning abilities, the integration of multimodal inputs and counterfactual reasoning remains underexplored. To tackle these challenges, we introduce LLaPa, a vision-language model framework designed for multimodal procedural planning. LLaPa generates executable action sequences from textual task descriptions and visual environmental images using vision-language models (VLMs). Furthermore, we enhance LLaPa with two auxiliary modules to improve procedural planning. The first module, the Task-Environment Reranker (TER), leverages task-oriented segmentation to create a task-sensitive feature space, aligning textual descriptions with visual environments and emphasizing critical regions for procedural execution. The second module, the Counterfactual Activities Retriever (CAR), identifies and emphasizes potential counterfactual conditions, enhancing the model's reasoning capability in counterfactual scenarios. Extensive experiments on ActPlan-1K and ALFRED benchmarks demonstrate that LLaPa generates higher-quality plans with superior LCS and correctness, outperforming advanced models. The code and models are available this https URL.
尽管大型语言模型(LLMs)通过强大的推理能力为具身AI系统推进了程序规划,但将多模态输入和反事实推理整合仍然鲜有研究。为了应对这些挑战,我们引入了LLaPa,这是一种面向多模态程序规划的视觉-语言模型框架。LLaPa使用视觉-语言模型(VLMs)从文本任务描述和视觉环境图像中生成可执行的动作序列。 此外,我们通过两个辅助模块增强了LLaPa以改善程序规划:第一模块是任务-环境重排序器(TER),该模块利用面向任务的分割来创建一个敏感于特定任务的功能空间,使得文字描述与视觉环境相吻合,并强调了程序执行的关键区域。第二模块为反事实活动检索器(CAR),它能够识别并突出潜在的反事实条件,从而增强了模型在反事实场景中的推理能力。 我们在ActPlan-1K和ALFRED基准测试中进行了广泛的实验,结果显示LLaPa生成的计划质量更高,具有更好的LCS(局部一致性得分)和正确性,并且优于先进的模型。代码和模型可在[此处](此链接应为提供具体URL的位置)获取。
https://arxiv.org/abs/2507.08496
There are currently two main paradigms for evaluating large language models (LLMs), reference-based evaluation and preference-based evaluation. The first, carried over from the evaluation of machine learning models in general, relies on pre-defined task instances, for which reference task executions are available. The second, best exemplified by the LM-arena, relies on (often self-selected) users bringing their own intents to a site that routes these to several models in parallel, among whose responses the user then selects their most preferred one. The former paradigm hence excels at control over what is tested, while the latter comes with higher ecological validity, testing actual use cases interactively. Recently, a third complementary paradigm has emerged that combines some of the strengths of these approaches, offering control over multi-turn, reference-free, repeatable interactions, while stressing goal-directedness: dialogue game based evaluation. While the utility of this approach has been shown by several projects, its adoption has been held back by the lack of a mature, easily re-usable implementation. In this paper, we present clembench, which has been in continuous development since 2023 and has in its latest release been optimized for ease of general use. We describe how it can be used to benchmark one's own models (using a provided set of benchmark game instances in English), as well as how easily the benchmark itself can be extended with new, tailor-made targeted tests.
目前评估大型语言模型(LLMs)主要有两种主要范式:基于参考的评估和基于偏好的评估。前者是从一般机器学习模型的评估中继承而来的,依赖于预先定义的任务实例,并且这些任务实例有可用的参考执行标准。后者,以LM-竞技场为代表,依靠用户自带意图到一个网站上,在这个网站上可以并行地将用户的请求传递给多个模型,然后用户从响应中选择他们最满意的一个。因此,前者在测试内容的控制方面表现出色,而后者则具有更高的生态有效性,因为它能够交互式地测试实际使用场景。 最近,第三种互补范式已经出现,它结合了上述两种方法的一些优点,提供对多轮、无参考、可重复互动的控制,并强调目标导向性:基于对话游戏的评估。尽管这种方法的价值已经被多个项目证明,但由于缺乏成熟且易于重用的实现方式,它的采用受到了阻碍。在本文中,我们介绍了clembench,该工具自2023年以来一直在持续开发中,在其最新版本中已优化为便于一般使用。我们将描述如何利用它来评估自己的模型(通过提供的一组基准游戏实例进行),以及如何轻松地扩展基准测试以包含新的、量身定制的目标测试。
https://arxiv.org/abs/2507.08491
The deep integration of large language models and automatic speech recognition systems has become a promising research direction with high practical value. To address the overfitting issue commonly observed in Low-Rank Adaptation (LoRA) during the supervised fine-tuning (SFT) stage, this work proposes an innovative training paradigm Iterative LoRA Training (ILT) in combination with an Iterative Pseudo Labeling strategy, effectively enhancing the theoretical upper bound of model performance. Based on Whisper-large-v3 and Qwen2-Audio, we conduct systematic experiments using a three-stage training process: Focus Training, Feed Back Training, and Fix Training. Experimental results demonstrate the effectiveness of the proposed method. Furthermore, the MegaAIS research team applied this technique in the Interspeech 2025 Multilingual Conversational Speech Language Modeling Challenge (MLC-SLM), achieving 4th in Track 1 (Multilingual ASR Task) and 1st place in Track 2 (Speech Separation and Recognition Task), showcasing the practical feasibility and strong application potential of our approach.
大型语言模型与自动语音识别系统的深度融合已经成为一个具有高实用价值的有希望的研究方向。为了解决在监督微调(SFT)阶段常见的低秩适配(LoRA)方法中的过拟合问题,本研究提出了一种创新性的训练范式——迭代低秩适配训练(ILT),结合了迭代伪标签策略,有效提升了模型性能的理论上限。基于Whisper-large-v3和Qwen2-Audio,我们通过一个三阶段的系统实验过程进行了验证:焦点训练、反馈训练和固定训练。实验证明了所提出方法的有效性。此外,MegaAIS研究团队在Interspeech 2025多语言对话语音建模挑战赛(MLC-SLM)中应用了这项技术,在Track 1(多语言ASR任务)上获得第4名,并在Track 2(语音分离与识别任务)上获得第一名,展示了我们方法的实际可行性以及强大的应用潜力。
https://arxiv.org/abs/2507.08477
This paper provides an experimental evaluation of the capability of large language models (LLMs) to assist in legal decision-making within the framework of Austrian and European Union value-added tax (VAT) law. In tax consulting practice, clients often describe cases in natural language, making LLMs a prime candidate for supporting automated decision-making and reducing the workload of tax professionals. Given the requirement for legally grounded and well-justified analyses, the propensity of LLMs to hallucinate presents a considerable challenge. The experiments focus on two common methods for enhancing LLM performance: fine-tuning and retrieval-augmented generation (RAG). In this study, these methods are applied on both textbook cases and real-world cases from a tax consulting firm to systematically determine the best configurations of LLM-based systems and assess the legal-reasoning capabilities of LLMs. The findings highlight the potential of using LLMs to support tax consultants by automating routine tasks and providing initial analyses, although current prototypes are not ready for full automation due to the sensitivity of the legal domain. The findings indicate that LLMs, when properly configured, can effectively support tax professionals in VAT tasks and provide legally grounded justifications for decisions. However, limitations remain regarding the handling of implicit client knowledge and context-specific documentation, underscoring the need for future integration of structured background information.
本文提供了一种实验评估,考察大型语言模型(LLMs)在奥地利和欧盟增值税(VAT)法律框架下辅助法律决策的能力。在税务咨询实践中,客户通常用自然语言描述案例,使得LLM成为支持自动化决策并减轻税务专业人士工作量的理想候选工具。鉴于需要具备法理依据且充分论证的分析要求,LLM产生的幻觉倾向构成了重大挑战。 实验主要关注两种提升LLM性能的方法:微调和检索增强生成(RAG)。在这项研究中,这两种方法分别应用于教科书案例以及来自税务咨询公司的实际案例上,以系统地确定基于LLM系统的最佳配置,并评估LLMs的法律推理能力。研究发现突显了使用LLM支持税务顾问通过自动化常规任务并提供初步分析的潜力,尽管当前原型尚不足以完全实现自动化,因为法律领域的敏感性不容忽视。 研究表明,在适当配置的情况下,LLMs可以有效地协助税务专业人士完成VAT相关任务,并为决策提供有法理依据的理由。然而,它们在处理客户隐含知识和特定上下文文档方面仍然存在局限性,这强调了未来整合结构化背景信息的必要性。
https://arxiv.org/abs/2507.08468
With the widespread application of Large Language Models (LLMs) in various tasks, the mainstream LLM platforms generate massive user-model interactions daily. In order to efficiently analyze the performance of models and diagnose failures in their answers, it is essential to develop an automated framework to systematically categorize and attribute errors. However, existing evaluation models lack error attribution capability. In this work, we establish a comprehensive Misattribution Framework with 6 primary and 15 secondary categories to facilitate in-depth analysis. Based on this framework, we present AttriData, a dataset specifically designed for error attribution, encompassing misattribution, along with the corresponding scores and feedback. We also propose MisAttributionLLM, a fine-tuned model on AttriData, which is the first general-purpose judge model capable of simultaneously generating score, misattribution, and feedback. Extensive experiments and analyses are conducted to confirm the effectiveness and robustness of our proposed method.
随着大型语言模型(LLMs)在各种任务中的广泛应用,主流的LLM平台每天都会生成大量的用户与模型之间的交互。为了有效地分析模型性能并诊断其回答中的错误,开发一个自动化框架来系统地分类和归因于这些错误至关重要。然而,现有的评估模型缺乏错误归因的能力。在这项工作中,我们建立了一个全面的误归类框架,包括6个主要类别和15个次要类别,以促进深入分析。基于这一框架,我们提出了AttriData数据集,这是一个专门用于错误归因的数据集,涵盖了误归类以及相应的评分和反馈。我们还提出了一种在AttriData上微调的模型——MisAttributionLLM,这是第一个能够同时生成分数、误归类和反馈的一般用途判别模型。通过广泛的实验和分析,证实了我们所提出的这种方法的有效性和鲁棒性。
https://arxiv.org/abs/2507.08459
Shapes Constraint Language (SHACL) is a powerful language for validating RDF data. Given the recent industry attention to Knowledge Graphs (KGs), more users need to validate linked data properly. However, traditional SHACL validation engines often provide terse reports in English that are difficult for non-technical users to interpret and act upon. This paper presents xpSHACL, an explainable SHACL validation system that addresses this issue by combining rule-based justification trees with retrieval-augmented generation (RAG) and large language models (LLMs) to produce detailed, multilanguage, human-readable explanations for constraint violations. A key feature of xpSHACL is its usage of a Violation KG to cache and reuse explanations, improving efficiency and consistency.
形状约束语言(SHACL)是一种用于验证RDF数据的强大语言。鉴于知识图谱(KGs)在行业中的关注度日益增加,越来越多的用户需要正确地验证链接数据。然而,传统的SHACL验证引擎通常会生成简短的英文报告,这些报告对于非技术人员来说难以理解和采取行动。本文介绍了一种名为xpSHACL的可解释性SHACL验证系统,该系统通过结合基于规则的推理树、检索增强型生成(RAG)和大型语言模型(LLMs),为约束违规提供详细、多语言的人类可读解释来解决这一问题。xpSHACL的一个关键特点是使用违反知识图谱(Violation KG)来缓存和重复使用解释,从而提高效率和一致性。
https://arxiv.org/abs/2507.08432
Large Language Models (LLMs) have demonstrated their transformative potential across numerous disciplinary studies, reshaping the existing research methodologies and fostering interdisciplinary collaboration. However, a systematic understanding of their integration into diverse disciplines remains underexplored. This survey paper provides a comprehensive overview of the application of LLMs in interdisciplinary studies, categorising research efforts from both a technical perspective and with regard to their applicability. From a technical standpoint, key methodologies such as supervised fine-tuning, retrieval-augmented generation, agent-based approaches, and tool-use integration are examined, which enhance the adaptability and effectiveness of LLMs in discipline-specific contexts. From the perspective of their applicability, this paper explores how LLMs are contributing to various disciplines including mathematics, physics, chemistry, biology, and the humanities and social sciences, demonstrating their role in discipline-specific tasks. The prevailing challenges are critically examined and the promising research directions are highlighted alongside the recent advances in LLMs. By providing a comprehensive overview of the technical developments and applications in this field, this survey aims to serve as an invaluable resource for the researchers who are navigating the complex landscape of LLMs in the context of interdisciplinary studies.
大型语言模型(LLMs)在众多学科研究中展示了其变革潜力,重塑了现有的研究方法,并促进了跨学科合作。然而,系统地理解它们如何融入不同学科仍然有待探索。这篇综述论文提供了关于LLMs在跨学科研究中的应用的全面概述,从技术视角和实际应用性两个方面对相关研究工作进行了分类。 从技术角度来看,本文考察了关键的方法论,如监督微调、检索增强生成、基于代理的方法以及工具使用集成等,这些方法增强了LLMs在特定领域背景下的适应性和有效性。从应用性的角度出发,论文探讨了LLMs如何助力数学、物理、化学、生物学及人文与社会科学等多个学科的发展,并展示了它们在特定学科任务中的作用。 文章还批判性地审视了当前面临的挑战,并突出了近期研究进展以及未来有前景的研究方向。通过提供关于该领域技术发展和应用的全面概述,这篇综述旨在成为那些在跨学科研究背景下探索LLMs复杂景观的研究者的宝贵资源。
https://arxiv.org/abs/2507.08425
Language models are prone to hallucination - generating text that is factually incorrect. Finetuning models on high-quality factual information can potentially reduce hallucination, but concerns remain; obtaining factual gold data can be expensive and training on correct but unfamiliar data may potentially lead to even more downstream hallucination. What data should practitioners finetune on to mitigate hallucinations in language models? In this work, we study the relationship between the factuality of finetuning data and the prevalence of hallucinations in long-form generation tasks. Counterintuitively, we find that finetuning on factual gold data is not as helpful as finetuning on model-generated data that models believe to be factual. Next, we evaluate filtering strategies applied on both factual gold data and model-generated data, and find that finetuning on model-generated data that is filtered by models' own internal judgments often leads to better overall factuality compared to other configurations: training on gold data filtered by models' judgments, training on gold data alone, or training on model-generated data that is supported by gold data. These factuality improvements transfer across three domains we study, suggesting that a models' own beliefs can provide a powerful signal for factuality.
语言模型容易产生幻觉,即生成不准确的事实性文本。在高质量事实信息上对这些模型进行微调可以潜在地减少这种幻觉,但存在一些顾虑:获得高质量的“黄金”数据可能成本高昂,并且在正确但陌生的数据上训练可能导致更多的下游幻觉问题。那么实践者应该使用什么样的数据来减少语言模型中的幻觉呢? 在这项工作中,我们研究了用于微调的事实性数据与长文本生成任务中幻觉出现频率之间的关系。出人意料的是,我们发现使用模型自动生成并认为正确的事实性“黄金”数据进行微调比直接使用纯正的“黄金”数据更有效。接下来,我们评估了在两种类型的数据上应用的各种过滤策略(即纯正的“黄金”数据和模型生成的数据),发现在经过自己内部判断筛选后的模型生成数据上训练通常会带来整体事实性的改进,优于其他配置:用模型判断过滤过的“黄金”数据、单独使用未处理的“黄金”数据或者基于“黄金”支持的模型生成数据进行微调。这些事实性改进在我们研究的三个领域中都有所体现,表明模型自身的信念可以提供强大的信号来改善文本的事实准确性。
https://arxiv.org/abs/2507.08371
Large language models (LLMs) are increasingly used to support creative tasks such as research idea generation. While recent work has shown that structured dialogues between LLMs can improve the novelty and feasibility of generated ideas, the optimal design of such interactions remains unclear. In this study, we conduct a comprehensive analysis of multi-agent LLM dialogues for scientific ideation. We compare different configurations of agent roles, number of agents, and dialogue depth to understand how these factors influence the novelty and feasibility of generated ideas. Our experimental setup includes settings where one agent generates ideas and another critiques them, enabling iterative improvement. Our results show that enlarging the agent cohort, deepening the interaction depth, and broadening agent persona heterogeneity each enrich the diversity of generated ideas. Moreover, specifically increasing critic-side diversity within the ideation-critique-revision loop further boosts the feasibility of the final proposals. Our findings offer practical guidelines for building effective multi-agent LLM systems for scientific ideation. Our code is available at this https URL.
大型语言模型(LLMs)在支持创意任务如研究想法生成方面的应用日益增加。尽管最近的工作表明,结构化的LLM对话可以提高生成想法的新颖性和可行性,但此类互动的最佳设计仍不清楚。在这项研究中,我们对用于科学构想的多代理LLM对话进行了全面分析。我们比较了不同配置的代理角色、代理数量以及对话深度,以了解这些因素如何影响所产生想法的新颖性和可行性。我们的实验设置包括一个代理生成想法而另一个对其进行批评的情况,从而可以进行迭代改进。研究结果表明,扩大代理团队规模、加深互动深度和扩展代理人身份异质性都能丰富产生的想法的多样性。此外,在构想-批评-修订循环中特别增加批评方的多样性还可以进一步提高最终提案的可行性。我们的发现为构建有效的多代理LLM系统以支持科学构想提供了实用指南。我们的代码可在[此处](https://this_https_URL.com)获取。
https://arxiv.org/abs/2507.08350
Automatic n-gram based metrics such as ROUGE are widely used for evaluating generative tasks such as summarization. While these metrics are considered indicative (even if imperfect) of human evaluation for English, their suitability for other languages remains unclear. To address this, we systematically assess evaluation metrics for generation both n-gram-based and neural based to evaluate their effectiveness across languages and tasks. Specifically, we design a large-scale evaluation suite across eight languages from four typological families: agglutinative, isolating, low-fusional, and high-fusional, spanning both low- and high-resource settings, to analyze their correlation with human judgments. Our findings highlight the sensitivity of evaluation metrics to the language type. For example, in fusional languages, n-gram-based metrics show lower correlation with human assessments compared to isolating and agglutinative languages. We also demonstrate that proper tokenization can significantly mitigate this issue for morphologically rich fusional languages, sometimes even reversing negative trends. Additionally, we show that neural-based metrics specifically trained for evaluation, such as COMET, consistently outperform other neural metrics and better correlate with human judgments in low-resource languages. Overall, our analysis highlights the limitations of n-gram metrics for fusional languages and advocates for greater investment in neural-based metrics trained for evaluation tasks.
自动n元语法(n-gram)基于的评估指标,如ROUGE,在摘要等生成任务的评价中被广泛使用。尽管这些指标被认为可以指示英语的人工评价结果(即使存在一些局限性),它们是否适合其他语言的有效性仍然不确定。为了应对这一挑战,我们系统地评估了用于不同语言和任务生成评估中的n-gram及神经网络基础的评估指标的效果。 具体来说,我们在四种不同类型家族的语言中设计了一套大规模评估方案:聚合型、孤立型、低融合型以及高融合型,并覆盖了从资源匮乏到丰富的情境。我们分析这些指标与人工评判的相关性。我们的发现强调了评价指标对语言类型的高度敏感性。例如,在融合型语言中,n-gram基于的评估指标与其他类型的语言相比,与人类评判的相关性较低。此外,我们还展示了正确的分词化可以显著缓解这个问题,尤其是在形态复杂的融合型语言中有时甚至能够逆转负面趋势。 另外,专门训练用于评价任务的神经网络基础评估指标(如COMET),在资源匮乏的语言环境中始终优于其他神经网络指标,并更好地与人类评判相关联。总体而言,我们的分析强调了n-gram度量方法在高融合语言中的局限性,并倡导对专为评价任务设计和训练的神经网络评估指标投入更多关注和发展。
https://arxiv.org/abs/2507.08342
Recently, the development of large language models (LLMs) and reasoning large language models (RLLMs) have gained considerable attention from many researchers. RLLMs enhance the reasoning capabilities of LLMs through Long Chain-of-Thought (Long CoT) processes, significantly improving the performance of LLMs in addressing complex problems. However, there are few works that systematically explore what methods can fully unlock the performance of LLMs and RLLMs within the financial domain. To investigate the impact of various methods on LLMs and RLLMs, we utilize five LLMs and three RLLMs to assess the effects of prompting methods, agentic frameworks, and multilingual alignment methods on financial question-answering tasks. Our research findings indicate: (1) Current prompting methods and agent frameworks enhance the performance of LLMs in financial question answering by simulating Long CoT; (2) RLLMs possess inherent Long CoT capabilities, which limits the effectiveness of conventional methods in further enhancing their performance; (3) Current advanced multilingual alignment methods primarily improve the multilingual performance of LLMs by extending the reasoning length, which yields minimal benefits for RLLMs. We hope that this study can serve as an important reference for LLMs and RLLMs in the field of financial question answering.
最近,大型语言模型(LLM)和推理型大型语言模型(RLLM)的发展受到了许多研究人员的广泛关注。通过长链思维(Long Chain-of-Thought, Long CoT)过程,RLLMs 增强了 LLMs 的推理能力,在解决复杂问题方面显著提升了性能表现。然而,很少有研究系统地探索在金融领域中哪些方法能够充分发挥 LLMs 和 RLLMs 的潜力。 为了探讨不同方法对 LLMs 和 RLLMs 的影响,我们利用五种 LLM 和三种 RLLM 来评估提示方法、代理框架和多语言对齐方法在金融问答任务中的效果。我们的研究结果表明: 1. 当前的提示方法和代理框架通过模拟长链思维(Long CoT),提升了 LLMs 在金融问题回答中的性能。 2. RLLMs 本身具备内在的长链思维能力,这限制了传统方法进一步提升其表现的有效性。 3. 目前先进的多语言对齐方法主要通过延长推理长度来改进 LLMs 的多语言性能,这对 RLLMs 来说几乎没有任何好处。 我们希望这项研究能为金融问答领域中 LLM 和 RLLM 的应用提供重要的参考。
https://arxiv.org/abs/2507.08339
Training text rerankers is crucial for information retrieval. Two primary strategies are widely used: contrastive learning (optimizing directly on ground-truth labels) and knowledge distillation (transferring knowledge from a larger reranker). While both have been studied in the literature, a clear comparison of their effectiveness for training cross-encoder rerankers under practical conditions is needed. This paper empirically compares these strategies by training rerankers of different sizes and architectures using both methods on the same data, with a strong contrastive learning model acting as the distillation teacher. Our results show that knowledge distillation generally yields better in-domain and out-of-domain ranking performance than contrastive learning when distilling from a larger teacher model. This finding is consistent across student model sizes and architectures. However, distilling from a teacher of the same capacity does not provide the same advantage, particularly for out-of-domain tasks. These findings offer practical guidance for choosing a training strategy based on available teacher models. Therefore, we recommend using knowledge distillation to train smaller rerankers if a larger, more powerful teacher is accessible; in its absence, contrastive learning provides a strong and more reliable alternative otherwise.
训练文本重排器对于信息检索至关重要。两种主要策略被广泛使用:对比学习(直接在真实标签上进行优化)和知识蒸馏(从更大的重排序模型中传递知识)。虽然文献已经研究了这两种方法,但在实际条件下针对跨编码器重排序模型的训练效果需要明确比较。 本文通过在相同数据集上分别采用两种策略来训练不同大小和架构的重排序模型,并使用强大的对比学习模型作为蒸馏教师模型,从经验上对这些策略进行了比较。我们的结果表明,在较大的教师模型中进行知识蒸馏时,知识蒸馏通常比对比学习提供更好的域内和域外排名性能。这一发现跨学生模型尺寸和架构是一致的。然而,当从能力相同的老师那里蒸馏时,并没有获得同样的优势,特别是在处理域外任务时。 这些研究结果为根据可用教师模型选择训练策略提供了实际指导。因此,我们建议如果可以访问到更大的、更强大的教师模型,则使用知识蒸馏来训练较小的重排序器;否则,在缺少较大教师模型的情况下,对比学习提供了一个强大且更加可靠的替代方案。
https://arxiv.org/abs/2507.08336
The Patent-Based Idea Generation task asks systems to turn real patents into product ideas viable within three years. We propose MK2, a prompt-centric pipeline: Gemini 2.5 drafts and iteratively edits a prompt, grafting useful fragments from weaker outputs; GPT-4.1 then uses this prompt to create one idea per patent, and an Elo loop judged by Qwen3-8B selects the best prompt-all without extra training data. Across three domains, two evaluator types, and six criteria, MK2 topped the automatic leaderboard and won 25 of 36 tests. Only the materials-chemistry track lagged, indicating the need for deeper domain grounding; yet, the results show that lightweight prompt engineering has already delivered competitive, commercially relevant ideation from patents.
基于专利的想法生成任务要求系统将真实的专利转化为三年内可行的产品创意。我们提出了MK2,这是一个以提示为中心的流程:Gemini 2.5 负责起草和迭代编辑一个提示,并从较弱的输出中嫁接有用的片段;随后,GPT-4.1 使用此提示为每项专利生成一个想法;最后通过由Qwen3-8B评判的Elo循环来选择最佳提示——整个过程无需额外训练数据。在三个领域、两种评估类型和六项标准下,MK2 在自动排行榜上名列前茅,并赢得了36个测试中的25个。只有材料化学赛道略显落后,这表明需要更深入的领域知识;然而,结果已经显示,轻量级提示工程已经能够从专利中生成有竞争力且具有商业相关性的创意想法。
https://arxiv.org/abs/2507.08335
In e-commerce private-domain channels such as instant messaging and e-mail, merchants engage customers directly as part of their Customer Relationship Management (CRM) programmes to drive retention and conversion. While a few top performers excel at crafting outbound messages, most merchants struggle to write persuasive copy because they lack both expertise and scalable tools. We introduce CRMAgent, a multi-agent system built on large language models (LLMs) that generates high-quality message templates and actionable writing guidance through three complementary modes. First, group-based learning enables the agent to learn from a merchant's own top-performing messages within the same audience segment and rewrite low-performing ones. Second, retrieval-and-adaptation fetches templates that share the same audience segment and exhibit high similarity in voucher type and product category, learns their successful patterns, and adapts them to the current campaign. Third, a rule-based fallback provides a lightweight zero-shot rewrite when no suitable references are available. Extensive experiments show that CRMAgent consistently outperforms merchants' original templates, delivering significant gains in both audience-match and marketing-effectiveness metrics.
在电子商务的私域渠道,如即时通讯和电子邮件中,商家通过客户关系管理(CRM)计划直接与顾客互动,以促进客户的留存率和转化率。尽管一些表现卓越的企业能够制作出有效的外向型消息,但大多数企业却因缺乏专业知识和可扩展工具而难以编写有说服力的文案。为此,我们推出了CRMAgent,这是一个基于大规模语言模型(LLMs)构建的多代理系统,通过三种互补模式生成高质量的消息模板并提供可操作的写作指导。 首先,群体学习模式使代理能够从同一受众细分市场中商家自己表现最好的消息中学习,并改写那些表现较差的信息。其次,检索与适应模式会获取具有相同受众细分、优惠券类型和产品类别相似度高的模板,从中学习成功模式并将其应用于当前的营销活动。第三,基于规则的备用方案在没有合适参考的情况下提供轻量级的一键重写功能。 经过广泛的实验表明,CRMAgent始终能超越商家原有的模板,在与观众匹配度和市场有效性指标方面均取得了显著提升。
https://arxiv.org/abs/2507.08325