An increasing body of work has leveraged multilingual language models for Natural Language Generation tasks such as summarization. A major empirical bottleneck in this area is the shortage of accurate and robust evaluation metrics for many languages, which hinders progress. Recent studies suggest that multilingual language models often use English as an internal pivot language, and that misalignment with this pivot can lead to degraded downstream performance. Motivated by the hypothesis that this mismatch could also apply to multilingual neural metrics, we ask whether steering their activations toward an English pivot can improve correlation with human judgments. We experiment with encoder- and decoder-based metrics and find that test-time intervention methods are effective across the board, increasing metric effectiveness for diverse languages.
越来越多的研究利用多语言语言模型进行自然语言生成任务,如摘要生成。然而,在这一领域中存在一个重要的实证瓶颈:许多语言缺乏准确和稳健的评估指标,这阻碍了研究进展。最近的一些研究表明,多语言语言模型往往将英语作为内部基准语言,而这种基准与目标语言之间的不匹配会导致下游性能下降。基于假设认为这种不匹配也可能适用于多语言神经评估指标,我们探讨了是否可以通过引导这些指标的激活向英语基准靠拢来提高它们与人类判断的相关性。我们在编码器和解码器基线度量方法上进行了实验,并发现测试时间干预的方法普遍有效,能够提升各种语言的度量效果。
https://arxiv.org/abs/2601.15809
While large language models (LLMs) are increasingly used to summarize long documents, this trend poses significant challenges in the legal domain, where the factual accuracy of deposition summaries is crucial. Nugget-based methods have been shown to be extremely helpful for the automated evaluation of summarization approaches. In this work, we translate these methods to the user side and explore how nuggets could directly assist end users. Although prior systems have demonstrated the promise of nugget-based evaluation, its potential to support end users remains underexplored. Focusing on the legal domain, we present a prototype that leverages a factual nugget-based approach to support legal professionals in two concrete scenarios: (1) determining which of two summaries is better, and (2) manually improving an automatically generated summary.
虽然大型语言模型(LLMs)在总结长文档方面的应用日益增多,但这种趋势在法律领域引发了重大挑战。在法律领域中,证词摘要的事实准确性至关重要。基于“事实要点”(nugget)的方法已被证明对自动评估总结方法非常有帮助。在这项工作中,我们将这些方法转化为用户侧的应用,并探讨如何直接利用“事实要点”来辅助最终用户。尽管此前的系统已经展示了基于“事实要点”的评价方式的巨大潜力,但它在支持最终用户方面的应用前景仍处于未充分探索的状态。专注于法律领域,我们提出了一种原型工具,该工具采用基于“事实要点”的方法,在两个具体场景中为法律专业人士提供支持:(1) 确定两个摘要中哪个更好;以及 (2) 手动改进自动生成的摘要。
https://arxiv.org/abs/2601.15182
Video summarization is a crucial technique for social understanding, enabling efficient browsing of massive multimedia content and extraction of key information from social platforms. Most existing unsupervised summarization methods rely on Generative Adversarial Networks (GANs) to enhance keyframe selection and generate coherent, video summaries through adversarial training. However, such approaches primarily exploit unimodal features, overlooking the guiding role of semantic information in keyframe selection, and often suffer from unstable training. To address these limitations, we propose a novel Semantic-Guided Unsupervised Video Summarization method. Specifically, we design a novel frame-level semantic alignment attention mechanism and integrate it into a keyframe selector, which guides the Transformer-based generator within the adversarial framework to better reconstruct videos. In addition, we adopt an incremental training strategy to progressively update the model components, effectively mitigating the instability of GAN training. Experimental results demonstrate that our approach achieves superior performance on multiple benchmark datasets.
视频摘要技术对于社会理解至关重要,它能够高效地浏览海量多媒体内容,并从社交平台中提取关键信息。现有的大多数无监督摘要方法依赖于生成对抗网络(GAN)来增强关键帧的选择,并通过对抗训练生成连贯的视频摘要。然而,这些方法主要利用单模态特征,忽视了语义信息在选择关键帧中的引导作用,并且常常遭受不稳定训练的问题。为了解决这些问题,我们提出了一种新的基于语义指导的无监督视频摘要方法。 具体而言,我们设计了一个新颖的帧级语义对齐注意力机制,并将其集成到一个关键帧选择器中,从而在对抗框架内引导基于Transformer的生成器更好地重构视频。此外,我们采用增量训练策略逐步更新模型组件,有效缓解GAN训练的不稳定性问题。实验结果表明,我们的方法在多个基准数据集上取得了优异的性能。
https://arxiv.org/abs/2601.14773
Large Language Models have demonstrated profound utility in the medical domain. However, their application to autonomous Electronic Health Records~(EHRs) navigation remains constrained by a reliance on curated inputs and simplified retrieval tasks. To bridge the gap between idealized experimental settings and realistic clinical environments, we present AgentEHR. This benchmark challenges agents to execute complex decision-making tasks, such as diagnosis and treatment planning, requiring long-range interactive reasoning directly within raw and high-noise databases. In tackling these tasks, we identify that existing summarization methods inevitably suffer from critical information loss and fractured reasoning continuity. To address this, we propose RetroSum, a novel framework that unifies a retrospective summarization mechanism with an evolving experience strategy. By dynamically re-evaluating interaction history, the retrospective mechanism prevents long-context information loss and ensures unbroken logical coherence. Additionally, the evolving strategy bridges the domain gap by retrieving accumulated experience from a memory bank. Extensive empirical evaluations demonstrate that RetroSum achieves performance gains of up to 29.16% over competitive baselines, while significantly decreasing total interaction errors by up to 92.3%.
大型语言模型在医疗领域展现了巨大的实用性。然而,它们在自主电子健康记录(EHR)导航方面的应用仍然受限于依赖于经过整理的输入和简化的检索任务。为了弥合理想化实验环境与现实临床环境之间的差距,我们提出了AgentEHR这一基准测试。该测试挑战代理执行复杂的决策任务,如诊断和治疗计划制定,要求它们在原始且噪音较高的数据库中直接进行长期交互推理。 通过解决这些任务,我们发现现有的摘要方法不可避免地会遭受关键信息丢失和逻辑连续性断裂的问题。为了解决这些问题,我们提出了RetroSum这一新型框架,它将回顾式摘要机制与动态经验策略相结合。通过动态重新评估交互历史,回顾机制可以防止长上下文的信息损失,并确保持续的逻辑连贯性。此外,该动态策略通过从记忆库中检索累积的经验来弥合领域差距。 广泛的实证评价表明,RetroSum在与竞争基准相比时性能提高了高达29.16%,同时将总的交互错误减少了高达92.3%。
https://arxiv.org/abs/2601.13918
In this paper, we propose active recap learning (ARL), a framework for enhancing large language model (LLM) in understanding long contexts. ARL enables models to revisit and summarize earlier content through targeted sequence construction during contined pretraining and retrospective summarization at inference. First, we identify key tokens in prepared long context based on loss gaps between long and short forward contexts and find most revant preceding paragraphs, then summarize them using an LLM. Second, ARL equips models with the ability to autonomously generate and utilize these retrospective summaries during inference, thereby establishing a recursive memory mechanism across paragraphs. Experimental results show substantial gains, with ARL achieving a 26.8% improvement on RULER and a 9.44% improvement on LongBench. Overall, ARL offers a simple yet effective continued pretraining-based approach to strengthen long-context understanding, advancing scalable memory augmentation in LLM
在这篇论文中,我们提出了主动回顾学习(Active Recap Learning,简称ARL),这是一种增强大型语言模型(Large Language Model,简称LLM)理解长上下文能力的框架。ARL允许模型在持续预训练过程中通过目标序列构建来重新审视并总结早期内容,并在推理阶段进行回顾性摘要生成。具体来说: 1. 首先,我们基于长上下文与短向前上下文之间的损失差距来识别准备好的长上下文中关键的标记,并找出最相关的先前段落,然后使用LLM对它们进行概括。 2. 其次,ARL为模型提供了自主地在推理过程中生成和利用这些回顾性摘要的能力,从而建立了跨段落的递归记忆机制。 实验结果表明,与基线相比,ARL取得了显著的进步:在RULER数据集上提高了26.8%,在LongBench数据集上提高了9.44%。总的来说,ARL提供了一种简单而有效的持续预训练方法来加强LLM对长上下文的理解,并推进了大规模记忆增强技术的发展。
https://arxiv.org/abs/2601.13734
Ensuring that collections of natural-language facts are globally consistent is essential for tasks such as fact-checking, summarization, and knowledge base construction. While Large Language Models (LLMs) can assess the consistency of small subsets of facts, their judgments are noisy, and pairwise checks are insufficient to guarantee global coherence. We formalize this problem and show that verifying global consistency requires exponentially many oracle queries in the worst case. To make the task practical, we propose an adaptive divide-and-conquer algorithm that identifies minimal inconsistent subsets (MUSes) of facts and optionally computes minimal repairs through hitting-sets. Our approach has low-degree polynomial query complexity. Experiments with both synthetic and real LLM oracles show that our method efficiently detects and localizes inconsistencies, offering a scalable framework for linguistic consistency verification with LLM-based evaluators.
确保自然语言事实集合在全球范围内的连贯性对于诸如事实核查、摘要生成和知识库构建等任务至关重要。虽然大型语言模型(LLMs)可以评估小规模事实子集的一致性,但它们的判断存在噪声问题,并且成对检查不足以保证整体一致性。我们形式化了这个问题,并展示了在最坏情况下验证全局一致性的需求需要呈指数级增长的查询次数。 为了使这一任务具有可操作性,我们提出了一种自适应分而治之算法,该算法可以识别最小不一致子集(MUSes)并可选地通过击中集计算最小修复。我们的方法具有低阶多项式查询复杂度。与合成和真实LLM预言机的实验表明,我们的方法能够高效地检测和定位不一致性,提供了一个使用基于LLM评估器的语言连贯性验证的可扩展框架。
https://arxiv.org/abs/2601.13600
Recent advances in audio-language models have demonstrated remarkable success on short, segment-level speech tasks. However, real-world applications such as meeting transcription, spoken document understanding, and conversational analysis require robust models capable of processing and reasoning over long-form audio. In this work, we present LongSpeech, a large-scale and scalable benchmark specifically designed to evaluate and advance the capabilities of speech models on long-duration audio. LongSpeech comprises over 100,000 speech segments, each approximately 10 minutes long, with rich annotations for ASR, speech translation, summarization, language detection, speaker counting, content separation, and question answering. We introduce a reproducible pipeline for constructing long-form speech benchmarks from diverse sources, enabling future extensions. Our initial experiments with state-of-the-art models reveal significant performance gaps, with models often specializing in one task at the expense of others and struggling with higher-level reasoning. These findings underscore the challenging nature of our benchmark. Our benchmark will be made publicly available to the research community.
近期在音频语言模型领域的进展已经在短时、片段级别的语音任务上取得了显著成功。然而,现实世界的应用场景如会议转录、口语文档理解以及对话分析需要能够处理和推理长时段音频的稳健模型。在此项工作中,我们介绍了LongSpeech,这是一个大规模且可扩展的基准测试平台,专门用于评估并推进语音模型在长时间录音上的能力。LongSpeech包含超过10万个语音片段,每个大约十分钟,并带有丰富的标注信息以供自动语音识别(ASR)、语音翻译、摘要生成、语言检测、计数说话人、内容分离和问答系统使用。 我们还介绍了一个可重复的流程,用于从多种来源构建长格式音频基准测试集,从而支持未来的扩展。通过采用最先进的模型进行初步实验后发现,在不同任务间存在显著性能差距:这些模型往往在单个任务上表现出色,但在其他任务上的表现却相对较弱,并且通常难以处理更高层次的推理需求。 这些发现强调了我们所设计基准的挑战性特点。我们的基准测试将对研究社区公开发布,以便促进更多相关领域的研究进展。
https://arxiv.org/abs/2601.13539
The pursuit of real-time agentic interaction has driven interest in Diffusion-based Large Language Models (dLLMs) as alternatives to auto-regressive backbones, promising to break the sequential latency bottleneck. However, does such efficiency gains translate into effective agentic behavior? In this work, we present a comprehensive evaluation of dLLMs (e.g., LLaDA, Dream) across two distinct agentic paradigms: Embodied Agents (requiring long-horizon planning) and Tool-Calling Agents (requiring precise formatting). Contrary to the efficiency hype, our results on Agentboard and BFCL reveal a "bitter lesson": current dLLMs fail to serve as reliable agentic backbones, frequently leading to systematically failure. (1) In Embodied settings, dLLMs suffer repeated attempts, failing to branch under temporal feedback. (2) In Tool-Calling settings, dLLMs fail to maintain symbolic precision (e.g. strict JSON schemas) under diffusion noise. To assess the potential of dLLMs in agentic workflows, we introduce DiffuAgent, a multi-agent evaluation framework that integrates dLLMs as plug-and-play cognitive cores. Our analysis shows that dLLMs are effective in non-causal roles (e.g., memory summarization and tool selection) but require the incorporation of causal, precise, and logically grounded reasoning mechanisms into the denoising process to be viable for agentic tasks.
追求实时代理互动的尝试激发了对基于扩散的大规模语言模型(dLLMs)的兴趣,这些模型作为自回归架构的替代方案,有望打破序列延迟瓶颈。然而,这种效率提升是否能转化为有效的代理行为?在这项工作中,我们呈现了一种针对两种不同的代理范式——具身代理(需要长期规划)和工具调用代理(需要精确格式化)对dLLMs(如LLADA、Dream等)进行的全面评估。与效率炒作相反,我们在Agentboard和BFCL上的结果显示了一个“苦涩教训”:当前的dLLMs无法作为可靠的代理骨干使用,并且经常导致系统性失败。(1)在具身设置中,dLLMs遭受重复尝试,在时间反馈下无法分支。(2)在工具调用设置中,dLLMs未能保持符号精度(如严格的JSON模式),这在扩散噪声的影响下显得尤为突出。为了评估dLLMs在代理工作流程中的潜力,我们引入了DiffuAgent,这是一个多代理评估框架,将dLLMs作为即插即用的认知核心进行集成。我们的分析表明,dLLMs在非因果角色(如记忆总结和工具选择)中表现有效,但要使其适合于代理任务,则需要在去噪过程中整合因果、精确且逻辑一致的推理机制。
https://arxiv.org/abs/2601.12979
Audio large language models (LLMs) enable unified speech understanding and generation, yet their adaptation to linguistically complex, dialect-rich settings remains underexplored. This paper presents the first systematic study of multi-task instruction tuning for an Arabic-centric audio LLM, covering a hierarchy of generative tasks (ASR, speech summarization) and discriminative tasks (dialect and emotion identification). To support this study, we introduce AraMega-SSum, a novel dataset for Arabic speech summarization. We fine-tune Qwen2.5-Omni (7B) and propose Task-Progressive Curriculum (TPC) along with Aligner-Based Diverse Sampling (ADS), a strategy that constructs information-dense batches by selecting task- and label-balanced examples. Our results reveal a critical efficiency, robustness trade-off: while ADS accelerates initial convergence and boosts paralinguistic F1-scores, its inherent gradient volatility can destabilize generative decoding under prolonged training. Furthermore, while the TPC stabilizes core acoustic mapping, it often induces negative transfer in downstream tasks. We demonstrate that a Hybrid TPC+ADS Strategy provides an optimal training ``recipe'', first establishing a robust representative foundation before employing diversity-aware refinement to capture fine-grained nuances. These findings offer practical guidance for the efficient adaptation of Omni-models in complex, low-resource multimodal environments.
音频大型语言模型(LLMs)能够实现统一的语音理解和生成,但它们在语言复杂、方言丰富的环境中的适应性研究仍处于初步阶段。本文首次系统地探讨了以阿拉伯语为中心的音频LLM的多任务指令微调方法,涵盖了从自动语音识别(ASR)、语音摘要到方言和情绪识别等层次化的生成性和判别性任务。为了支持这项研究,我们引入了一个名为AraMega-SSum的新数据集,用于阿拉伯语语音摘要。 我们对Qwen2.5-Omni (7B)模型进行了微调,并提出了任务渐进式课程(TPC)以及基于对齐器的多样本采样(ADS),这是一种通过选择任务和标签平衡的例子来构建信息密集型批次的战略。我们的研究结果揭示了效率与稳健性之间的关键权衡:尽管ADS可以加速初始收敛并提升副语言F1分数,但其内在的梯度波动可能导致长时间训练后生成解码不稳定;同时,TPC虽然能够稳定核心声学映射,但也常常导致下游任务中的负向迁移。我们证明了一种混合策略(TPC+ADS)为最优的训练方案:首先建立一个稳健的代表性基础,然后再利用多样化意识型优化来捕捉细微差别。 这些发现为在复杂且资源有限的多模态环境中有效适应Omni模型提供了实用指导。
https://arxiv.org/abs/2601.12494
Video summarization helps turn long videos into clear, concise representations that are easier to review, document, and analyze, especially in high-stakes domains like surgical training. Prior work has progressed from using basic visual features like color, motion, and structural changes to using pre-trained vision-language models that can better understand what's happening in the video (semantics) and capture temporal flow, resulting in more context-aware video summarization. We propose a three-stage framework, PRISM: Procedural Representation via Integrated Semantic and Multimodal analysis, that produces semantically grounded video summaries. PRISM combines adaptive visual sampling, label-driven keyframe anchoring, and contextual validation using a large language model (LLM). Our method ensures that selected frames reflect meaningful and procedural transitions while filtering out generic or hallucinated content, resulting in contextually coherent summaries across both domain-specific and instructional videos. We evaluate our method on instructional and activity datasets, using reference summaries for instructional videos. Despite sampling fewer than 5% of the original frames, our summaries retain 84% semantic content while improving over baselines by as much as 33%. Our approach generalizes across procedural and domain-specific video tasks, achieving strong performance with both semantic alignment and precision.
视频摘要技术能够将长视频转换为清晰、简洁的表示形式,使其更易于回顾、记录和分析,在如外科培训等高风险领域尤其有用。早期的工作从使用基本视觉特征(例如颜色、运动和结构变化)开始,逐渐转向利用预训练的视觉-语言模型来更好地理解视频中的内容(语义),并捕捉时间流,从而生成更具上下文意识的视频摘要。 我们提出了一种三阶段框架,称为 PRISM:通过集成语义和多模态分析实现程序表示(Procedural Representation via Integrated Semantic and Multimodal analysis)。PRISM 结合了自适应视觉采样、标签驱动的关键帧锚定以及使用大型语言模型进行上下文验证。我们的方法确保所选的帧反映了有意义且程序化的过渡,同时过滤掉通用或虚构的内容,从而在特定领域的视频和教学视频中生成具有连贯性的摘要。 我们在教学数据集和活动数据集上评估了我们的方法,并使用参考摘要对教学视频进行了评价。尽管采样不到原始帧数的5%,我们的摘要保留了84%的语义内容,同时比基准方法提高了最多33%的表现。我们的方法在程序性和特定领域的视频任务中均表现出色,实现了强大的语义一致性与精确度。 总的来说,PRISM 提供了一种生成基于语义的视频摘要的新途径,在各种应用场景中都能发挥重要作用。
https://arxiv.org/abs/2601.12243
This study investigates the use of neural topic modeling and LLMs to uncover meaningful themes from patient storytelling data, to offer insights that could contribute to more patient-oriented healthcare practices. We analyze a collection of transcribed interviews with cancer patients (132,722 words in 13 interviews). We first evaluate BERTopic and Top2Vec for individual interview summarization by using similar preprocessing, chunking, and clustering configurations to ensure a fair comparison on Keyword Extraction. LLMs (GPT4) are then used for the next step topic labeling. Their outputs for a single interview (I0) are rated through a small-scale human evaluation, focusing on {coherence}, {clarity}, and {relevance}. Based on the preliminary results and evaluation, BERTopic shows stronger performance and is selected for further experimentation using three {clinically oriented embedding} models. We then analyzed the full interview collection with the best model setting. Results show that domain-specific embeddings improved topic \textit{precision} and \textit{interpretability}, with BioClinicalBERT producing the most consistent results across transcripts. The global analysis of the full dataset of 13 interviews, using the BioClinicalBERT embedding model, reveals the most dominant topics throughout all 13 interviews, namely ``Coordination and Communication in Cancer Care Management" and ``Patient Decision-Making in Cancer Treatment Journey''. Although the interviews are machine translations from Dutch to English, and clinical professionals are not involved in this evaluation, the findings suggest that neural topic modeling, particularly BERTopic, can help provide useful feedback to clinicians from patient interviews. This pipeline could support more efficient document navigation and strengthen the role of patients' voices in healthcare workflows.
这项研究探讨了使用神经主题建模和大型语言模型(LLM)从患者叙述数据中挖掘有意义的主题,以提供有助于更以患者为中心的医疗实践的见解。我们分析了一组癌症患者的转录采访记录(13次采访共计132,722个单词)。首先,我们通过使用相似的预处理、分块和聚类配置来确保公平比较关键词提取性能,评估了BERTopic和Top2Vec在单份采访总结中的表现。接下来,使用GPT-4进行主题标签生成。针对一次访谈(I0),其输出通过小规模的人工评价根据{连贯性}、{清晰度} 和 {相关性} 进行评分。基于初步结果及评估,在三种不同的临床导向嵌入模型中,BERTopic表现出更强的性能,并被选为后续实验使用。然后我们利用最佳模型设置分析了整个访谈集合。 结果显示,特定领域的嵌入提升了主题的准确性和可解释性,BioClinicalBERT在所有转录中的表现最为一致。对包含13次采访的数据集进行全面分析后,使用BioClinicalBERT嵌入模型揭示出在整个数据集中最突出的主题是“癌症护理管理中的协调与沟通”和“癌症治疗旅程中的患者决策”。尽管这些访谈是由荷兰语翻译成英语的,并且没有临床专业人员参与评估,但研究发现表明神经主题建模,特别是BERTopic,可以帮助从患者采访中为临床医生提供有用的反馈。这种管道可以支持更高效的文档导航,并增强患者声音在医疗工作流程中的作用。
https://arxiv.org/abs/2601.12154
Large language models are increasingly deployed as research agents for deep search and long-horizon information seeking, yet their performance often degrades as interaction histories grow. This degradation, known as context rot, reflects a failure to maintain coherent and task-relevant internal states over extended reasoning horizons. Existing approaches primarily manage context through raw accumulation or passive summarization, treating it as a static artifact and allowing early errors or misplaced emphasis to persist. Motivated by this perspective, we propose ARC, which is the first framework to systematically formulate context management as an active, reflection-driven process that treats context as a dynamic internal reasoning state during execution. ARC operationalizes this view through reflection-driven monitoring and revision, allowing agents to actively reorganize their working context when misalignment or degradation is detected. Experiments on challenging long-horizon information-seeking benchmarks show that ARC consistently outperforms passive context compression methods, achieving up to an 11% absolute improvement in accuracy on BrowseComp-ZH with Qwen2.5-32B-Instruct.
大型语言模型被越来越多地部署为深入搜索和长期信息获取的研究代理,但它们的性能通常会随着交互历史的增长而下降。这种性能下降被称为“上下文衰变”,反映了在长时间推理过程中无法保持连贯且与任务相关内部状态的问题。目前的方法主要通过原始积累或被动总结来管理上下文,将它视为静态的艺术品,并允许早期错误或不当重点的持续存在。鉴于这一视角,我们提出了ARC框架,这是首个系统地将上下文管理定义为主动、反思驱动过程的概念框架,将上下文视为执行期间动态变化的内部推理状态。通过反思驱动的监控和修订,ARC使代理能够主动重组其工作中的上下文当检测到不一致或退化时。在具有挑战性的长期信息获取基准测试中进行的实验表明,与被动的上下文压缩方法相比,ARC始终表现出更好的性能,在BrowseComp-ZH上使用Qwen2.5-32B-Instruct模型时,准确性提高了高达11%的绝对值。 简而言之:我们提出了一种新的框架——ARC,它将大型语言模型中的上下文管理视为一种动态、反思驱动的过程。实验表明,与现有的被动压缩方法相比,ARC在处理长期信息获取任务中表现出了显著的优势。
https://arxiv.org/abs/2601.12030
Simultaneous Machine Translation (SiMT) requires high-quality translations under strict real-time constraints, which traditional policies with only READ/WRITE actions cannot fully address. We extend the action space of SiMT with four adaptive actions: Sentence_Cut, Drop, Partial_Summarization and Pronominalization, which enable real-time restructuring, omission, and simplification while preserving semantic fidelity. We adapt these actions in a large language model (LLM) framework and construct training references through action-aware prompting. To evaluate both quality and word-level monotonicity, we further develop a latency-aware TTS pipeline that maps textual outputs to speech with realistic timing. Experiments on the ACL60/60 English-Chinese, English-German and English-Japanese benchmarks show that our framework consistently improves semantic metrics and achieves lower delay compared to reference translations and salami-based baselines. Notably, combining Drop and Sentence_Cut leads to consistent improvements in the balance between fluency and latency. These results demonstrate that enriching the action space of LLM-based SiMT provides a promising direction for bridging the gap between human and machine interpretation.
同步机器翻译(Simultaneous Machine Translation,SiMT)要求在严格的实时约束下提供高质量的翻译,而这超出了传统仅包含读取/写入操作策略的能力。为了更好地满足这一需求,我们为SiMT扩展了四个自适应动作:Sentence_Cut、Drop、Partial_Summarization 和 Pronominalization,这些动作允许实现实时结构调整、省略和简化,并同时保持语义忠实性。我们在大型语言模型(LLM)框架中应用了这些操作,并通过意识动作用的提示构建训练参考。为了评估质量和单词级单调性的延迟感知文本到语音管道也得到了开发,该管道将文本输出映射为具有现实时间感的口语。 实验在ACL60/60 英汉、英德和英日基准测试上进行,结果表明我们的框架在语义指标方面持续改进,并且相比参考翻译和萨拉米(切片)基线模型实现了更低的延迟。值得注意的是,结合Drop 和 Sentence_Cut 动作可以显著改善流利度与延迟之间的平衡。 这些结果证明了通过丰富LLM 基础SiMT 的动作空间为弥合人类解释与机器解释之间的差距提供了有前途的方向。
https://arxiv.org/abs/2601.11002
This paper presents team Kl33n3x's multilingual dialogue summarization and question answering system developed for the NLPAI4Health 2025 shared task. The approach employs a three-stage pipeline: forward translation from Indic languages to English, multitask text generation using a 2.55B parameter distilled language model, and reverse translation back to source languages. By leveraging knowledge distillation techniques, this work demonstrates that compact models can achieve highly competitive performance across nine languages. The system achieved strong win rates across the competition's tasks, with particularly robust performance on Marathi (86.7% QnA), Tamil (86.7% QnA), and Hindi (80.0% QnA), demonstrating the effectiveness of translation-based approaches for low-resource language processing without task-specific fine-tuning.
本文介绍了团队Kl33n3x为NLPAI4Health 2025共享任务开发的多语言对话摘要和问答系统。该方法采用三阶段管道:将印地语系语言翻译成英语,使用一个2.55B参数精简的语言模型进行多任务文本生成,以及返回源语言的逆向翻译。通过利用知识蒸馏技术,这项工作展示了紧凑型模型在九种语言中可以实现非常有竞争力的表现。该系统在整个竞赛任务中取得了强大的胜率,在马拉地语(问答部分86.7%)、泰卢固语(问答部分86.7%)和印地语(问答部分80.0%)上的表现尤为出色,证明了基于翻译的方法在处理资源不足的语言时的有效性,并且无需针对特定任务进行微调。
https://arxiv.org/abs/2601.09059
Summarization of multi-party dialogues is a critical capability in industry, enhancing knowledge transfer and operational effectiveness across many domains. However, automatically generating high-quality summaries is challenging, as the ideal summary must satisfy a set of complex, multi-faceted requirements. While summarization has received immense attention in research, prior work has primarily utilized static datasets and benchmarks, a condition rare in practical scenarios where requirements inevitably evolve. In this work, we present an industry case study on developing an agentic system to summarize multi-party interactions. We share practical insights spanning the full development lifecycle to guide practitioners in building reliable, adaptable summarization systems, as well as to inform future research, covering: 1) robust methods for evaluation despite evolving requirements and task subjectivity, 2) component-wise optimization enabled by the task decomposition inherent in an agentic architecture, 3) the impact of upstream data bottlenecks, and 4) the realities of vendor lock-in due to the poor transferability of LLM prompts.
多轮对话的总结在工业界是一个关键能力,能够增强知识转移和操作效率,在多个领域中都有重要作用。然而,自动生成高质量摘要具有挑战性,因为理想的摘要必须满足一系列复杂的、多方面的要求。尽管总结技术在研究中受到了广泛关注,但之前的工作主要使用静态数据集和基准测试,而在实际场景中这种条件很少见,需求不可避免地会变化。在这项工作中,我们展示了一个行业案例研究,内容是如何开发一个代理系统来总结多方交互。我们分享了贯穿整个开发周期的实用见解,以指导从业者构建可靠且适应性强的摘要系统,并为未来的研发提供信息,包括: 1. **动态评估方法**:尽管需求不断变化和任务具有主观性,仍然能够进行有效的评估。 2. **基于代理架构的任务分解优化**:通过任务分割来实现组件级优化。 3. **上游数据瓶颈的影响**:讨论如何克服由有限或质量不高的数据造成的开发障碍。 4. **供应商锁定的现实情况**:由于大语言模型提示的较差迁移性,导致在多个平台之间切换时面临的技术和经济挑战。 这些洞察为构建适应性强、性能稳定的多轮对话总结系统提供了宝贵指导,并促进了该领域的未来发展。
https://arxiv.org/abs/2601.08682
The LLM-as-a-Judge paradigm promises scalable rubric-based evaluation, yet aligning frozen black-box models with human standards remains a challenge due to inherent generation stochasticity. We reframe judge alignment as a criteria transfer problem and isolate three recurrent failure modes: rubric instability caused by prompt sensitivity, unverifiable reasoning that lacks auditable evidence, and scale misalignment with human grading boundaries. To address these issues, we introduce RULERS (Rubric Unification, Locking, and Evidence-anchored Robust Scoring), a compiler-executor framework that transforms natural language rubrics into executable specifications. RULERS operates by compiling criteria into versioned immutable bundles, enforcing structured decoding with deterministic evidence verification, and applying lightweight Wasserstein-based post-hoc calibration, all without updating model parameters. Extensive experiments on essay and summarization benchmarks demonstrate that RULERS significantly outperforms representative baselines in human agreement, maintains strong stability against adversarial rubric perturbations, and enables smaller models to rival larger proprietary judges. Overall, our results suggest that reliable LLM judging requires executable rubrics, verifiable evidence, and calibrated scales rather than prompt phrasing alone. Code is available at this https URL.
“法官作为大型语言模型(LLM-as-a-Judge)”的范式承诺了基于评分标准的大规模评估,但将冻结的黑箱模型与人类标准对齐仍然是一个挑战,这主要是由于生成过程中的随机性。我们重新定义了法官对齐为一项评判标准转移的问题,并且隔离出三个反复出现的失败模式:由提示敏感性导致的标准不稳定、缺乏可审核证据的支持性推理不足以及评分尺度与人类评分界限之间的不对准问题。 为了应对这些问题,我们引入了一种名为RULERS(评分标准统一化、锁定及基于证据的稳健评分)的方法,这是一种将自然语言评分准则转换为可执行规范的编译器-执行框架。RULERS通过将评判标准编译成版本化的不可变包来操作,并强制执行结构化的解码过程以及确定性的证据验证,同时应用轻量级的Wasserstein后置校准方法,所有这些都不需要更新模型参数。 在论说文和摘要评估基准上的广泛实验表明,RULERS显著超越了代表性的基线,在人类一致性方面表现出色,并且能够有效地抵抗对抗性评分标准的扰动。此外,它还使得小型模型可以与大型专有评判者相匹敌。总的来说,我们的研究结果表明,可靠的LLM评判需要可执行的标准、可验证的证据以及校准后的尺度,而不仅仅依赖于提示措辞本身。代码可在提供的链接处获取。
https://arxiv.org/abs/2601.08654
Movie screenplays are rich long-form narratives that interleave complex character relationships, temporally ordered events, and dialogue-driven interactions. While prior benchmarks target individual subtasks such as question answering or dialogue generation, they rarely evaluate whether models can construct a coherent story world and use it consistently across multiple forms of reasoning and generation. We introduce STAGE (Screenplay Text, Agents, Graphs and Evaluation), a unified benchmark for narrative understanding over full-length movie screenplays. STAGE defines four tasks: knowledge graph construction, scene-level event summarization, long-context screenplay question answering, and in-script character role-playing, all grounded in a shared narrative world representation. The benchmark provides cleaned scripts, curated knowledge graphs, and event- and character-centric annotations for 150 films across English and Chinese, enabling holistic evaluation of models' abilities to build world representations, abstract and verify narrative events, reason over long narratives, and generate character-consistent responses.
电影剧本是丰富而长篇幅的故事叙述,其中交织了复杂的角色关系、时间有序的事件和对话驱动的互动。虽然现有的基准测试通常针对问答或对话生成等单一子任务进行评估,但很少有基准能够检验模型是否能够在多种推理和生成形式中构建并一致地使用一个连贯的故事世界。为此,我们引入了STAGE(剧本文本、代理、图谱与评价),这是一个基于完整长度电影剧本的统一叙事理解基准测试。STAGE定义了四个任务:知识图谱构造、场景级事件总结、长上下文剧本问答和脚本内角色扮演,所有这些都建立在一个共享的故事世界表示基础上。该基准提供了150部英文及中文电影的清理后的剧本、精心整理的知识图谱以及以事件和角色为中心的注释,从而能够全面评估模型构建世界表示、抽象并验证叙事事件、推理长篇故事以及生成符合角色一致性响应的能力。
https://arxiv.org/abs/2601.08510
This paper incorporates the efficiency of automatic summarization and addresses the challenge of generating personalized summaries tailored to individual users' interests and requirements. To tackle this challenge, we introduce SummPilot, an interaction-based customizable summarization system. SummPilot leverages a large language model to facilitate both automatic and interactive summarization. Users can engage with the system to understand document content and personalize summaries through interactive components such as semantic graphs, entity clustering, and explainable evaluation. Our demo and user studies demonstrate SummPilot's adaptability and usefulness for customizable summarization.
本文集成了自动摘要的效率,并解决了生成符合个人用户兴趣和需求的个性化摘要这一挑战。为了解决这个挑战,我们引入了SummPilot,这是一个基于互动的可定制化摘要系统。SummPilot利用大型语言模型来支持自动和交互式摘要生成。用户可以通过与系统的互动来理解文档内容并通过诸如语义图、实体聚类和解释性评估等交互组件个性化摘要。我们的演示和用户研究证明了SummPilot在自定义摘要方面的适应性和实用性。
https://arxiv.org/abs/2601.08475
Large language models frequently generate plausible but unfaithful summaries that users cannot verify against source text, a critical limitation in compliance-sensitive domains such as government and legal analysis. We present sui-1, a 24B parameter model that produces abstractive summaries with inline citations, enabling users to trace each claim to its source sentence. Our synthetic data pipeline combines chain-of-thought prompting with multi-stage verification, generating over 22,000 high-quality training examples across five languages from diverse sources including parliamentary documents, web text, and Wikipedia. Evaluation shows sui-1 significantly outperforms all tested open-weight baselines, including models with 3x more parameters. These results demonstrate that task-specific training substantially outperforms scale alone for citation-grounded summarization. Model weights and an interactive demo are publicly available.
大型语言模型经常生成看似合理但实际上不准确的摘要,这些摘要用户无法与原始文本进行验证,在如政府和法律分析等合规性敏感领域存在关键限制。我们介绍了sui-1,这是一个拥有240亿参数的模型,可以生成带有内联引用的抽象式摘要,让用户能够追溯每个主张到其来源句子。我们的合成数据管道结合了链式思维提示与多阶段验证,在包括议会文件、网络文本和维基百科等多样源中跨五种语言生成超过22,000个高质量训练样本。 评估结果显示,sui-1在所有测试的开源权重基准模型中显著优于其他模型,即使这些模型参数量为其三倍。这些结果表明,特定任务的培训极大地超越了单纯规模对于引用基础摘要的作用。模型权重和交互式演示已公开提供。
https://arxiv.org/abs/2601.08472
Identifying the strengths and limitations of a research paper is a core component of any literature review. However, traditional summaries reflect only the authors' self-presented perspective. Analyzing how other researchers discuss and cite the paper can offer a deeper, more practical understanding of its contributions and shortcomings. In this research, we introduce SECite, a novel approach for evaluating scholarly impact through sentiment analysis of citation contexts. We develop a semi-automated pipeline to extract citations referencing nine research papers and apply advanced natural language processing (NLP) techniques with unsupervised machine learning to classify these citation statements as positive or negative. Beyond sentiment classification, we use generative AI to produce sentiment-specific summaries that capture the strengths and limitations of each target paper, derived both from clustered citation groups and from the full text. Our findings reveal meaningful patterns in how the academic community perceives these works, highlighting areas of alignment and divergence between external citation feedback and the authors' own presentation. By integrating citation sentiment analysis with LLM-based summarization, this study provides a comprehensive framework for assessing scholarly contributions.
识别研究论文的优缺点是文献综述的核心组成部分。然而,传统的摘要只能反映作者自身的观点。分析其他研究人员如何讨论和引用该论文可以提供更深层次、更具实践意义的理解其贡献与不足之处。在这项研究中,我们介绍了SECite,这是一种通过引文语境中的情感分析来评估学术影响力的新型方法。我们开发了一种半自动管道,用于提取九篇目标论文的引用,并采用先进的自然语言处理(NLP)技术和无监督机器学习技术将这些引文声明分类为正面或负面评价。 除了情感分类之外,我们还使用生成式人工智能根据聚类引文组和全文本信息来制作特定情感摘要,捕捉每篇目标论文的优点与不足。我们的研究结果揭示了学术界如何看待这些作品的有意义模式,突出了外部引用反馈与作者自身表述之间的共识与分歧领域。 通过将引文情感分析与大型语言模型(LLM)生成的总结相结合,本研究为评估学术贡献提供了一个全面的框架。
https://arxiv.org/abs/2601.07939