Text summarization is a fundamental task in natural language processing (NLP), and the information explosion has made long-document processing increasingly demanding, making summarization essential. Existing research mainly focuses on model improvements and sentence-level pruning, but often overlooks global structure, leading to disrupted coherence and weakened downstream performance. Some studies employ large language models (LLMs), which achieve higher accuracy but incur substantial resource and time costs. To address these issues, we introduce GloSA-sum, the first summarization approach that achieves global structure awareness via topological data analysis (TDA). GloSA-sum summarizes text efficiently while preserving semantic cores and logical dependencies. Specifically, we construct a semantic-weighted graph from sentence embeddings, where persistent homology identifies core semantics and logical structures, preserved in a ``protection pool'' as the backbone for summarization. We design a topology-guided iterative strategy, where lightweight proxy metrics approximate sentence importance to avoid repeated high-cost computations, thus preserving structural integrity while improving efficiency. To further enhance long-text processing, we propose a hierarchical strategy that integrates segment-level and global summarization. Experiments on multiple datasets demonstrate that GloSA-sum reduces redundancy while preserving semantic and logical integrity, striking a balance between accuracy and efficiency, and further benefits LLM downstream tasks by shortening contexts while retaining essential reasoning chains.
文本摘要在自然语言处理(NLP)中是一项基础任务,信息爆炸使得长文档的处理日益复杂,从而突显了自动摘要的重要性。现有的研究主要集中在模型改进和句子级修剪上,但往往忽略了全局结构,导致摘要连贯性受损并削弱下游性能表现。一些研究使用大型语言模型(LLMs),虽然提高了准确性,但是带来了高昂的时间与资源成本。为了应对这些问题,我们提出了GloSA-sum方法,这是首个通过拓扑数据分析实现全局结构感知的文本总结技术。 GloSA-sum在保持语义核心和逻辑依赖的同时高效地生成摘要。具体而言,该方法构建了一个基于句子嵌入的语义加权图,在此图中,持久同调性(Persistent Homology)用于识别关键语义及逻辑结构,并将这些结构保存在一个称为“保护池”的部分以供后续摘要过程使用。我们设计了一种拓扑引导迭代策略:利用轻量级代理指标来估算句子的重要性,从而避免重复计算高成本的复杂度问题,在提高效率的同时保持结构完整性和连贯性。 为了进一步增强对长文本处理的能力,我们提出了一个分层策略,该策略结合了段落级别的和全局的总结技术。在多个数据集上的实验表明,GloSA-sum能够减少冗余,并且保留语义及逻辑的一致性,在准确度与效率之间找到平衡点;同时它还能通过缩短上下文长度而保持关键推理链的方式进一步提升大型语言模型下游任务的表现。
https://arxiv.org/abs/2602.09821
We introduce MILE-RefHumEval, a reference-free framework for evaluating Large Language Models (LLMs) without ground-truth annotations or evaluator coordination. It leverages an ensemble of independently prompted evaluators guided by a human-aligned schema, supporting both discrete and continuous scoring judgement. With task-specific prompts from best candidate selection, summarization and image captioning to dialogue, MILE-RefHumEval provides flexible, interpretable, and scalable assessments. Experiments show it aligns closely with human judgments, outperforms prior methods, and reduces computational overhead, offering an efficient, robust, and human-aligned solution for real-world LLM evaluation.
我们介绍了MILE-RefHumEval,这是一个无需真实标注或评估者协调的大型语言模型(LLMs)无参考评估框架。该框架利用一组由人类一致性的模式指导的独立提示评估器,支持离散和连续评分判断。通过从最佳候选选择、摘要生成到图像描述和对话等特定任务的提示,MILE-RefHumEval 提供了灵活、可解释且可扩展的评估方式。实验表明,该框架与人类判断高度一致,优于先前的方法,并减少了计算开销,为现实世界中LLMs 的评估提供了一个高效、稳健且符合人类标准的解决方案。
https://arxiv.org/abs/2602.09624
Pretrained Large Language Models (LLMs) are prone to generating fluent yet factually incorrect text-a phenomenon known as hallucinations, undermining their reliability and utility in downstream tasks. We hypothesize that a generated text span's factuality is correlated with its representational instability across the model's internal layers. Based on this, we propose the CoCoA (Confusion and Consistency Aware) decoder, a novel, training-free decoding algorithm that mitigates hallucinations at inference time by listening to these signals in the middle layers. We propose two metrics to quantify this instability in the middle layers, and use it to penalize outputs that exhibit high internal confusion, thereby steering the model towards more internally consistent and factually grounded outputs. We further propose a self-information gated variant, CoCoA-SIG, that dynamically modulates this penalty to selectively target high-surprise, unstable generations. Extensive experiments on diverse tasks, including question-answering, summarization and code generation demonstrate that CoCoA significantly improves factual correctness across multiple model families (e.g., Llama-3, Qwen-2.5, Mistral). By leveraging model-intrinsic signals, CoCoA offers an effective and broadly applicable method for enhancing the trustworthiness of LLMs at inference time, without requiring any model retraining.
预训练的大规模语言模型(LLMs)容易生成流畅但事实错误的文本——这一现象被称为幻觉,这会削弱其在下游任务中的可靠性和实用性。我们假设生成的文本片段的事实准确性与其在模型内部层中的表示不稳定有关。基于此,我们提出了CoCoA(混淆和一致性感知)解码器,这是一种新颖的无训练解码算法,在推理时通过听取中间层的这些信号来减轻幻觉。我们提出了两个度量标准来量化中间层的这种不稳定性,并使用它对表现出高度内部混乱的输出进行惩罚,从而引导模型生成更加内在一致和基于事实的输出。我们还提出了一种自我信息门控变体CoCoA-SIG,该变体动态调节这一惩罚以选择性地针对高惊喜度且不稳定的新颖生成。在问答、摘要和代码生成等多样化的任务上的广泛实验表明,CoCoA显著提高了多种模型族(例如Llama-3、Qwen-2.5、Mistral)的事实准确性。通过利用模型内部的信号,CoCoA提供了一种有效且广泛应用的方法,在不重新训练模型的情况下增强LLMs在推理时的信任度。
https://arxiv.org/abs/2602.09486
While there is rapid progress in video-LLMs with advanced reasoning capabilities, prior work shows that these models struggle on the challenging task of sports feedback generation and require expensive and difficult-to-collect finetuning feedback data for each sport. This limitation is evident from the poor generalization to sports unseen during finetuning. Furthermore, traditional text generation evaluation metrics (e.g., BLEU-4, METEOR, ROUGE-L, BERTScore), originally developed for machine translation and summarization, fail to capture the unique aspects of sports feedback quality. To address the first problem, using rock climbing as our case study, we propose using auxiliary freely-available web data from the target domain, such as competition videos and coaching manuals, in addition to existing sports feedback from a disjoint, source domain to improve sports feedback generation performance on the target domain. To improve evaluation, we propose two evaluation metrics: (1) specificity and (2) actionability. Together, our approach enables more meaningful and practical generation of sports feedback under limited annotations.
尽管具备高级推理能力的视频大语言模型(video-LLMs)取得了快速进展,但先前的研究表明,这些模型在生成体育反馈这一具有挑战性的任务上表现不佳,并且需要昂贵且难以收集的微调数据来针对每项运动进行优化。这种局限性明显体现在模型对微调过程中未见过的运动项目的泛化能力差的问题上。此外,传统的文本生成评估指标(如BLEU-4、METEOR、ROUGE-L和BERTScore),最初为机器翻译和摘要任务设计,无法捕捉体育反馈质量的独特方面。 为了应对第一个问题,我们以攀岩作为案例研究,提出了一种利用目标领域中辅助的免费网络数据的方法,例如比赛视频和教练手册,并结合来自独立源领域的现有体育反馈来提高针对特定运动项目的体育反馈生成性能。为了改进评估方法,我们提出了两个新的评价指标:(1)具体性;(2)可操作性。 我们的方法能够更有效地在注释有限的情况下生成有意义且实用的体育反馈。
https://arxiv.org/abs/2602.08996
We explore the need for more comprehensive and precise evaluation techniques for generative artificial intelligence (GenAI) in text summarization tasks, specifically in the area of opinion summarization. Traditional methods, which leverage automated metrics to compare machine-generated summaries from a collection of opinion pieces, e.g. product reviews, have shown limitations due to the paradigm shift introduced by large language models (LLM). This paper addresses these shortcomings by proposing a novel, fully automated methodology for assessing the factual consistency of such summaries. The method is based on measuring the similarity between the claims in a given summary with those from the original reviews, measuring the coverage and consistency of the generated summary. To do so, we rely on a simple approach to extract factual assessment from texts that we then compare and summarize in a suitable score. We demonstrate that the proposed metric attributes higher scores to similar claims, regardless of whether the claim is negated, paraphrased, or expanded, and that the score has a high correlation to human judgment when compared to state-of-the-art metrics.
我们探讨了在文本摘要任务中,特别是意见总结领域,对生成式人工智能(GenAI)进行更全面和精确评估技术的需求。传统方法利用自动化指标来比较机器生成的摘要与一系列意见文章(如产品评论)之间的差异,但由于大型语言模型(LLM)带来的范式转变,这些方法显示出一定的局限性。本文通过提出一种新的、完全自动化的评估方法来解决这些问题,该方法旨在衡量此类摘要的事实一致性。 这种方法基于测量给定摘要中的主张与其原始评论中声明的相似度,以此来评估生成摘要的覆盖范围和一致性。为此,我们依赖于从文本中提取事实评估的一种简单方法,并将其进行比较以得出合适的评分。研究表明,所提出的指标无论声明是否被否定、改写或扩展,都会对类似陈述赋予更高的分数;并且与现有最先进的指标相比,该得分与人工评判的相关性很高。
https://arxiv.org/abs/2602.08709
Retrieval-Augmented Generation (RAG) often struggles with knowledge conflicts, where model-internal parametric knowledge overrides retrieved evidence, leading to unfaithful outputs. Existing approaches are often limited, relying either on superficial decoding adjustments or weight editing that necessitates ground-truth targets. Through layer-wise analysis, we attribute this failure to a parametric suppression phenomenon: specifically, in deep layers, certain FFN layers overwrite context-sensitive representations with memorized priors. To address this, we propose CoRect (Context-Aware Logit Contrast for Hidden State Rectification). By contrasting logits from contextualized and non-contextualized forward passes, CoRect identifies layers that exhibit high parametric bias without requiring ground-truth labels. It then rectifies the hidden states to preserve evidence-grounded information. Across question answering (QA) and summarization benchmarks, CoRect consistently improves faithfulness and reduces hallucinations compared to strong baselines.
基于检索的生成(Retrieval-Augmented Generation,RAG)常常在知识冲突中遇到困难,在这种情况下,模型内部参数化的知识会覆盖从外部检索到的证据,导致输出不准确。现有方法通常有限,要么依赖于浅层解码调整,要么依靠需要真实标签的权重编辑来解决问题。通过逐层分析,我们将这一失败归因于一个参数抑制现象:具体来说,在深层中,某些全连接网络(FFN)层会用记忆化的先验覆盖上下文敏感表示。 为了应对这个问题,我们提出了CoRect(Context-Aware Logit Contrast for Hidden State Rectification),即具有上下文感知的逻辑对比用于隐藏状态校正。通过比较有上下文和无上下文前向传递中的逻辑值,CoRect能够识别那些表现出高参数偏差的层,并且无需真实标签即可进行这一过程。然后它会修正这些隐藏状态以保留基于证据的信息。 在问题回答(QA)和摘要生成基准上,与强大的基线方法相比,CoRect在提高输出准确性、减少幻觉方面表现出了持续改进的效果。
https://arxiv.org/abs/2602.08221
Local governance meeting records are official documents, in the form of minutes or transcripts, documenting how proposals, discussions, and procedural actions unfold during institutional meetings. While generally structured, these documents are often dense, bureaucratic, and highly heterogeneous across municipalities, exhibiting significant variation in language, terminology, structure, and overall organization. This heterogeneity makes them difficult for non-experts to interpret and challenging for intelligent automated systems to process, limiting public transparency and civic engagement. To address these challenges, computational methods can be employed to structure and interpret such complex documents. In particular, Natural Language Processing (NLP) offers well-established methods that can enhance the accessibility and interpretability of governmental records. In this focus article, we review foundational NLP tasks that support the structuring of local governance meeting documents. Specifically, we review three core tasks: document segmentation, domain-specific entity extraction and automatic text summarization, which are essential for navigating lengthy deliberations, identifying political actors and personal information, and generating concise representations of complex decision-making processes. In reviewing these tasks, we discuss methodological approaches, evaluation metrics, and publicly available resources, while highlighting domain-specific challenges such as data scarcity, privacy constraints, and source variability. By synthesizing existing work across these foundational tasks, this article provides a structured overview of how NLP can enhance the structuring and accessibility of local governance meeting records.
地方治理会议记录是官方文件,以会议纪要或实录的形式记载了提案、讨论和程序性行动在机构会议中的展开过程。尽管这些文档通常具有结构化的特征,但它们往往内容密集、官僚化,并且跨地区表现出语言、术语、结构以及整体组织上的高度异质性,这使得非专业人士难以解读,并给智能自动化系统处理带来挑战,从而限制了公共透明度和市民参与度。为了应对这些问题,可以采用计算方法来整理并解析这些复杂的文档。具体而言,自然语言处理(NLP)提供了一套成熟的方法,能够增强政府记录的可访问性和解释性。 本文聚焦于介绍支持地方治理会议文件结构化的基础NLP任务。特别地,我们将回顾三个核心任务:文档分割、领域特定实体抽取和自动文本摘要。这些任务对于梳理冗长讨论、识别政治角色和个人信息以及生成复杂决策过程的简洁表示至关重要。在探讨这些任务时,本文将涉及方法论途径、评估指标以及公开可用资源,并突出数据稀缺性、隐私限制及来源变化等领域的特定挑战。 通过综合现有工作的基础任务,本文为如何利用NLP提升地方治理会议记录的结构化与可访问性提供了一个有条理的概述。
https://arxiv.org/abs/2602.08162
Personalizing large language models (LLMs) to individual users requires incorporating extensive interaction histories and profiles, but input token constraints make this impractical due to high inference latency and API costs. Existing approaches rely on heuristic methods such as selecting recent interactions or prompting summarization models to compress user profiles. However, these methods treat context as a monolithic whole and fail to consider how LLMs internally process and prioritize different profile components. We investigate whether LLMs' attention patterns can effectively identify important personalization signals for intelligent context compression. Through preliminary studies on representative personalization tasks, we discover that (a) LLMs' attention patterns naturally reveal important signals, and (b) fine-tuning enhances LLMs' ability to distinguish between relevant and irrelevant information. Based on these insights, we propose Attn-GS, an attention-guided context compression framework that leverages attention feedback from a marking model to mark important personalization sentences, then guides a compression model to generate task-relevant, high-quality compressed user contexts. Extensive experiments demonstrate that Attn-GS significantly outperforms various baselines across different tasks, token limits, and settings, achieving performance close to using full context while reducing token usage by 50 times.
将大型语言模型(LLM)个性化以适应个别用户需要整合详细的交互历史和档案信息,但由于推理延迟高和API成本高昂的问题,输入标记限制使得这种方法在实际操作中难以实现。现有的方法主要依赖于诸如选择最近的互动或是提示总结模型来压缩用户档案等启发式手段。然而,这些方法将上下文视为一个整体,并未考虑到LLM内部如何处理和优先化不同类型的个人资料信息。 我们研究了LLM的关注模式是否可以有效地识别重要的人格化信号,从而实现智能的上下文压缩。通过在代表性的个性化任务中的初步研究中发现:(a)LLM的关注模式自然揭示出了重要的信号;(b)微调能够提升模型区分相关信息和无关信息的能力。 基于这些见解,我们提出了一种新的框架Attn-GS,这是一种以注意力引导的上下文压缩方案。该方法利用标记模型提供的注意反馈来标出重要的人格化句子,并指导压缩模型生成与任务相关、高质量且经过压缩后的用户上下文信息。 广泛的实验表明,在不同的任务、令牌限制和设置下,Attn-GS框架显著优于各种基线模型,它能够在减少50倍的令牌使用量的情况下,接近于使用完整上下文时的表现水平。
https://arxiv.org/abs/2602.07778
Large language model (LLM) judges have often been used alongside traditional, algorithm-based metrics for tasks like summarization because they better capture semantic information, are better at reasoning, and are more robust to paraphrasing. However, LLM judges show biases for length and order among others, and are vulnerable to various adversarial input prompts. While recent studies have looked into these biases, few have analyzed them at a more granular level in relation to a well-defined overlap metric. In this work we provide an LLM judge bias analysis as a function of overlap with human-written responses in the domain of summarization. We test 9 recent LLMs with parameter counts ranging from 1 billion to 12 billion, including variants of Gemma 3 and LLaMA 3. We find that LLM judges increasingly prefer summaries generated by other LLMs over those written by humans as the similarities (as measured by ROUGE and BLEU) between the judged summaries decrease, and this pattern extends to all but one model tested, and exists regardless of the models' own position biases. Additionally, we find that models struggle to judge even summaries with limited overlaps, suggesting that LLM-as-a-judge in the summary domain should rely on techniques beyond a simple comparison.
大型语言模型(LLM)裁判通常与传统的基于算法的指标一起用于任务如摘要生成,因为它们能更好地捕捉语义信息、更擅长推理,并且更能抵抗同义改写。然而,LLM裁判在长度和顺序等方面表现出偏见,并对各种对抗性输入提示敏感。尽管近期研究已经探讨了这些偏见,但很少有研究从与明确重叠度量相关的更细微层面进行分析。在这项工作中,我们提供了一个关于摘要领域的LLM裁判偏见分析,该分析依据人类写作的响应之间的相似程度(通过ROUGE和BLEU等指标测量)。我们测试了9个最近的大规模语言模型,参数数量从10亿到120亿不等,包括Gemma 3和LLaMA 3的变体。我们发现随着被评判摘要与由其他LLM生成的摘要之间的相似性(通过ROUGE和BLEU测量)降低,LLM裁判倾向于更偏好于机器生成而非人类撰写的摘要,这一模式适用于除一个模型之外的所有测试模型,并且独立于各模型自身的偏置倾向。此外,我们发现即使对于重叠度有限的摘要,模型也难以进行评判,这表明在摘要领域中使用LLM作为评判标准应该依赖于超越简单比较的技术手段。
https://arxiv.org/abs/2602.07673
Summarizing Indian legal court judgments is a complex task not only due to the intricate language and unstructured nature of the legal texts, but also since a large section of the Indian population does not understand the complex English in which legal text is written, thus requiring summaries in Indian languages. In this study, we aim to improve the summarization of Indian legal text to generate summaries in both English and Hindi (the most widely spoken Indian language), by injecting domain knowledge into diverse summarization models. We propose a framework to enhance extractive neural summarization models by incorporating domain-specific pre-trained encoders tailored for legal texts. Further, we explore the injection of legal domain knowledge into generative models (including Large Language Models) through continual pre-training on large legal corpora in English and Hindi. Our proposed approaches achieve statistically significant improvements in both English-to-English and English-to-Hindi Indian legal document summarization, as measured by standard evaluation metrics, factual consistency metrics, and legal domain-specific metrics. Furthermore, these improvements are validated through domain experts, demonstrating the effectiveness of our approaches.
将印度法律判决的摘要工作进行总结是一项复杂任务,不仅因为法律文本的语言繁复且结构不规则,还由于许多印度人无法理解用英语书写的复杂的法律文件内容。因此,需要生成各种印度语言版本的摘要。在这项研究中,我们的目标是通过向多种多样的摘要模型注入领域知识来改进印度法律文本的总结工作,并生成英语和印地语(最广泛使用的印度语言)的摘要。 我们提出了一种框架,用以增强提取式神经网络摘要模型,该框架引入了专门针对法律文本预训练的编码器。此外,我们还探讨了通过在大量英文和印地文法律语料库上进行持续预训练来向生成式模型(包括大规模语言模型)注入法律领域知识的方法。 我们的方法在标准评估指标、事实一致性指标以及特定于法律领域的指标中实现了统计显著性改进,不仅限于从英语到英语的总结,还包括从英语到印地语的印度法律文件摘要。这些改进的有效性还通过领域专家的验证得到确认。
https://arxiv.org/abs/2602.07382
Large reasoning models achieve strong performance by scaling inference-time chain-of-thought, but this paradigm suffers from quadratic cost, context length limits, and degraded reasoning due to lost-in-the-middle effects. Iterative reasoning mitigates these issues by periodically summarizing intermediate thoughts, yet existing methods rely on supervised learning or fixed heuristics and fail to optimize when to summarize, what to preserve, and how to resume reasoning. We propose InftyThink+, an end-to-end reinforcement learning framework that optimizes the entire iterative reasoning trajectory, building on model-controlled iteration boundaries and explicit summarization. InftyThink+ adopts a two-stage training scheme with supervised cold-start followed by trajectory-level reinforcement learning, enabling the model to learn strategic summarization and continuation decisions. Experiments on DeepSeek-R1-Distill-Qwen-1.5B show that InftyThink+ improves accuracy by 21% on AIME24 and outperforms conventional long chain-of-thought reinforcement learning by a clear margin, while also generalizing better to out-of-distribution benchmarks. Moreover, InftyThink+ significantly reduces inference latency and accelerates reinforcement learning training, demonstrating improved reasoning efficiency alongside stronger performance.
大型推理模型通过扩展推断时的思维链达到强大的性能,但这一体系面临着二次成本、上下文长度限制以及由于中间效果丢失而导致的推理能力下降的问题。迭代推理通过周期性地总结中间思想来缓解这些问题,但现有的方法依赖于监督学习或固定的启发式算法,并且无法优化何时进行总结、保留什么内容以及如何继续推理等问题。 我们提出了一种名为InftyThink+的端到端强化学习框架,该框架旨在优化整个迭代推理路径。InftyThink+基于模型控制的迭代边界和显式的总结机制,采用两阶段训练方案:首先通过监督学习启动,然后过渡到轨迹级别的强化学习,使模型能够学会策略性的总结和继续决策。 在DeepSeek-R1-Distill-Qwen-1.5B上的实验表明,InftyThink+在AIME24数据集上提高了21%的准确率,并且明显优于传统的长思维链推理强化学习方法。此外,该方法在处理分布外基准时表现更佳。更重要的是,InftyThink+显著减少了推断延迟并加速了强化学习训练过程,从而展示了改进的推理效率以及更强的性能。 这项研究和框架设计为大型语言模型的优化提供了新的方向,特别是在解决长序列任务或需要多次迭代的问题上表现出色,并且对于推动高效、高性能的人工智能系统具有重要意义。
https://arxiv.org/abs/2602.06960
Hallucinations in large language models remain a persistent challenge, particularly in multilingual and generative settings where factual consistency is difficult to maintain. While recent models show strong performance on English-centric benchmarks, their behavior across languages, tasks, and hallucination types is not yet well understood. In this work, we introduce Halluverse-M^3, a dataset designed to enable systematic analysis of hallucinations across multiple languages, multiple generation tasks, and multiple hallucination categories. Halluverse-M^3 covers four languages, English, Arabic, Hindi, and Turkish, and supports two generation tasks: question answering and dialogue summarization. The dataset explicitly distinguishes between entity-level, relation-level, and sentence-level hallucinations. Hallucinated outputs are constructed through a controlled editing process and validated by human annotators, ensuring clear alignment between original content and hallucinated generations. Using this dataset, we evaluate a diverse set of contemporary open-source and proprietary language models on fine-grained hallucination detection. Our results show that question answering is consistently easier than dialogue summarization, while sentence-level hallucinations remain challenging even for the strongest models. Performance is highest in English and degrades in lower-resource languages, with Hindi exhibiting the lowest detection accuracy. Overall, Halluverse-M^3 provides a realistic and challenging benchmark for studying hallucinations in multilingual, multi-task settings. We release the dataset to support future research on hallucination detection and mitigation\footnote{this https URL}.
大型语言模型中的幻觉问题仍然是一个持续的挑战,尤其是在多语言和生成环境中,事实一致性难以维持。尽管最近的模型在以英语为中心的基准测试中表现出色,但它们在不同语言、任务以及各种幻觉类型上的行为尚不完全明了。在此项研究工作中,我们引入了Halluverse-M^3数据集,该数据集旨在实现对多种语言、生成任务和幻觉类别中的幻觉进行系统的分析。Halluverse-M^3涵盖了四种语言:英语、阿拉伯语、印地语和土耳其语,并支持两种生成任务:问答和对话总结。此数据集明确区分了实体级、关系级和句子级的幻觉。通过受控编辑过程构造出幻觉输出,再由人工标注员验证,确保原始内容与生成的内容之间有清晰的一致性。使用该数据集,我们评估了一组多样化的当代开源及专有语言模型在细粒度幻觉检测上的表现。我们的结果显示,在问答任务中,即使是较弱的模型也能较好地进行检测;而在对话总结的任务中则相对较难,并且即便是最强力的模型在句子级别的幻觉检测上仍面临挑战。总的来说,性能最高的依然是英语环境下的模型,而资源较少的语言如印地语,则表现出最低的检测准确率。 总体而言,Halluverse-M^3为研究多语言和多任务设置中的幻觉提供了一个现实且具有挑战性的基准测试平台。我们发布了该数据集以支持未来关于幻觉检测与缓解的研究工作[1](此链接指向发布数据集的网址)。
https://arxiv.org/abs/2602.06920
Large Language Models (LLMs) inference is central in modern AI applications, making it critical to understand their energy footprint. Existing approaches typically estimate energy consumption through simple linear functions of input and output sequence lengths, yet our observations reveal clear Energy Efficiency regimes: peak efficiency occurs with short-to-moderate inputs and medium-length outputs, while efficiency drops sharply for long inputs or very short outputs, indicating a non-linear dependency. In this work, we propose an analytical model derived from the computational and memory-access complexity of the Transformer architecture, capable of accurately characterizing the efficiency curve as a function of input and output lengths. To assess its accuracy, we evaluate energy consumption using TensorRT-LLM on NVIDIA H100 GPUs across a diverse set of LLMs ranging from 1B to 9B parameters, including OPT, LLaMA, Gemma, Falcon, Qwen2, and Granite, tested over input and output lengths from 64 to 4096 tokens, achieving a mean MAPE of 1.79%. Our results show that aligning sequence lengths with these efficiency "Sweet Spots" can substantially reduce energy usage, supporting informed truncation, summarization, and adaptive generation strategies in production systems.
大型语言模型(LLMs)的推理是现代人工智能应用的核心,因此理解它们的能量消耗足迹至关重要。现有的方法通常通过输入和输出序列长度的简单线性函数来估算能耗,但我们的观察揭示了清晰的能量效率模式:在较短至中等长度的输入和中等长度的输出时达到峰值效率;而对于较长输入或非常短输出的情况,能量效率则会急剧下降,表明存在非线性的依赖关系。在这项工作中,我们提出了一种基于Transformer架构计算复杂性和内存访问复杂性而衍生出的分析模型,该模型能够精确地描述以输入和输出长度为变量的能量效率曲线。为了评估其准确性,我们在NVIDIA H100 GPU上使用TensorRT-LLM对一系列参数从10亿到90亿、包括OPT、LLaMA、Gemma、Falcon、Qwen2和Granite等的大型语言模型进行了能耗测试,在输入输出长度从64到4096个token的不同组合中,实现了平均绝对百分比误差(MAPE)为1.79%的结果。我们的结果显示,将序列长度与这些效率“最佳区间”对齐可以大幅减少能量消耗,从而支持生产系统中的知情截断、摘要和自适应生成策略。 这段翻译概述了该研究的主要发现和贡献:通过分析大型语言模型的能量使用模式,并提出了一种新的基于Transformer架构复杂性的能耗预测模型。实验结果证明了模型的准确性,并展示了如何根据这些效率模式优化序列长度,从而在生产环境中减少能源消耗。
https://arxiv.org/abs/2602.05695
Individuals are increasingly generating substantial personal health and lifestyle data, e.g. through wearables and smartphones. While such data could transform preventative care, its integration into clinical practice is hindered by its scale, heterogeneity and the time pressure and data literacy of healthcare professionals (HCPs). We explore how large language models (LLMs) can support sensemaking of patient-generated health data (PGHD) with automated summaries and natural language data exploration. Using cardiovascular disease (CVD) risk reduction as a use case, 16 HCPs reviewed multimodal PGHD in a mixed-methods study with a prototype that integrated common charts, LLM-generated summaries, and a conversational interface. Findings show that AI summaries provided quick overviews that anchored exploration, while conversational interaction supported flexible analysis and bridged data-literacy gaps. However, HCPs raised concerns about transparency, privacy, and overreliance. We contribute empirical insights and sociotechnical design implications for integrating AI-driven summarization and conversation into clinical workflows to support PGHD sensemaking.
个人越来越多地通过可穿戴设备和智能手机等途径生成大量的健康和个人生活方式数据。尽管这种数据能够改变预防性护理,但由于其规模、异质性和医疗保健专业人员(HCPs)的时间压力及数据素养等因素,将其整合到临床实践中面临许多挑战。我们探讨了大型语言模型(LLMs)如何通过自动生成摘要和自然语言数据分析来支持对患者生成的健康数据(PGHD)的理解。 以心血管疾病(CVD)风险降低为例,我们在一项混合方法研究中让16名HCP使用一个原型工具审查多模态PGHD。该原型集成了常见图表、LLM生成的摘要以及对话式界面。研究结果表明,AI生成的摘要提供了快速概览,并为探索奠定了基础;而对话式的交互支持了灵活的分析,并弥补了数据素养差距。然而,HCP们也对透明度、隐私和过度依赖提出了担忧。 我们的研究贡献了将AI驱动的总结和对话整合到临床工作流程中以支持PGHD理解的经验见解和社会技术设计启示。
https://arxiv.org/abs/2602.05687
Despite the growing success of Large Speech Language Models (LSLMs) in processing short-term acoustic signals, their extension to long-form audio understanding is severely bottlenecked. This limitation stems from the limited context length and the exorbitant memory footprints required for long-form inference. In this work, we propose Speech-XL, a new model that capitalizes on the intrinsic key-value (KV) sparsification capacity of Large Language Models (LLMs) to achieve high-ratio speech input compression. Specifically, we introduce a novel special token, the Speech Summarization Token (SST), for each speech interval to encapsulate the intra-interval speech information into its associated KV pairs. The SST module is trained via instruction fine-tuning, employing a curriculum learning strategy where the SST learns to compress information in a progressive manner--advancing from low-ratio (simple) to high-ratio (challenging) compression. Despite utilizing significantly less training data than other baselines, our model achieves highly competitive performance on major benchmarks, including LongSpeech and AUDIOMARATHON. By addressing the long-standing bottlenecks in long-form audio modeling, our approach offers a novel perspective on the condensation of extensive acoustic sequences.
尽管大型语音语言模型(LSLMs)在处理短期声学信号方面取得了成功,但它们向长音频理解的扩展却受到了严重限制。这种局限性主要源自于有限的上下文长度和进行长音频推理所需的巨大内存消耗。在这项工作中,我们提出了Speech-XL这一新模型,该模型利用了大型语言模型(LLMs)内在的关键值(KV)稀疏化能力来实现高比率语音输入压缩。具体而言,我们引入了一种新颖的特殊标记——语音摘要令牌(SST),用于每个语音区间,以将其内部的语音信息封装到与其相关的KV对中。SST模块通过指令微调进行训练,并采用一种渐进式的课程学习策略,在这种策略下,SST逐步从低比率(简单)压缩向高比率(复杂)压缩过渡。 尽管我们所用的训练数据远少于其他基线模型,我们的模型在包括LongSpeech和AUDIOMARATHON在内的主要基准测试中取得了极具竞争力的表现。通过解决长音频建模长期以来存在的瓶颈问题,我们的方法为大量声学序列的高度浓缩提供了新的视角。
https://arxiv.org/abs/2602.05373
With the increasing use of large language models (LLMs) for generating answers to biomedical questions, it is crucial to evaluate the quality of the generated answers and the references provided to support the facts in the generated answers. Evaluation of text generated by LLMs remains a challenge for question answering, retrieval-augmented generation (RAG), summarization, and many other natural language processing tasks in the biomedical domain, due to the requirements of expert assessment to verify consistency with the scientific literature and complex medical terminology. In this work, we propose BioACE, an automated framework for evaluating biomedical answers and citations against the facts stated in the answers. The proposed BioACE framework considers multiple aspects, including completeness, correctness, precision, and recall, in relation to the ground-truth nuggets for answer evaluation. We developed automated approaches to evaluate each of the aforementioned aspects and performed extensive experiments to assess and analyze their correlation with human evaluations. In addition, we considered multiple existing approaches, such as natural language inference (NLI) and pre-trained language models and LLMs, to evaluate the quality of evidence provided to support the generated answers in the form of citations into biomedical literature. With the detailed experiments and analysis, we provide the best approaches for biomedical answer and citation evaluation as a part of BioACE (this https URL) evaluation package.
随着大型语言模型(LLMs)在生成生物医学问题答案中的使用越来越广泛,评估这些答案的质量以及支持事实引用的准确性变得至关重要。由于需要专家评估以验证与科学文献和复杂医疗术语的一致性,在回答、检索增强生成(RAG)、总结以及其他自然语言处理任务中,对由LLM生成文本进行评价仍然是一个挑战。在这项工作中,我们提出了BioACE,这是一个用于自动评估生物医学答案及其引用的框架。BioACE框架考虑了多个方面,包括完整性、准确性、精确度和召回率,以评价相对于事实基准的答案质量。 我们开发了自动化方法来评估上述每个方面,并进行了广泛的实验,以评估这些方法与人类评价的相关性。此外,我们还采用了一些现有的方法,如自然语言推理(NLI)以及预训练的语言模型和大型语言模型,来评估生成答案中提供的引用证据的质量。通过详细的实验和分析,我们将BioACE框架中的最佳评估生物医学答案及其引文质量的方法提供给研究社区作为评价包的一部分(访问此链接:this https URL)。
https://arxiv.org/abs/2602.04982
The introduction of large language models ignited great retooling and rethinking of the software development models. The ensuing response of software engineering research yielded a massive body of tools and approaches. In this paper, we join the hassle by introducing agentic AI solutions for two tasks. First, we developed a solution for automatic test scenario generation from a detailed requirements description. This approach relies on specialized worker agents forming a star topology with the supervisor agent in the middle. We demonstrate its capabilities on a real-world example. Second, we developed an agentic AI solution for the document retrieval task in the context of software engineering documents. Our solution enables performing various use cases on a body of documents related to the development of a single software, including search, question answering, tracking changes, and large document summarization. In this case, each use case is handled by a dedicated LLM-based agent, which performs all subtasks related to the corresponding use case. We conclude by hinting at the future perspectives of our line of research.
大型语言模型的引入引发了对软件开发模式的重大革新和重新思考。随后,软件工程研究领域产生了大量的工具和方法。在本文中,我们加入这场变革浪潮,提出了解决两项任务的代理式AI解决方案。首先,我们开发了一种从详细需求描述自动生成测试场景的方法。这种方法依赖于多个专门的工作代理构成星形拓扑结构,并由位于中心的监督代理进行协调。我们在一个实际案例中展示了该方法的能力。其次,我们为软件工程文档中的文件检索任务开发了一个代理式AI解决方案。我们的方案支持在与单一软件开发相关的大量文档集合上执行各种用例,包括搜索、问答、追踪变更和长文档摘要生成等。在这种情况下,每个用例都有一个特定的基于大语言模型(LLM)的代理来处理所有相关子任务。最后,我们展望了未来的研究方向。
https://arxiv.org/abs/2602.04726
Autonomous inspection of underground infrastructure, such as sewer and culvert systems, is critical to public safety and urban sustainability. Although robotic platforms equipped with visual sensors can efficiently detect structural deficiencies, the automated generation of human-readable summaries from these detections remains a significant challenge, especially on resource-constrained edge devices. This paper presents a novel two-stage pipeline for end-to-end summarization of underground deficiencies, combining our lightweight RAPID-SCAN segmentation model with a fine-tuned Vision-Language Model (VLM) deployed on an edge computing platform. The first stage employs RAPID-SCAN (Resource-Aware Pipeline Inspection and Defect Segmentation using Compact Adaptive Network), achieving 0.834 F1-score with only 0.64M parameters for efficient defect segmentation. The second stage utilizes a fine-tuned Phi-3.5 VLM that generates concise, domain-specific summaries in natural language from the segmentation outputs. We introduce a curated dataset of inspection images with manually verified descriptions for VLM fine-tuning and evaluation. To enable real-time performance, we employ post-training quantization with hardware-specific optimization, achieving significant reductions in model size and inference latency without compromising summarization quality. We deploy and evaluate our complete pipeline on a mobile robotic platform, demonstrating its effectiveness in real-world inspection scenarios. Our results show the potential of edge-deployable integrated AI systems to bridge the gap between automated defect detection and actionable insights for infrastructure maintenance, paving the way for more scalable and autonomous inspection solutions.
地下基础设施(如下水道和涵洞系统)的自主检查对于公共安全和城市可持续性至关重要。尽管配备视觉传感器的机器人平台可以高效地检测结构缺陷,但从这些检测结果自动生成易于人类理解的摘要仍然是一个重大挑战,尤其是在资源受限的边缘设备上实现这一点尤为困难。本文提出了一种新颖的两阶段流水线方法,用于地下缺陷的端到端总结,该方法结合了我们的轻量级RAPID-SCAN分割模型和在边缘计算平台上部署的微调视觉语言模型(VLM)。 第一阶段采用了RAPID-SCAN(资源感知管道检查与缺陷分割使用的紧凑自适应网络),它使用仅0.64M参数实现了0.834的F1分数,从而实现高效的缺陷分割。第二阶段利用了经过微调的Phi-3.5 VLM,该模型能够从分割输出中生成简洁、特定领域的自然语言总结。 我们还引入了一套由人工验证描述支持的检查图像数据集,用于VLM的微调和评估。为了实现实时性能,我们在训练后采用硬件特异性的优化进行量化处理,在不牺牲摘要质量的前提下显著减少了模型大小和推断延迟。 我们的完整流水线在移动机器人平台上部署并进行了评估,证明了其在实际检查场景中的有效性。研究表明,边缘可部署的集成AI系统具有将自动化缺陷检测与基础设施维护方面的实际行动之间的差距弥合的可能性,为更可扩展和自主的检查解决方案铺平了道路。
https://arxiv.org/abs/2602.03742
As Generative AI (GenAI), particularly inference, rapidly emerges as a dominant workload category, the Kubernetes ecosystem is proactively evolving to natively support its unique demands. This industry paper demonstrates how emerging Kubernetes-native projects can be combined to deliver the benefits of container orchestration, such as scalability and resource efficiency, to complex AI workflows. We implement and evaluate an illustrative, multi-stage use case consisting of automatic speech recognition and summarization. First, we address batch inference by using Kueue to manage jobs that transcribe audio files with Whisper models and Dynamic Accelerator Slicer (DAS) to increase parallel job execution. Second, we address a discrete online inference scenario by feeding the transcripts to a Large Language Model for summarization hosted using llm-d, a novel solution utilizing the recent developments around the Kubernetes Gateway API Inference Extension (GAIE) for optimized routing of inference requests. Our findings illustrate that these complementary components (Kueue, DAS, and GAIE) form a cohesive, high-performance platform, proving Kubernetes' capability to serve as a unified foundation for demanding GenAI workloads: Kueue reduced total makespan by up to 15%; DAS shortened mean job completion time by 36%; and GAIE improved Time to First Token by 82\%.
随着生成式人工智能(GenAI)尤其是推理工作负载迅速成为主导类型,Kubernetes生态系统正积极演进以原生支持其独特需求。本行业报告展示了如何结合新兴的原生Kubernetes项目,将容器编排的好处如可扩展性和资源效率提供给复杂的AI工作流程。我们实施并评估了一个具有代表性的多阶段用例,其中包括自动语音识别和摘要。首先,我们使用Kueue来管理批处理推理作业,并利用Whisper模型转录音频文件,同时采用动态加速器切片器(DAS)以增加并行作业执行的数量。其次,我们在一个离线推理场景中应用了这一方案,即将转录文本传递给由llm-d托管的大规模语言模型进行摘要生成,这是一种新颖的解决方案,利用Kubernetes Gateway API推理扩展(GAIE)来优化推理请求的路由。 我们的研究发现表明,这些互补组件(包括Kueue、DAS和GAIE)共同构成了一个连贯且高性能的平台,这证明了Kubernetes有能力成为复杂GenAI工作负载统一基础的理想选择:Kueue最多可将总耗时减少15%;DAS缩短平均作业完成时间36%;而GAIE则使首个令牌生成时间提高了82%。
https://arxiv.org/abs/2602.04900
Hallucinations remain a major obstacle for large language models (LLMs), especially in safety-critical domains. We present HALT (Hallucination Assessment via Log-probs as Time series), a lightweight hallucination detector that leverages only the top-20 token log-probabilities from LLM generations as a time series. HALT uses a gated recurrent unit model combined with entropy-based features to learn model calibration bias, providing an extremely efficient alternative to large encoders. Unlike white-box approaches, HALT does not require access to hidden states or attention maps, relying only on output log-probabilities. Unlike black-box approaches, it operates on log-probs rather than surface-form text, which enables stronger domain generalization and compatibility with proprietary LLMs without requiring access to internal weights. To benchmark performance, we introduce HUB (Hallucination detection Unified Benchmark), which consolidates prior datasets into ten capabilities covering both reasoning tasks (Algorithmic, Commonsense, Mathematical, Symbolic, Code Generation) and general purpose skills (Chat, Data-to-Text, Question Answering, Summarization, World Knowledge). While being 30x smaller, HALT outperforms Lettuce, a fine-tuned modernBERT-base encoder, achieving a 60x speedup gain on HUB. HALT and HUB together establish an effective framework for hallucination detection across diverse LLM capabilities.
幻觉(hallucinations)仍然是大型语言模型(LLMs)面临的主要障碍,特别是在安全关键领域。我们介绍了HALT(通过概率时间序列评估幻觉),这是一种轻量级的幻觉检测器,仅利用大模型生成输出中的前20个令牌对数概率作为时间序列。HALT采用门控循环单元模型结合基于熵的特征来学习模型校准偏差,并提供了一个极为高效的替代方案,相比大型编码器更加高效。与白盒方法不同,HALT不需要访问隐藏状态或注意力图,仅依赖于输出对数概率。不同于黑盒方法,它在对数概率而非表面形式文本上操作,这使得其能够在不同的领域中实现更强的一般化能力,并且能够兼容专有LLMs,而无需访问内部权重。 为了评估性能,我们引入了HUB(幻觉检测统一基准),该基准将先前的数据集整合为涵盖推理任务(算法、常识、数学、符号和代码生成)以及通用技能(聊天、数据到文本、问答、摘要、世界知识)在内的十种能力。尽管大小仅为30倍于后者,HALT的表现优于使用现代BERT-base模型进行微调的方法Lettuce,并在HUB上实现了60倍的速度提升。HALT和HUB共同为跨多种LLM功能的幻觉检测建立了有效的框架。
https://arxiv.org/abs/2602.02888