Aspect-based summarization aims to generate summaries tailored to specific aspects, addressing the resource constraints and limited generalizability of traditional summarization approaches. Recently, large language models have shown promise in this task without the need for training. However, they rely excessively on prompt engineering and face token limits and hallucination challenges, especially with in-context learning. To address these challenges, in this paper, we propose a novel framework for aspect-based summarization: Self-Aspect Retrieval Enhanced Summary Generation. Rather than relying solely on in-context learning, given an aspect, we employ an embedding-driven retrieval mechanism to identify its relevant text segments. This approach extracts the pertinent content while avoiding unnecessary details, thereby mitigating the challenge of token limits. Moreover, our framework optimizes token usage by deleting unrelated parts of the text and ensuring that the model generates output strictly based on the given aspect. With extensive experiments on benchmark datasets, we demonstrate that our framework not only achieves superior performance but also effectively mitigates the token limitation problem.
基于方面的摘要生成旨在为特定方面的内容生成定制化的摘要,以解决传统摘要方法中的资源限制和有限泛化能力的问题。最近,大型语言模型在无需训练的情况下也展示了在这个任务上的潜力。然而,这些模型过于依赖提示工程,并且面对上下文学习时会遇到令牌限制及幻觉(即生成不准确或与事实不符的信息)的挑战。为了解决这些问题,在本文中我们提出了一种新的基于方面摘要生成框架:自适应方面检索增强摘要生成(Self-Aspect Retrieval Enhanced Summary Generation)。 我们的方法不同于仅依赖于上下文学习的方式,而是当给定一个特定方面的请求时,利用一种基于嵌入的检索机制来识别与该方面相关的文本段落。这种方法能够提取相关的内容而避免不必要的细节,从而缓解令牌限制问题。此外,通过删除无关的部分并确保模型生成的输出严格依据指定的摘要要求,我们的框架还能优化令牌使用效率。 通过在基准数据集上进行广泛的实验,我们证明了我们的框架不仅实现了更优的表现,而且有效地解决了令牌限制的问题。
https://arxiv.org/abs/2504.13054
Recent advances in long-context reasoning abilities of language models led to interesting applications in large-scale multi-document summarization. However, prior work has shown that these long-context models are not effective at their claimed context windows. To this end, retrieval-augmented systems provide an efficient and effective alternative. However, their performance can be highly sensitive to the choice of retrieval context length. In this work, we present a hybrid method that combines retrieval-augmented systems with long-context windows supported by recent language models. Our method first estimates the optimal retrieval length as a function of the retriever, summarizer, and dataset. On a randomly sampled subset of the dataset, we use a panel of LLMs to generate a pool of silver references. We use these silver references to estimate the optimal context length for a given RAG system configuration. Our results on the multi-document summarization task showcase the effectiveness of our method across model classes and sizes. We compare against length estimates from strong long-context benchmarks such as RULER and HELMET. Our analysis also highlights the effectiveness of our estimation method for very long-context LMs and its generalization to new classes of LMs.
最近在语言模型的长上下文推理能力方面取得的进步,为大规模多文档摘要生成带来了有趣的用途。然而,先前的研究表明,这些声称具备长上下文窗口功能的模型实际上并不有效。为此,检索增强系统提供了一种高效且有效的替代方案。不过,它们的表现可能会高度依赖于所选择的检索上下文长度。在这项工作中,我们提出了一种混合方法,结合了检索增强系统和最近的语言模型支持的长上下文窗口。我们的方法首先根据检索器、摘要生成器以及数据集来估算最佳检索长度。 在数据集的一个随机样本子集中,我们使用一组大型语言模型(LLMs)生成一个银色参考池。然后利用这些银色参考估计给定RAG系统配置下的最优上下文长度。 我们在多文档摘要任务上的结果展示了我们的方法在不同模型类别和规模中的有效性,并且与来自强大长上下文基准如RULER和HELMET的长度估计进行了比较。此外,我们的分析还强调了我们估算方法对于非常长上下文的语言模型的有效性以及其对新一类语言模型的泛化能力。
https://arxiv.org/abs/2504.12972
Collaborative perception has attracted growing interest from academia and industry due to its potential to enhance perception accuracy, safety, and robustness in autonomous driving through multi-agent information fusion. With the advancement of Vehicle-to-Everything (V2X) communication, numerous collaborative perception datasets have emerged, varying in cooperation paradigms, sensor configurations, data sources, and application scenarios. However, the absence of systematic summarization and comparative analysis hinders effective resource utilization and standardization of model evaluation. As the first comprehensive review focused on collaborative perception datasets, this work reviews and compares existing resources from a multi-dimensional perspective. We categorize datasets based on cooperation paradigms, examine their data sources and scenarios, and analyze sensor modalities and supported tasks. A detailed comparative analysis is conducted across multiple dimensions. We also outline key challenges and future directions, including dataset scalability, diversity, domain adaptation, standardization, privacy, and the integration of large language models. To support ongoing research, we provide a continuously updated online repository of collaborative perception datasets and related literature: this https URL.
协作感知由于其在自动驾驶中通过多智能体信息融合提高感知准确性、安全性和鲁棒性的潜力,已经引起了学术界和工业界的广泛关注。随着车辆到一切(V2X)通信技术的进步,出现了许多不同的协作感知数据集,在合作模式、传感器配置、数据来源和应用场景等方面各不相同。然而,由于缺乏系统的总结和比较分析,有效资源利用以及模型评估标准化受到了阻碍。作为首个专注于协作感知数据集的全面回顾工作,本文从多维度视角对现有资源进行了回顾和对比。我们根据合作模式对数据集进行分类,考察其数据来源和应用场景,并分析传感器模态和支持的任务类型。我们还从多个维度开展详细的比较分析,并概述了关键挑战和未来方向,包括数据集可扩展性、多样性、领域适应性、标准化、隐私保护以及大型语言模型的集成问题。为了支持持续的研究,我们提供了一个持续更新的协作感知数据集及相关文献在线资源库:[此URL]。 (请注意,最后一个链接应替换为实际提供的具体网址)
https://arxiv.org/abs/2504.12696
Stories are a fundamental aspect of human experience. Engaging deeply with stories and spotting plot holes -- inconsistencies in a storyline that break the internal logic or rules of a story's world -- requires nuanced reasoning skills, including tracking entities and events and their interplay, abstract thinking, pragmatic narrative understanding, commonsense and social reasoning, and theory of mind. As Large Language Models (LLMs) increasingly generate, interpret, and modify text, rigorously assessing their narrative consistency and deeper language understanding becomes critical. However, existing benchmarks focus mainly on surface-level comprehension. In this work, we propose plot hole detection in stories as a proxy to evaluate language understanding and reasoning in LLMs. We introduce FlawedFictionsMaker, a novel algorithm to controllably and carefully synthesize plot holes in human-written stories. Using this algorithm, we construct a benchmark to evaluate LLMs' plot hole detection abilities in stories -- FlawedFictions -- , which is robust to contamination, with human filtering ensuring high quality. We find that state-of-the-art LLMs struggle in accurately solving FlawedFictions regardless of the reasoning effort allowed, with performance significantly degrading as story length increases. Finally, we show that LLM-based story summarization and story generation are prone to introducing plot holes, with more than 50% and 100% increases in plot hole detection rates with respect to human-written originals.
故事是人类体验中的一个基本方面。深入地参与故事并发现情节漏洞——即故事线中破坏内部逻辑或故事世界规则的不一致之处,需要精细的推理能力,包括追踪实体和事件及其相互作用、抽象思维、实用性的叙事理解、常识和社会推理以及理论心理。随着大型语言模型(LLMs)越来越多地生成、解释和修改文本,严格评估其叙述一致性与更深层次的语言理解变得至关重要。然而,现有的基准测试主要关注表面级的理解。 在本文中,我们提出将情节漏洞检测作为衡量大型语言模型语言理解和推理能力的一个代理方法。我们引入了一个名为FlawedFictionsMaker的新算法,该算法可以控制且细致地在人类撰写的故事情节中合成情节漏洞。利用这一算法,我们构建了用于评估LLM在故事中发现情节漏洞的能力的基准测试——FlawedFictions,该基准具备抗污染性,并通过人工筛选确保高质量。我们发现最先进的大型语言模型无论允许多少推理努力,在准确解决FlawedFictions的问题上都面临挑战,并且随着故事情节长度的增加,性能显著下降。 最后,我们展示了基于LLM的故事摘要和故事生成容易引入情节漏洞,与人类撰写的原始作品相比,前者的情节漏洞检测率分别增加了超过50%和100%。
https://arxiv.org/abs/2504.11900
The exponential increase in video content poses significant challenges in terms of efficient navigation, search, and retrieval, thus requiring advanced video summarization techniques. Existing video summarization methods, which heavily rely on visual features and temporal dynamics, often fail to capture the semantics of video content, resulting in incomplete or incoherent summaries. To tackle the challenge, we propose a new video summarization framework that leverages the capabilities of recent Large Language Models (LLMs), expecting that the knowledge learned from massive data enables LLMs to evaluate video frames in a manner that better aligns with diverse semantics and human judgments, effectively addressing the inherent subjectivity in defining keyframes. Our method, dubbed LLM-based Video Summarization (LLMVS), translates video frames into a sequence of captions using a Muti-modal Large Language Model (M-LLM) and then assesses the importance of each frame using an LLM, based on the captions in its local context. These local importance scores are refined through a global attention mechanism in the entire context of video captions, ensuring that our summaries effectively reflect both the details and the overarching narrative. Our experimental results demonstrate the superiority of the proposed method over existing ones in standard benchmarks, highlighting the potential of LLMs in the processing of multimedia content.
视频内容的指数级增长给高效导航、搜索和检索带来了重大挑战,因此需要先进的视频摘要技术。现有的视频摘要方法主要依赖于视觉特征和时间动态特性,往往无法捕捉到视频内容的语义信息,导致生成的摘要不完整或缺乏连贯性。为了应对这一挑战,我们提出了一种新的基于最新大型语言模型(LLMs)能力的视频摘要框架,期望通过从海量数据中学到的知识使LLM能够以更符合多样化语义和人类判断的方式评估视频帧的重要性,从而有效解决关键帧定义的内在主观性问题。我们的方法被称为“基于大型语言模型的视频总结”(LLMVS),该方法首先使用多模态大型语言模型(M-LLM)将视频帧转换为一系列描述文本,然后根据这些描述文本在局部上下文中的表现来评估每个帧的重要性。随后通过在整个视频描述文本的上下文中采用全局注意力机制对这些局部重要性评分进行优化处理,确保生成的摘要既能反映细节又能呈现整体叙事脉络。实验结果表明,在标准基准测试中我们提出的方法优于现有方法,突显了LLMs在多媒体内容处理中的潜力。
https://arxiv.org/abs/2504.11199
Determining and ranking the most salient entities in a text is critical for user-facing systems, especially as users increasingly rely on models to interpret long documents they only partially read. Graded entity salience addresses this need by assigning entities scores that reflect their relative importance in a text. Existing approaches fall into two main categories: subjective judgments of salience, which allow for gradient scoring but lack consistency, and summarization-based methods, which define salience as mention-worthiness in a summary, promoting explainability but limiting outputs to binary labels (entities are either summary-worthy or not). In this paper, we introduce a novel approach for graded entity salience that combines the strengths of both approaches. Using an English dataset spanning 12 spoken and written genres, we collect 5 summaries per document and calculate each entity's salience score based on its presence across these summaries. Our approach shows stronger correlation with scores based on human summaries and alignments, and outperforms existing techniques, including LLMs. We release our data and code at this https URL to support further research on graded salient entity extraction.
确定并排名文本中最显著的实体对于面向用户的系统至关重要,尤其是当用户越来越依赖模型来解读他们只部分阅读的长文档时。分级实体显著性通过为每个实体分配反映其在文本中相对重要性的分数来解决这一需求。现有方法主要分为两类:主观判断法(允许进行梯度评分但缺乏一致性)和基于摘要的方法(将显著性定义为摘要中的提及价值,促进可解释性但仅限于二元标签输出——实体要么值得被摘录,要么不值得)。在本文中,我们引入了一种新的分级实体显著性的方法,结合了这两种方法的优点。通过使用涵盖12个口语和书面体裁的英语数据集,我们为每个文档收集了5份摘要,并根据这些摘要中各实体出现的情况计算其显著性分数。我们的方法显示出与基于人类摘要和对齐评分更强的相关性,并优于现有技术,包括大型语言模型(LLM)。我们在提供的链接上发布了我们的数据和代码以支持进一步的分级显着实体提取研究。
https://arxiv.org/abs/2504.10792
Vision-Language Models (VLMs) can process visual and textual information in multiple formats: texts, images, interleaved texts and images, or even hour-long videos. In this work, we conduct fine-grained quantitative and qualitative analyses of automatic summarization of multimodal presentations using VLMs with various representations as input. From these experiments, we suggest cost-effective strategies for generating summaries from text-heavy multimodal documents under different input-length budgets using VLMs. We show that slides extracted from the video stream can be beneficially used as input against the raw video, and that a structured representation from interleaved slides and transcript provides the best performance. Finally, we reflect and comment on the nature of cross-modal interactions in multimodal presentations and share suggestions to improve the capabilities of VLMs to understand documents of this nature.
视觉-语言模型(VLM)能够处理多种形式的视听和文本信息,包括纯文本、图像、交错的图文以及长达数小时的视频。在这项研究中,我们对使用不同表示形式作为输入的多模态演示自动摘要进行了详细的定量和定性分析。通过这些实验,我们提出了在不同的输入长度预算下利用VLM生成以文本为主的多模态文档摘要的有效策略。研究表明,从视频流中提取的幻灯片可以比原始视频更有益地用作输入,并且由交错的幻灯片和字幕组成的结构化表示提供了最佳性能。最后,我们反思并讨论了多模态演示中的跨模态交互的本质,并分享了一些提高VLM理解此类文档能力的建议。
https://arxiv.org/abs/2504.10049
We propose SUMART, a method for summarizing and compressing the volume of verbose subtitle translations. SUMART is designed for understanding translated captions (e.g., interlingual conversations via subtitle translation or when watching movies in foreign language audio and translated captions). SUMART is intended for users who want a big-picture and fast understanding of the conversation, audio, video content, and speech in a foreign language. During the training data collection, when a speaker makes a verbose statement, SUMART employs a large language model on-site to compress the volume of subtitles. This compressed data is then stored in a database for fine-tuning purposes. Later, SUMART uses data pairs from those non-compressed ASR results and compressed translated results for fine-tuning the translation model to generate more concise translations for practical uses. In practical applications, SUMART utilizes this trained model to produce concise translation results. Furthermore, as a practical application, we developed an application that allows conversations using subtitle translation in augmented reality spaces. As a pilot study, we conducted qualitative surveys using a SUMART prototype and a survey on the summarization model for SUMART. We envision the most effective use case of this system is where users need to consume a lot of information quickly (e.g., Speech, lectures, podcasts, Q&A in conferences).
我们提出了SUMART方法,用于总结和压缩冗长的字幕翻译内容。SUMART旨在帮助理解经过翻译的字幕(例如,通过字幕翻译进行的跨语言对话或在观看外语音频及翻译字幕的电影时)。SUMART设计供希望快速理解和概括外国语言中的对话、音频、视频内容以及演讲的用户使用。 在训练数据收集过程中,当说话人发出冗长表述时,SUMART会现场利用大型语言模型来压缩字幕的内容量。这些经过压缩的数据会被存储在一个数据库中用于微调目的。之后,SUMART使用未压缩的自动语音识别(ASR)结果和压缩后的翻译结果对数对,以微调翻译模型生成更加简洁实用的翻译。 在实际应用中,SUMART利用训练好的模型来产生简洁的翻译结果。此外,作为一种实际的应用案例,我们开发了一个应用程序,允许用户在增强现实空间内使用字幕翻译进行对话。作为初步研究的一部分,我们使用了SUMART原型和关于总结模型的调查进行了定性调研。 我们认为该系统最有效的应用场景是当用户需要快速大量获取信息时(例如演讲、讲座、播客、会议中的问答环节)。
https://arxiv.org/abs/2504.09860
Detecting transitions between intro/credits and main content in videos is a crucial task for content segmentation, indexing, and recommendation systems. Manual annotation of such transitions is labor-intensive and error-prone, while heuristic-based methods often fail to generalize across diverse video styles. In this work, we introduce a deep learning-based approach that formulates the problem as a sequence-to-sequence classification task, where each second of a video is labeled as either "intro" or "film." Our method extracts frames at a fixed rate of 1 FPS, encodes them using CLIP (Contrastive Language-Image Pretraining), and processes the resulting feature representations with a multihead attention model incorporating learned positional encoding. The system achieves an F1-score of 91.0%, Precision of 89.0%, and Recall of 97.0% on the test set, and is optimized for real-time inference, achieving 11.5 FPS on CPU and 107 FPS on high-end GPUs. This approach has practical applications in automated content indexing, highlight detection, and video summarization. Future work will explore multimodal learning, incorporating audio features and subtitles to further enhance detection accuracy.
检测视频中开场和片尾与主要内容之间的过渡对于内容分割、索引和推荐系统来说是一项关键任务。手动标注这些过渡既耗时又容易出错,而基于启发式的方法往往难以适应各种不同的视频风格。在这项工作中,我们引入了一种基于深度学习的方法,将问题形式化为一个序列到序列的分类任务,在此任务中,视频中的每一秒都被标记为“开场”或“正片”。我们的方法以每秒一帧的速度提取画面,并使用CLIP(对比语言-图像预训练)对其进行编码。然后,该系统采用多头注意力模型处理生成的特征表示,其中包括了学习到的位置编码。 在测试集中,此系统实现了F1值为91.0%,精度为89.0%和召回率为97.0%的成绩,并且经过优化以实现实时推理,在CPU上的速度可达每秒11.5帧,在高端GPU上则可达到每秒107帧。这种方法在自动内容索引、精彩片段检测以及视频摘要生成等方面具有实际应用价值。未来的研究工作将探索多模态学习,结合音频特征和字幕以进一步提高检测准确性。
https://arxiv.org/abs/2504.09738
Advertisement videos serve as a rich and valuable source of purpose-driven information, encompassing high-quality visual, textual, and contextual cues designed to engage viewers. They are often more complex than general videos of similar duration due to their structured narratives and rapid scene transitions, posing significant challenges to multi-modal large language models (MLLMs). In this work, we introduce VideoAds, the first dataset tailored for benchmarking the performance of MLLMs on advertisement videos. VideoAds comprises well-curated advertisement videos with complex temporal structures, accompanied by \textbf{manually} annotated diverse questions across three core tasks: visual finding, video summary, and visual reasoning. We propose a quantitative measure to compare VideoAds against existing benchmarks in terms of video complexity. Through extensive experiments, we find that Qwen2.5-VL-72B, an opensource MLLM, achieves 73.35\% accuracy on VideoAds, outperforming GPT-4o (66.82\%) and Gemini-1.5 Pro (69.66\%); the two proprietary models especially fall behind the opensource model in video summarization and reasoning, but perform the best in visual finding. Notably, human experts easily achieve a remarkable accuracy of 94.27\%. These results underscore the necessity of advancing MLLMs' temporal modeling capabilities and highlight VideoAds as a potentially pivotal benchmark for future research in understanding video that requires high FPS sampling. The dataset and evaluation code will be publicly available at this https URL.
广告视频作为一种富含目的性信息的宝贵资源,包含了高质量的视觉、文本和情境线索,旨在吸引观众。它们通常比同长度的一般视频更为复杂,因为其结构化的叙述以及快速的场景转换给多模态大型语言模型(MLLM)带来了巨大的挑战。为此,我们引入了VideoAds,这是首个专为评估广告视频上MLLM性能而设计的数据集。VideoAds包含了一系列精心策划、具有复杂时间结构的广告视频,并附有**人工标注**的多样化问题,涵盖视觉发现、视频总结和视觉推理三大核心任务。 为了比较VideoAds与其他现有基准测试在视频复杂性方面的表现,我们提出了一种定量衡量方法。通过广泛实验,我们发现在VideoAds数据集上,开源MLLM Qwen2.5-VL-72B实现了73.35%的准确率,优于专有模型GPT-4o(66.82%)和Gemini-1.5 Pro(69.66%)。值得注意的是,在视频总结与推理任务上,这两款专有模型明显落后于开源模型;但在视觉发现方面,则表现最佳。与此同时,人类专家能够轻松达到94.27%的准确率。 这些结果强调了推进MLLM时间建模能力的重要性,并突显VideoAds作为未来研究中理解需要高FPS采样的视频的关键基准测试的重要潜力。该数据集及评估代码将公开发布在此网址:[https://this-url.com](请注意,实际链接应替换为此URL)。
https://arxiv.org/abs/2504.09282
Plan-guided summarization attempts to reduce hallucinations in small language models (SLMs) by grounding generated summaries to the source text, typically by targeting fine-grained details such as dates or named entities. In this work, we investigate whether plan-based approaches in SLMs improve summarization in long document, narrative tasks. Narrative texts' length and complexity often mean they are difficult to summarize faithfully. We analyze existing plan-guided solutions targeting fine-grained details, and also propose our own higher-level, narrative-based plan formulation. Our results show that neither approach significantly improves on a baseline without planning in either summary quality or faithfulness. Human evaluation reveals that while plan-guided approaches are often well grounded to their plan, plans are equally likely to contain hallucinations compared to summaries. As a result, the plan-guided summaries are just as unfaithful as those from models without planning. Our work serves as a cautionary tale to plan-guided approaches to summarization, especially for long, complex domains such as narrative texts.
基于计划的摘要生成试图通过将生成的摘要与源文本相关联来减少小型语言模型(SLM)中的幻觉现象,通常会针对诸如日期或专有名词之类的细粒度细节。在这项工作中,我们研究了在长文档和叙述任务中使用基于计划的方法是否能提高总结的质量。由于叙述性文本的长度和复杂性,它们往往难以忠实概括。我们分析了现有关注细粒度信息的基于计划的解决方案,并提出了自己的一种更高层次、以叙事为基础的计划制定方法。我们的结果表明,在摘要质量和忠实度方面,无论是针对细粒度细节还是采用更高级别的叙述策略,这两种方法都没有显著改善基线模型的效果(即不使用计划的方法)。人工评估发现,虽然基于计划的方法通常很好地与其计划相关联,但这些计划中的幻觉现象与没有规划的总结中一样常见。因此,基于计划的摘要与未使用计划的模型生成的摘要一样不可靠。 这项工作对基于计划的摘要方法提出了警示,尤其是在长篇和复杂的领域(如叙述性文本)中更是如此。
https://arxiv.org/abs/2504.09071
Clinical case reports and discharge summaries may be the most complete and accurate summarization of patient encounters, yet they are finalized, i.e., timestamped after the encounter. Complementary data structured streams become available sooner but suffer from incompleteness. To train models and algorithms on more complete and temporally fine-grained data, we construct a pipeline to phenotype, extract, and annotate time-localized findings within case reports using large language models. We apply our pipeline to generate an open-access textual time series corpus for Sepsis-3 comprising 2,139 case reports from the Pubmed-Open Access (PMOA) Subset. To validate our system, we apply it on PMOA and timeline annotations from I2B2/MIMIC-IV and compare the results to physician-expert annotations. We show high recovery rates of clinical findings (event match rates: O1-preview--0.755, Llama 3.3 70B Instruct--0.753) and strong temporal ordering (concordance: O1-preview--0.932, Llama 3.3 70B Instruct--0.932). Our work characterizes the ability of LLMs to time-localize clinical findings in text, illustrating the limitations of LLM use for temporal reconstruction and providing several potential avenues of improvement via multimodal integration.
临床病例报告和出院总结可能是对患者就诊情况最完整且准确的记录,但它们是在会诊之后完成并标注时间戳。相比之下,补充的数据流虽然可以更早地提供信息,但却可能存在不完整性的问题。为了使用更加完整且具有更高时间分辨率的数据来训练模型和算法,我们构建了一个利用大规模语言模型对病例报告进行表型分析、提取以及标注时间相关发现的流水线。我们将此流程应用于生成一个开放访问文本时间序列语料库,该语料库包含来自PubMed开放存取子集(PMOA)中的2,139份塞斯普斯-3病例报告。 为了验证我们的系统,我们在PMOA和I2B2/MIMIC-IV的时间线注释上应用了它,并将结果与医生专家的标注进行了比较。我们展示了临床发现的高度恢复率(事件匹配率:O1-preview为0.755,Llama 3.3 70B Instruct为0.753)和强大的时间顺序准确性(一致性:O1-preview为0.932,Llama 3.3 70B Instruct为0.932)。我们的工作描述了大规模语言模型在文本中对临床发现进行时间定位的能力,并揭示了使用LLM进行时间重建的局限性,同时提供了通过多模态集成来改进的方法。
https://arxiv.org/abs/2504.12326
AI Agents powered by Large Language Models are transforming the world through enormous applications. A super agent has the potential to fulfill diverse user needs, such as summarization, coding, and research, by accurately understanding user intent and leveraging the appropriate tools to solve tasks. However, to make such an agent viable for real-world deployment and accessible at scale, significant optimizations are required to ensure high efficiency and low cost. This paper presents a design of the Super Agent System. Upon receiving a user prompt, the system first detects the intent of the user, then routes the request to specialized task agents with the necessary tools or automatically generates agentic workflows. In practice, most applications directly serve as AI assistants on edge devices such as phones and robots. As different language models vary in capability and cloud-based models often entail high computational costs, latency, and privacy concerns, we then explore the hybrid mode where the router dynamically selects between local and cloud models based on task complexity. Finally, we introduce the blueprint of an on-device super agent enhanced with cloud. With advances in multi-modality models and edge hardware, we envision that most computations can be handled locally, with cloud collaboration only as needed. Such architecture paves the way for super agents to be seamlessly integrated into everyday life in the near future.
由大型语言模型驱动的AI代理正在通过广泛的用途来改变世界。一个超级代理具备满足多样化用户需求(如摘要生成、编程和研究)的能力,它能够准确理解用户意图,并利用适当的工具解决问题。然而,为了使这样的代理能够在实际环境中部署并大规模使用,需要进行重大优化以确保高效率和低成本。本文提出了一种超能代理系统的架构设计。当接收到用户的提示时,系统首先检测用户的意图,然后将请求路由到配备有适当工具的专门任务代理,或者自动生成智能工作流程。 在实践中,大多数应用程序直接作为AI助手运行于边缘设备(如手机和机器人)上。鉴于不同的语言模型能力各异,并且基于云的语言模型通常伴随着高昂的计算成本、延迟以及隐私问题,我们探索了一种混合模式,在这种模式下路由器可以根据任务复杂性动态选择本地或云端模型。 最后,本文介绍了增强型云端协作的设备端超级代理蓝图设计。随着多模态模型和边缘硬件的发展,我们认为大多数计算可以在本地处理,仅在必要时才需要云协同工作。这样的架构为超级代理在未来不久能够无缝融入日常生活铺平了道路。
https://arxiv.org/abs/2504.10519
Reasoning-enabled large language models (LLMs) have recently demonstrated impressive performance in complex logical and mathematical tasks, yet their effectiveness in evaluating natural language generation remains unexplored. This study systematically compares reasoning-based LLMs (DeepSeek-R1 and OpenAI o3) with their non-reasoning counterparts across machine translation (MT) and text summarization (TS) evaluation tasks. We evaluate eight models across three architectural categories, including state-of-the-art reasoning models, their distilled variants (ranging from 8B to 70B parameters), and equivalent conventional, non-reasoning LLMs. Our experiments on WMT23 and SummEval benchmarks reveal that the benefits of reasoning capabilities are highly model and task-dependent: while OpenAI o3-mini models show consistent performance improvements with increased reasoning intensity, DeepSeek-R1 underperforms compared to its non-reasoning variant, with exception to certain aspects of TS evaluation. Correlation analysis demonstrates that increased reasoning token usage positively correlates with evaluation quality in o3-mini models. Furthermore, our results show that distillation of reasoning capabilities maintains reasonable performance in medium-sized models (32B) but degrades substantially in smaller variants (8B). This work provides the first comprehensive assessment of reasoning LLMs for NLG evaluation and offers insights into their practical use.
近期,具备推理能力的大规模语言模型(LLMs)在复杂逻辑和数学任务中表现出令人印象深刻的性能,但它们在评估自然语言生成方面的有效性尚未得到充分探索。本研究系统地比较了基于推理的大型语言模型(DeepSeek-R1 和 OpenAI o3)与其非推理版本,在机器翻译(MT)和文本摘要(TS)评估任务中的表现。我们对八种不同架构类别的模型进行了测试,包括最先进的推理模型、其蒸馏变体(参数从8B到70B不等),以及相应的传统非推理LLMs。 我们在WMT23和SummEval基准上进行的实验表明,推理能力的好处高度依赖于模型和任务:虽然OpenAI o3-mini模型在增加推理强度时表现一致提升,但DeepSeek-R1的表现却不如其非推理版本,在某些特定方面除外。相关性分析显示,在o3-mini模型中,使用更多推理令牌与评估质量呈正相关。 此外,我们的结果表明,推理能力的蒸馏可以在中型规模的模型(32B参数)上保持合理的性能,但在较小变体(8B参数)中大幅下降。这项工作首次全面评估了推理LLMs在自然语言生成评估中的表现,并提供了关于其实际应用的重要见解。
https://arxiv.org/abs/2504.08120
Speech summarization has become an essential tool for efficiently managing and accessing the growing volume of spoken and audiovisual content. However, despite its increasing importance, speech summarization is still not clearly defined and intersects with several research areas, including speech recognition, text summarization, and specific applications like meeting summarization. This survey not only examines existing datasets and evaluation methodologies, which are crucial for assessing the effectiveness of summarization approaches but also synthesizes recent developments in the field, highlighting the shift from traditional systems to advanced models like fine-tuned cascaded architectures and end-to-end solutions.
语音总结已成为高效管理和访问日益增长的口语和音频视频内容的重要工具。然而,尽管其重要性不断提高,但语音总结尚未得到明确定义,并且与包括语音识别、文本摘要以及会议总结等特定应用在内的多个研究领域相互交织。此次调查不仅审视了现有数据集和评估方法(这些对于衡量总结方法的有效性至关重要),还综合了该领域的最新进展,强调从传统系统向精细化级联架构和端到端解决方案的转变。
https://arxiv.org/abs/2504.08024
Large Language Models (LLMs) have revolutionized artificial intelligence, driving advancements in machine translation, summarization, and conversational agents. However, their increasing integration into critical societal domains has raised concerns about embedded biases, which can perpetuate stereotypes and compromise fairness. These biases stem from various sources, including historical inequalities in training data, linguistic imbalances, and adversarial manipulation. Despite mitigation efforts, recent studies indicate that LLMs remain vulnerable to adversarial attacks designed to elicit biased responses. This work proposes a scalable benchmarking framework to evaluate LLM robustness against adversarial bias elicitation. Our methodology involves (i) systematically probing models with a multi-task approach targeting biases across various sociocultural dimensions, (ii) quantifying robustness through safety scores using an LLM-as-a-Judge approach for automated assessment of model responses, and (iii) employing jailbreak techniques to investigate vulnerabilities in safety mechanisms. Our analysis examines prevalent biases in both small and large state-of-the-art models and their impact on model safety. Additionally, we assess the safety of domain-specific models fine-tuned for critical fields, such as medicine. Finally, we release a curated dataset of bias-related prompts, CLEAR-Bias, to facilitate systematic vulnerability benchmarking. Our findings reveal critical trade-offs between model size and safety, aiding the development of fairer and more robust future language models.
大型语言模型(LLMs)已经彻底改变了人工智能领域,推动了机器翻译、摘要生成和对话代理等方面的进步。然而,随着它们在关键社会领域的不断集成,人们对嵌入其中的偏见越来越感到担忧,这些偏见可能会加剧刻板印象并损害公平性。这些偏见源于多种来源,包括训练数据中的历史不平等、语言不平衡以及对手操作等。 尽管已采取了缓解措施,但最近的研究表明,LLMs仍然容易受到旨在引发偏见响应的对抗攻击的影响。这项工作提出了一种可扩展的基准测试框架,用于评估LLM针对对抗性偏见证伪(adversarial bias elicitation)的稳健性。我们的方法包括: (i) 通过多任务方法系统地检测模型在各种社会文化维度上的偏见; (ii) 使用“LLM作为裁判”的方法来量化稳健性,并自动评估模型响应的安全得分; (iii) 运用破解技术(jailbreak techniques)来调查安全机制中的漏洞。 我们的分析考察了小规模和大规模前沿模型中普遍存在的偏见及其对模型安全性的影响。此外,我们还评估了针对关键领域(如医学)进行微调的特定领域的模型的安全性。最后,我们发布了CLEAR-Bias数据集——一组精心策划的与偏见相关的提示,以促进系统性的漏洞基准测试。 我们的研究发现揭示了模型大小和安全之间的关键权衡,有助于开发更公平且更具鲁棒性的未来语言模型。
https://arxiv.org/abs/2504.07887
Practically all large language models have been pre-trained on data that is subject to global uncertainty related to copyright infringement and breach of contract. This creates potential risk for users and developers due to this uncertain legal status. The KL3M Data Project directly confronts this critical issue by introducing the largest comprehensive training data pipeline that minimizes risks related to copyright or breach of contract. The foundation of this project is a corpus of over 132 million documents and trillions of tokens spanning 16 different sources that have been verified to meet the strict copyright and licensing protocol detailed herein. We are releasing the entire pipeline, including 1) the source code to acquire and process these documents, 2) the original document formats with associated provenance and metadata, 3) extracted content in a standardized format, 4) pre-tokenized representations of the documents, and 5) various mid- and post-train resources such as question-answer, summarization, conversion, drafting, classification, prediction, and conversational data. All of these resources are freely available to the public on S3, Hugging Face, and GitHub under CC-BY terms. We are committed to continuing this project in furtherance of a more ethical, legal, and sustainable approach to the development and use of AI models.
几乎所有大型语言模型都是在可能存在版权侵权和合同违约等全球不确定性问题的数据上进行预训练的。这为用户和开发者带来了潜在风险,因为这些数据的法律地位存在不确定性。KL3M 数据项目直接面对这一关键问题,通过引入一个最小化与版权或合同违约相关的风险的最大规模综合训练数据管道来解决这个问题。该项目的基础是一套经过验证符合此处详述的严格版权限制协议的语料库,这套语料库包含超过1.32亿份文档和数万亿个标记,涵盖了16种不同的来源。 我们正在发布整个管道,包括: 1. 获取和处理这些文件的源代码; 2. 原始文档格式及其相关的归属信息和元数据; 3. 标准化格式下的提取内容; 4. 文件的预分词表示; 5. 各种中端和后训练资源,如问答、摘要生成、转换、起草、分类、预测和对话数据。 所有这些资源都将在S3、Hugging Face和GitHub上以CC-BY条款免费向公众提供。我们致力于继续推进这一项目,为AI模型的开发和使用提供更加道德、合法和可持续的方法。
https://arxiv.org/abs/2504.07854
Intrusion Detection Systems (IDS) have long been a hot topic in the cybersecurity community. In recent years, with the introduction of deep learning (DL) techniques, IDS have made great progress due to their increasing generalizability. The rationale behind this is that by learning the underlying patterns of known system behaviors, IDS detection can be generalized to intrusions that exploit zero-day vulnerabilities. In this survey, we refer to this type of IDS as DL-based IDS (DL-IDS). From the perspective of DL, this survey systematically reviews all the stages of DL-IDS, including data collection, log storage, log parsing, graph summarization, attack detection, and attack investigation. To accommodate current researchers, a section describing the publicly available benchmark datasets is included. This survey further discusses current challenges and potential future research directions, aiming to help researchers understand the basic ideas and visions of DL-IDS research, as well as to motivate their research interests.
入侵检测系统(IDS)长期以来一直是网络安全社区热议的话题。近年来,随着深度学习(DL)技术的引入,由于其日益增强的泛化能力,IDS取得了显著的进步。这一进步背后的逻辑是:通过学习已知系统行为的基本模式,IDS可以将检测范围扩展到利用零日漏洞的入侵行为上。在本次调查中,我们将此类IDS称为基于深度学习的IDS(DL-IDS)。从深度学习的角度来看,本调查系统地回顾了DL-IDS的所有阶段,包括数据收集、日志存储、日志解析、图摘要化、攻击检测和攻击调查。为了满足当前研究人员的需求,还包含了一个描述公开可用基准数据集的部分。此外,本次调查进一步讨论了目前面临的挑战以及未来的潜在研究方向,旨在帮助研究人员了解DL-IDS研究的基本理念与愿景,并激发他们的研究兴趣。
https://arxiv.org/abs/2504.07839
We propose a novel framework for generating causal graphs from narrative texts, bridging high-level causality and detailed event-specific relationships. Our method first extracts concise, agent-centered vertices using large language model (LLM)-based summarization. We introduce an "Expert Index," comprising seven linguistically informed features, integrated into a Situation-Task-Action-Consequence (STAC) classification model. This hybrid system, combining RoBERTa embeddings with the Expert Index, achieves superior precision in causal link identification compared to pure LLM-based approaches. Finally, a structured five-iteration prompting process refines and constructs connected causal graphs. Experiments on 100 narrative chapters and short stories demonstrate that our approach consistently outperforms GPT-4o and Claude 3.5 in causal graph quality, while maintaining readability. The open-source tool provides an interpretable, efficient solution for capturing nuanced causal chains in narratives.
我们提出了一种从叙述性文本生成因果图的新型框架,旨在连接高层次因果关系和具体的事件特异性关系。我们的方法首先使用基于大型语言模型(LLM)的摘要技术提取简洁、以代理为中心的节点。我们引入了一个“专家索引”,包含七个基于语言学信息的功能特征,并将其整合到情境-任务-行动-后果(STAC)分类模型中。这一混合系统结合了RoBERTa嵌入和专家索引,在因果关系识别精度方面,优于纯LLM方法。 最后,通过一个结构化的五轮提示过程对因果图进行精炼和完善。实验在100个叙述章节和短故事上进行了测试,结果显示我们的方法在因果图质量上持续超越GPT-4o和Claude 3.5,同时保持了可读性。开源工具提供了一个解释性强、效率高的解决方案,适用于捕捉叙述中的复杂因果链条。
https://arxiv.org/abs/2504.07459
Collaborative assistants, or chatbots, are data-driven decision support systems that enable natural interaction for task completion. While they can meet critical needs in modern society, concerns about their reliability and trustworthiness persist. In particular, Large Language Model (LLM)-based chatbots like ChatGPT, Gemini, and DeepSeek are becoming more accessible. However, such chatbots have limitations, including their inability to explain response generation, the risk of generating problematic content, the lack of standardized testing for reliability, and the need for deep AI expertise and extended development times. These issues make chatbots unsuitable for trust-sensitive applications like elections or healthcare. To address these concerns, we introduce SafeChat, a general architecture for building safe and trustworthy chatbots, with a focus on information retrieval use cases. Key features of SafeChat include: (a) safety, with a domain-agnostic design where responses are grounded and traceable to approved sources (provenance), and 'do-not-respond' strategies to prevent harmful answers; (b) usability, with automatic extractive summarization of long responses, traceable to their sources, and automated trust assessments to communicate expected chatbot behavior, such as sentiment; and (c) fast, scalable development, including a CSV-driven workflow, automated testing, and integration with various devices. We implemented SafeChat in an executable framework using the open-source chatbot platform Rasa. A case study demonstrates its application in building ElectionBot-SC, a chatbot designed to safely disseminate official election information. SafeChat is being used in many domains, validating its potential, and is available at: this https URL.
协作型助手或聊天机器人是数据驱动的决策支持系统,能够实现自然交互以完成任务。尽管它们可以满足现代社会中的关键需求,但关于其可靠性和可信度的问题仍然存在。特别是基于大型语言模型(LLM)的聊天机器人如ChatGPT、Gemini和DeepSeek正变得越来越普及。然而,这些聊天机器人存在着局限性,包括无法解释响应生成过程、产生有争议内容的风险、缺乏标准化测试以确保可靠性以及需要深入的人工智能专业知识和较长的研发周期等。这些问题使得聊天机器人不适合在选举或医疗保健等信任敏感的应用场景中使用。 为了应对这些关切,我们介绍了SafeChat——一个用于构建安全且值得信赖的聊天机器人的通用架构,并重点关注信息检索用例。SafeChat的关键特性包括:(a)安全性,在其领域无关的设计中,回应基于并可追溯到经过批准的信息来源(出处),并且采用“不回应”策略以避免有害的回答;(b)易用性,自动提取长回答的总结并可以追溯到其来源,并通过自动化信任评估来传达预期中的聊天机器人行为,例如情绪分析;以及(c)快速、可扩展的发展能力,包括由CSV驱动的工作流程、自动化测试和与各种设备集成的功能。 我们在Rasa开源聊天机器人平台上实现了SafeChat。一个案例研究展示了它在构建ElectionBot-SC方面的应用——这是一个旨在安全传播官方选举信息的聊天机器人。SafeChat已经在许多领域得到使用,并验证了其潜力,您可以在此链接中获取更多信息:[此URL]。
https://arxiv.org/abs/2504.07995