Despite progress in comment-aware multimodal and multilingual summarization for English and Chinese, research in Indian languages remains limited. This study addresses this gap by introducing COSMMIC, a pioneering comment-sensitive multimodal, multilingual dataset featuring nine major Indian languages. COSMMIC comprises 4,959 article-image pairs and 24,484 reader comments, with ground-truth summaries available in all included languages. Our approach enhances summaries by integrating reader insights and feedback. We explore summarization and headline generation across four configurations: (1) using article text alone, (2) incorporating user comments, (3) utilizing images, and (4) combining text, comments, and images. To assess the dataset's effectiveness, we employ state-of-the-art language models such as LLama3 and GPT-4. We conduct a comprehensive study to evaluate different component combinations, including identifying supportive comments, filtering out noise using a dedicated comment classifier using IndicBERT, and extracting valuable insights from images with a multilingual CLIP-based classifier. This helps determine the most effective configurations for natural language generation (NLG) tasks. Unlike many existing datasets that are either text-only or lack user comments in multimodal settings, COSMMIC uniquely integrates text, images, and user feedback. This holistic approach bridges gaps in Indian language resources, advancing NLP research and fostering inclusivity.
尽管在英语和中文的评论感知多模态和跨语言摘要研究方面已经取得了一定进展,但印地语的研究仍然有限。为了解决这一空白,本研究引入了COSMMIC,这是一个开创性的、针对九种主要印度语言的评论敏感型多模态、跨语言数据集。COSMMIC包含4,959篇文章-图像对和24,484条读者评论,并且所有纳入的语言都提供了真实的摘要作为基准。 我们的方法通过整合读者见解和反馈来增强摘要内容。我们在四种配置下探索了摘要生成和标题制作:(1)仅使用文章文本;(2)结合用户评论;(3)利用图像;以及 (4) 结合文本、评论和图像。为了评估数据集的有效性,我们采用了最先进的语言模型,如LLama3和GPT-4。 我们进行了一项全面的研究来评估不同组件组合的效果,包括识别支持性的评论,使用专门的IndicBERT评论分类器过滤噪音,以及利用多语言CLIP基分类器从图像中提取有价值的见解。这有助于确定自然语言生成(NLG)任务中最有效的配置。与许多现有数据集要么仅包含文本信息,要么在多模态场景下缺乏用户评论不同,COSMMIC独特地整合了文本、图像和用户反馈。 这种全面的方法填补了印度语言资源的空白,推进了NLP研究,并促进了包容性发展。
https://arxiv.org/abs/2506.15372
Recent large language models (LLMs) achieve impressive performance in source-conditioned text generation but often fail to correctly provide fine-grained attributions for their outputs, undermining verifiability and trust. Moreover, existing attribution methods do not explain how and why models leverage the provided source documents to generate their final responses, limiting interpretability. To overcome these challenges, we introduce a modular generation framework, GenerationPrograms, inspired by recent advancements in executable "code agent" architectures. Unlike conventional generation methods that simultaneously generate outputs and attributions or rely on post-hoc attribution, GenerationPrograms decomposes the process into two distinct stages: first, creating an executable program plan composed of modular text operations (such as paraphrasing, compression, and fusion) explicitly tailored to the query, and second, executing these operations following the program's specified instructions to produce the final response. Empirical evaluations demonstrate that GenerationPrograms significantly improves attribution quality at both the document level and sentence level across two long-form question-answering tasks and a multi-document summarization task. We further demonstrate that GenerationPrograms can effectively function as a post-hoc attribution method, outperforming traditional techniques in recovering accurate attributions. In addition, the interpretable programs generated by GenerationPrograms enable localized refinement through modular-level improvements that further enhance overall attribution quality.
最近的大规模语言模型(LLM)在基于源文档的文本生成方面表现出色,但常常无法正确地为其输出提供细粒度的归属说明,这削弱了验证性和信任。此外,现有的归属方法未能解释模型如何以及为何利用提供的源文件来生成最终响应,从而限制了可解释性。为克服这些挑战,我们引入了一种模块化生成框架——GenerationPrograms,该框架借鉴了最近可执行“代码代理”架构的进展。与传统的生成方法同时生成输出和归属或依赖于事后归属不同,GenerationPrograms将过程分解成两个独立阶段:首先创建一个由模板块文本操作(如改写、压缩和融合)组成的可执行程序计划,这些操作明确针对查询进行了定制;其次根据该程序的指定指令执行这些操作以产生最终响应。实证评估表明,在两项长形式问答任务和一项多文档摘要任务中,GenerationPrograms显著提高了文档级别和句子级别的归属质量。此外,我们还展示了GenerationPrograms可以有效充当事后归属方法,超越传统技术准确恢复归属的能力。生成程序的可解释性使通过模块级改进实现局部精炼成为可能,并进一步提升了整体归属质量。
https://arxiv.org/abs/2506.14580
Human language production exhibits remarkable richness and variation, reflecting diverse communication styles and intents. However, this variation is often overlooked in summarization evaluation. While having multiple reference summaries is known to improve correlation with human judgments, the impact of using different reference sets on reference-based metrics has not been systematically investigated. This work examines the sensitivity of widely used reference-based metrics in relation to the choice of reference sets, analyzing three diverse multi-reference summarization datasets: SummEval, GUMSum, and DUC2004. We demonstrate that many popular metrics exhibit significant instability. This instability is particularly concerning for n-gram-based metrics like ROUGE, where model rankings vary depending on the reference sets, undermining the reliability of model comparisons. We also collect human judgments on LLM outputs for genre-diverse data and examine their correlation with metrics to supplement existing findings beyond newswire summaries, finding weak-to-no correlation. Taken together, we recommend incorporating reference set variation into summarization evaluation to enhance consistency alongside correlation with human judgments, especially when evaluating LLMs.
人类语言生成表现出丰富性和多样性,反映了各种不同的沟通风格和意图。然而,在摘要评估中,这种变异性往往被忽视了。尽管已知使用多个参考摘要可以提高与人工判断的相关性,但不同参考集对基于参考的指标的影响尚未进行系统的调查研究。这项工作考察了广泛使用的基于参考的度量标准在选择参考集合时的敏感性,并分析了三个多样化的多参考摘要数据集:SummEval、GUMSum 和 DUC2004。我们展示了许多流行指标表现出显著的不稳定性。这种不稳定性尤其令人担忧,特别是在基于 n-gram 的指标(如 ROUGE)中,模型排名会根据参考集合的不同而变化,从而损害了对模型比较的可靠性。此外,我们在各类别不同的数据上收集了大型语言模型输出的人工判断,并考察这些人工判断与指标的相关性,以补充现有研究并超越新闻摘要的研究范围,发现相关性很弱甚至没有相关性。综合来看,我们建议在总结评估中纳入参考集的变化,以增强一致性以及与人类判断的相关性,尤其是在评估大型语言模型时。
https://arxiv.org/abs/2506.14335
We extend the frameworks of Serialized Output Training (SOT) to address practical needs of both streaming and offline automatic speech recognition (ASR) applications. Our approach focuses on balancing latency and accuracy, catering to real-time captioning and summarization requirements. We propose several key improvements: (1) Leveraging Continuous Speech Separation (CSS) single-channel front-end with end-to-end (E2E) systems for highly overlapping scenarios, challenging the conventional wisdom of E2E versus cascaded setups. The CSS framework improves the accuracy of the ASR system by separating overlapped speech from multiple speakers. (2) Implementing dual models -- Conformer Transducer for streaming and Sequence-to-Sequence for offline -- or alternatively, a two-pass model based on cascaded encoders. (3) Exploring segment-based SOT (segSOT) which is better suited for offline scenarios while also enhancing readability of multi-talker transcriptions.
我们将 Serialized Output Training(SOT)框架扩展,以解决流媒体和离线自动语音识别(ASR)应用的实际需求。我们的方法侧重于平衡延迟与准确性的关系,满足实时字幕生成和摘要制作的需求。我们提出以下关键改进: 1. 利用连续语音分离(CSS)单通道前端与端到端(E2E)系统相结合的方法,在高度重叠的场景中进行优化,这挑战了传统E2E与级联设置之间的观点。CSS框架通过从多个说话者中分离出重叠的语音来提高ASR系统的准确性。 2. 实现双模型——Conformer Transducer用于流媒体应用,Sequence-to-Sequence用于离线处理,或采用基于级联编码器的两步模型作为替代方案。 3. 探索分段SOT(segSOT),这种方法更适合于离线场景,并且还能提高多说话者转录文本的可读性。
https://arxiv.org/abs/2506.14204
The Achilles heel of Large Language Models (LLMs) is hallucination, which has drastic consequences for the clinical domain. This is particularly important with regards to automatically generating discharge summaries (a lengthy medical document that summarizes a hospital in-patient visit). Automatically generating these summaries would free physicians to care for patients and reduce documentation burden. The goal of this work is to discover new methods that combine language-based graphs and deep learning models to address provenance of content and trustworthiness in automatic summarization. Our method shows impressive reliability results on the publicly available Medical Information Mart for Intensive III (MIMIC-III) corpus and clinical notes written by physicians at Anonymous Hospital. rovide our method, generated discharge ary output examples, source code and trained models.
大型语言模型(LLMs)的阿基里斯之踵是幻觉现象,这对临床领域有着严重的后果。特别是在自动生成出院总结这一任务中尤为关键(这是一个详细的医疗文件,汇总了住院患者的访问记录)。如果能够实现自动生成这些摘要,医生就可以有更多时间照顾患者,并减少文档负担。本研究的目标是发现新的方法,结合语言图和深度学习模型来解决内容来源可靠性和可信度在自动化摘要中的问题。我们的方法在公开的医学信息市场重症监护III(MIMIC-III)语料库以及匿名医院医生撰写的临床笔记上表现出令人印象深刻的可靠性。 提供我们的方法、生成的出院总结输出示例、源代码和训练好的模型。
https://arxiv.org/abs/2506.14101
The rise of large language models (LLMs) has led to dramatic improvements across a wide range of natural language tasks. These advancements have extended into the domain of code, facilitating complex tasks such as code generation, translation, summarization, and repair. However, their utility for real-world deployment in-the-wild has only recently been studied, particularly on software engineering (SWE) tasks such as GitHub issue resolution. In this study, we examine the code reasoning techniques that underlie the ability to perform such tasks, and examine the paradigms used to drive their performance. Our contributions in this paper are: (1) the first dedicated survey on code reasoning for code tasks, highlighting overarching strategies, hybrid and agentic approaches; (2) a taxonomy of various techniques used to drive code reasoning; (3) a comprehensive overview of performance on common benchmarks and a showcase of new, under-explored benchmarks with high potential in SWE; (4) an exploration on how core properties of code can be used to explain different reasoning techniques; and (5) gaps and potentially under-explored areas for future research.
大型语言模型(LLMs)的兴起已在广泛范围内的自然语言任务中带来了显著改进。这些进展扩展到了代码领域,使得复杂的任务如代码生成、翻译、总结和修复成为可能。然而,它们在现实世界中的实际部署研究才刚刚开始,特别是在软件工程(SWE)任务上,例如GitHub问题解决。在这项研究中,我们探讨了执行此类任务所依赖的代码推理技术,并考察了驱动其性能的各种范式。本文的贡献包括: 1. 首个专门针对代码任务中的代码推理进行调查的综述,强调总体策略、混合和代理方法; 2. 各种用于推动代码推理的技术分类学; 3. 常见基准测试中表现的整体概述及展示具有高潜力但在SWE领域尚待探索的新基准测试; 4. 探讨如何利用代码的基本属性来解释不同的推理技术; 5. 未来研究中的空白和潜在未被充分探索的领域。
https://arxiv.org/abs/2506.13932
This study explores the extent to which national music preferences reflect underlying cultural values. We collected long-term popular music data from YouTube Music Charts across 62 countries, encompassing both Western and non-Western regions, and extracted audio embeddings using the CLAP model. To complement these quantitative representations, we generated semantic captions for each track using LP-MusicCaps and GPT-based summarization. Countries were clustered based on contrastive embeddings that highlight deviations from global musical norms. The resulting clusters were projected into a two-dimensional space via t-SNE for visualization and evaluated against cultural zones defined by the World Values Survey (WVS). Statistical analyses, including MANOVA and chi-squared tests, confirmed that music-based clusters exhibit significant alignment with established cultural groupings. Furthermore, residual analysis revealed consistent patterns of overrepresentation, suggesting non-random associations between specific clusters and cultural zones. These findings indicate that national-level music preferences encode meaningful cultural signals and can serve as a proxy for understanding global cultural boundaries.
这项研究探讨了国家音乐偏好在多大程度上反映了深层次的文化价值观。我们从YouTube Music Charts收集了62个国家的长期流行音乐数据,这些国家涵盖了西方和非西方地区,并使用CLAP模型提取了音频嵌入。为了补充这些定量表示,我们利用LP-MusicCaps和基于GPT的摘要生成技术为每首歌曲创建语义描述。根据强调偏离全球音乐规范对比嵌入对各国进行了聚类分析。通过t-SNE方法将得到的聚类投影到二维空间中以进行可视化,并与世界经济合作与发展组织(WVS)定义的文化区域进行评估。统计分析,包括多元方差分析(MANOVA)和卡方检验,确认基于音乐的聚类在很大程度上与已有的文化分组相一致。此外,残差分析揭示了一致性的过度表示模式,表明特定聚类与文化区域之间存在非随机关联。这些发现表明,国家级别的音乐偏好编码了有意义的文化信号,并可作为理解全球文化界限的一个代理指标。
https://arxiv.org/abs/2506.13199
Media outlets are becoming more partisan and polarized nowadays. Most previous work focused on detecting media bias. In this paper, we aim to mitigate media bias by generating a neutralized summary given multiple articles presenting different ideological views. Motivated by the critical role of events and event relations in media bias detection, we propose to increase awareness of bias in LLMs via multi-document events reasoning and use a multi-document event relation graph to guide the summarization process. This graph contains rich event information useful to reveal bias: four common types of in-doc event relations to reflect content framing bias, cross-doc event coreference relation to reveal content selection bias, and event-level moral opinions to highlight opinionated framing bias. We further develop two strategies to incorporate the multi-document event relation graph for neutralized summarization. Firstly, we convert a graph into natural language descriptions and feed the textualized graph into LLMs as a part of a hard text prompt. Secondly, we encode the graph with graph attention network and insert the graph embedding into LLMs as a soft prompt. Both automatic evaluation and human evaluation confirm that our approach effectively mitigates both lexical and informational media bias, and meanwhile improves content preservation.
如今,媒体机构正变得越来越党派化和两极分化。以往的大多数研究主要集中在检测媒体偏见上。在本文中,我们的目标是通过生成中立摘要来缓解媒体偏见,该摘要基于呈现不同意识形态观点的多篇文章。鉴于事件及其关系在媒体偏见检测中的关键作用,我们提出增加大型语言模型(LLM)对偏见的认识,方法是进行跨文档的事件推理,并使用一个多文档事件关系图来指导总结过程。此图表包含丰富的事件信息,有助于揭示偏见:四种常见的文档内事件关系类型以反映内容框架偏差;跨文档事件共指关系以揭示内容选择偏差;以及在道德观点上的意见化框架偏差。 我们进一步开发了两种策略将多文档事件关系图融入中立总结: 首先,我们将图表转换为自然语言描述,并将其作为硬文本提示的一部分提供给LLM。 其次,我们使用图形注意力网络对图表进行编码,并插入到LLM中作为软提示。 自动评估和人工评估均证实我们的方法能够有效缓解词汇和信息层面的媒体偏见,同时提升内容保留度。
https://arxiv.org/abs/2506.12978
We study multi-modal summarization for instructional videos, whose goal is to provide users an efficient way to learn skills in the form of text instructions and key video frames. We observe that existing benchmarks focus on generic semantic-level video summarization, and are not suitable for providing step-by-step executable instructions and illustrations, both of which are crucial for instructional videos. We propose a novel benchmark for user interface (UI) instructional video summarization to fill the gap. We collect a dataset of 2,413 UI instructional videos, which spans over 167 hours. These videos are manually annotated for video segmentation, text summarization, and video summarization, which enable the comprehensive evaluations for concise and executable video summarization. We conduct extensive experiments on our collected MS4UI dataset, which suggest that state-of-the-art multi-modal summarization methods struggle on UI video summarization, and highlight the importance of new methods for UI instructional video summarization.
我们研究了针对教学视频的多模态摘要技术,其目标是为用户提供一种通过文本说明和关键视频帧的形式来高效学习技能的方法。我们注意到现有的基准测试侧重于通用语义层面的视频摘要,并且不适合提供逐步骤的操作性指令和插图,而这对于教学视频来说至关重要。为此,我们提出了一种新的用户界面(UI)教学视频摘要的基准测试,以填补这一空白。 我们收集了一个包含2,413个UI教学视频的数据集,这些视频时长超过167小时。所有视频均经过人工标注,包括视频分割、文本总结和视频总结,这使得对简洁且可执行的视频摘要进行全面评估成为可能。 我们在所采集的MS4UI数据集上进行了广泛的实验,结果表明现有的最先进的多模态摘要方法在UI视频摘要方面存在困难,并强调了开发新的UI教学视频摘要方法的重要性。
https://arxiv.org/abs/2506.12623
Inference-time alignment methods have gained significant attention for their efficiency and effectiveness in aligning large language models (LLMs) with human preferences. However, existing dominant approaches using reward-guided search (RGS) primarily rely on outcome reward models (ORMs), which suffer from a critical granularity mismatch: ORMs are designed to provide outcome rewards for complete responses, while RGS methods rely on process rewards to guide the policy, leading to inconsistent scoring and suboptimal alignment. To address this challenge, we introduce process reward models (PRMs) into RGS and argue that an ideal PRM should satisfy two objectives: Score Consistency, ensuring coherent evaluation across partial and complete responses, and Preference Consistency, aligning partial sequence assessments with human preferences. Based on these, we propose SP-PRM, a novel dual-consistency framework integrating score consistency-based and preference consistency-based partial evaluation modules without relying on human annotation. Extensive experiments on dialogue, summarization, and reasoning tasks demonstrate that SP-PRM substantially enhances existing RGS methods, achieving a 3.6%-10.3% improvement in GPT-4 evaluation scores across all tasks.
在推理时间对齐方法中,由于其高效性和有效性,在将大型语言模型(LLMs)与人类偏好对齐方面受到了广泛关注。然而,目前主导的使用奖励引导搜索(RGS)的方法主要依赖于结果奖励模型(ORMs),这些模型存在一个关键的粒度不匹配问题:ORMs 设计用于为完整响应提供结果奖励,而 RGS 方法则依靠过程奖励来指导策略选择,这导致了评分的一致性和对齐效果不佳。为了应对这一挑战,我们将过程奖励模型(PRMs)引入到 RGS 中,并提出一个理想的 PRM 应该满足两个目标:一致性评分和偏好一致性的要求。前者确保了在部分响应和完整响应之间的评价保持连贯性;后者则保证了针对序列的部分评估与人类偏好的对齐。 基于这两个目标,我们提出了 SP-PRM——一种新的双一致性框架,它集成了基于评分一致性和偏好一致性的局部评估模块,并且无需依赖人工标注。在对话、摘要和推理任务的广泛实验中表明,SP-PRM 显著增强了现有的 RGS 方法,在所有任务上实现了 GPT-4 评价分数提升了3.6%到10.3%的成绩。
https://arxiv.org/abs/2506.12446
Multimodal document retrieval systems enable information access across text, images, and layouts, benefiting various domains like document-based question answering, report analysis, and interactive content summarization. Rerankers improve retrieval precision by reordering retrieved candidates. However, current multimodal reranking methods remain underexplored, with significant room for improvement in both training strategies and overall effectiveness. Moreover, the lack of explicit reasoning makes it difficult to analyze and optimize these methods further. In this paper, We propose MM-R5, a MultiModal Reasoning-Enhanced ReRanker via Reinforcement Learning for Document Retrieval, aiming to provide a more effective and reliable solution for multimodal reranking tasks. MM-R5 is trained in two stages: supervised fine-tuning (SFT) and reinforcement learning (RL). In the SFT stage, we focus on improving instruction-following and guiding the model to generate complete and high-quality reasoning chains. To support this, we introduce a novel data construction strategy that produces rich, high-quality reasoning data. In the RL stage, we design a task-specific reward framework, including a reranking reward tailored for multimodal candidates and a composite template-based reward to further refine reasoning quality. We conduct extensive experiments on MMDocIR, a challenging public benchmark spanning multiple domains. MM-R5 achieves state-of-the-art performance on most metrics and delivers comparable results to much larger models on the remaining ones. Moreover, compared to the best retrieval-only method, MM-R5 improves recall@1 by over 4%. These results validate the effectiveness of our reasoning-enhanced training pipeline.
多模态文档检索系统能够跨越文本、图像和版面布局,为文档问答、报告分析和互动内容总结等众多领域提供信息访问服务。重排序器通过重新排列检索到的候选对象来提高检索精度。然而,目前的多模态重排序方法仍然研究不足,在训练策略及整体有效性方面还有很大的改进空间。此外,缺乏明确的推理机制使得这些方法难以进一步分析和优化。 本文提出了一种名为MM-R5的方法,这是一种通过强化学习增强多模态推理能力的文档检索重排序器(MultiModal Reasoning-Enhanced ReRanker via Reinforcement Learning for Document Retrieval),旨在为多模态重排序任务提供更有效且可靠的解决方案。MM-R5分为两个训练阶段:监督微调(SFT)和强化学习(RL)。在SFT阶段,我们专注于改进指令遵循能力,并引导模型生成完整而高质量的推理链。为此,我们引入了一种新颖的数据构建策略,以产生丰富、高质量的推理数据。 在RL阶段,我们设计了一个特定任务的奖励框架,其中包括为多模态候选对象定制的重排序奖励以及进一步精炼推理质量的基于复合模板的奖励。我们在MMDocIR上进行了广泛的实验,这是一个跨越多个领域的具有挑战性的公共基准测试集。MM-R5在大多数指标上达到了最先进的性能,并且对于剩余的一些指标来说,其表现与更大规模的模型相当。 此外,相较于仅限于检索的方法,MM-R5提升了超过4%的召回率@1。这些结果验证了我们增强推理能力的训练管道的有效性。
https://arxiv.org/abs/2506.12364
Navigating the vast and rapidly growing body of scientific literature is a formidable challenge for aspiring researchers. Current approaches to supporting research idea generation often rely on generic large language models (LLMs). While LLMs are effective at aiding comprehension and summarization, they often fall short in guiding users toward practical research ideas due to their limitations. In this study, we present a novel structural framework for research ideation. Our framework, The Budget AI Researcher, uses retrieval-augmented generation (RAG) chains, vector databases, and topic-guided pairing to recombine concepts from hundreds of machine learning papers. The system ingests papers from nine major AI conferences, which collectively span the vast subfields of machine learning, and organizes them into a hierarchical topic tree. It uses the tree to identify distant topic pairs, generate novel research abstracts, and refine them through iterative self-evaluation against relevant literature and peer reviews, generating and refining abstracts that are both grounded in real-world research and demonstrably interesting. Experiments using LLM-based metrics indicate that our method significantly improves the concreteness of generated research ideas relative to standard prompting approaches. Human evaluations further demonstrate a substantial enhancement in the perceived interestingness of the outputs. By bridging the gap between academic data and creative generation, the Budget AI Researcher offers a practical, free tool for accelerating scientific discovery and lowering the barrier for aspiring researchers. Beyond research ideation, this approach inspires solutions to the broader challenge of generating personalized, context-aware outputs grounded in evolving real-world knowledge.
在浩瀚且迅速增长的科学文献海洋中,对于初出茅庐的研究者来说,导航是一项艰巨的任务。目前支持研究想法生成的方法往往依赖于通用的大规模语言模型(LLMs)。虽然这些模型在帮助理解与总结方面表现优异,但在指导用户产生实用的研究想法上却常常受限。为此,在这项研究中,我们提出了一种新颖的结构化框架来促进研究构思——“预算AI研究员”(The Budget AI Researcher)。 该框架采用检索增强生成(RAG)链、向量数据库以及主题引导配对技术,从数百篇机器学习论文的概念中重新组合产生新的想法。系统会摄入来自九个主要人工智能会议的论文,并将这些涵盖机器学习各个子领域的文献组织成一个层次化的主题树。借助这棵树,“预算AI研究员”能够识别远端的主题搭配,生成新颖的研究摘要并反复自我评估以确保与相关文献和同行评审保持一致。这一过程使得产生的研究摘要既能基于现实世界中的研究成果,又具有明确的创新性。 实验结果显示,相对于标准提示方法,我们的方法在提高生成研究想法的具体性和实用性方面取得了显著进步。人类评价进一步证明了输出的有趣程度有了实质性的提升。“预算AI研究员”通过弥合学术数据与创意生成之间的鸿沟,为加速科学发现提供了一种实用且免费的工具,并降低了有志于科研的年轻人进入门槛。 除了促进研究构思外,该方法还激发了解决更广泛的挑战——即如何生成个性化的、情境相关的输出并基于不断变化的实际知识进行构建。
https://arxiv.org/abs/2506.12317
Enhancing clinical decision support (CDS), reducing documentation burdens, and improving patient health literacy remain persistent challenges in digital health. This paper presents an open-source, agent-based framework that integrates Large Language Models (LLMs) with HL7 FHIR data via the Model Context Protocol (MCP) for dynamic extraction and reasoning over electronic health records (EHRs). Built on the established MCP-FHIR implementation, the framework enables declarative access to diverse FHIR resources through JSON-based configurations, supporting real-time summarization, interpretation, and personalized communication across multiple user personas, including clinicians, caregivers, and patients. To ensure privacy and reproducibility, the framework is evaluated using synthetic EHR data from the SMART Health IT sandbox (this https URL), which conforms to the FHIR R4 standard. Unlike traditional approaches that rely on hardcoded retrieval and static workflows, the proposed method delivers scalable, explainable, and interoperable AI-powered EHR applications. The agentic architecture further supports multiple FHIR formats, laying a robust foundation for advancing personalized digital health solutions.
在数字健康领域,提升临床决策支持(CDS)、减少文档负担以及提高患者健康素养仍然是持续面临的挑战。本文介绍了一种开源、基于代理的框架,该框架通过模型上下文协议(MCP)将大型语言模型(LLMs)与HL7 FHIR数据进行集成,以实现实时提取和推理电子健康记录(EHRs)。此框架建立在成熟的MCP-FHIR实现之上,允许通过JSON配置声明性地访问各种FHIR资源,支持实时汇总、解释,并为包括临床医生、护理人员及患者在内的多个用户角色提供个性化沟通。为了确保隐私性和可重复性,该框架使用来自SMART Health IT沙盒(此链接)的合成EHR数据进行评估,这些数据符合FHIR R4标准。 与依赖于硬编码检索和静态工作流的传统方法不同,所提出的方法能够为AI赋能的电子健康记录应用提供可扩展、解释性强且互操作性高的解决方案。代理架构进一步支持多种FHIR格式,为推进个性化数字健康解决方案奠定了坚实的基础。
https://arxiv.org/abs/2506.13800
The explosive growth of video data intensified the need for flexible user-controllable summarization tools that can operate without domain-specific training data. Existing methods either rely on datasets, limiting generalization, or cannot incorporate user intent expressed in natural language. We introduce Prompts-to-Summaries: the first zero-shot, text-queryable video summarizer that converts off-the-shelf video-language models (VidLMs) captions into user-guided skims via large language models (LLMs) judging, without the use of training data at all, beating all unsupervised and matching supervised methods. Our pipeline (i) segments raw video footage into coherent scenes, (ii) generates rich scene-level descriptions through a memory-efficient, batch-style VidLM prompting scheme that scales to hours-long videos on a single GPU, (iii) leverages an LLM as a judge to assign scene-level importance scores under a carefully crafted prompt, and finally, (iv) propagates those scores to short segments level via two new metrics: consistency (temporal coherency) and uniqueness (novelty), yielding fine-grained frame importance. On SumMe and TVSum, our data-free approach surpasses all prior data-hungry unsupervised methods. It also performs competitively on the Query-Focused Video Summarization (QFVS) benchmark, despite using no training data and the competing methods requiring supervised frame-level importance. To spur further research, we release VidSum-Reason, a new query-driven dataset featuring long-tailed concepts and multi-step reasoning; our framework attains robust F1 scores and serves as the first challenging baseline. Overall, our results demonstrate that pretrained multimodal models, when orchestrated with principled prompting and score propagation, already provide a powerful foundation for universal, text-queryable video summarization.
视频数据的爆炸性增长加剧了对灵活且用户可控的摘要工具的需求,这些工具可以在没有特定领域训练数据的情况下运行。现有的方法要么依赖于数据集,从而限制了泛化能力;要么无法结合自然语言表达的用户意图。我们引入了一个名为“Prompts-to-Summaries”的系统:这是首个零样本、文本查询式的视频摘要生成器,它通过大型语言模型(LLM)评估将现成的视频-语言模型(VidLM)的字幕转换为用户的指导式概览,完全不使用训练数据,超越了所有无监督方法,并在某些情况下与有监督方法相匹敌。 我们的处理流程包括以下四个步骤: (i) 将原始视频片段分割成连贯的情节; (ii) 通过一种内存高效、批处理式的VidLM提示方案生成丰富的情景描述,能够支持单个GPU上长达数小时的视频; (iii) 利用一个LLM作为评判者,在精心设计的提示下为每个场景分配重要性得分; (iv) 最后,我们采用两种新的指标——一致性(时间连贯性)和独特性(新颖性),将这些分数传播到较短片段级别上,生成细粒度的关键帧的重要性评价。 在SumMe和TVSum数据集上,我们的无数据方法超过了所有之前的“数据饥渴”的无监督方法。它还在Query-Focused Video Summarization (QFVS)基准测试中表现出竞争力,尽管我们没有使用训练数据而竞争对手的方法需要有监督的帧级重要性标注。 为了推动进一步的研究,我们发布了VidSum-Reason:一个新的查询驱动的数据集,其中包含长尾概念和多步推理特征。我们的框架在该数据集中取得了稳健的F1分数,并且首次为这一挑战提供了有力的基础线。 总体而言,我们的结果表明,当使用原则性的提示机制与评分传播来协调预训练的多模态模型时,它们已经为通用、文本查询式的视频摘要提供了一个强大的基础。
https://arxiv.org/abs/2506.10807
The rapid proliferation of online video content necessitates effective video summarization techniques. Traditional methods, often relying on a single modality (typically visual), struggle to capture the full semantic richness of videos. This paper introduces MF2Summ, a novel video summarization model based on multimodal content understanding, integrating both visual and auditory information. MF2Summ employs a five-stage process: feature extraction, cross-modal attention interaction, feature fusion, segment prediction, and key shot selection. Visual features are extracted using a pre-trained GoogLeNet model, while auditory features are derived using SoundNet. The core of our fusion mechanism involves a cross-modal Transformer and an alignment-guided self-attention Transformer, designed to effectively model inter-modal dependencies and temporal correspondences. Segment importance, location, and center-ness are predicted, followed by key shot selection using Non-Maximum Suppression (NMS) and the Kernel Temporal Segmentation (KTS) algorithm. Experimental results on the SumMe and TVSum datasets demonstrate that MF2Summ achieves competitive performance, notably improving F1-scores by 1.9\% and 0.6\% respectively over the DSNet model, and performing favorably against other state-of-the-art methods.
在线视频内容的迅速增长需要有效的视频摘要技术。传统方法通常依赖单一模态(通常是视觉信息),难以捕捉视频的全部语义丰富性。本文介绍了MF2Summ,这是一种基于多模态内容理解的新颖视频摘要模型,融合了视听信息。MF2Summ采用五阶段流程:特征提取、跨模态注意力交互、特征融合、片段预测和关键镜头选择。视觉特征通过预训练的GoogLeNet模型提取,而音频特征则利用SoundNet获取。我们融合机制的核心涉及一个跨模态Transformer和一个以对齐为导向的自注意力Transformer,旨在有效建模跨模态依赖关系和时间对应性。在片段重要性、位置和中心度预测之后,使用非极大值抑制(NMS)算法以及核时间分割(KTS)算法进行关键镜头选择。在SumMe和TVSum数据集上的实验结果表明,MF2Summ达到了竞争性的性能,在F1分数上分别比DSNet模型提高了1.9%和0.6%,并且相对于其他最先进的方法也有显著的优势。
https://arxiv.org/abs/2506.10430
Camera-based 3D object detection in Bird's Eye View (BEV) is one of the most important perception tasks in autonomous driving. Earlier methods rely on dense BEV features, which are costly to construct. More recent works explore sparse query-based detection. However, they still require a large number of queries and can become expensive to run when more video frames are used. In this paper, we propose DySS, a novel method that employs state-space learning and dynamic queries. More specifically, DySS leverages a state-space model (SSM) to sequentially process the sampled features over time steps. In order to encourage the model to better capture the underlying motion and correspondence information, we introduce auxiliary tasks of future prediction and masked reconstruction to better train the SSM. The state of the SSM then provides an informative yet efficient summarization of the scene. Based on the state-space learned features, we dynamically update the queries via merge, remove, and split operations, which help maintain a useful, lean set of detection queries throughout the network. Our proposed DySS achieves both superior detection performance and efficient inference. Specifically, on the nuScenes test split, DySS achieves 65.31 NDS and 57.4 mAP, outperforming the latest state of the art. On the val split, DySS achieves 56.2 NDS and 46.2 mAP, as well as a real-time inference speed of 33 FPS.
基于相机的Bird's Eye View(BEV)中的三维物体检测是自动驾驶中最关键的任务之一。早期的方法依赖于密集的BEV特征,这在构造过程中成本较高。最近的工作则探索了稀疏查询基检测方法,尽管如此,这些方法仍然需要大量的查询,在处理更多视频帧时可能会变得非常昂贵。 在这篇文章中,我们提出了DySS,这是一种采用状态空间学习和动态查询的新方法。具体来说,DySS利用一个状态空间模型(SSM)来逐时间步长处理采样特征。为了鼓励模型更好地捕捉底层运动和对应信息,我们在训练SSM时引入了未来预测和掩码重建的辅助任务。这样,SSM的状态可以提供一种既丰富又高效地对场景进行总结的方法。 基于从状态空间学习得到的特征,我们通过合并、移除和分裂操作动态更新查询集,在整个网络中保持有用且精简的检测查询集。 我们的DySS方法实现了优异的检测性能以及高效的推理速度。具体而言,在nuScenes测试集中,DySS达到了65.31 NDS(Normalized Detection Score)和57.4 mAP(mean Average Precision),超越了最新的技术前沿。在验证数据集上,DySS分别获得了56.2的NDS和46.2的mAP,并且实现了每秒33帧的真实时间推理速度。
https://arxiv.org/abs/2506.10242
Reviews are valuable resources for customers making purchase decisions in online shopping. However, it is impractical for customers to go over the vast number of reviews and manually conclude the prominent opinions, which prompts the need for automated opinion summarization systems. Previous approaches, either extractive or abstractive, face challenges in automatically producing grounded aspect-centric summaries. In this paper, we propose a novel summarization system that not only captures predominant opinions from an aspect perspective with supporting evidence, but also adapts to varying domains without relying on a pre-defined set of aspects. Our proposed framework, ASESUM, summarizes viewpoints relevant to the critical aspects of a product by extracting aspect-centric arguments and measuring their salience and validity. We conduct experiments on a real-world dataset to demonstrate the superiority of our approach in capturing diverse perspectives of the original reviews compared to new and existing methods.
评论是在线购物中客户进行购买决策时的重要资源。然而,面对大量的商品评价,顾客很难一一阅读并手动总结出突出的观点,这就催生了自动观点摘要系统的需要。先前的研究方法,无论是提取式的还是生成式的,都在自动生成基于方面的有根据的摘要方面面临挑战。在本文中,我们提出了一种新的摘要系统,该系统不仅从面向方面的角度捕捉到主要观点,并提供支持证据,而且还能适应不同的领域而不依赖于预定义的一组方面。我们的框架ASESUM通过提取以方面为中心的论点并衡量其重要性和有效性来总结与产品关键方面相关的观点。 我们在一个真实世界的数据集上进行了实验,结果表明,相较于现有和新的方法,我们提出的方法在捕捉原始评论中的多样视角方面具有优越性。
https://arxiv.org/abs/2506.09917
Modern Large Language Models (LLMs) exhibit impressive zero-shot and few-shot generalization capabilities across complex natural language tasks, enabling their widespread use as virtual assistants for diverse applications such as translation and summarization. Despite being trained solely on large corpora of text without explicit supervision on author intent, LLMs appear to infer the underlying meaning of textual interactions. This raises a fundamental question: can LLMs model and reason about the intentions of others, i.e., do they possess a form of theory of mind? Understanding other's intentions is crucial for effective collaboration, which underpins human societal success and is essential for cooperative interactions among multiple agents, including humans and autonomous systems. In this work, we investigate the theory of mind in LLMs through the lens of cooperative multi-agent reinforcement learning (MARL), where agents learn to collaborate via repeated interactions, mirroring human social reasoning. Our approach aims to enhance artificial agent's ability to adapt and cooperate with both artificial and human partners. By leveraging LLM-based agents capable of natural language interaction, we move towards creating hybrid human-AI systems that can foster seamless collaboration, with broad implications for the future of human-artificial interaction.
现代大型语言模型(LLMs)在复杂的自然语言任务中展现出了令人印象深刻的零样本和少量样本泛化能力,使其能够广泛用作翻译和摘要等多样化应用的虚拟助手。尽管这些模型仅通过大规模文本语料库进行训练而未明确监督作者意图,它们似乎能够推断出文本交互背后的含义。这引发了这样一个基本问题:LLMs 是否具备推理他人意图的能力,即是否拥有某种形式的心智理论?理解他人的意图对于有效合作至关重要,这是人类社会成功的基础,并且在包括人和自主系统在内的多代理协同互动中也是必不可少的。在这项工作中,我们通过合作型多智能体强化学习(MARL)来探讨LLMs中的心智理论问题,在这种环境中,代理通过重复交互学习如何协作,这反映了人类的社会推理过程。我们的方法旨在增强人工代理适应并与其它人工智能代理和人类伙伴进行合作的能力。通过利用能够进行自然语言互动的基于LLM的代理,我们朝着创建可以促进无缝协同工作的混合人机系统迈进了一步,这对于未来的人类与AI交互具有深远的影响。
https://arxiv.org/abs/2506.09331
Video memorability refers to the ability of videos to be recalled after viewing, playing a crucial role in creating content that remains memorable. Existing models typically focus on extracting multimodal features to predict video memorability scores but often fail to fully utilize motion cues. The representation of motion features is compromised during the fine-tuning phase of the motion feature extractor due to a lack of labeled data. In this paper, we introduce the Text-Motion Cross-modal Contrastive Loss (TMCCL), a multimodal video memorability prediction model designed to enhance the representation of motion features. We tackle the challenge of improving motion feature representation by leveraging text description similarities across videos to establish positive and negative motion sample sets for a given target. This enhancement allows the model to learn similar feature representations for semantically related motion content, resulting in more accurate memorability predictions. Our model achieves state-of-the-art performance on two video memorability prediction datasets. Moreover, the potential applications of video memorability prediction have been underexplored. To address this gap, we present Memorability Weighted Correction for Video Summarization (MWCVS), using video memorability prediction to reduce subjectivity in video summarization labels. Experimental results on two video summarization datasets demonstrate the effectiveness of MWCVS, showcasing the promising applications of video memorability prediction.
视频记忆性是指视频在观看后能够被回忆起来的能力,在创建令人难忘的内容方面发挥着关键作用。现有模型通常侧重于提取多模态特征以预测视频的记忆性得分,但往往未能充分利用运动线索。由于缺乏标注数据,在对运动特征抽取器进行微调时,会损害运动特征的表示能力。 在本文中,我们引入了文本-运动跨模态对比损失(Text-Motion Cross-modal Contrastive Loss, TMCCL),这是一种多模态视频记忆性预测模型,旨在增强运动特征的表现形式。通过利用不同视频间文本描述的相似性来为给定目标建立正负运动样本集,以应对提升运动特征表示能力的挑战。这种改进使得模型能够学习到语义相关运动内容的类似特征表示,从而实现更准确的记忆性预测。 我们的模型在两个视频记忆性预测数据集中达到了当前最佳性能。此外,对视频记忆性预测潜在应用的研究尚处于起步阶段。为弥补这一不足,我们提出了基于视频记忆性预测来减少视频摘要标签主观性的记忆加权校正(Memorability Weighted Correction for Video Summarization, MWCVS)。在两个视频摘要数据集上的实验结果证明了MWCVS的有效性,展示了视频记忆性预测的前景应用。
https://arxiv.org/abs/2506.08649
We explore a generative relation extraction (RE) pipeline tailored to the study of interactions in the intestinal microbiome, a complex and low-resource biomedical domain. Our method leverages summarization with large language models (LLMs) to refine context before extracting relations via instruction-tuned generation. Preliminary results on a dedicated corpus show that summarization improves generative RE performance by reducing noise and guiding the model. However, BERT-based RE approaches still outperform generative models. This ongoing work demonstrates the potential of generative methods to support the study of specialized domains in low-resources setting.
我们探索了一种针对肠道微生物组相互作用研究的生成式关系抽取(RE)管道,这是一个复杂且资源匮乏的生物医学领域。我们的方法利用大型语言模型(LLMs)进行摘要提炼,在此基础上通过指令调优生成的方式提取关系。在专门构建的数据集上的初步结果显示,摘要提炼能够减少噪音并指导模型,从而提高生成式RE的性能。然而,基于BERT的关系抽取方法仍然优于生成式模型。这项正在进行的工作展示了生成式方法在资源匮乏环境中支持特定领域研究的巨大潜力。
https://arxiv.org/abs/2506.08647