We propose VADER, a spatio-temporal matching, alignment, and change summarization method to help fight misinformation spread via manipulated videos. VADER matches and coarsely aligns partial video fragments to candidate videos using a robust visual descriptor and scalable search over adaptively chunked video content. A transformer-based alignment module then refines the temporal localization of the query fragment within the matched video. A space-time comparator module identifies regions of manipulation between aligned content, invariant to any changes due to any residual temporal misalignments or artifacts arising from non-editorial changes of the content. Robustly matching video to a trusted source enables conclusions to be drawn on video provenance, enabling informed trust decisions on content encountered.
我们提出了VADER,一种时间和空间匹配、对齐和变化概述方法,以帮助打击通过操纵视频传播虚假信息的行为。VADER使用一种 robust 的视觉描述器和自适应分割的视频内容,以将部分视频片段与候选视频进行稳健匹配和粗略对齐。一个Transformer-based对齐模块随后 refine 了匹配视频中查询片段的时间和空间定位。一个空间-时间比较模块识别出对齐内容包括之间的操纵区域,不受任何Residual temporal Misalignment或内容非编辑变化所导致的任何变化的影响。通过稳健地匹配视频与信任来源,可以得出视频溯源的结论,从而在处理遇到的内容包括做出知情的信任决策。
https://arxiv.org/abs/2303.13193
Electronic health records (EHRs) store an extensive array of patient information, encompassing medical histories, diagnoses, treatments, and test outcomes. These records are crucial for enabling healthcare providers to make well-informed decisions regarding patient care. Summarizing clinical notes further assists healthcare professionals in pinpointing potential health risks and making better-informed decisions. This process contributes to reducing errors and enhancing patient outcomes by ensuring providers have access to the most pertinent and current patient data. Recent research has shown that incorporating prompts with large language models (LLMs) substantially boosts the efficacy of summarization tasks. However, we show that this approach also leads to increased output variance, resulting in notably divergent outputs even when prompts share similar meanings. To tackle this challenge, we introduce a model-agnostic Soft Prompt-Based Calibration (SPeC) pipeline that employs soft prompts to diminish variance while preserving the advantages of prompt-based summarization. Experimental findings on multiple clinical note tasks and LLMs indicate that our method not only bolsters performance but also effectively curbs variance for various LLMs, providing a more uniform and dependable solution for summarizing vital medical information.
电子健康记录(EHRs)存储了广泛的患者信息,包括医疗历史、诊断、治疗和测试结果。这些记录对于使医疗保健提供者做出关于护理的知情决策至关重要。总结临床笔记进一步协助医疗保健专业人员指出潜在的健康风险并做出更好的知情决策。这个过程有助于减少错误并增强患者的治疗效果,通过确保提供者访问最相关和最新的患者数据来实现。最近的研究表明,包括大型语言模型(LLMs)的提示极大地提高了摘要任务的有效性。然而,我们表明,这种方法也导致输出变异性增加,即使提示共享相似的含义,仍会导致显著的不同输出。为了应对这个挑战,我们引入了一种模型无关的软提示-基于校准(SPeC)管道,采用软提示以减少变异性,同时保留提示摘要的优势。对多个临床笔记任务和LLM的实验室发现表明,我们的方法不仅增强了性能,而且有效地限制了各种LLM的输出变异性,提供了一种更均匀和可靠的摘要重要医疗信息的解决方案。
https://arxiv.org/abs/2303.13035
Lack of encyclopedic text contributors, especially on Wikipedia, makes automated text generation for \emph{low resource (LR) languages} a critical problem. Existing work on Wikipedia text generation has focused on \emph{English only} where English reference articles are summarized to generate English Wikipedia pages. But, for low-resource languages, the scarcity of reference articles makes monolingual summarization ineffective in solving this problem. Hence, in this work, we propose \task{}, which is the task of cross-lingual multi-document summarization of text from multiple reference articles, written in various languages, to generate Wikipedia-style text. Accordingly, we contribute a benchmark dataset, \data{}, spanning $\sim$69K Wikipedia articles covering five domains and eight languages. We harness this dataset to train a two-stage system where the input is a set of citations and a section title and the output is a section-specific LR summary. The proposed system is based on a novel idea of neural unsupervised extractive summarization to coarsely identify salient information followed by a neural abstractive model to generate the section-specific text. Extensive experiments show that multi-domain training is better than the multi-lingual setup on average.
缺乏知识文本贡献者,特别是在维基百科上,使得对低资源语言(LR)的自动文本生成成为一个关键问题。现有的维基百科文本生成工作主要关注英语,英语参考文章被总结生成英语维基百科页面。但对于低资源语言,缺乏参考文章使得单语言总结无法有效地解决这个问题。因此,在本工作中,我们提出了任务(task),它是跨语言多文档摘要从多种语言参考文章生成维基百科样式文本的任务。据此,我们提供了一个基准数据集\data{},覆盖大约69,000个维基百科页面,涵盖了五个领域和八个语言。我们利用这些数据集训练了一个两阶段系统,输入是引用和章节标题,输出是特定章节的LR摘要。我们提出的系统基于一种新颖的神经网络非监督提取总结想法,以粗略地识别突出信息,然后使用神经网络抽象模型生成特定章节文本。广泛的实验结果表明,跨域训练比跨语言 setup平均更好。
https://arxiv.org/abs/2303.12308
Video summarization aims to distill the most important information from a source video to produce either an abridged clip or a textual narrative. Traditionally, different methods have been proposed depending on whether the output is a video or text, thus ignoring the correlation between the two semantically related tasks of visual summarization and textual summarization. We propose a new joint video and text summarization task. The goal is to generate both a shortened video clip along with the corresponding textual summary from a long video, collectively referred to as a cross-modal summary. The generated shortened video clip and text narratives should be semantically well aligned. To this end, we first build a large-scale human-annotated dataset -- VideoXum (X refers to different modalities). The dataset is reannotated based on ActivityNet. After we filter out the videos that do not meet the length requirements, 14,001 long videos remain in our new dataset. Each video in our reannotated dataset has human-annotated video summaries and the corresponding narrative summaries. We then design a novel end-to-end model -- VTSUM-BILP to address the challenges of our proposed task. Moreover, we propose a new metric called VT-CLIPScore to help evaluate the semantic consistency of cross-modality summary. The proposed model achieves promising performance on this new task and establishes a benchmark for future research.
视频摘要的目标是从源视频中提取最重要的信息,以生成一种简短的片段或文本叙事。传统上,不同的方法会根据输出是否为视频或文本而提出,从而忽视了视觉摘要和文本摘要这两个语义相关的任务之间的相关性。我们提出了一个新的联合视频和文本摘要任务。该任务的目标是从一段较长的视频中提取 both a缩短的视频片段和相应的文本摘要,并将其统称为跨媒体摘要。生成的缩短视频片段和文本叙事应该语义上紧密对齐。为此,我们首先建立了一个大规模的人类标注数据集——VideoXum(X代表不同感官方式)。该数据集基于活动Net进行重新标注。在我们过滤掉不符合长度要求的视频后,我们的新数据集仍然存在14,001段较长的视频。在每个视频中,我们的重新标注数据集都有人类标注的视频摘要和相应的文本摘要。然后我们设计了一种新的端到端模型——VTSUM-BILP,以解决我们提出的任务的挑战。此外,我们提出了一种新的度量指标——VT-CLIPScore,以帮助评估跨媒体摘要的语义一致性。该提议模型在 this 新的任务上取得了良好的表现,并为未来的研究树立了基准。
https://arxiv.org/abs/2303.12060
Opinion summarization provides an important solution for summarizing opinions expressed among a large number of reviews. However, generating aspect-specific and general summaries is challenging due to the lack of annotated data. In this work, we propose two simple yet effective unsupervised approaches to generate both aspect-specific and general opinion summaries by training on synthetic datasets constructed with aspect-related review contents. Our first approach, Seed Words Based Leave-One-Out (SW-LOO), identifies aspect-related portions of reviews simply by exact-matching aspect seed words and outperforms existing methods by 3.4 ROUGE-L points on SPACE and 0.5 ROUGE-1 point on OPOSUM+ for aspect-specific opinion summarization. Our second approach, Natural Language Inference Based Leave-One-Out (NLI-LOO) identifies aspect-related sentences utilizing an NLI model in a more general setting without using seed words and outperforms existing approaches by 1.2 ROUGE-L points on SPACE for aspect-specific opinion summarization and remains competitive on other metrics.
观点总结提供了总结大量评论意见的重要解决方案。然而,由于缺乏标注数据,生成特定方面的一般和总体观点总结是一项挑战。在本研究中,我们提出了两个简单但有效的未监督方法,通过训练基于与特定评论内容相关的词汇的合成数据集,生成特定方面的一般和总体观点总结。我们的第一种方法是基于词干选择 leave-one-out (SW-LOO),仅通过匹配词干来识别评论中的相关部分,在空间上比现有方法高出3.4 ROUGE-L点,在OPOSUM+上高出0.5 ROUGE-1点,对于特定观点总结表现优异。我们的第二种方法是基于自然语言推断 leave-one-out (NLI-LOO),在没有词干的情况下使用NLP模型识别相关句子,在空间上比现有方法高出1.2 ROUGE-L点,对于特定观点总结表现优异,在其他指标上仍然竞争力。
https://arxiv.org/abs/2303.11660
The pre-training and fine-tuning paradigm has contributed to a number of breakthroughs in Natural Language Processing (NLP). Instead of directly training on a downstream task, language models are first pre-trained on large datasets with cross-domain knowledge (e.g., Pile, MassiveText, etc.) and then fine-tuned on task-specific data (e.g., natural language generation, text summarization, etc.). Scaling the model and dataset size has helped improve the performance of LLMs, but unfortunately, this also leads to highly prohibitive computational costs. Pre-training LLMs often require orders of magnitude more FLOPs than fine-tuning and the model capacity often remains the same between the two phases. To achieve training efficiency w.r.t training FLOPs, we propose to decouple the model capacity between the two phases and introduce Sparse Pre-training and Dense Fine-tuning (SPDF). In this work, we show the benefits of using unstructured weight sparsity to train only a subset of weights during pre-training (Sparse Pre-training) and then recover the representational capacity by allowing the zeroed weights to learn (Dense Fine-tuning). We demonstrate that we can induce up to 75% sparsity into a 1.3B parameter GPT-3 XL model resulting in a 2.5x reduction in pre-training FLOPs, without a significant loss in accuracy on the downstream tasks relative to the dense baseline. By rigorously evaluating multiple downstream tasks, we also establish a relationship between sparsity, task complexity, and dataset size. Our work presents a promising direction to train large GPT models at a fraction of the training FLOPs using weight sparsity while retaining the benefits of pre-trained textual representations for downstream tasks.
预训练和微调范式已经在自然语言处理(NLP)中取得了多项突破。相反,语言模型先预训练在具有跨域知识的大型数据集上(例如,Pile、MassiveText等),然后微调特定数据集(例如,自然语言生成、文本摘要等)。通过扩大模型和数据集的大小,帮助提高了LLM的性能,但不幸的是,这也导致计算成本极高的问题。预训练LLM常常需要更多的FLOPs,而微调阶段模型容量通常在两个阶段之间保持不变。为了在与训练FLOPs相比训练效率方面实现优化,我们建议将模型能力在两个阶段之间分离并引入稀疏预训练和密集微调(SPDF)。在本文中,我们展示了使用无结构权重稀疏性在预训练期间训练仅一部分权重的好处(稀疏预训练),然后通过允许未初始化权重学习来恢复表示能力(密集微调)。我们证明了可以在1.3B参数的GPT-3XL模型中诱导高达75%的稀疏性,导致预训练FLOPs的2.5倍减少,而与密集基线相比,下游任务的准确性没有显著下降。通过严格评估多个下游任务,我们还建立了稀疏性、任务复杂性和数据集大小之间的关系。我们的工作提供了一个有前途的方向,使用权重稀疏性训练大型GPT模型,同时将预训练的文本表示用于下游任务,而不放弃。
https://arxiv.org/abs/2303.10464
Automatic evaluation metrics have been facilitating the rapid development of automatic summarization methods by providing instant and fair assessments of the quality of summaries. Most metrics have been developed for the general domain, especially news and meeting notes, or other language-generation tasks. However, these metrics are applied to evaluate summarization systems in different domains, such as biomedical question summarization. To better understand whether commonly used evaluation metrics are capable of evaluating automatic summarization in the biomedical domain, we conduct human evaluations of summarization quality from four different aspects of a biomedical question summarization task. Based on human judgments, we identify different noteworthy features for current automatic metrics and summarization systems as well. We also release a dataset of our human annotations to aid the research of summarization evaluation metrics in the biomedical domain.
自动评估指标通过提供即时和公平的评估摘要质量,促进了自动摘要方法的迅速发展。大多数指标是针对一般领域开发的,特别是新闻和会议笔记,或其他语言生成任务。然而,这些指标也被应用于评估不同领域的摘要系统,如生物医学问题摘要。为了更好地理解通常使用的评估指标是否能够评估生物医学领域的自动摘要方法,我们进行了人类评估生物医学问题摘要任务的四个不同方面的摘要质量评估。基于人类判断,我们识别了当前自动 metrics 和摘要系统的不同值得一提的特点。此外,我们还发布了我们的人注释数据集,以帮助生物医学领域的摘要评估指标研究。
https://arxiv.org/abs/2303.10328
We systematically study the capacity of two large language models for code - CodeT5 and Codex - to generalize to out-of-domain data. In this study, we consider two fundamental applications - code summarization, and code generation. We split data into domains following its natural boundaries - by an organization, by a project, and by a module within the software project. This makes recognition of in-domain vs out-of-domain data at the time of deployment trivial. We establish that samples from each new domain present both models with a significant challenge of distribution shift. We study how well different established methods can adapt models to better generalize to new domains. Our experiments show that while multitask learning alone is a reasonable baseline, combining it with few-shot finetuning on examples retrieved from training data can achieve very strong performance. In fact, according to our experiments, this solution can outperform direct finetuning for very low-data scenarios. Finally, we consider variations of this approach to create a more broadly applicable method to adapt to multiple domains at once. We find that in the case of code generation, a model adapted to multiple domains simultaneously performs on par with those adapted to each domain individually.
我们系统研究了两个大型语言模型 CodeT5 和 Codex 对跨域数据 generalization 的能力。在这项研究中,我们考虑了两个基本应用:代码摘要和代码生成。我们将数据按照其自然边界——由组织、项目和软件项目中模块——分阶段划分到不同的领域。这使得在部署时识别跨域数据和跨域数据是非常简单的。我们确定每个新领域的样本都同时给两个模型带来了分布 shift 的重大挑战。我们研究不同建立的方法如何适应模型更好地泛化到新领域。我们的实验表明,虽然多任务学习单独是一种合理的基准,但将它在从训练数据中提取的示例上的少量微调相结合可以实现非常强大的性能。事实上,根据我们的实验, this 解决方案可以在非常低数据量的情况下优于直接微调。最后,我们考虑这种方法的变体,以创建一种更普遍适用的方法,一次性适应多个领域。我们发现,在代码生成的情况下,适应多个领域的模型同时与适应每个领域的模型表现相当。
https://arxiv.org/abs/2303.09128
We present Mirror, an open-source platform for data exploration and analysis powered by large language models. Mirror offers an intuitive natural language interface for querying databases, and automatically generates executable SQL commands to retrieve relevant data and summarize it in natural language. In addition, users can preview and manually edit the generated SQL commands to ensure the accuracy of their queries. Mirror also generates visualizations to facilitate understanding of the data. Designed with flexibility and human input in mind, Mirror is suitable for both experienced data analysts and non-technical professionals looking to gain insights from their data.
我们呈现 Mirror 一个基于大型语言模型的开源数据探索和分析平台。 Mirror 提供了一个直观的自然语言界面,用于查询数据库,并自动生成可执行的 SQL 命令,以检索相关的数据并以自然语言进行简要总结。此外,用户还可以预览和手动编辑生成的 SQL 命令,以确保其查询的准确性。 Mirror 还生成可视化工具,以帮助理解数据。设计以灵活性和人类输入为主, Mirror 适用于经验丰富的数据分析师和非技术专业人员,并从他们的数据中获得洞察力。
https://arxiv.org/abs/2303.08697
Automatic radiology report summarization is a crucial clinical task, whose key challenge is to maintain factual accuracy between produced summaries and ground truth radiology findings. Existing research adopts reinforcement learning to directly optimize factual consistency metrics such as CheXBert or RadGraph score. However, their decoding method using greedy search or beam search considers no factual consistency when picking the optimal candidate, leading to limited factual consistency improvement. To address it, we propose a novel second-stage summarizing approach FactReranker, the first attempt that learns to choose the best summary from all candidates based on their estimated factual consistency score. We propose to extract medical facts of the input medical report, its gold summary, and candidate summaries based on the RadGraph schema and design the fact-guided reranker to efficiently incorporate the extracted medical facts for selecting the optimal summary. We decompose the fact-guided reranker into the factual knowledge graph generation and the factual scorer, which allows the reranker to model the mapping between the medical facts of the input text and its gold summary, thus can select the optimal summary even the gold summary can't be observed during inference. We also present a fact-based ranking metric (RadMRR) for measuring the ability of the reranker on selecting factual consistent candidates. Experimental results on two benchmark datasets demonstrate the superiority of our method in generating summaries with higher factual consistency scores when compared with existing methods.
自动放射学报告总结是一个重要的临床任务,其关键挑战是保持产生的总结和基准事实放射学发现之间的事实准确性。现有的研究采用强化学习直接优化CheXBert或RadGraph等事实一致性指标。然而,他们使用贪心搜索或 beam搜索的解码方法在选取最佳候选人时不考虑事实一致性,导致事实一致性改进有限。为了解决这个问题,我们提出了一种新的第二阶段总结方法FactReranker,这是第一个尝试,学习从所有候选人中选择最佳总结,根据其估计的事实一致性得分。我们建议从输入医学报告的医学事实、其黄金总结和候选人总结中提取,并设计Fact guided reranker,以高效地将提取的医学事实纳入选择最佳的总结。我们将Fact guided reranker分解为事实知识图生成和事实评分,这使reranker可以模型输入文本中的医学事实和其黄金总结之间的关系,因此可以选择最佳的总结,即使在黄金总结无法在推理期间观察到时也是如此。我们还提出了一个基于事实的排名指标(RadMRR)以衡量reranker在选择事实一致性候选人方面的能力。在两个基准数据集上的实验结果证明,我们的方法和现有方法在生成具有更高事实一致性得分的总结方面相比,具有优越的性能。
https://arxiv.org/abs/2303.08335
Transformer-based autoregressive (AR) methods have achieved appealing performance for varied sequence-to-sequence generation tasks, e.g., neural machine translation, summarization, and code generation, but suffer from low inference efficiency. To speed up the inference stage, many non-autoregressive (NAR) strategies have been proposed in the past few years. Among them, the conditional masked language model (CMLM) is one of the most versatile frameworks, as it can support many different sequence generation scenarios and achieve very competitive performance on these tasks. In this paper, we further introduce a simple yet effective adaptive masking over masking strategy to enhance the refinement capability of the decoder and make the encoder optimization easier. Experiments on \textbf{3} different tasks (neural machine translation, summarization, and code generation) with \textbf{15} datasets in total confirm that our proposed simple method achieves significant performance improvement over the strong CMLM model. Surprisingly, our proposed model yields state-of-the-art performance on neural machine translation (\textbf{34.62} BLEU on WMT16 EN$\to$RO, \textbf{34.82} BLEU on WMT16 RO$\to$EN, and \textbf{34.84} BLEU on IWSLT De$\to$En) and even better performance than the \textbf{AR} Transformer on \textbf{7} benchmark datasets with at least \textbf{2.2$\times$} speedup. Our code is available at GitHub.
使用Transformer-based自回归(AR)方法已经取得了对于各种序列到序列生成任务具有吸引人的性能,例如神经网络机器翻译、摘要、和代码生成,但是 inference 效率较低。为了加速推断阶段,过去几年中提出了许多非自回归(NAR)策略。其中,条件掩码语言模型(CMLM)是最为灵活的框架之一,因为它可以支持许多不同的序列生成场景,并在这些任务上实现非常竞争力的性能。在本文中,我们进一步介绍了一种简单但有效的自适应掩码覆盖策略,以增强解码器的细化能力,并使其优化更容易。对三个不同的任务(神经网络机器翻译、摘要、和代码生成)使用总共15个数据集进行实验确认,我们提出的简单方法在神经网络机器翻译任务上实现了显著的性能改进,比强大的CMLM模型表现更好。令人惊讶地,我们提出的模型在神经网络机器翻译任务上获得了最先进的性能(WMT16 EN$ o$RO BLEU为34.62,WMT16 RO$ o$EN BLEU为34.82,IWSLT De$ o$En BLEU为34.84),甚至在与ARTransformer相比速度至少有2.2倍加速的7个基准数据集上表现更好。我们的代码可在GitHub上获取。
https://arxiv.org/abs/2303.07457
The goal of multimodal summarization is to extract the most important information from different modalities to form summaries. Unlike unimodal summarization, the multimodal summarization task explicitly leverages cross-modal information to help generate more reliable and high-quality summaries. However, existing methods fail to leverage the temporal correspondence between different modalities and ignore the intrinsic correlation between different samples. To address this issue, we introduce Align and Attend Multimodal Summarization (A2Summ), a unified multimodal transformer-based model which can effectively align and attend the multimodal input. In addition, we propose two novel contrastive losses to model both inter-sample and intra-sample correlations. Extensive experiments on two standard video summarization datasets (TVSum and SumMe) and two multimodal summarization datasets (Daily Mail and CNN) demonstrate the superiority of A2Summ, achieving state-of-the-art performances on all datasets. Moreover, we collected a large-scale multimodal summarization dataset BLiSS, which contains livestream videos and transcribed texts with annotated summaries. Our code and dataset are publicly available at ~\url{this https URL}.
多感官摘要的目标是从不同感官中提取最重要的信息,以形成摘要。与单感官摘要不同,多感官摘要任务 explicitly 利用了跨感官信息,以帮助生成更为可靠和高质量的摘要。然而,现有的方法未能充分利用不同感官之间的时间对应关系,并忽略了不同样本之间的内在相关性。为了解决这个问题,我们介绍了 align andAttend 多感官摘要生成器 (A2Summ),它是一个统一的数据转换器模型,能够有效地 align 和 attend 多感官输入。此外,我们提出了两个新的对抗损失,以建模 inter-sample 和intra-sample corrlation。在两个标准视频摘要数据集 (TVSum和SumMe) 和两个多感官摘要数据集 (《每日邮报》和CNN) 上的广泛实验证明了 A2Summ 的优越性,在所有数据集上实现了最先进的性能。此外,我们收集了一个大规模的多感官摘要数据集 BLiSS,其中包括实时视频和转写文本,带有注释摘要。我们的代码和数据集可在 ~this https URL 上公开可用。
https://arxiv.org/abs/2303.07284
Multi-document summarization (MDS) aims to generate a summary for a number of related documents. We propose HGSUM, an MDS model that extends an encoder-decoder architecture, to incorporate a heterogeneous graph to represent different semantic units (e.g., words and sentences) of the documents. This contrasts with existing MDS models which do not consider different edge types of graphs and as such do not capture the diversity of relationships in the documents. To preserve only key information and relationships of the documents in the heterogeneous graph, HGSUM uses graph pooling to compress the input graph. And to guide HGSUM to learn compression, we introduce an additional objective that maximizes the similarity between the compressed graph and the graph constructed from the ground-truth summary during training. HGSUM is trained end-to-end with graph similarity and standard cross-entropy objectives. Experimental results over MULTI-NEWS, WCEP-100, and ARXIV show that HGSUM outperforms state-of-the-art MDS models. The code for our model and experiments is available at: this https URL.
多文档摘要(MDS)的目标是为多个相关文档生成摘要。我们提出了HGSUM,一个扩展编码-解码架构的MDS模型,以引入一种 heterogeneous graph 来代表文档的不同语义单元(例如,单词和句子)。这与现有的MDS模型不同,它们不考虑不同的图形边类型,因此无法捕捉文档中关系的多样性。为了在 heterogeneous graph 中保留文档的关键信息和关系,HGSUM使用图聚合来压缩输入图。为了指导HGSUM学习压缩,我们引入了另一个目标,该目标最大化训练期间压缩图与实际摘要生成图之间的相似性。HGSUM以图形相似性和标准交叉熵目标进行训练。在Multi-NEWS、WCEP-100和ARXIV的实验结果中,表明HGSUM比最先进的MDS模型表现更好。我们的模型和实验代码可在以下httpsURL获得。
https://arxiv.org/abs/2303.06565
Outlier detection is critical in real applications to prevent financial fraud, defend network intrusions, or detecting imminent device failures. To reduce the human effort in evaluating outlier detection results and effectively turn the outliers into actionable insights, the users often expect a system to automatically produce interpretable summarizations of subgroups of outlier detection results. Unfortunately, to date no such systems exist. To fill this gap, we propose STAIR which learns a compact set of human understandable rules to summarize and explain the anomaly detection results. Rather than use the classical decision tree algorithms to produce these rules, STAIR proposes a new optimization objective to produce a small number of rules with least complexity, hence strong interpretability, to accurately summarize the detection results. The learning algorithm of STAIR produces a rule set by iteratively splitting the large rules and is optimal in maximizing this objective in each iteration. Moreover, to effectively handle high dimensional, highly complex data sets which are hard to summarize with simple rules, we propose a localized STAIR approach, called L-STAIR. Taking data locality into consideration, it simultaneously partitions data and learns a set of localized rules for each partition. Our experimental study on many outlier benchmark datasets shows that STAIR significantly reduces the complexity of the rules required to summarize the outlier detection results, thus more amenable for humans to understand and evaluate, compared to the decision tree methods.
异常检测在实际应用中至关重要,可以防止金融欺诈、保护网络入侵或检测即将发生的设备故障。为了减少评估异常检测结果的人类工作量,并将异常点转化为可解释的见解,用户通常期望系统自动产生解释性摘要分组内异常检测结果的详细描述。不幸的是,目前不存在这样的系统。为了填补这一空缺,我们提出了STAIR,它学习一组易于理解的简单规则来摘要和解释异常检测结果。而不是使用经典的决策树算法来生成这些规则,STAIR提出了一个新的优化目标,以产生最少的复杂度、因此具有更强的解释性的规则,以准确地摘要异常检测结果。STAIR的学习算法通过迭代分割大型规则来生成规则集,并在每个迭代中最大化该目标的最佳。此外,为了有效处理难以用简单规则概括的高维度、高度复杂的数据集,我们提出了一种局部的STAIR方法,称为L-STAIR。考虑到数据局部性,它同时分区数据,并为每个分区学习一组局部规则。我们的许多异常基准数据集的实验研究表明,STAIR significantly reduce the complexity of the rules required to summarize the异常检测结果,因此更容易让人类理解和评估,相比决策树方法。
https://arxiv.org/abs/2303.06261
Fine-tuning the Natural Language Processing (NLP) models for each new data set requires higher computational time associated with increased carbon footprint and cost. However, fine-tuning helps the pre-trained models adapt to the latest data sets; what if we avoid the fine-tuning steps and attempt to generate summaries using just the pre-trained models to reduce computational time and cost. In this paper, we tried to omit the fine-tuning steps and investigate whether the Marginal Maximum Relevance (MMR)-based approach can help the pre-trained models to obtain query-focused summaries directly from a new data set that was not used to pre-train the models. First, we used topic modelling on Wikipedia Current Events Portal (WCEP) and Debatepedia datasets to generate queries for summarization tasks. Then, using MMR, we ranked the sentences of the documents according to the queries. Next, we passed the ranked sentences to seven transformer-based pre-trained models to perform the summarization tasks. Finally, we used the MMR approach again to select the query relevant sentences from the generated summaries of individual pre-trained models and constructed the final summary. As indicated by the experimental results, our MMR-based approach successfully ranked and selected the most relevant sentences as summaries and showed better performance than the individual pre-trained models.
对每个新数据集微调自然语言处理(NLP)模型需要更高的计算时间,与增加碳排放和成本相关。然而,微调有助于训练好的模型适应最新数据集;如果我们避免微调步骤并仅使用训练好的模型来生成摘要,可以减少计算时间和成本。在本文中,我们尝试省略微调步骤并研究是否存在基于marginal maximum relevance(MMR)的方法可以帮助训练好的模型从一个全新的数据集中提取针对提问的重点摘要,而该数据集并未用于训练模型。首先,我们使用维基百科当前事件 Portal(WCEP)和辩论pedia数据集来生成摘要任务的问题。然后,使用MMR,我们按提问顺序对文档中的语句进行评估。接着,我们将评估语句传递给七个基于Transformer的训练好的模型进行摘要任务。最后,我们再次使用MMR方法选择每个训练好的模型生成的针对提问的相关语句,并构建最终摘要。实验结果显示,我们的MMR方法成功地排名和选择了最相关的语句作为摘要,比单个训练好的模型表现出更好的性能。
https://arxiv.org/abs/2303.06230
During the patient's hospitalization, the physician must record daily observations of the patient and summarize them into a brief document called "discharge summary" when the patient is discharged. Automated generation of discharge summary can greatly relieve the physicians' burden, and has been addressed recently in the research community. Most previous studies of discharge summary generation using the sequence-to-sequence architecture focus on only inpatient notes for input. However, electric health records (EHR) also have rich structured metadata (e.g., hospital, physician, disease, length of stay, etc.) that might be useful. This paper investigates the effectiveness of medical meta-information for summarization tasks. We obtain four types of meta-information from the EHR systems and encode each meta-information into a sequence-to-sequence model. Using Japanese EHRs, meta-information encoded models increased ROUGE-1 by up to 4.45 points and BERTScore by 3.77 points over the vanilla Longformer. Also, we found that the encoded meta-information improves the precisions of its related terms in the outputs. Our results showed the benefit of the use of medical meta-information.
在患者住院期间,医生必须记录每日对患者的评价,并在患者出院时将其总结成一份简短的文件,称为“出院总结”。自动生成的出院总结可以极大地减轻医生的工作量,并在最近在学术界得到了关注。大多数使用序列到序列架构的出院总结生成研究仅关注 inpatient notes 的输入。然而,电子健康记录(EHR)也具有丰富的结构化元数据(例如医院、医生、疾病、住院时间等)可能很有用。本文研究了医学元数据对于摘要任务的有效性。我们从 EHR 系统中提取了四种类型的元数据,并将其编码为序列到序列模型。使用日本 EHR 数据,元数据编码模型相比于传统的 Longformer 模型提高了 ROUGE-1 高达 4.45 点,BERTScore 也提高了 3.77 点。此外,我们还发现,编码的元数据在输出中提高了相关术语的精度。我们的结果显示了使用医学元数据的好处。
https://arxiv.org/abs/2303.06002
Query-focused meeting summarization (QFMS) aims to generate summaries from meeting transcripts in response to a given query. Previous works typically concatenate the query with meeting transcripts and implicitly model the query relevance only at the token level with attention mechanism. However, due to the dilution of key query-relevant information caused by long meeting transcripts, the original transformer-based model is insufficient to highlight the key parts related to the query. In this paper, we propose a query-aware framework with joint modeling token and utterance based on Query-Utterance Attention. It calculates the utterance-level relevance to the query with a dense retrieval module. Then both token-level query relevance and utterance-level query relevance are combined and incorporated into the generation process with attention mechanism explicitly. We show that the query relevance of different granularities contributes to generating a summary more related to the query. Experimental results on the QMSum dataset show that the proposed model achieves new state-of-the-art performance.
Query focused meeting summarization (QFMS)旨在从会议记录中提取针对给定查询的摘要。以前的研究通常会将查询与会议记录并合起来,并使用注意力机制在 token 级别上隐含地建模查询相关性。然而,由于长期会议记录可能导致关键查询相关性的稀释,原基于Transformer 的模型不足以突出与查询相关的关键部分。在本文中,我们提出了基于 Query-Utterance Attention 的注意力 aware 框架,使用联合建模 token 和言论来进行摘要。它使用密度检索模块计算言论级别的查询相关性。然后将 token 级别的查询相关性和言论级别的查询相关性合并,并 explicitly 使用注意力机制进入生成过程。我们证明,不同粒度的查询相关性有助于生成更与查询相关的摘要。在 QMSum 数据集上的实验结果表明,该模型实现了新的前沿技术性能。
https://arxiv.org/abs/2303.04487
In this work, we develop a prompting approach for incremental summarization of task videos. We develop a sample-efficient few-shot approach for extracting semantic concepts as an intermediate step. We leverage an existing model for extracting the concepts from the images and extend it to videos and introduce a clustering and querying approach for sample efficiency, motivated by the recent advances in perceiver-based architectures. Our work provides further evidence that an approach with richer input context with relevant entities and actions from the videos and using these as prompts could enhance the summaries generated by the model. We show the results on a relevant dataset and discuss possible directions for the work.
在本研究中,我们开发了一种提示方法,用于对任务视频进行增量摘要。我们开发了一种高效的片段片段提取语义概念的方法,作为中间步骤。我们利用现有的模型从图像中提取概念,并将其扩展到视频,并引入了一种基于簇和查询的片段效率提高方法,基于感知器架构的最新进展的启发。我们的工作提供了进一步的证据,表明使用更丰富的输入上下文和从视频中提取相关实体和行动的方法,并将其用作提示,可以增强模型生成的摘要。我们展示了相关数据集的结果,并讨论了工作的可能方向。
https://arxiv.org/abs/2303.04361
Cross-lingual summarization (CLS) has attracted increasing interest in recent years due to the availability of large-scale web-mined datasets and the advancements of multilingual language models. However, given the rareness of naturally occurring CLS resources, the majority of datasets are forced to rely on translation which can contain overly literal artifacts. This restricts our ability to observe naturally occurring CLS pairs that capture organic diction, including instances of code-switching. This alteration between languages in mid-message is a common phenomenon in multilingual settings yet has been largely overlooked in cross-lingual contexts due to data scarcity. To address this gap, we introduce CroCoSum, a dataset of cross-lingual code-switched summarization of technology news. It consists of over 24,000 English source articles and 18,000 human-curated Chinese news summaries, with more than 92% of the summaries containing code-switched phrases. For reference, we evaluate the performance of existing approaches including pipeline, end-to-end, and zero-shot methods. We show that leveraging existing resources as a pretraining step does not improve performance on CroCoSum, indicating the limited generalizability of existing resources. Finally, we discuss the challenges of evaluating cross-lingual summarizers on code-switched generation through qualitative error analyses. Our collection and code can be accessed at this https URL.
Cross-lingual概括(CLS)近年来吸引了越来越多的关注,因为大规模通过网络访问的数据集和多语言语言模型的进步。然而,由于自然语言资源相对较少,大多数数据集被迫依赖翻译,这可能包含过于字面化的 artifacts。这限制了我们观察自然语言生成的现象,包括代码切换实例,包括在消息中间语言的更改。在多语言环境中,这种在句子中的语言之间的更改是一种常见的现象,但由于数据稀缺,它在跨语言环境中往往被忽视。为了解决这个问题,我们介绍了 CroCoSum 一个跨语言代码切换概括科技新闻的数据集。它包含超过 24,000 个英文来源文章和 18,000 个人类编辑的中文新闻摘要,其中超过 92% 的摘要包含代码切换短语。为了参考,我们评估了现有的方法的性能,包括管道、端到端方法和零经验方法。我们表明,将现有资源作为预训练步骤并不改善 CroCoSum 的性能,这表明现有资源的通用性有限。最后,我们讨论了通过定性错误分析评估跨语言概括器面临的挑战,我们的收集和代码可以在这个 https URL 上访问。
https://arxiv.org/abs/2303.04092
We present TrialsSummarizer, a system that aims to automatically summarize evidence presented in the set of randomized controlled trials most relevant to a given query. Building on prior work, the system retrieves trial publications matching a query specifying a combination of condition, intervention(s), and outcome(s), and ranks these according to sample size and estimated study quality. The top-k such studies are passed through a neural multi-document summarization system, yielding a synopsis of these trials. We consider two architectures: A standard sequence-to-sequence model based on BART and a multi-headed architecture intended to provide greater transparency to end-users. Both models produce fluent and relevant summaries of evidence retrieved for queries, but their tendency to introduce unsupported statements render them inappropriate for use in this domain at present. The proposed architecture may help users verify outputs allowing users to trace generated tokens back to inputs.
我们介绍了试验摘要器系统,该系统旨在自动摘要出与给定查询最相关的随机控制试验集合中的文献。基于先前的工作,系统检索与查询指定条件的组合、干预和结果相关的试验出版物,并根据样本大小和估计的研究质量进行排名。对这些研究的topk级别研究通过神经网络多文档摘要系统进行处理,生成对这些研究的概述。我们考虑了两个架构:一个基于巴特(Bart)系统的标准序列到序列模型,另一个是旨在向最终用户提供更透明的多头部模型。这两种模型都产生与查询提取的文献流畅和相关摘要,但它们引入未支持的声明的倾向使它们目前不适合用于该领域。我们所提出的架构可能帮助用户验证输出,使用户可以追踪生成的代币回到输入。
https://arxiv.org/abs/2303.05392