Data visualization serves as a critical means for presenting data and mining its valuable insights. The task of chart summarization, through natural language processing techniques, facilitates in-depth data analysis of charts. However, there still are notable deficiencies in terms of visual-language matching and reasoning ability for existing approaches. To address these limitations, this study constructs a large-scale dataset of comprehensive chart-caption pairs and fine-tuning instructions on each chart. Thanks to the broad coverage of various topics and visual styles within this dataset, better matching degree can be achieved from the view of training data. Moreover, we propose an innovative chart summarization method, ChartThinker, which synthesizes deep analysis based on chains of thought and strategies of context retrieval, aiming to improve the logical coherence and accuracy of the generated summaries. Built upon the curated datasets, our trained model consistently exhibits superior performance in chart summarization tasks, surpassing 8 state-of-the-art models over 7 evaluation metrics. Our dataset and codes are publicly accessible.
数据可视化是一种展示数据和挖掘其有价值的信息的关键手段。通过自然语言处理技术,图表摘要任务有助于深入分析图表。然而,现有的方法在视觉语言匹配和推理能力方面仍然存在显著的不足。为了克服这些限制,本研究构建了一个大型的图表摘要数据集,并为每个图表提供了详细的指令。得益于这个数据集中的广泛涵盖各种主题和视觉风格的全面性,从训练数据的角度可以实现更好的匹配程度。此外,我们提出了一个创新性的图表摘要方法,ChartThinker,它基于思维链和上下文检索策略的深度分析,旨在提高生成的摘要的逻辑连贯性和准确性。基于精心挑选的训练数据,我们训练的模型在图表摘要任务中始终表现出卓越的性能,超过7个评估指标中的8个领先水平。我们的数据集和代码都是公开可用的。
https://arxiv.org/abs/2403.11236
The advent of large language models (LLMs) has significantly advanced natural language processing tasks like text summarization. However, their large size and computational demands, coupled with privacy concerns in data transmission, limit their use in resource-constrained and privacy-centric settings. To overcome this, we introduce TriSum, a framework for distilling LLMs' text summarization abilities into a compact, local model. Initially, LLMs extract a set of aspect-triple rationales and summaries, which are refined using a dual-scoring method for quality. Next, a smaller local model is trained with these tasks, employing a curriculum learning strategy that evolves from simple to complex tasks. Our method enhances local model performance on various benchmarks (CNN/DailyMail, XSum, and ClinicalTrial), outperforming baselines by 4.5%, 8.5%, and 7.4%, respectively. It also improves interpretability by providing insights into the summarization rationale.
大规模语言模型的出现显著推动了自然语言处理任务(如文本摘要)的发展。然而,它们的大小和计算需求,以及数据传输过程中的隐私问题,限制了它们在资源受限和隐私集中的场景中的应用。为了克服这个问题,我们引入了TriSum框架,一种将大规模语言模型的文本摘要能力缩小为紧凑的局部模型。最初,大规模语言模型提取了一系列方面三元理据和摘要,这些摘要通过使用双评分方法来提高质量。接下来,使用这些任务训练一个较小的局部模型,采用一种从简单到复杂任务的课程学习策略。我们的方法在各种基准测试(CNN/DailyMail,XSum和ClinicalTrial)上的局部模型性能得到了显著提高(分别比基线高4.5%,8.5%和7.4%)。此外,通过提供关于摘要原因的洞察,提高了可解释性。
https://arxiv.org/abs/2403.10351
While text summarization is a well-known NLP task, in this paper, we introduce a novel and useful variant of it called functionality extraction from Git README files. Though this task is a text2text generation at an abstract level, it involves its own peculiarities and challenges making existing text2text generation systems not very useful. The motivation behind this task stems from a recent surge in research and development activities around the use of large language models for code-related tasks, such as code refactoring, code summarization, etc. We also release a human-annotated dataset called FuncRead, and develop a battery of models for the task. Our exhaustive experimentation shows that small size fine-tuned models beat any baseline models that can be designed using popular black-box or white-box large language models (LLMs) such as ChatGPT and Bard. Our best fine-tuned 7 Billion CodeLlama model exhibit 70% and 20% gain on the F1 score against ChatGPT and Bard respectively.
尽管文本摘要是一个著名的自然语言处理(NLP)任务,但在这篇论文中,我们引入了一个新颖且实用的文本摘要变体,称为从Git README文件中提取功能。尽管这个任务在抽象层面上是一个文本到文本的生成任务,但它具有自己的独特特点和挑战,使得现有的文本到文本生成系统变得不太有用。这个任务的背后是对大型语言模型在代码相关任务中进行研究和开发活动的激增。我们还发布了一个人类标注的数据集FuncRead,并为该任务开发了各种模型。我们进行彻底的实验,结果表明,经过微调的小规模模型击败了所有使用流行的大型语言模型(LLMs)如ChatGPT和Bard设计的基线模型。我们最佳微调的70亿CodeLlama模型在ChatGPT和Bard上的F1得分分别增加了70%和20%。
https://arxiv.org/abs/2403.10205
Large Language Models (LLMs) have demonstrated efficacy in various linguistic applications, including text summarization and controlled text generation. However, studies into their capacity of switching between styles via fine-tuning remain underexplored. This study concentrates on textual professionalism and introduces a novel methodology, named ProSwitch, which equips a language model with the ability to produce both professional and non-professional responses through knowledge-guided instruction tuning. ProSwitch unfolds across three phases: data preparation for gathering domain knowledge and training corpus; instruction tuning for optimizing language models with multiple levels of instruction formats; and comprehensive evaluation for assessing the professionalism discrimination and reference-based quality of generated text. Comparative analysis of ProSwitch against both general and specialized language models reveals that our approach outperforms baselines in switching between professional and non-professional text generation.
大语言模型(LLMs)已经在各种语言应用中展示了有效性,包括文本摘要和控制文本生成。然而,关于它们通过微调在风格之间切换的能力的研究仍然很少。这项研究重点关注文本专业性,并引入了一种名为ProSwitch的新方法,该方法通过知识指导的指令调整授予语言模型生产专业和非专业响应的能力。ProSwitch展开于三个阶段:数据准备用于收集领域知识和训练数据;指令微调,以优化具有多种指令格式的语言模型;全面评估,以评估生成的文本的专业性区分和基于参考的质量。与通用和专用语言模型进行比较,揭示了我们在专业和非专业文本生成之间的切换能力超过了基线。
https://arxiv.org/abs/2403.09131
Transformers have emerged as the underpinning architecture for Large Language Models (LLMs). In generative language models, the inference process involves two primary phases: prompt processing and token generation. Token generation, which constitutes the majority of the computational workload, primarily entails vector-matrix multiplications and interactions with the Key-Value (KV) Cache. This phase is constrained by memory bandwidth due to the overhead of transferring weights and KV cache values from the memory system to the computing units. This memory bottleneck becomes particularly pronounced in applications that require long-context and extensive text generation, both of which are increasingly crucial for LLMs. This paper introduces "Keyformer", an innovative inference-time approach, to mitigate the challenges associated with KV cache size and memory bandwidth utilization. Keyformer leverages the observation that approximately 90% of the attention weight in generative inference focuses on a specific subset of tokens, referred to as "key" tokens. Keyformer retains only the key tokens in the KV cache by identifying these crucial tokens using a novel score function. This approach effectively reduces both the KV cache size and memory bandwidth usage without compromising model accuracy. We evaluate Keyformer's performance across three foundational models: GPT-J, Cerebras-GPT, and MPT, which employ various positional embedding algorithms. Our assessment encompasses a variety of tasks, with a particular emphasis on summarization and conversation tasks involving extended contexts. Keyformer's reduction of KV cache reduces inference latency by 2.1x and improves token generation throughput by 2.4x, while preserving the model's accuracy.
翻译:Transformer已成为大型语言模型(LLMs)的基础架构。在生成语言模型中,推理过程涉及两个主要阶段:提示处理和词生成。词生成阶段构成了计算工作量的绝大多数,主要涉及向量与键值(KV)缓存之间的矩阵乘法和与KV缓存交互。由于在从内存系统到计算单元的权重和KV缓存值传输时产生的开销,这个阶段受到内存带宽的限制。对于需要长上下文和复杂文本生成的应用,这个内存瓶颈越来越重要。本文介绍了一种名为“Keyformer”的创新推理时间方法,以减轻与KV缓存大小和内存带宽利用率相关的挑战。Keyformer通过使用一种新颖的分数函数,仅保留KV缓存中的关键词。这种方法在保持模型准确性的同时有效减少了KV缓存大小和内存带宽使用。我们在三个基本模型(GPT-J、Cerebras-GPT和MPT)上评估了Keyformer的性能。我们的评估包括各种任务,特别关注涉及扩展上下文的总结和对话任务。Keyformer通过减少KV缓存减少了推理延迟,同时通过2.4倍的词生成提高了解析吞吐量。而保持模型的准确性。
https://arxiv.org/abs/2403.09054
Charts provide visual representations of data and are widely used for analyzing information, addressing queries, and conveying insights to others. Various chart-related downstream tasks have emerged recently, such as question-answering and summarization. A common strategy to solve these tasks is to fine-tune various models originally trained on vision tasks language. However, such task-specific models are not capable of solving a wide range of chart-related tasks, constraining their real-world applicability. To overcome these challenges, we introduce ChartInstruct: a novel chart-specific vision-language Instruction-following dataset comprising 191K instructions generated with 71K charts. We then present two distinct systems for instruction tuning on such datasets: (1) an end-to-end model that connects a vision encoder for chart understanding with a LLM; and (2) a pipeline model that employs a two-step approach to extract chart data tables and input them into the LLM. In experiments on four downstream tasks, we first show the effectiveness of our model--achieving a new set of state-of-the-art results. Further evaluation shows that our instruction-tuning approach supports a wide array of real-world chart comprehension and reasoning scenarios, thereby expanding the scope and applicability of our models to new kinds of tasks.
折线图提供了数据的可视化表示,广泛用于分析信息、回答问题和向其他人传达见解。最近,出现了许多与折线图相关的下游任务,例如问答和总结。为解决这些任务,通常采用微调各种在视觉任务上训练的模型。然而,这种特定任务模型的适用范围有限,限制了其在现实世界的应用。为了克服这些挑战,我们引入了ChartInstruct:一种新颖的折线图特定视觉语言指令跟随数据集,由71K个图表生成,包含191K个指令。然后,我们介绍了在折线图数据集上调整指令的两个不同系统:(1)连接用于图表理解的视觉编码器与LLM的端到端模型;(2)采用两步方法提取图表数据表并将其输入到LLM的管道模型。在四个下游任务的实验中,我们首先证明了我们的模型的有效性——实现了最新的最佳结果。进一步评估显示,我们的指令调整方法支持各种真实世界折线图理解和推理场景,从而扩大了我们的模型对各种任务的适用性和范围。
https://arxiv.org/abs/2403.09028
In the age of information abundance, the ability to provide users with contextually relevant and concise information is crucial. Keyword in Context (KIC) generation is a task that plays a vital role in and generation applications, such as search engines, personal assistants, and content summarization. In this paper, we present a novel approach to generating unambiguous and brief sentence-contexts for given keywords using the T5 transformer model, leveraging data obtained from the Context-Reverso API. The code is available at this https URL .
在信息丰富时代,为用户提供相关且简洁的信息至关重要。关键词上下文(KIC)生成是一种在生成应用(如搜索引擎、个人助手和内容摘要)中发挥关键作用的任务。在本文中,我们提出了使用T5变压器模型生成无歧义且简洁的句子上下文的新颖方法,并利用从Context-Reverso API获得的数据。代码可在此https:// URL上查看。
https://arxiv.org/abs/2403.08103
Large language models (LLMs) excel in many diverse applications beyond language generation, e.g., translation, summarization, and sentiment analysis. One intriguing application is in text classification. This becomes pertinent in the realm of identifying hateful or toxic speech -- a domain fraught with challenges and ethical dilemmas. In our study, we have two objectives: firstly, to offer a literature review revolving around LLMs as classifiers, emphasizing their role in detecting and classifying hateful or toxic content. Subsequently, we explore the efficacy of several LLMs in classifying hate speech: identifying which LLMs excel in this task as well as their underlying attributes and training. Providing insight into the factors that contribute to an LLM proficiency (or lack thereof) in discerning hateful content. By combining a comprehensive literature review with an empirical analysis, our paper strives to shed light on the capabilities and constraints of LLMs in the crucial domain of hate speech detection.
大语言模型(LLMs)在许多应用领域表现出色,除了语言生成,如翻译、摘要和情感分析等。其中一个引人注目的应用是在文本分类领域。在这个领域,识别出具有仇恨或毒性言论的分类变得尤为重要——这是一个充满挑战和伦理困境的领域。在我们的研究中,我们有两个目标:首先,提供一个关于LLMs作为分类器的文献综述,强调它们在检测和分类仇恨或有害内容中的作用。其次,我们探讨了几个LLM在分类仇恨言论的有效性:确定哪些LLM在这项任务上表现出色以及它们的潜在特征和训练。通过综合文献综述和实证分析,我们的论文旨在阐明LLM在仇恨言论检测领域的能力和限制。
https://arxiv.org/abs/2403.08035
As more than 70$\%$ of reviews in the existing opinion summary data set are positive, current opinion summarization approaches are reluctant to generate negative summaries given the input of negative texts. To address such sentiment bias, a direct approach without the over-reliance on a specific framework is to generate additional data based on large language models to balance the emotional distribution of the dataset. However, data augmentation based on large language models faces two disadvantages: 1) the potential issues or toxicity in the augmented data; 2) the expensive costs. Therefore, in this paper, we propose a novel data augmentation framework based on both large and small language models for debiasing opinion summarization. In specific, a small size of synthesized negative reviews is obtained by rewriting the positive text via a large language model. Then, a disentangle reconstruction model is trained based on the generated data. After training, a large amount of synthetic data can be obtained by decoding the new representation obtained from the combination of different sample representations and filtering based on confusion degree and sentiment classification. Experiments have proved that our framework can effectively alleviate emotional bias same as using only large models, but more economically.
由于现有观点总结数据集中的超过70%好评,现有的观点总结方法不愿意根据负面文本生成负面摘要。为了解决这种情感偏见,一种不依赖特定框架的直接方法是根据大型语言模型生成额外数据来平衡数据集的情感分布。然而,基于大型语言模型的数据增强存在两个缺点:1)增强数据的潜在问题或毒性;2)昂贵的成本。因此,在本文中,我们提出了一个基于大型和小型语言模型的观点总结去偏新方法。具体来说,通过大型语言模型重新编写积极文本可以获得小的负面评论数量。然后,基于生成的数据训练解离重构模型。训练后,可以通过解码不同样本表示的组合以及根据混淆程度和情感分类进行过滤来获得大量合成数据。实验证明,我们的框架可以有效地消除仅使用大型模型时存在的情感偏见,而且更加经济实惠。
https://arxiv.org/abs/2403.07693
Ensuring factual consistency between the summary and the original document is paramount in summarization tasks. Consequently, considerable effort has been dedicated to detecting inconsistencies. With the advent of Large Language Models (LLMs), recent studies have begun to leverage their advanced language understanding capabilities for inconsistency detection. However, early attempts have shown that LLMs underperform traditional models due to their limited ability to follow instructions and the absence of an effective detection methodology. In this study, we reassess summary inconsistency detection with LLMs, comparing the performances of GPT-3.5 and GPT-4. To advance research in LLM-based inconsistency detection, we propose SIFiD (Summary Inconsistency Detection with Filtered Document) that identify key sentences within documents by either employing natural language inference or measuring semantic similarity between summaries and documents.
在摘要任务中确保事实一致性是至关重要的。因此,在摘要任务中投入了大量精力来检测不一致性。随着大型语言模型的(LLMs)的出现,最近的研究开始利用它们先进的语言理解能力来进行不一致性检测。然而,早期的尝试表明,LLMs由于其有限的指令跟随能力和缺乏有效的检测方法而表现不佳。在这项研究中,我们用LLMs重新评估了摘要不一致性检测,比较了GPT-3.5和GPT-4的性能。为了促进LLM基于不一致性检测的研究,我们提出了SIFiD(基于过滤文档的摘要不一致性检测),通过利用自然语言推理或测量摘要与文档的语义相似性来识别文中的关键句子。
https://arxiv.org/abs/2403.07557
In this paper, we introduce summarization MevakerSumm and conclusion extraction MevakerConc datasets for the Hebrew language based on the State Comptroller and Ombudsman of Israel reports, along with two auxiliary datasets. We accompany these datasets with models for conclusion extraction (HeConE, HeConEspc) and conclusion allocation (HeCross). All of the code, datasets, and model checkpoints used in this work are publicly available.
在本文中,我们引入了 summarization MevakerSumm 和 conclusion extraction MevakerConc 数据集,基于以色列国家controller和ombudsman的報告,以及兩個附屬數據集。我們與這些數據集一起提供了結論提取(HeConE,HeConEspc)和結論分配(HeCross)的模型。本研究中使用的所有代碼、數據集和模型檢查點都是公开可用的。
https://arxiv.org/abs/2403.09719
Despite the effectiveness of data selection for large language models (LLMs) during pretraining and instruction fine-tuning phases, improving data efficiency in supervised fine-tuning (SFT) for specialized domains poses significant challenges due to the complexity of fine-tuning data. To bridge this gap, we introduce an effective and scalable data selection method for SFT, SmallToLarge (S2L), which leverages training trajectories from small models to guide the data selection for larger models. We demonstrate through extensive experiments that S2L significantly improves data efficiency in SFT for mathematical problem-solving, reducing the training data to just 11% of the original MathInstruct dataset (Yue et al., 2023) to match full dataset performance while outperforming state-of-the-art data selection algorithms by an average of 4.7% across 6 in- and out-domain evaluation datasets. Remarkably, selecting only 50K data for SFT, S2L achieves a 32.7% accuracy on the most challenging MATH (Hendrycks et al., 2021) benchmark, improving Phi-2 (Li et al., 2023b) by 16.6%. In clinical text summarization on the MIMIC-III dataset (Johnson et al., 2016), S2L again outperforms training on the full dataset using only 50% of the data. Notably, S2L can perform data selection using a reference model 40x smaller than the target model, proportionally reducing the cost of data selection.
尽管在预训练和指令微调阶段,数据选择对于大型语言模型(LLMs)的有效性已经得到了显著提高,但在面向专门领域的监督微调(SFT)中提高数据效率仍然具有巨大挑战,因为微调数据的复杂性。为了填补这一空白,我们引入了一种有效且可扩展的数据选择方法——SmallToLarge(S2L),它利用小模型训练过程中的训练轨迹来指导大模型的数据选择。通过广泛的实验,我们证明了S2L在SFT方面显著提高了数据效率,将训练数据减少到原始MathInstruct数据集的11%(Yue等人,2023),同时在大约6个域评估数据集上的平均性能比最先进的数据选择算法提高了4.7%。值得注意的是,仅使用50K个数据进行SFT,S2L在最具挑战性的MATH(Hendrycks等人,2021)基准上取得了32.7%的准确率,提高了Φ2(Li等人,2023b)的程度。在临床文本摘要的MIMIC-III数据集(Johnson等人,2016)上,S2L再次使用仅占全数据集50%的数据在训练集上超过了训练。值得注意的是,S2L可以使用小于目标模型参考模型40倍的数据进行数据选择,从而降低数据选择的成本。
https://arxiv.org/abs/2403.07384
New software and updates are downloaded by end users every day. Each dowloaded software has associated with it an End Users License Agreements (EULA), but this is rarely read. An EULA includes information to avoid legal repercussions. However,this proposes a host of potential problems such as spyware or producing an unwanted affect in the target system. End users do not read these EULA's because of length of the document and users find it extremely difficult to understand. Text summarization is one of the relevant solution to these kind of problems. This require a solution which can summarize the EULA and classify the EULA as "Benign" or "Malicious". We propose a solution in which we have summarize the EULA and classify the EULA as "Benign" or "Malicious". We extract EULA text of different sofware's then we classify the text using eight different supervised classifiers. we use ensemble learning to classify the EULA as benign or malicious using five different text summarization methods. An accuracy of $95.8$\% shows the effectiveness of the presented approach.
每天,用户会下载新的软件和更新。每个下载的软件都附带了一份用户许可协议(EULA),但很少有人会阅读它。EULA包含信息以避免法律纠纷。然而,这可能导致一系列问题,如间谍软件或对目标系统产生不良影响。用户不会阅读这些EULA,因为文件篇幅较长,用户发现很难理解。文本摘要是对这类问题的一种相关解决方案。这项解决方案要求我们提供一个解决方案,可以总结EULA并将其分类为“良性”或“恶意”。我们提出了一个解决方案,其中我们总结了不同软件的EULA,并使用八个不同的监督分类器对其进行分类。我们使用集成学习将EULA分类为良性或恶意。使用五种不同的文本摘要方法对EULA进行分类。其准确率在95.8%以上,证明了所提出方法的有效性。
https://arxiv.org/abs/2403.09715
User alignment is crucial for adapting general-purpose language models (LMs) to downstream tasks, but human annotations are often not available for all types of instructions, especially those with customized constraints. We observe that user instructions typically contain constraints. While assessing response quality in terms of the whole instruction is often costly, efficiently evaluating the satisfaction rate of constraints is feasible. We investigate common constraints in NLP tasks, categorize them into three classes based on the types of their arguments, and propose a unified framework, ACT (Aligning to ConsTraints), to automatically produce supervision signals for user alignment with constraints. Specifically, ACT uses constraint verifiers, which are typically easy to implement in practice, to compute constraint satisfaction rate (CSR) of each response. It samples multiple responses for each prompt and collect preference labels based on their CSR automatically. Subsequently, ACT adapts the LM to the target task through a ranking-based learning process. Experiments on fine-grained entity typing, abstractive summarization, and temporal question answering show that ACT is able to enhance LMs' capability to adhere to different classes of constraints, thereby improving task performance. Further experiments show that the constraint-following capabilities are transferable.
用户对齐对将通用语言模型(LMs)适应下游任务至关重要,但通常在所有类型的指令中,人类注释并不总是可用的,尤其是具有自定义约束条件的指令。我们观察到,用户指令通常包含约束。虽然评估指令的响应质量在很大程度上是昂贵的,但有效地评估约束的满意度是可行的。我们研究了自然语言处理任务中常见的约束,将它们分为三类,基于它们的论据,并提出了一个统一框架ACT(对齐到ConsTraints)。具体来说,ACT使用约束验证器来计算每个响应的约束满足率(CSR),并自动收集偏好标签。随后,ACT通过基于排名的学习过程将LM适应于目标任务。在细粒度实体分类、抽象式总结和时间问题回答等实验中,ACT能够增强LMs在不同类别的约束下坚持的能力,从而提高任务性能。进一步的实验证明,约束跟随能力是可转移的。
https://arxiv.org/abs/2403.06326
Large pre-trained language models (PLMs) are at the forefront of advances in Natural Language Processing. One widespread use case of PLMs is "prompting" - or in-context learning - where a user provides a description of a task and some completed examples of the task to a PLM as context before prompting the PLM to perform the task on a new example. Only the largest, most capable PLMs are able to perform in-context learning effectively, and these models are typically trained with a predominantly English corpus, leaving all other languages behind. The data limitations in most languages preclude the training of language-specific PLMs capable of prompting. Albeit the surge in work of prompting settings, it is still unclear how PLMs should be adapted cross-lingually specifically for prompting. We evaluate the possible methods to adapt LLaMa, a 7B parameter open-source PLM mainly trained in English, for prompting in low-resource languages, namely for Kinyarwanda, Hausa, and Luganda. We consider three methods: few-shot prompting (prompt), language-adaptive fine-tuning (LAFT), and neural machine translation (translate), and evaluate on abstractive summarization, multi-class topic classification, and named-entity recognition. Although LAFT carries the greatest compute cost and intuitively should lead to the best results, our experiments exhibit that LAFT is only occasionally the optimal choice for adapting PLMs for prompting. Rather, the translate and prompt settings are a compute-efficient and cost-effective method of few-shot prompting for the selected low-resource languages. We find that the results are task and language dependent but find that the prompting method is the best on average across all tasks and languages. Results show that the prompt setting performs better than both translating and LAFT with statistical significance for all shots when aggregated across all tasks and languages.
大型预训练语言模型(PLMs)是自然语言处理领域取得进展的先锋。PLM的一个广泛应用是“提示”——或者在上下文学习——用户向PLM提供任务描述和一些已完成任务的示例作为上下文,然后提示PLM在新例子上执行任务。只有最大的、最具有活力的PLM才能够有效地执行上下文学习,这些模型通常使用主要训练语料库为英语训练,其他语言都被抛在后面。大多数语言的数据限制意味着无法训练具有提示能力的语言特定PLM。尽管提示设置的工作激增,但是尚不清楚PLM如何跨语言具体适应提示。我们对LLaMa进行了评估,一个主要在英语上训练的7B参数开源PLM,以评估其在低资源语言(如Kinyarwanda、Hausa和Luganda)上的提示能力。我们考虑了三种方法:少样本提示(prompt)、语言自适应微调(LAFT)和神经机器翻译(translate),并对摘要总结、多类主题分类和命名实体识别进行评估。虽然LAFT的计算成本最高,从直觉上应该获得最佳结果,但我们的实验结果表明,LAFT只是偶尔是适应PLM提示的最佳选择。相反,translate和prompt设置是一种计算高效和成本效益高的对所选低资源语言进行少样本提示的方法。我们发现,所有任务和语言的提示方法平均而言都要好于翻译和LAFT。结果表明,提示设置在所有 shot和任务上优于翻译和LAFT,且具有统计学意义。
https://arxiv.org/abs/2403.06018
Text summarization and simplification are among the most widely used applications of AI. However, models developed for such tasks are often prone to hallucination, which can result from training on unaligned data. One efficient approach to address this issue is Loss Truncation (LT) (Kang and Hashimoto, 2020), an approach to modify the standard log loss to adaptively remove noisy examples during training. However, we find that LT alone yields a considerable number of hallucinated entities on various datasets. We study the behavior of the underlying losses between factual and non-factual examples, to understand and refine the performance of LT. We demonstrate that LT's performance is limited when the underlying assumption that noisy targets have higher NLL loss is not satisfied, and find that word-level NLL among entities provides better signal for distinguishing factuality. We then leverage this to propose a fine-grained NLL loss and fine-grained data cleaning strategies, and observe improvements in hallucination reduction across some datasets. Our work is available at https://https://github.com/yale-nlp/fine-grained-lt.
文本摘要和简化为人工智能中最广泛使用的应用程序之一。然而,为这些任务开发的模型往往容易产生幻觉,这可能是因为在训练过程中使用了非对齐的数据。解决这个问题的一个有效方法是损失截断(LT)(Kang和Hashimoto,2020),一种修改标准对数损失以在训练过程中适应性地去除噪声的例子。然而,我们发现,单独使用LT在各种数据集上产生相当数量的幻觉实体。我们研究了事实和非事实实例之间的潜在损失行为,以理解并改进LT的性能。我们证明了,当不满足噪声目标具有更高的NLL损失的假设时,LT的性能是有限的,而单词级别的NLL在实体之间提供更好的信号用于区分事实。然后,我们利用这个想法提出了细粒度NLL损失和细粒度数据清洁策略,并在一些数据集上观察到幻觉减少的效果。我们的工作可以在https://github.com/yale-nlp/fine-grained-lt上查看。
https://arxiv.org/abs/2403.05788
Brief hospital course (BHC) summaries are common clinical documents generated by summarizing clinical notes. While large language models (LLMs) depict remarkable capabilities in automating real-world tasks, their capabilities for healthcare applications such as BHC synthesis have not been shown. To enable the adaptation of LLMs for BHC synthesis, we introduce a novel benchmark consisting of a pre-processed dataset extracted from MIMIC-IV notes, encapsulating clinical note, and brief hospital course (BHC) pairs. We assess the performance of two general-purpose LLMs and three healthcare-adapted LLMs to improve BHC synthesis from clinical notes. Using clinical notes as input for generating BHCs, we apply prompting-based (using in-context learning) and fine-tuning-based adaptation strategies to three open-source LLMs (Clinical-T5-Large, Llama2-13B, FLAN-UL2) and two proprietary LLMs (GPT-3.5, GPT-4). We quantitatively evaluate the performance of these LLMs across varying context-length inputs using conventional natural language similarity metrics. We further perform a qualitative study where five diverse clinicians blindly compare clinician-written BHCs and two LLM-generated BHCs for 30 samples across metrics of comprehensiveness, conciseness, factual correctness, and fluency. Overall, we present a new benchmark and pre-processed dataset for using LLMs in BHC synthesis from clinical notes. We observe high-quality summarization performance for both in-context proprietary and fine-tuned open-source LLMs using both quantitative metrics and a qualitative clinical reader study. We propose our work as a benchmark to motivate future works to adapt and assess the performance of LLMs in BHC synthesis.
简短的医院课程(BHC)摘要是一种由概括病历笔记生成的重要临床文件。虽然大型语言模型(LLMs)在现实世界任务中表现出惊人的能力,但它们在医疗应用(如BHC合成)方面的能力尚未得到证实。为了使LLMs适应BHC合成,我们引入了一个新颖的基准,由从MIMIC-IV笔记中提取预处理的数据集、病历笔记和简短的医院课程(BHC)对组成。我们评估了两种通用LLM和三种专门为医疗应用设计的LLM在从病历笔记中进行BHC合成的性能。使用病历笔记作为输入生成BHC,我们应用提示式(利用上下文学习)和微调式(利用预训练模型)适应策略对三种开源LLM(Clinical-T5-Large,Llama2-13B,FLAN-UL2)和两种专有LLM(GPT-3.5,GPT-4)进行评估。我们使用传统的自然语言相似度指标对不同上下文长度输入的LLM性能进行定量评估。我们进一步进行了一项定性研究,让五位不同的临床医生对30个样本进行BHC的盲目比较,以及对两种LLM生成的BHC的比较。总的来说,我们提出了一个用于使用LLM进行BHC合成的新的基准和预处理数据集。我们观察到,基于定量指标和定性临床读者研究,LLM在BHC合成中的总结表现都相当高质量。我们将我们的工作作为基准,以激励未来的工作,评估和适应LLM在BHC合成中的性能。
https://arxiv.org/abs/2403.05720
Extensive efforts in the past have been directed toward the development of summarization datasets. However, a predominant number of these resources have been (semi)-automatically generated, typically through web data crawling, resulting in subpar resources for training and evaluating summarization systems, a quality compromise that is arguably due to the substantial costs associated with generating ground-truth summaries, particularly for diverse languages and specialized domains. To address this issue, we present ACLSum, a novel summarization dataset carefully crafted and evaluated by domain experts. In contrast to previous datasets, ACLSum facilitates multi-aspect summarization of scientific papers, covering challenges, approaches, and outcomes in depth. Through extensive experiments, we evaluate the quality of our resource and the performance of models based on pretrained language models and state-of-the-art large language models (LLMs). Additionally, we explore the effectiveness of extractive versus abstractive summarization within the scholarly domain on the basis of automatically discovered aspects. Our results corroborate previous findings in the general domain and indicate the general superiority of end-to-end aspect-based summarization. Our data is released at this https URL.
在过去的几年中,大量工作都致力于开发总结数据集。然而,这些资源中绝大多数(半自动)是通过网络数据爬取产生的,导致用于训练和评估总结系统的资源质量较差,这是由于生成真实摘要所带来的大量成本(特别是对于多样语言和专业领域)导致的。为了应对这个问题,我们提出了ACLSum,一个由领域专家精心设计和评估的新型总结数据集。与之前的数据集相比,ACLSum有助于对科学论文进行多方面总结,深入涵盖挑战、方法和结果。通过大量的实验,我们评估了我们的资源质量和基于预训练语言模型的模型性能。此外,我们还探讨了在学术领域中,提取式和抽象式总结的有效性。我们的结果证实了在一般领域中的先前发现,并表明了端到端面向 aspects 的总结具有优越性。我们的数据已发布在以下链接处:https://www.aclsum.org/。
https://arxiv.org/abs/2403.05303
Keywords, that is, content-relevant words in summaries play an important role in efficient information conveyance, making it critical to assess if system-generated summaries contain such informative words during evaluation. However, existing evaluation metrics for extreme summarization models do not pay explicit attention to keywords in summaries, leaving developers ignorant of their presence. To address this issue, we present a keyword-oriented evaluation metric, dubbed ROUGE-K, which provides a quantitative answer to the question of -- \textit{How well do summaries include keywords?} Through the lens of this keyword-aware metric, we surprisingly find that a current strong baseline model often misses essential information in their summaries. Our analysis reveals that human annotators indeed find the summaries with more keywords to be more relevant to the source documents. This is an important yet previously overlooked aspect in evaluating summarization systems. Finally, to enhance keyword inclusion, we propose four approaches for incorporating word importance into a transformer-based model and experimentally show that it enables guiding models to include more keywords while keeping the overall quality. Our code is released at this https URL.
关键词(内容相关的单词)在摘要中的作用在有效的信息传递中非常重要,因此在评估系统生成的摘要时评估系统是否包含这些信息非常重要。然而,现有的极端摘要模型的评估指标并没有明确关注摘要中的关键词,导致开发人员无法了解其存在。为了解决这个问题,我们提出了一个关键词导向的评估指标,被称为ROUGE-K,它为这个问题提供了一个定量的答案-- \textit{摘要中包括关键词吗?} 通过这个关键词敏感的指标,我们惊讶地发现,目前强大的基线模型通常会忽略摘要中至关重要的信息。我们的分析表明,人类注释者确实认为摘要中关键词越多越相关。这是评估摘要系统的一个重要方面,但之前被忽视了。最后,为了增强关键词包括,我们提出了四种将关键词重要性融入Transformer基模型中的方法,并实验证明,它能够指导模型包含更多的关键词,同时保持整体质量。我们的代码发布在https://这个URL上。
https://arxiv.org/abs/2403.05186
We introduce Adversarial Policy Optimization (AdvPO), a novel solution to the pervasive issue of reward over-optimization in Reinforcement Learning from Human Feedback (RLHF) for Large Language Models (LLMs). Over-optimization occurs when a reward model serves as an imperfect proxy for human preference, and RL-driven policy optimization erroneously exploits reward inaccuracies. In this paper, we begin by introducing a lightweight way to quantify uncertainties in rewards, relying solely on the last layer embeddings of the reward model, without the need for computationally expensive reward ensembles. AdvPO then addresses a distributionally robust optimization problem centred around the confidence interval of the reward model's predictions for policy improvement. Through comprehensive experiments on the Anthropic HH and TL;DR summarization datasets, we illustrate the efficacy of AdvPO in mitigating the overoptimization issue, consequently resulting in enhanced performance as evaluated through human-assisted evaluation.
我们提出了Adversarial Policy Optimization(AdvPO)作为一种解决大规模语言模型(LLMs)在强化学习中的奖励过度优化问题的全新方法,该问题源于人类反馈(RLHF)中的奖励过度优化。过度优化发生在奖励模型作为人类偏好的不完美代理时,而基于强化学习的策略优化错误地利用了奖励不准确性的情况下。在本文中,我们首先介绍了一种量化奖励不确定性,仅依赖于奖励模型的最后一层嵌入,无需计算昂贵的奖励集成。AdvPO then addresses a distributionally robust optimization problem centred around the confidence interval of the reward model's predictions for policy improvement.通过对Anthropic HH和TL;DR摘要数据集的全面实验,我们阐明了AdvPO在减轻过度优化问题上的有效性,从而通过人类辅助评估获得了更卓越的性能。
https://arxiv.org/abs/2403.05171