In tackling the challenge of Multi-Document Summarization (MDS), numerous methods have been proposed, spanning both extractive and abstractive summarization techniques. However, each approach has its own limitations, making it less effective to rely solely on either one. An emerging and promising strategy involves a synergistic fusion of extractive and abstractive summarization methods. Despite the plethora of studies in this domain, research on the combined methodology remains scarce, particularly in the context of Vietnamese language processing. This paper presents a novel Vietnamese MDS framework leveraging a two-component pipeline architecture that integrates extractive and abstractive techniques. The first component employs an extractive approach to identify key sentences within each document. This is achieved by a modification of the pre-trained BERT network, which derives semantically meaningful phrase embeddings using siamese and triplet network structures. The second component utilizes the VBD-LLaMA2-7B-50b model for abstractive summarization, ultimately generating the final summary document. Our proposed framework demonstrates a positive performance, attaining ROUGE-2 scores of 39.6% on the VN-MDS dataset and outperforming the state-of-the-art baselines.
在解决多文档摘要(MDS)的挑战方面,已经提出了许多方法,包括提取式和抽象式摘要方法。然而,每种方法都有其局限性,因此仅依赖其中一种并不有效。一种新兴且具有前景的策略涉及提取式和抽象式摘要方法的协同融合。尽管该领域有很多研究,但关于综合方法的研究仍然很少,尤其是在越南语处理方面。本文介绍了一种利用两个组件的管道架构的新颖越南MDS框架,该框架集成了提取式和抽象式技术。第一个组件采用提取式方法来确定每个文档中的关键句子。这是通过修改预训练的BERT网络来实现的,该网络使用同义词和三元组网络结构生成具有语义意义的短语嵌入。第二个组件使用VBD-LLaMA2-7B-50b模型进行摘要式摘要,最终生成摘要文档。我们提出的框架表现出良好的性能,在VN-MDS数据集上的ROUGE-2得分为39.6%,并超过了最先进的基线。
https://arxiv.org/abs/2409.12134
Extract-then-Abstract is a naturally coherent paradigm to conduct abstractive summarization with the help of salient information identified by the extractive model. Previous works that adopt this paradigm train the extractor and abstractor separately and introduce extra parameters to highlight the extracted salients to the abstractor, which results in error accumulation and additional training costs. In this paper, we first introduce a parameter-free highlight method into the encoder-decoder framework: replacing the encoder attention mask with a saliency mask in the cross-attention module to force the decoder to focus only on salient parts of the input. A preliminary analysis compares different highlight methods, demonstrating the effectiveness of our saliency mask. We further propose the novel extract-and-abstract paradigm, ExtAbs, which jointly and seamlessly performs Extractive and Abstractive summarization tasks within single encoder-decoder model to reduce error accumulation. In ExtAbs, the vanilla encoder is augmented to extract salients, and the vanilla decoder is modified with the proposed saliency mask to generate summaries. Built upon BART and PEGASUS, experiments on three datasets show that ExtAbs can achieve superior performance than baselines on the extractive task and performs comparable, or even better than the vanilla models on the abstractive task.
提取然后抽象是一种自然流畅的范式,通过在提取模型的帮助下识别显眼的上下文信息,进行抽象性摘要。之前的工作采用这种范式分别训练提取器和抽象器,并引入了额外的参数来突出提取到的显眼信息,导致误差累积和额外的训练成本。在本文中,我们首先将参数无显着方法引入到编码器-解码器框架中:用注意力机制在交叉注意力模块中替换编码器注意力掩码,迫使解码器只关注输入中的显眼部分。初步分析比较了不同的显着方法,证明了我们的显着掩码的有效性。我们进一步提出了名为ExtAbs的新范式,它与单编码器-解码器模型一起在单个框架内协同有效地执行提取性和摘要性摘要。在ExtAbs中,标准的编码器被增强以提取显眼信息,而标准的解码器则用所提出的显眼掩码进行修改,以生成摘要。基于BART和PEGASUS,在三个数据集上的实验表明,ExtAbs在提取任务上优于基线,并且在摘要任务上与基线模型相当,或者甚至比基线模型更好。
https://arxiv.org/abs/2409.11827
Next-token prediction serves as the dominant component in current neural language models. During the training phase, the model employs teacher forcing, which predicts tokens based on all preceding ground truth tokens. However, this approach has been found to create shortcuts, utilizing the revealed prefix to spuriously fit future tokens, potentially compromising the accuracy of the next-token predictor. In this paper, we introduce Semformer, a novel method of training a Transformer language model that explicitly models the semantic planning of response. Specifically, we incorporate a sequence of planning tokens into the prefix, guiding the planning token representations to predict the latent semantic representations of the response, which are induced by an autoencoder. In a minimal planning task (i.e., graph path-finding), our model exhibits near-perfect performance and effectively mitigates shortcut learning, a feat that standard training methods and baseline models have been unable to accomplish. Furthermore, we pretrain Semformer from scratch with 125M parameters, demonstrating its efficacy through measures of perplexity, in-context learning, and fine-tuning on summarization tasks.
下一个词预测是当前神经语言模型中的主导组件。在训练阶段,模型采用教师强制,根据所有先前的真实标记词预测标记词。然而,这种方法被发现会导致捷径,利用已揭示的前缀来通过不正当的方式适应未来的标记词,可能削弱下一个词预测器的准确性。在本文中,我们引入了Semformer,一种用于训练Transformer语言模型的全新方法,它明确地建模了响应的语义规划。具体来说,我们将规划标记词序列纳入前缀中,指导规划标记词表示,预测响应的潜在语义表示,这些表示是由自编码器生成的。在最小规划任务(即图路径查找)中,我们的模型表现出近乎完美的性能,有效缓解了捷径学习的问题,这是标准训练方法和基线模型无法实现的。此外,我们还通过标准化Semformer从零开始进行预训练,用125M参数来证明其有效性,通过衡量混淆度和上下文学习以及摘要任务上的微调效果。
https://arxiv.org/abs/2409.11143
Large Language Models (LLMs) have spurred interest in automatic evaluation methods for summarization, offering a faster, more cost-effective alternative to human evaluation. However, existing methods often fall short when applied to complex tasks like long-context summarizations and dialogue-based meeting summarizations. In this paper, we introduce CREAM (Comparison-Based Reference-Free Elo-Ranked Automatic Evaluation for Meeting Summarization), a novel framework that addresses the unique challenges of evaluating meeting summaries. CREAM leverages a combination of chain-of-thought reasoning and key facts alignment to assess conciseness and completeness of model-generated summaries without requiring reference. By employing an ELO ranking system, our approach provides a robust mechanism for comparing the quality of different models or prompt configurations.
大语言模型(LLMs)激发了对于自动评估摘要的兴趣,为人类评估提供了一种更快、更经济高效的可替代方案。然而,在复杂任务(如长文本摘要和对话式会议摘要)上,现有方法往往表现不足。在本文中,我们介绍了CREAM(基于比较的免费参考-免 Evaluated自动评估会议摘要),一种为评估会议摘要的摘要。CREAM利用思维链推理和关键事实对齐来评估模型生成的摘要的简洁性和完整性,无需参考。通过采用ELO排名系统,我们的方法为比较不同模型或提示配置的质量提供了稳健的机制。
https://arxiv.org/abs/2409.10883
Keyphrase selection plays a pivotal role within the domain of scholarly texts, facilitating efficient information retrieval, summarization, and indexing. In this work, we explored how to apply fine-tuned generative transformer-based models to the specific task of keyphrase selection within Russian scientific texts. We experimented with four distinct generative models, such as ruT5, ruGPT, mT5, and mBART, and evaluated their performance in both in-domain and cross-domain settings. The experiments were conducted on the texts of Russian scientific abstracts from four domains: mathematics \& computer science, history, medicine, and linguistics. The use of generative models, namely mBART, led to gains in in-domain performance (up to 4.9\% in BERTScore, 9.0\% in ROUGE-1, and 12.2\% in F1-score) over three keyphrase extraction baselines for the Russian language. Although the results for cross-domain usage were significantly lower, they still demonstrated the capability to surpass baseline performances in several cases, underscoring the promising potential for further exploration and refinement in this research field.
关键词选择在学术文本领域中扮演着关键角色,促进有效的信息检索、总结和索引。在这项工作中,我们探讨了如何将经过微调的生成 transformer 模型应用于关键词选择任务,特别是在俄罗斯科学文本中。我们尝试了四个不同的生成模型,如 ruT5、ruGPT、mT5 和 mBART,并在领域内和跨领域设置中评估了它们的表现。实验在四个领域的俄罗斯科学摘要文本上进行,这些领域包括数学与计算机科学、历史、医学和语言学。使用生成模型(特别是 mBART)导致在领域内性能(在 BERTScore 上的最高达 4.9\%,在 ROUGE-1 上的最高达 9.0\%,在 F1-score 上的最高达 12.2\%)超过了三个关键词提取基线。尽管跨领域使用的结果明显较低,但它们在几个方面仍表现出超过基线的性能,这表明该研究领域有进一步探索和优化的巨大潜力。
https://arxiv.org/abs/2409.10640
Dialogue summarization aims to provide a concise and coherent summary of conversations between multiple speakers. While recent advancements in language models have enhanced this process, summarizing dialogues accurately and faithfully remains challenging due to the need to understand speaker interactions and capture relevant information. Indeed, abstractive models used for dialog summarization may generate summaries that contain inconsistencies. We suggest using the semantic information proposed for performing Spoken Language Understanding (SLU) in human-machine dialogue systems for goal-oriented human-human dialogues to obtain a more semantically faithful summary regarding the task. This study introduces three key contributions: First, we propose an exploration of how incorporating task-related information can enhance the summarization process, leading to more semantically accurate summaries. Then, we introduce a new evaluation criterion based on task semantics. Finally, we propose a new dataset version with increased annotated data standardized for research on task-oriented dialogue summarization. The study evaluates these methods using the DECODA corpus, a collection of French spoken dialogues from a call center. Results show that integrating models with task-related information improves summary accuracy, even with varying word error rates.
对话总结旨在对多轮对话进行简洁、连贯的总结。尽管近年来自然语言处理(NLP)技术的进步增强了这一过程,但准确、忠实地总结对话仍然具有挑战性,因为需要理解说话者之间的互动并捕捉相关信息。事实上,用于对话总结的抽象模型可能会生成包含不一致性的摘要。我们建议使用SLU中提出的语义信息来获得关于目标导向的人际对话更精确的摘要。本研究引入了三个关键贡献:首先,我们探讨了如何通过包含任务相关信息来增强总结过程,从而产生更精确的摘要。其次,我们引入了一种基于任务语义的新评估标准。最后,我们提出了一个用于研究任务导向对话总结的新数据集版本,该版本标准化了增加的注释数据。研究通过对DECODA语料库的评估来评估这些方法。结果表明,将具有任务相关信息的模型集成可以提高摘要准确性,即使单词错误率不同。
https://arxiv.org/abs/2409.10070
Depression is a globally prevalent mental disorder with potentially severe repercussions if not addressed, especially in individuals with recurrent episodes. Prior research has shown that early intervention has the potential to mitigate or alleviate symptoms of depression. However, implementing such interventions in a real-world setting may pose considerable challenges. A promising strategy involves leveraging machine learning and artificial intelligence to autonomously detect depression indicators from diverse data sources. One of the most widely available and informative data sources is text, which can reveal a person's mood, thoughts, and feelings. In this context, virtual agents programmed to conduct interviews using clinically validated questionnaires, such as those found in the DAIC-WOZ dataset, offer a robust means for depression detection through linguistic analysis. Utilizing BERT-based models, which are powerful and versatile yet use fewer resources than contemporary large language models, to convert text into numerical representations significantly enhances the precision of depression diagnosis. These models adeptly capture complex semantic and syntactic nuances, improving the detection accuracy of depressive symptoms. Given the inherent limitations of these models concerning text length, our study proposes text summarization as a preprocessing technique to diminish the length and intricacies of input texts. Implementing this method within our uniquely developed framework for feature extraction and classification yielded an F1-score of 0.67 on the test set surpassing all prior benchmarks and 0.81 on the validation set exceeding most previous results on the DAIC-WOZ dataset. Furthermore, we have devised a depression lexicon to assess summary quality and relevance. This lexicon constitutes a valuable asset for ongoing research in depression detection.
抑郁症是一种全球普遍的心理障碍,如果得不到解决,尤其是对于反复发作的患者,可能会带来严重的后果。先前的研究表明,早期干预有可能减轻或缓解抑郁症的症状。然而,在现实环境中实施这些干预措施可能会面临相当大的挑战。一种有前景的策略是利用机器学习和人工智能来自动检测来自各种数据源的抑郁症指标。目前最广泛可用且信息丰富的数据源是文本,它可以从一个人的情绪、想法和感受中揭示出来。在这种情况下,使用经过临床验证的问卷进行访谈的虚拟代理,如DAIC-WOZ数据集中的那种,通过语言分析检测抑郁症是一种稳健的检测方法。利用基于BERT的模型将文本转换为数值表示,显著增强了抑郁症诊断的准确性。这些模型巧妙地捕捉了复杂语义和句法细微差别,提高了抑郁症状的检测精度。然而,这些模型在文本长度方面存在固有局限。因此,我们的研究提出了一种文本摘要作为预处理技术,以减少输入文本的长度和复杂性。在我们的独特特征提取和分类框架内实施这种方法,在测试集上的F1分数为0.67,超过所有先前的基准,而在验证集上的分数为0.81,超过大多数之前的结果。此外,我们还制定了一个抑郁症词汇表来评估摘要的质量和相关性。这个词汇表为 ongoing research in depression detection 提供了宝贵的资产。
https://arxiv.org/abs/2409.08483
Knowledge conflict arises from discrepancies between information in the context of a large language model (LLM) and the knowledge stored in its parameters. This can hurt performance when using standard decoding techniques, which tend to ignore the context. Existing test-time contrastive methods seek to address this by comparing the LLM's output distribution with and without the context and adjust the model according to the contrast between them. However, we find that these methods frequently misjudge the degree of conflict and struggle to handle instances that vary in their amount of conflict, with static methods over-adjusting when conflict is absent. We propose a fine-grained, instance-level approach called AdaCAD, which dynamically infers the weight of adjustment based on the degree of conflict, as measured by the Jensen-Shannon divergence between distributions representing contextual and parametric knowledge. Our experiments across four models on six diverse question-answering (QA) datasets and three summarization tasks demonstrate that our training-free adaptive method consistently outperforms other decoding methods on QA, with average accuracy gains of 14.21% (absolute) over a static contrastive baseline, and improves the factuality of summaries by 5.59 (AlignScore). Furthermore, our analysis shows that while decoding with contrastive baselines hurts performance when conflict is absent, AdaCAD mitigates these losses, making it more applicable to real-world datasets in which some examples have conflict and others do not.
知识冲突源于大型语言模型(LLM)上下文中的信息与模型参数中存储的知识之间的差异。这会在使用标准解码技术时损害性能,这些技术倾向于忽略上下文。现有的测试时间对比方法试图解决这个问题,通过比较LLM的输出分布有上下文和不有上下文以及根据它们之间的对比调整模型。然而,我们发现这些方法经常错误地估计冲突的程度,并且在冲突缺失的情况下过度调整模型。我们提出了一个细粒度、实例级别的AdaCAD方法,它根据代表上下文和参数知识分布之间的Jensen-Shannon熵变来动态推断调整的权重。我们对六个具有不同问题回答(QA)数据集的四种不同模型以及三个总结任务进行实验,结果表明,我们的无训练自适应方法在QA上始终优于其他解码方法,平均准确率增加了14.21%(绝对值)。此外,我们的分析还发现,使用对比性基线进行解码时,对冲突的缺失会损害性能,而AdaCAD方法减轻了这些损失,使得它更适用于现实世界的数据集,其中有些示例存在冲突,而有些则不存在。
https://arxiv.org/abs/2409.07394
The rapid development of Large Language Models (LLMs) for healthcare applications has spurred calls for holistic evaluation beyond frequently-cited benchmarks like USMLE, to better reflect real-world performance. While real-world assessments are valuable indicators of utility, they often lag behind the pace of LLM evolution, likely rendering findings obsolete upon deployment. This temporal disconnect necessitates a comprehensive upfront evaluation that can guide model selection for specific clinical applications. We introduce MEDIC, a framework assessing LLMs across five critical dimensions of clinical competence: medical reasoning, ethics and bias, data and language understanding, in-context learning, and clinical safety. MEDIC features a novel cross-examination framework quantifying LLM performance across areas like coverage and hallucination detection, without requiring reference outputs. We apply MEDIC to evaluate LLMs on medical question-answering, safety, summarization, note generation, and other tasks. Our results show performance disparities across model sizes, baseline vs medically finetuned models, and have implications on model selection for applications requiring specific model strengths, such as low hallucination or lower cost of inference. MEDIC's multifaceted evaluation reveals these performance trade-offs, bridging the gap between theoretical capabilities and practical implementation in healthcare settings, ensuring that the most promising models are identified and adapted for diverse healthcare applications.
大型语言模型(LLMs)在医疗应用中的快速发展引发了超越经常引用的基准指标如USMLE等进行整体评估的呼声。虽然现实世界的评估是衡量模型实用性的有益指标,但它们通常在LLM演变的速度后面,可能在部署时使研究结果过时。这种时间脱节迫使进行全面的初始评估,以指导特定临床应用的模型选择。我们引入了MEDIC,一个评估LLMs在五个关键临床能力维度上的框架:医学推理,伦理和偏见,数据和语言理解,上下文学习和临床安全。MEDIC采用了一种新的交叉检验框架,量化LLM在覆盖和幻觉检测等领域的性能,无需要求参考输出。我们将MEDIC应用于评估LLMs在医疗问题回答、安全、总结、笔记生成和其他任务上的表现。我们的结果表明,模型大小、基础模型和经过医学微调的模型的性能差异明显,这将对需要特定模型强度的应用程序的模型选择产生影响,例如低幻觉或较低的推理成本。MEDIC的多维度评估揭示了这些性能权衡,确保在医疗环境中实现理论能力和实践操作之间的平衡,从而确定最有前途的模型并为其适应各种医疗应用程序做好准备。
https://arxiv.org/abs/2409.07314
The progress in text summarization techniques has been remarkable. However the task of accurately extracting and summarizing necessary information from highly specialized documents such as research papers has not been sufficiently investigated. We are focusing on the task of extracting research questions (RQ) from research papers and construct a new dataset consisting of machine learning papers, RQ extracted from these papers by GPT-4, and human evaluations of the extracted RQ from multiple perspectives. Using this dataset, we systematically compared recently proposed LLM-based evaluation functions for summarizations, and found that none of the functions showed sufficiently high correlations with human evaluations. We expect our dataset provides a foundation for further research on developing better evaluation functions tailored to the RQ extraction task, and contribute to enhance the performance of the task. The dataset is available at this https URL.
文本摘要技术的进步是显著的。然而,从高度专业化的论文中准确提取和总结必要信息的任务并没有得到充分研究。我们专注于从论文中提取研究问题(RQ),并构建了一个新数据集,包括由GPT-4提取的RQ和研究论文,以及从不同角度对提取的RQ进行人类评估的新数据集。使用这个数据集,我们系统地比较了最近提出的LLM-based评估函数,发现没有一个函数与人类评估结果具有足够高的相关性。我们期望我们的数据集为开发更适用于RQ提取任务的更好评估函数奠定了基础,并促进了任务的性能提升。数据集可在https://这个链接中获取。
https://arxiv.org/abs/2409.06883
In the realm of Large Language Models (LLMs), the ability to process long contexts is increasingly crucial for tasks such as multi-round dialogues, code generation, and document summarization. This paper addresses the challenges of enhancing the long-context performance, reducing computational complexity, and leveraging pretrained models collectively termed the "impossible triangle." We introduce E2LLM (Encoder Elongated Large Language Models), a novel approach that effectively navigates this paradox. The method involves splitting long contexts into chunks, compressing each into embedding vectors via a pretrained text encoder, and utilizing an adapter to align these representations with a decoder-only LLM. Two training objectives, focusing on reconstruction of the encoder output and long-context instruction fine-tuning, are employed to facilitate the understanding of soft prompts by the LLM. Experimental results demonstrate that E2LLM achieves superior performance in long-context scenarios while balancing efficiency, performance, and compatibility with pretrained models. Our framework thus represents a significant advancement in the field, contributing to effective long-text modeling.
在大型语言模型(LLMs)领域,处理长上下文的能力对于诸如多轮对话、代码生成和文档摘要等任务至关重要。本文讨论了如何增强长上下文性能、降低计算复杂度以及共同利用预训练模型的“不可能三角”所带来的挑战。我们引入了E2LLM(编码器扩展大型语言模型),一种有效解决这一困境的新方法。该方法包括将长上下文分割成片段,通过预训练的文本编码器将每个片段压缩成嵌入向量,并使用适配器将这些表示与仅 decoder-only LLM 对齐。为了促进LLM对软提示的理解,采用两个训练目标,重点关注编码器输出和长上下文指令微调。实验结果表明,E2LLM在长上下文场景中实现卓越性能,同时平衡效率、性能和与预训练模型的兼容性。因此,我们的框架在领域内取得了显著的进展,为有效长文本建模做出了重要贡献。
https://arxiv.org/abs/2409.06679
As the body of academic literature continues to grow, researchers face increasing difficulties in effectively searching for relevant resources. Existing databases and search engines often fall short of providing a comprehensive and contextually relevant collection of academic literature. To address this issue, we propose a novel framework that leverages Natural Language Processing (NLP) techniques. This framework automates the retrieval, summarization, and clustering of academic literature within a specific research domain. To demonstrate the effectiveness of our approach, we introduce CyLit, an NLP-powered repository specifically designed for the cyber risk literature. CyLit empowers researchers by providing access to context-specific resources and enabling the tracking of trends in the dynamic and rapidly evolving field of cyber risk. Through the automatic processing of large volumes of data, our NLP-powered solution significantly enhances the efficiency and specificity of academic literature searches. We compare the literature categorization results of CyLit to those presented in survey papers or generated by ChatGPT, highlighting the distinctive insights this tool provides into cyber risk research literature. Using NLP techniques, we aim to revolutionize the way researchers discover, analyze, and utilize academic resources, ultimately fostering advancements in various domains of knowledge.
随着学术文献库不断增长,研究人员在有效地查找相关资源方面面临越来越多的困难。现有的数据库和搜索引擎通常无法提供全面且相关的一组学术文献。为解决这个问题,我们提出了一个利用自然语言处理(NLP)技术的新框架。这个框架自动检索、总结和分类特定研究领域的学术文献。为了证明我们方法的有效性,我们引入了Cylit,一个专门为网络风险文献而设计的NLP驱动的存储库。Cylit使研究人员能够访问特定领域的资源,并跟踪该领域的动态和快速变化趋势。通过自动处理大量数据,我们的NLP驱动解决方案显著提高了学术文献搜索的效率和准确性。我们比较了Cylit的文献分类结果与调查论文或由ChatGPT生成的结果,突出了这个工具在网络风险研究文献中的独特见解。利用自然语言处理技术,我们的目标是彻底改变研究人员发现、分析和利用学术资源的方式,最终推动知识的进步。
https://arxiv.org/abs/2409.06226
LLM-powered personalization agent systems employ Large Language Models (LLMs) to predict users' behavior from their past activities. However, their effectiveness often hinges on the ability to effectively leverage extensive, long user historical data due to its inherent noise and length of such data. Existing pretrained LLMs may generate summaries that are concise but lack the necessary context for downstream tasks, hindering their utility in personalization systems. To address these challenges, we introduce Reinforcement Learning from Prediction Feedback (RLPF). RLPF fine-tunes LLMs to generate concise, human-readable user summaries that are optimized for downstream task performance. By maximizing the usefulness of the generated summaries, RLPF effectively distills extensive user history data while preserving essential information for downstream tasks. Our empirical evaluation demonstrates significant improvements in both extrinsic downstream task utility and intrinsic summary quality, surpassing baseline methods by up to 22% on downstream task performance and achieving an up to 84.59% win rate on Factuality, Abstractiveness, and Readability. RLPF also achieves a remarkable 74% reduction in context length while improving performance on 16 out of 19 unseen tasks and/or datasets, showcasing its generalizability. This approach offers a promising solution for enhancing LLM personalization by effectively transforming long, noisy user histories into informative and human-readable representations.
LLM-powered personalization agent systems employ Large Language Models (LLMs) to predict users' behavior from their past activities. However, their effectiveness often hinges on the ability to effectively leverage extensive, long user historical data due to its inherent noise and length of such data. Existing pretrained LLMs may generate summaries that are concise but lack the necessary context for downstream tasks, hindering their utility in personalization systems. 为了应对这些挑战,我们引入了强化学习从预测反馈(RLPF)。RLPF 通过微调 LLMs 生成简洁、易读的用户摘要,使其优化为下游任务的性能。通过最大化生成的摘要的有用性,RLPF 有效地提炼了广泛用户历史数据,同时保留下游任务的关键信息。 我们的实证评估表明,在 extrinsic 下游任务价值和 intrinsic 摘要质量方面都有显著的改进,超过基线方法。在下游任务性能上,RLPF 的 win 率达到了 84.59%。在 16 个未见任务/数据集上,RLPF 的上下文长度减少了 74%,同时性能也得到了提高,展示了其泛化能力。 这种方法为通过有效将长、噪音用户历史转化为有益且易读的表示,增强 LLM 个性化提供了一个有前景的解决方案。
https://arxiv.org/abs/2409.04421
This work proposes an efficient parallel algorithm for non-monotone submodular maximization under a knapsack constraint problem over the ground set of size $n$. Our algorithm improves the best approximation factor of the existing parallel one from $8+\epsilon$ to $7+\epsilon$ with $O(\log n)$ adaptive complexity. The key idea of our approach is to create a new alternate threshold algorithmic framework. This strategy alternately constructs two disjoint candidate solutions within a constant number of sequence rounds. Then, the algorithm boosts solution quality without sacrificing the adaptive complexity. Extensive experimental studies on three applications, Revenue Maximization, Image Summarization, and Maximum Weighted Cut, show that our algorithm not only significantly increases solution quality but also requires comparative adaptivity to state-of-the-art algorithms.
本文提出了一种高效的并行算法,用于在有限大小$n$的地面集中解决带约束的非单调子模块最大化问题。与现有的并行算法相比,我们的算法在$O(\log n)$自适应复杂度的条件下将最大逼近因子从$8+\epsilon$提高到$7+\epsilon$。我们方法的关键思想是创建一个新的交替阈值算法框架。这种策略在常数序列回合内交替构建两个离散的候选解决方案。然后,在不牺牲自适应复杂性的情况下,提高了解决方案的质量。对应用领域(如收益最大化、图像摘要和最大权重剪枝)的广泛实验研究表明,我们的算法不仅显著提高了解决方案的质量,而且与最先进的算法相比,需要较低的比较适应性。
https://arxiv.org/abs/2409.04415
To reduce the need for human annotations, large language models (LLMs) have been proposed as judges of the quality of other candidate models. LLM judges are typically evaluated by measuring the correlation with human judgments on generation tasks such as summarization or machine translation. In contrast, we study LLM judges on mathematical reasoning tasks. These tasks require multi-step reasoning, and the correctness of their solutions is verifiable, enabling a more objective evaluation. We perform a detailed performance analysis and find that the used judges are mostly unable to improve task performance but are able to pick the better model. Our analysis uncovers a strong correlation between judgment performance and the candidate model task performance. We observe that judges tend to choose the model of higher quality even if its answer is incorrect. Further, we show that it is possible to use statistics, such as the task performances of the individual models, to predict judgment performance. In an ablation, we either swap or mask the candidate answers and observe that judges often keep the original judgment, providing evidence that judges incorporate writing style in their judgments. In summary, we find that regularities in the judgments are quantifiable using statistical measures and provide various angles on exploiting them.
为减少人类注释的需求,大型语言模型(LLMs)已被提议作为评估其他候选模型质量的判断者。LLM判断者通常通过衡量与人类判断在生成任务(如摘要或机器翻译)上的相关性来评估。相比之下,我们在数学推理任务上研究LLM判断者。这些任务需要多步推理,并且他们的解决方案的正确性可以通过验证来确定,从而实现更客观的评估。我们进行了详细的表现分析,并发现使用的判断者大多无法提高任务性能,但能够挑选出更好的模型。我们的分析揭示了判断绩效与候选模型任务绩效之间的强烈关联。我们观察到,即使候选模型的答案是错误的,判断者也会选择质量更高的模型。进一步,我们展示了可以使用统计方法(如各模型的任务绩效)预测判断绩效。在消融中,我们要么交换候选答案,要么遮盖候选答案,观察到法官通常保留原始判断,这表明法官在判断中考虑了写作风格。总之,我们发现使用统计措施可以量化判断的规律性,并为利用这些规律提供各种角度。
https://arxiv.org/abs/2409.04168
In this paper, we investigate the presence of additive bias in Large Language Models (LLMs), drawing a parallel to the cognitive bias observed in humans where individuals tend to favor additive over subtractive changes. Using a series of controlled experiments, we tested various LLMs, including GPT-3.5 Turbo, Claude 3.5 Sonnet, Mistral, Math$\Sigma$tral, and Llama 3.1, on tasks designed to measure their propensity for additive versus subtractive modifications. Our findings demonstrate a significant preference for additive changes across all tested models. For example, in a palindrome creation task, Llama 3.1 favored adding letters 97.85% of the time over removing them. Similarly, in a Lego tower balancing task, GPT-3.5 Turbo chose to add a brick 76.38% of the time rather than remove one. In a text summarization task, Mistral 7B produced longer summaries in 59.40% to 75.10% of cases when asked to improve its own or others' writing. These results indicate that, similar to humans, LLMs exhibit a marked additive bias, which might have implications when LLMs are used on a large scale. Addittive bias might increase resource use and environmental impact, leading to higher economic costs due to overconsumption and waste. This bias should be considered in the development and application of LLMs to ensure balanced and efficient problem-solving approaches.
在本文中,我们研究了大型语言模型(LLMs)中是否存在加性偏见,并将其与人类认知偏差进行了类比。我们进行了一系列控制实验,测试了各种LLM,包括GPT-3.5涡轮,Claude 3.5 Sonnet,Mistral,Math$\Sigma$tral和Llama 3.1,在设计测量它们对加性变化和减性变化倾向的任务上。我们的研究结果表明,所有测试模型对加性变化的偏好都非常显著。例如,在回文词创建任务中,Llama 3.1在97.85%的时间里倾向于添加字母,而不是删除它们。同样,在乐高积木搭高任务中,GPT-3.5涡轮选择添加砖头76.38%的时间,而不是删除一个。在文本摘要任务中,Mistral 7B在其要求改善其自己或他人写作的59.40%至75.10%的案例中产生了更长的摘要。这些结果表明,与人类相似,LLM表现出明显的加性偏见。这种偏见可能导致资源消耗和环境影响的增加,从而导致由于过度消费和浪费而产生的更高经济成本。在开发和应用LLM时,应该考虑这种偏见,以确保平衡和有效的解决方案。
https://arxiv.org/abs/2409.02569
Specifically focusing on the landscape of abstractive text summarization, as opposed to extractive techniques, this survey presents a comprehensive overview, delving into state-of-the-art techniques, prevailing challenges, and prospective research directions. We categorize the techniques into traditional sequence-to-sequence models, pre-trained large language models, reinforcement learning, hierarchical methods, and multi-modal summarization. Unlike prior works that did not examine complexities, scalability and comparisons of techniques in detail, this review takes a comprehensive approach encompassing state-of-the-art methods, challenges, solutions, comparisons, limitations and charts out future improvements - providing researchers an extensive overview to advance abstractive summarization research. We provide vital comparison tables across techniques categorized - offering insights into model complexity, scalability and appropriate applications. The paper highlights challenges such as inadequate meaning representation, factual consistency, controllable text summarization, cross-lingual summarization, and evaluation metrics, among others. Solutions leveraging knowledge incorporation and other innovative strategies are proposed to address these challenges. The paper concludes by highlighting emerging research areas like factual inconsistency, domain-specific, cross-lingual, multilingual, and long-document summarization, as well as handling noisy data. Our objective is to provide researchers and practitioners with a structured overview of the domain, enabling them to better understand the current landscape and identify potential areas for further research and improvement.
该调查专门关注抽象文本摘要的地形,而不是提取技术。该调查全面概述了最新的技术、存在的挑战和未来的研究方向。我们将技术分为传统序列到序列模型、预训练的大型语言模型、强化学习、分层方法和支持多模态摘要。与之前的作品不同,该调查深入研究了复杂性、可扩展性和技术比较的详细信息,涵盖了最先进的方法、挑战、解决方案、比较和局限性,为研究人员提供了全面了解抽象文本摘要研究的发展趋势以及未来改进的机会。我们提供了分类技术之间的重要比较表格,揭示了模型复杂性、可扩展性和适当的应用。论文重点探讨了诸如不足的意义表示、事实一致性、可控制文本摘要、跨语言摘要和评估指标等问题。利用知识整合和其他创新策略提出解决方案来解决这些问题。论文最后强调了新兴的研究领域,如事实不一致、领域特定、跨语言、多语言和长篇文档摘要,以及处理嘈杂数据。我们的目标是为研究人员和从业者提供一个结构化的领域概述,使他们能够更好地了解现状并确定进一步研究和改进的机会。
https://arxiv.org/abs/2409.02413
The recent advances in large language models (LLMs) have significantly expanded their applications across various fields such as language generation, summarization, and complex question answering. However, their application to privacy compliance and technical privacy reviews remains under-explored, raising critical concerns about their ability to adhere to global privacy standards and protect sensitive user data. This paper seeks to address this gap by providing a comprehensive case study evaluating LLMs' performance in privacy-related tasks such as privacy information extraction (PIE), legal and regulatory key point detection (KPD), and question answering (QA) with respect to privacy policies and data protection regulations. We introduce a Privacy Technical Review (PTR) framework, highlighting its role in mitigating privacy risks during the software development life-cycle. Through an empirical assessment, we investigate the capacity of several prominent LLMs, including BERT, GPT-3.5, GPT-4, and custom models, in executing privacy compliance checks and technical privacy reviews. Our experiments benchmark the models across multiple dimensions, focusing on their precision, recall, and F1-scores in extracting privacy-sensitive information and detecting key regulatory compliance points. While LLMs show promise in automating privacy reviews and identifying regulatory discrepancies, significant gaps persist in their ability to fully comply with evolving legal standards. We provide actionable recommendations for enhancing LLMs' capabilities in privacy compliance, emphasizing the need for robust model improvements and better integration with legal and regulatory requirements. This study underscores the growing importance of developing privacy-aware LLMs that can both support businesses in compliance efforts and safeguard user privacy rights.
近年来,大型语言模型(LLMs)在各种领域的应用已经显著扩展。然而,它们在隐私合规和技术隐私审查方面的应用仍然缺乏深入的研究,这引发了对它们能否遵守全球隐私标准和保护敏感用户数据能力的担忧。本文旨在通过提供一个全面评估LLMs在隐私相关任务(如隐私信息提取、法律和法规关键点检测和问答)上的表现的案例研究来填补这一空白。我们引入了一个隐私技术审查(PTR)框架,强调了它在软件开发生命周期中减轻隐私风险的作用。通过实证评估,我们研究了包括BERT、GPT-3.5、GPT-4和自定义模型的几个 prominent LLMs在执行隐私合规检查和进行技术隐私审查方面的能力。我们的实验以多个维度为基准,重点关注它们在提取隐私敏感信息方面的精确度、召回率和F1分数,以及检测关键法规合规点。虽然LLMs在自动化隐私审查和发现法规差异方面显示出前景,但它们在全面遵守不断变化的法律法规标准方面仍然存在显著的差距。我们提供了提高LLMs隐私合规能力的具体建议,强调了需要进行 robust 模型改进和更好地融入法律和法规要求的重要性。这项研究突显了开发 privacy-aware LLMs支持企业合规努力并保护用户隐私权利的重要性。
https://arxiv.org/abs/2409.02375
Building high-quality datasets for specialized tasks is a time-consuming and resource-intensive process that often requires specialized domain knowledge. We propose Corpus Retrieval and Augmentation for Fine-Tuning (CRAFT), a method for generating synthetic datasets, given a small number of user-written few-shots that demonstrate the task to be performed. Given the few-shot examples, we use large-scale public web-crawled corpora and similarity-based document retrieval to find other relevant human-written documents. Lastly, instruction-tuned large language models (LLMs) augment the retrieved documents into custom-formatted task samples, which then can be used for fine-tuning. We demonstrate that CRAFT can efficiently generate large-scale task-specific training datasets for four diverse tasks: biology question-answering (QA), medicine QA and commonsense QA as well as summarization. Our experiments show that CRAFT-based models outperform or achieve comparable performance to general LLMs for QA tasks, while CRAFT-based summarization models outperform models trained on human-curated data by 46 preference points.
建立高质量的数据集为专业任务是一个耗时且资源密集的过程,通常需要专业知识。我们提出 Corpus Retrieval and Augmentation for Fine-Tuning (CRAFT),一种生成合成数据的方法,基于少量用户写的微样本,这些微样本展示了要执行的任务。鉴于微样本,我们使用大规模公共网络爬取的语料库和基于相似度的文档检索来寻找其他相关的人类写作文档。最后,指令调整的大语言模型(LLMs)将检索到的文档格式化为自定义任务样本,然后可以用于微调。我们证明了CRAFT可以有效地为四个不同的任务生成大型任务特定的训练数据集:生物学问题回答(QA),医学 QA和常识 QA以及总结。我们的实验结果表明,基于CRAFT的模型在QA任务上优于或达到与一般LLM相当或更好的性能,而基于CRAFT的摘要模型在46个偏好点上优于训练在人类标注数据上的模型。
https://arxiv.org/abs/2409.02098
The effectiveness of long-context modeling is important for Large Language Models (LLMs) in various applications. Despite their potential, LLMs' efficacy in processing long context does not consistently meet expectations, posing significant challenges for efficient management of prolonged sequences in training. This difficulty is compounded by the scarcity of comprehensive and diverse training datasets suitable for long sequences, which stems from inherent length biases across different data sources, and the logistical complexities associated with massive data management for training in extended contexts. In this work, we introduce DataSculpt, a data construction framework designed to strategically augment the data architecture for extended-context training. Our thorough evaluations demonstrate DataSculpt's remarkable capacity to boost long-context training performance, achieving improvements including an 18.09% increase in retrieval augmentation, 21.23% in summarization, 21.27% in reading comprehension, and a 3.81% rise in code completion, all while preserving the models' overall proficiency with a 4.88% improvement.
长语境建模的有效性对于各种应用中的大型语言模型(LLMs)非常重要。尽管它们具有巨大的潜力,但LLMs在处理长语境时并不总是能够达到预期效果,这给在训练中管理长序列的高效性带来了巨大的挑战。这个困难再加上由于不同数据源固有的长度偏见以及扩展语境下的大数据管理所涉及的复杂性,进一步加大了。 在这种工作里,我们介绍了一个名为DataSculpt的数据构建框架,旨在战略性地增强扩展语境训练的数据架构。我们对DataSculpt的深入评估证明,DataSculpt具有惊人的提高长语境训练性能的能力,包括检索增强18.09%,总结21.23%,阅读理解21.27%和代码完成3.81%,同时保持模型总体能力的提高4.88%。
https://arxiv.org/abs/2409.00997