Extracting semantic information from generated text is a useful tool for applications such as automated fact checking or retrieval augmented generation. Currently, this requires either separate models during inference, which increases computational cost, or destructive fine-tuning of the language model. Instead, we propose directly embedding information extraction capabilities into pre-trained language models using probing classifiers, enabling efficient simultaneous text generation and information extraction. For this, we introduce an approach called EMBER and show that it enables named entity recognition in decoder-only language models without fine-tuning them and while incurring minimal additional computational cost at inference time. Specifically, our experiments using GPT-2 show that EMBER maintains high token generation rates during streaming text generation, with only a negligible decrease in speed of around 1% compared to a 43.64% slowdown measured for a baseline using a separate NER model. Code and data are available at this https URL.
提取语义信息从生成文本中是一个有用的工具,可以应用于诸如自动事实检查或检索增强生成等应用。目前,这需要使用推理期间分离模型,这会增加计算成本,或者对语言模型进行破坏性微调。相反,我们提出了一种将信息提取功能直接嵌入预训练语言模型中的方法,使用提示分类器,实现高效的同时文本生成和信息提取。为此,我们介绍了一种称为EMBER的方法,并证明了它可以在无需微调的情况下,使命名实体识别在解码语言模型中实现,同时仅在推理时造成极少量的额外计算成本。具体来说,我们的实验使用GPT-2表明,EMBER在流式文本生成中保持高 token 生成率,仅比使用单独的NER模型进行测量时的速度下降略有减少,约为1%。代码和数据可在此处获得:https://www.aclweb.org/anthology/N22-11965/14914260/
https://arxiv.org/abs/2403.11747
To meet the requirements of real-world applications, it is essential to control generations of large language models (LLMs). Prior research has tried to introduce reinforcement learning (RL) into controllable text generation while most existing methods suffer from overfitting issues (finetuning-based methods) or semantic collapse (post-processing methods). However, current RL methods are generally guided by coarse-grained (sentence/paragraph-level) feedback, which may lead to suboptimal performance owing to semantic twists or progressions within sentences. To tackle that, we propose a novel reinforcement learning algorithm named TOLE which formulates TOken-LEvel rewards for controllable text generation, and employs a "first-quantize-then-noise" paradigm to enhance the robustness of the RL algorithm.Furthermore, TOLE can be flexibly extended to multiple constraints with little computational expense. Experimental results show that our algorithm can achieve superior performance on both single-attribute and multi-attribute control tasks. We have released our codes at this https URL
为满足实际应用的需求,控制大型语言模型(LLMs)的多个代数是至关重要的。先前的研究试图将强化学习(RL)引入可控制文本生成,而大多数现有方法都受到过拟合问题的困扰(基于微调的方法)或语义崩塌(后处理方法)。然而,当前的RL方法通常受到粗粒度(句子/段级别)反馈的指导,这可能导致语义扭曲或句子内进步导致的性能低。为了解决这个问题,我们提出了一个名为TOLE的新颖强化学习算法,为可控制文本生成定义了词嵌入-LE级别的奖励,并采用了一种“先量化后噪声”范式来增强RL算法的鲁棒性。此外,TOLE可以灵活地扩展到多个约束,而几乎不需要额外的计算费用。实验结果表明,我们的算法可以在单属性和多属性控制任务上实现卓越的性能。我们将代码发布在此处:https:// URL
https://arxiv.org/abs/2403.11558
Automatic methods for evaluating machine-generated texts hold significant importance due to the expanding applications of generative systems. Conventional methods tend to grapple with a lack of explainability, issuing a solitary numerical score to signify the assessment outcome. Recent advancements have sought to mitigate this limitation by incorporating large language models (LLMs) to offer more detailed error analyses, yet their applicability remains constrained, particularly in industrial contexts where comprehensive error coverage and swift detection are paramount. To alleviate these challenges, we introduce DEE, a Dual-stage Explainable Evaluation method for estimating the quality of text generation. Built upon Llama 2, DEE follows a dual-stage principle guided by stage-specific instructions to perform efficient identification of errors in generated texts in the initial stage and subsequently delves into providing comprehensive diagnostic reports in the second stage. DEE is fine-tuned on our elaborately assembled dataset AntEval, which encompasses 15K examples from 4 real-world applications of Alipay that employ generative systems. The dataset concerns newly emerged issues like hallucination and toxicity, thereby broadening the scope of DEE's evaluation criteria. Experimental results affirm that DEE's superiority over existing evaluation methods, achieving significant improvements in both human correlation as well as efficiency.
自动评估机器生成文本的方法因生成系统的广泛应用而具有重要意义。传统的评估方法往往缺乏可解释性,仅给出一个单一的数值分数来表示评估结果。最近的研究尝试通过引入大型语言模型(LLMs)来缓解这一限制,提供更详细的错误分析,但它们的适用性仍然受到限制,特别是在工业环境中,全面覆盖错误和快速检测至关重要。为了减轻这些挑战,我们引入了DEE,一种用于评估文本生成质量的双阶段可解释评估方法。DEE基于Llama 2,遵循阶段特定指令进行有效的文本生成过程中的错误识别,并在第二个阶段提供全面的诊断报告。DEE在我们精心构建的数据集Anteval上进行微调,该数据集包括来自四个Alipay实际应用的15K个例子,这些应用使用生成系统。数据集关注的新兴问题如幻觉和毒性扩展了DEE评估标准的范围。实验结果证实了DEE在现有评估方法上的优越性,实现了在人类相关性和效率方面的显著改进。
https://arxiv.org/abs/2403.11509
Conversational search provides a more convenient interface for users to search by allowing multi-turn interaction with the search engine. However, the effectiveness of the conversational dense retrieval methods is limited by the scarcity of training data required for their fine-tuning. Thus, generating more training conversational sessions with relevant labels could potentially improve search performance. Based on the promising capabilities of large language models (LLMs) on text generation, we propose ConvSDG, a simple yet effective framework to explore the feasibility of boosting conversational search by using LLM for session data generation. Within this framework, we design dialogue/session-level and query-level data generation with unsupervised and semi-supervised learning, according to the availability of relevance judgments. The generated data are used to fine-tune the conversational dense retriever. Extensive experiments on four widely used datasets demonstrate the effectiveness and broad applicability of our ConvSDG framework compared with several strong baselines.
对话搜索为用户通过多轮交互来搜索提供了更方便的界面。然而,对话密集检索方法的有效性受到其所需的训练数据量的限制。因此,通过生成相关标签的更多训练对话会,可能会有助于提高搜索性能。根据大型语言模型(LLMs)在文本生成方面的有益能力,我们提出了ConvSDG,一个简单而有效的框架,用于探讨使用LLM生成对话数据来提高对话搜索的可行性。在框架内,根据可用性判断,我们设计对话/会话级和查询级数据生成,包括无监督和半监督学习。生成的数据用于微调对话密集检索器。在四个广泛使用的数据集上进行的大量实验证明,与几个强大的基线相比,我们的ConvSDG框架的有效性和广泛应用。
https://arxiv.org/abs/2403.11335
Energy-Based Models (EBMs) are an important class of probabilistic models, also known as random fields and undirected graphical models. EBMs are un-normalized and thus radically different from other popular self-normalized probabilistic models such as hidden Markov models (HMMs), autoregressive models, generative adversarial nets (GANs) and variational auto-encoders (VAEs). Over the past years, EBMs have attracted increasing interest not only from the core machine learning community, but also from application domains such as speech, vision, natural language processing (NLP) and so on, due to significant theoretical and algorithmic progress. The sequential nature of speech and language also presents special challenges and needs a different treatment from processing fix-dimensional data (e.g., images). Therefore, the purpose of this monograph is to present a systematic introduction to energy-based models, including both algorithmic progress and applications in speech and language processing. First, the basics of EBMs are introduced, including classic models, recent models parameterized by neural networks, sampling methods, and various learning methods from the classic learning algorithms to the most advanced ones. Then, the application of EBMs in three different scenarios is presented, i.e., for modeling marginal, conditional and joint distributions, respectively. 1) EBMs for sequential data with applications in language modeling, where the main focus is on the marginal distribution of a sequence itself; 2) EBMs for modeling conditional distributions of target sequences given observation sequences, with applications in speech recognition, sequence labeling and text generation; 3) EBMs for modeling joint distributions of both sequences of observations and targets, and their applications in semi-supervised learning and calibrated natural language understanding.
基于能量的模型(EBMs)是一种重要的概率模型,也被称为随机场和无向图形模型。EBMs 是不归一化的,因此与广受欢迎的自归一化概率模型(如隐马尔可夫模型(HMMs)、自回归模型、生成对抗网络(GANs)和变分自编码器(VAEs)等)具有显著的区别。在过去的几年里,EBMs 不仅吸引了来自核心机器学习社区越来越多的关注,还吸引了来自应用领域(如语音、视觉、自然语言处理等)的更多关注, due to 显著的理论和算法的进步。语音和语言的序列特性也带来了特殊的挑战,需要与处理固定维数据(如图像)不同的处理方式。因此,本著作的目的是系统地介绍基于能量的模型,包括理论进展和语音和语言处理中的应用。首先介绍 EBM 的基本原理,包括经典模型、通过神经网络参数化的最近模型、采样方法和各种经典学习算法到最先进模型的各种学习方法。然后,分别介绍 EBM 在三种不同场景中的应用,即分别建模序列数据、目标序列的联合分布以及同时建模观察序列和目标序列的联合分布及其在半监督学习和标定自然语言理解中的应用。1)用于语言建模的序列数据的 EBM,主要关注序列本身的后验分布;2)用于建模给定观察序列的目标序列联合分布,应用于语音识别、序列标签和文本生成;3)用于同时建模观察序列和目标序列的联合分布及其在半监督学习和标定自然语言理解中的应用。
https://arxiv.org/abs/2403.10961
While text summarization is a well-known NLP task, in this paper, we introduce a novel and useful variant of it called functionality extraction from Git README files. Though this task is a text2text generation at an abstract level, it involves its own peculiarities and challenges making existing text2text generation systems not very useful. The motivation behind this task stems from a recent surge in research and development activities around the use of large language models for code-related tasks, such as code refactoring, code summarization, etc. We also release a human-annotated dataset called FuncRead, and develop a battery of models for the task. Our exhaustive experimentation shows that small size fine-tuned models beat any baseline models that can be designed using popular black-box or white-box large language models (LLMs) such as ChatGPT and Bard. Our best fine-tuned 7 Billion CodeLlama model exhibit 70% and 20% gain on the F1 score against ChatGPT and Bard respectively.
尽管文本摘要是一个著名的自然语言处理(NLP)任务,但在这篇论文中,我们引入了一个新颖且实用的文本摘要变体,称为从Git README文件中提取功能。尽管这个任务在抽象层面上是一个文本到文本的生成任务,但它具有自己的独特特点和挑战,使得现有的文本到文本生成系统变得不太有用。这个任务的背后是对大型语言模型在代码相关任务中进行研究和开发活动的激增。我们还发布了一个人类标注的数据集FuncRead,并为该任务开发了各种模型。我们进行彻底的实验,结果表明,经过微调的小规模模型击败了所有使用流行的大型语言模型(LLMs)如ChatGPT和Bard设计的基线模型。我们最佳微调的70亿CodeLlama模型在ChatGPT和Bard上的F1得分分别增加了70%和20%。
https://arxiv.org/abs/2403.10205
Dynamic retrieval augmented generation (RAG) paradigm actively decides when and what to retrieve during the text generation process of Large Language Models (LLMs). There are two key elements of this paradigm: identifying the optimal moment to activate the retrieval module (deciding when to retrieve) and crafting the appropriate query once retrieval is triggered (determining what to retrieve). However, current dynamic RAG methods fall short in both aspects. Firstly, the strategies for deciding when to retrieve often rely on static rules. Moreover, the strategies for deciding what to retrieve typically limit themselves to the LLM's most recent sentence or the last few tokens, while the LLM's real-time information needs may span across the entire context. To overcome these limitations, we introduce a new framework, DRAGIN, i.e., Dynamic Retrieval Augmented Generation based on the real-time Information Needs of LLMs. Our framework is specifically designed to make decisions on when and what to retrieve based on the LLM's real-time information needs during the text generation process. We evaluate DRAGIN along with existing methods comprehensively over 4 knowledge-intensive generation datasets. Experimental results show that DRAGIN achieves superior performance on all tasks, demonstrating the effectiveness of our method. We have open-sourced all the code, data, and models in GitHub: this https URL
动态检索增强生成(RAG)范式在大型语言模型(LLMs)的文本生成过程中积极决定何时何地检索。这种范式的两个关键要素是:确定激活检索模块的最佳时刻(决定何时检索)和检索触发后适当创建查询(决定检索什么)。然而,当前的动态RAG方法在这两个方面都存在不足。首先,决定何时检索通常依赖于静态规则。其次,决定检索什么通常仅限于LLM的最新句子或最后的几个词,而LLM的实际信息需求可能跨越整个上下文。为了克服这些限制,我们引入了一个新框架:DRAGIN,即基于LLM实时信息需求的动态检索增强生成。我们的框架专门设计用于在文本生成过程中根据LLM的实时信息需求决定何时何地检索。我们全面评估DRAGIN与现有方法在4个知识密集型生成数据集上的表现。实验结果表明,DRAGIN在所有任务上都取得了优越的性能,证明了我们的方法的的有效性。我们已经开源了所有代码、数据和模型到GitHub的URL:https://github.com/your-username/dragin
https://arxiv.org/abs/2403.10081
Large Language Models (LLMs) have demonstrated efficacy in various linguistic applications, including text summarization and controlled text generation. However, studies into their capacity of switching between styles via fine-tuning remain underexplored. This study concentrates on textual professionalism and introduces a novel methodology, named ProSwitch, which equips a language model with the ability to produce both professional and non-professional responses through knowledge-guided instruction tuning. ProSwitch unfolds across three phases: data preparation for gathering domain knowledge and training corpus; instruction tuning for optimizing language models with multiple levels of instruction formats; and comprehensive evaluation for assessing the professionalism discrimination and reference-based quality of generated text. Comparative analysis of ProSwitch against both general and specialized language models reveals that our approach outperforms baselines in switching between professional and non-professional text generation.
大语言模型(LLMs)已经在各种语言应用中展示了有效性,包括文本摘要和控制文本生成。然而,关于它们通过微调在风格之间切换的能力的研究仍然很少。这项研究重点关注文本专业性,并引入了一种名为ProSwitch的新方法,该方法通过知识指导的指令调整授予语言模型生产专业和非专业响应的能力。ProSwitch展开于三个阶段:数据准备用于收集领域知识和训练数据;指令微调,以优化具有多种指令格式的语言模型;全面评估,以评估生成的文本的专业性区分和基于参考的质量。与通用和专用语言模型进行比较,揭示了我们在专业和非专业文本生成之间的切换能力超过了基线。
https://arxiv.org/abs/2403.09131
Transformers have emerged as the underpinning architecture for Large Language Models (LLMs). In generative language models, the inference process involves two primary phases: prompt processing and token generation. Token generation, which constitutes the majority of the computational workload, primarily entails vector-matrix multiplications and interactions with the Key-Value (KV) Cache. This phase is constrained by memory bandwidth due to the overhead of transferring weights and KV cache values from the memory system to the computing units. This memory bottleneck becomes particularly pronounced in applications that require long-context and extensive text generation, both of which are increasingly crucial for LLMs. This paper introduces "Keyformer", an innovative inference-time approach, to mitigate the challenges associated with KV cache size and memory bandwidth utilization. Keyformer leverages the observation that approximately 90% of the attention weight in generative inference focuses on a specific subset of tokens, referred to as "key" tokens. Keyformer retains only the key tokens in the KV cache by identifying these crucial tokens using a novel score function. This approach effectively reduces both the KV cache size and memory bandwidth usage without compromising model accuracy. We evaluate Keyformer's performance across three foundational models: GPT-J, Cerebras-GPT, and MPT, which employ various positional embedding algorithms. Our assessment encompasses a variety of tasks, with a particular emphasis on summarization and conversation tasks involving extended contexts. Keyformer's reduction of KV cache reduces inference latency by 2.1x and improves token generation throughput by 2.4x, while preserving the model's accuracy.
翻译:Transformer已成为大型语言模型(LLMs)的基础架构。在生成语言模型中,推理过程涉及两个主要阶段:提示处理和词生成。词生成阶段构成了计算工作量的绝大多数,主要涉及向量与键值(KV)缓存之间的矩阵乘法和与KV缓存交互。由于在从内存系统到计算单元的权重和KV缓存值传输时产生的开销,这个阶段受到内存带宽的限制。对于需要长上下文和复杂文本生成的应用,这个内存瓶颈越来越重要。本文介绍了一种名为“Keyformer”的创新推理时间方法,以减轻与KV缓存大小和内存带宽利用率相关的挑战。Keyformer通过使用一种新颖的分数函数,仅保留KV缓存中的关键词。这种方法在保持模型准确性的同时有效减少了KV缓存大小和内存带宽使用。我们在三个基本模型(GPT-J、Cerebras-GPT和MPT)上评估了Keyformer的性能。我们的评估包括各种任务,特别关注涉及扩展上下文的总结和对话任务。Keyformer通过减少KV缓存减少了推理延迟,同时通过2.4倍的词生成提高了解析吞吐量。而保持模型的准确性。
https://arxiv.org/abs/2403.09054
We explore a strategy to handle controversial topics in LLM-based chatbots based on Wikipedia's Neutral Point of View (NPOV) principle: acknowledge the absence of a single true answer and surface multiple perspectives. We frame this as retrieval augmented generation, where perspectives are retrieved from a knowledge base and the LLM is tasked with generating a fluent and faithful response from the given perspectives. As a starting point, we use a deterministic retrieval system and then focus on common LLM failure modes that arise during this approach to text generation, namely hallucination and coverage errors. We propose and evaluate three methods to detect such errors based on (1) word-overlap, (2) salience, and (3) LLM-based classifiers. Our results demonstrate that LLM-based classifiers, even when trained only on synthetic errors, achieve high error detection performance, with ROC AUC scores of 95.3% for hallucination and 90.5% for coverage error detection on unambiguous error cases. We show that when no training data is available, our other methods still yield good results on hallucination (84.0%) and coverage error (85.2%) detection.
我们探讨了一种处理基于知识图谱(KG)的聊天机器人中具有争议性话题的策略,基于维基百科的中立观点(NPOV)原则:承认缺乏唯一正确答案,并探究多个观点。我们将这一方法称为检索增强生成,其中从知识库中检索观点,并让LLM生成从给定观点的流畅且忠实响应。作为起点,我们使用确定性检索系统,然后关注这种方法在文本生成过程中出现的常见LLM失败模式,即幻觉和覆盖错误。我们提出了并评估了三种基于(1)单词重叠,(2)显著性,(3)基于LLM的分类器的方法来检测这些错误。我们的结果表明,即使仅在基于合成错误的训练上进行训练,LLM-based分类器也具有很高的错误检测性能,在幻觉和覆盖错误检测上的ROC AUC分数分别为95.3%和90.5%。我们证明了,当没有训练数据可用时,我们的其他方法在幻觉和覆盖错误检测上仍然具有好的结果。
https://arxiv.org/abs/2403.08904
Most data-to-text datasets are for English, so the difficulties of modelling data-to-text for low-resource languages are largely unexplored. In this paper we tackle data-to-text for isiXhosa, which is low-resource and agglutinative. We introduce Triples-to-isiXhosa (T2X), a new dataset based on a subset of WebNLG, which presents a new linguistic context that shifts modelling demands to subword-driven techniques. We also develop an evaluation framework for T2X that measures how accurately generated text describes the data. This enables future users of T2X to go beyond surface-level metrics in evaluation. On the modelling side we explore two classes of methods - dedicated data-to-text models trained from scratch and pretrained language models (PLMs). We propose a new dedicated architecture aimed at agglutinative data-to-text, the Subword Segmental Pointer Generator (SSPG). It jointly learns to segment words and copy entities, and outperforms existing dedicated models for 2 agglutinative languages (isiXhosa and Finnish). We investigate pretrained solutions for T2X, which reveals that standard PLMs come up short. Fine-tuning machine translation models emerges as the best method overall. These findings underscore the distinct challenge presented by T2X: neither well-established data-to-text architectures nor customary pretrained methodologies prove optimal. We conclude with a qualitative analysis of generation errors and an ablation study.
大多数数据到文本数据集都是针对英语的,因此低资源语言数据建模的困难很大程度上没有被探索。在本文中,我们研究了isIxhosa语料库,这是一种低资源和粘连的语言。我们引入了Triples-to-isiXhosa(T2X)新数据集,这是一个基于WebNLG的子集,呈现出了一种新的语言上下文,将建模需求转移到亚词驱动技术上。我们还为T2X开发了一个评估框架,衡量生成的文本准确地描述了数据。这使得未来的T2X用户能够超越表面指标来进行评估。在建模方面,我们探讨了两类方法 - 从零开始训练的数据到文本模型和预训练语言模型(PLMs)。我们提出了一个旨在实现粘连数据到文本的新专用架构,即Subword Segmental Pointer Generator(SSPG)。它共同学习分割单词和复制实体,并在两个粘连语言(isiXhosa和芬兰)上优于现有的专用模型。我们研究了T2X的预训练解决方案,这揭示了标准PLMs的不足之处。对齐机器翻译模型成为最佳方法。这些发现突出了T2X所提出的独特挑战:既没有良好的数据到文本架构,也没有实用的预训练方法论证明最优。我们得出结论并进行定性分析和消融研究。
https://arxiv.org/abs/2403.07567
Although large language models (LLMs) have demonstrated impressive text generation capabilities, they are easily misled by the untruthful context provided by users or knowledge argumentation tools, thereby producing hallucinations. To alleviate the LLMs from being misled by untruthful information and take advantage of knowledge argumentation, we propose Truth-Aware Context Selection (TACS), a lightweight method to shield untruthful context from the inputs. TACS begins by performing truth detection on the input context, leveraging the parameterized knowledge within the LLM. Subsequently, it constructs a corresponding attention mask based on the truthfulness of each position, selecting the truthful context and discarding the untruthful context. Additionally, we introduce a new evaluation metric, Disturbance Adaption Rate, to further study the LLMs' ability to accept truthful information and resist untruthful information. Experimental results show that TACS can effectively filter information in context and significantly improve the overall quality of LLMs' responses when presented with misleading information.
尽管大型语言模型(LLMs)已经展示了出色的文本生成能力,但它们很容易受到用户或知识论据工具提供的虚假上下文的影响,从而产生幻觉。为了减轻LLMs受到虚假信息的影响,并利用知识论据,我们提出了真相感知上下文选择(TACS)方法,这是一种轻量级的方法,用于从输入中筛选出虚假的上下文。TACS首先对输入上下文进行真理检测,并利用LLM中的参数化知识。然后,它根据每个位置的真相性构建了一个相应的关注掩码,选择真相上文并丢弃虚假上文。此外,我们还引入了一个新的评估指标,称为干扰适应率,以进一步研究LLMs接受真实信息并抵抗虚假信息的能力。实验结果表明,TACS可以有效地从上下文中过滤信息,并在展示误导性信息时显著提高LLMs的响应质量。
https://arxiv.org/abs/2403.07556
Event commonsense reasoning requires the ability to reason about the relationship between events, as well as infer implicit context underlying that relationship. However, data scarcity makes it challenging for language models to learn to generate commonsense inferences for contexts and questions involving interactions between complex events. To address this demand, we present COM2 (COMplex COMmonsense), a new dataset created by sampling multi-hop logical queries (e.g., the joint effect or cause of both event A and B, or the effect of the effect of event C) from an existing commonsense knowledge graph (CSKG), and verbalizing them using handcrafted rules and large language models into multiple-choice and text generation questions. Our experiments show that language models trained on COM2 exhibit significant improvements in complex reasoning ability, resulting in enhanced zero-shot performance in both in-domain and out-of-domain tasks for question answering and generative commonsense reasoning, without expensive human annotations.
事件常识推理需要能够推理事件之间的关系,以及推断隐含在这些关系背后的上下文。然而,数据稀缺使得语言模型难以学会为涉及复杂事件交互的上下文和问题生成常识推理。为了解决这个需求,我们提出了COM2(复杂事件常识),一个新的数据集,通过从现有常识知识图谱(CSKG)中采样多级逻辑查询(例如,事件A和事件B的共同影响或事件C的影响)创建,并使用手工规则和大型语言模型进行阐述,生成多个选择题和文本生成问题。我们的实验结果表明,训练在COM2上的语言模型在复杂推理能力上表现出显著的提高,从而在问答和生成常识推理方面实现了在领域内和领域外任务的显著提升,而无需昂贵的 human 注释。
https://arxiv.org/abs/2403.07398
Natural Language Processing (NLP) is an important branch of artificial intelligence that studies how to enable computers to understand, process, and generate human language. Text classification is a fundamental task in NLP, which aims to classify text into different predefined categories. Text classification is the most basic and classic task in natural language processing, and most of the tasks in natural language processing can be regarded as classification tasks. In recent years, deep learning has achieved great success in many research fields, and today, it has also become a standard technology in the field of NLP, which is widely integrated into text classification tasks. Unlike numbers and images, text processing emphasizes fine-grained processing ability. Traditional text classification methods generally require preprocessing the input model's text data. Additionally, they also need to obtain good sample features through manual annotation and then use classical machine learning algorithms for classification. Therefore, this paper analyzes the application status of deep learning in the three core tasks of NLP (including text representation, word order modeling, and knowledge representation). This content explores the improvement and synergy achieved through natural language processing in the context of text classification, while also taking into account the challenges posed by adversarial techniques in text generation, text classification, and semantic parsing. An empirical study on text classification tasks demonstrates the effectiveness of interactive integration training, particularly in conjunction with TextCNN, highlighting the significance of these advancements in text classification augmentation and enhancement.
自然语言处理(NLP)是人工智能的一个重要分支,研究如何使计算机理解和处理人类语言。文本分类是NLP的一个基本任务,旨在将文本分类为预定义的类别。文本分类是自然语言处理中最基本和最经典的任务,而且自然语言处理中的大多数任务都可以看作是分类任务。近年来,深度学习在许多研究领域取得了巨大的成功,如今,它也已经成为NLP领域中的标准技术,并广泛应用于文本分类任务中。与数字和图像不同,文本处理强调了对输入模型的文本数据的细粒度处理能力。传统的文本分类方法通常需要对输入模型的文本数据进行预处理。此外,它们还需要通过手动注释获得良好的样本特征,然后使用经典的机器学习算法进行分类。因此,本文分析了在NLP的三个核心任务(包括文本表示、词序建模和知识表示)中深度学习的应用现状。本文讨论了通过自然语言处理在文本分类任务中实现改进和协同作用以及面对文本生成、文本分类和语义解析等对抗技术所带来的挑战。一个关于文本分类任务的实证研究证明了交互式整合训练的有效性,特别是在与TextCNN结合时,突出了在文本分类增强和增强方面的这些进步的重要性。
https://arxiv.org/abs/2403.09718
In this paper, we study the problem of Generalized Category Discovery (GCD), which aims to cluster unlabeled data from both known and unknown categories using the knowledge of labeled data from known categories. Current GCD methods rely on only visual cues, which however neglect the multi-modality perceptive nature of human cognitive processes in discovering novel visual categories. To address this, we propose a two-phase TextGCD framework to accomplish multi-modality GCD by exploiting powerful Visual-Language Models. TextGCD mainly includes a retrieval-based text generation (RTG) phase and a cross-modality co-teaching (CCT) phase. First, RTG constructs a visual lexicon using category tags from diverse datasets and attributes from Large Language Models, generating descriptive texts for images in a retrieval manner. Second, CCT leverages disparities between textual and visual modalities to foster mutual learning, thereby enhancing visual GCD. In addition, we design an adaptive class aligning strategy to ensure the alignment of category perceptions between modalities as well as a soft-voting mechanism to integrate multi-modality cues. Experiments on eight datasets show the large superiority of our approach over state-of-the-art methods. Notably, our approach outperforms the best competitor, by 7.7% and 10.8% in All accuracy on ImageNet-1k and CUB, respectively.
在本文中,我们研究了通用分类发现(GCD)问题,该旨在利用已知类别知识对未标记数据进行分类聚类。当前的GCD方法仅依赖视觉线索,然而,这却忽略了人类认知过程中多模态感知性质在发现新颖视觉类别中的重要性。为了应对这个问题,我们提出了一个两阶段TextGCD框架,通过利用强大的视觉语言模型实现多模态GCD。TextGCD主要包括基于检索的文本生成(RTG)阶段和跨模态协同教学(CCT)阶段。首先,RTG通过类标签和大型语言模型的属性构建视觉词汇表,以描述图像检索方式生成描述性文本。其次,CCT利用文本和视觉模式之间的差异来促进相互学习,从而增强视觉GCD。此外,我们还设计了一个自适应类别对齐策略,以确保不同模式之间类别感知的一致性以及一个软投票机制,以整合多模态线索。在八个数据集上的实验结果表明,我们的方法在现有方法上具有很大的优越性。值得注意的是,我们的方法在ImageNet-1k和CUB上的所有准确率均优于最优秀的竞争者,分别高出7.7%和10.8%。
https://arxiv.org/abs/2403.07369
This paper presents an exploration of Long Short-Term Memory (LSTM) networks in the realm of text generation, focusing on the utilization of historical datasets for Shakespeare and Nietzsche. LSTMs, known for their effectiveness in handling sequential data, are applied here to model complex language patterns and structures inherent in historical texts. The study demonstrates that LSTM-based models, when trained on historical datasets, can not only generate text that is linguistically rich and contextually relevant but also provide insights into the evolution of language patterns over time. The finding presents models that are highly accurate and efficient in predicting text from works of Nietzsche, with low loss values and a training time of 100 iterations. The accuracy of the model is 0.9521, indicating high accuracy. The loss of the model is 0.2518, indicating its effectiveness. The accuracy of the model in predicting text from the work of Shakespeare is 0.9125, indicating a low error rate. The training time of the model is 100, mirroring the efficiency of the Nietzsche dataset. This efficiency demonstrates the effectiveness of the model design and training methodology, especially when handling complex literary texts. This research contributes to the field of natural language processing by showcasing the versatility of LSTM networks in text generation and offering a pathway for future explorations in historical linguistics and beyond.
本文在文本生成的领域对长短时记忆(LSTM)网络进行了探索,重点关注莎士比亚和尼采等历史数据集的利用。由于LSTM在处理序列数据方面表现出色,因此将其应用于历史文本中模型复杂语言模式和结构的建模。研究证明,基于历史数据集训练的LSTM模型可以生成既语言丰富又具有上下文相关性的文本,并揭示语言模式随时间演变的事实。这一发现为尼采作品的预测文本提供了高度准确和高效的模型,具有较低的损失值和100次迭代的学习时间。模型的准确度为0.9521,表明高度准确。模型的损失为0.2518,表明其有效性。从莎士比亚作品中预测文本的准确度为0.9125,表明较低的错误率。模型的训练时间为100,与尼采数据集的效率相呼应。这一效率证明了模型设计和训练方法的有效性,尤其是在处理复杂文学作品时。这项研究为自然语言处理领域做出了贡献,展示了LSTM网络在文本生成方面的多样性,并为未来在历史语言学及其他领域的探索提供了路径。
https://arxiv.org/abs/2403.07087
We introduce ALaRM, the first framework modeling hierarchical rewards in reinforcement learning from human feedback (RLHF), which is designed to enhance the alignment of large language models (LLMs) with human preferences. The framework addresses the limitations of current alignment approaches, which often struggle with the inconsistency and sparsity of human supervision signals, by integrating holistic rewards with aspect-specific rewards. This integration enables more precise and consistent guidance of language models towards desired outcomes, particularly in complex and open text generation tasks. By employing a methodology that filters and combines multiple rewards based on their consistency, the framework provides a reliable mechanism for improving model alignment. We validate our approach through applications in long-form question answering and machine translation tasks, employing gpt-3.5-turbo for pairwise comparisons, and demonstrate improvements over existing baselines. Our work underscores the effectiveness of hierarchical rewards modeling in refining LLM training processes for better human preference alignment. We release our code at this https URL.
我们提出了ALaRM,第一个从人类反馈(RLHF)建模层次奖励在强化学习中的框架,旨在增强大型语言模型(LLMs)与人类偏好的对齐。该框架通过将全面奖励与特定奖励相结合解决了现有对齐方法的局限性,往往难以处理人类监督信号的不一致性和稀疏性。通过整合这种集成,该框架可以更精确地一致地指导语言模型朝着所需的结果方向,特别是在复杂和开放文本生成任务中。通过采用一种根据一致性过滤和组合多个奖励的方法,该框架为改善模型对齐提供了可靠的机制。我们在长篇问题回答和机器翻译任务上进行应用,使用gpt-3.5-turbo进行成对比较,并证明了与现有基线相比的改进。我们发布的代码在这个链接上。
https://arxiv.org/abs/2403.06754
Large language models (LLMs) have demonstrated remarkable capabilities across various NLP tasks. However, their computational costs are prohibitively high. To address this issue, previous research has attempted to distill the knowledge of LLMs into smaller models by generating annotated data. Nonetheless, these works have mainly focused on the direct use of LLMs for text generation and labeling, without fully exploring their potential to comprehend the target task and acquire valuable knowledge. In this paper, we propose EvoKD: Evolving Knowledge Distillation, which leverages the concept of active learning to interactively enhance the process of data generation using large language models, simultaneously improving the task capabilities of small domain model (student model). Different from previous work, we actively analyze the student model's weaknesses, and then synthesize labeled samples based on the analysis. In addition, we provide iterative feedback to the LLMs regarding the student model's performance to continuously construct diversified and challenging samples. Experiments and analysis on different NLP tasks, namely, text classification and named entity recognition show the effectiveness of EvoKD.
大语言模型(LLMs)在各种自然语言处理任务中表现出非凡的能力。然而,它们的计算成本过高。为解决这个问题,之前的研究试图通过生成注释数据将LLM的知识精炼为较小的模型。然而,这些工作主要集中在直接使用LLMs进行文本生成和标注,而没有完全探索其理解目标任务并获取有价值知识的潜力。在本文中,我们提出EvoKD:进化知识蒸馏,它利用主动学习的概念来交互式增强大型语言模型的数据生成过程,同时提高小领域模型的任务能力。与之前的工作不同,我们积极分析学生模型的不足之处,然后根据分析合成有标签的样本。此外,我们在LLMs关于学生模型性能的反馈上提供迭代反馈,以持续构建多样化和具有挑战性的样本。在文本分类和命名实体识别等不同自然语言处理任务上的实验和分析表明,EvoKD的有效性得到了验证。
https://arxiv.org/abs/2403.06414
AI systems sometimes exhibit harmful unintended behaviors post-deployment. This is often despite extensive diagnostics and debugging by developers. Minimizing risks from models is challenging because the attack surface is so large. It is not tractable to exhaustively search for inputs that may cause a model to fail. Red-teaming and adversarial training (AT) are commonly used to make AI systems more robust. However, they have not been sufficient to avoid many real-world failure modes that differ from the ones adversarially trained on. In this work, we utilize latent adversarial training (LAT) to defend against vulnerabilities without generating inputs that elicit them. LAT leverages the compressed, abstract, and structured latent representations of concepts that the network actually uses for prediction. We use LAT to remove trojans and defend against held-out classes of adversarial attacks. We show in image classification, text classification, and text generation tasks that LAT usually improves both robustness and performance on clean data relative to AT. This suggests that LAT can be a promising tool for defending against failure modes that are not explicitly identified by developers.
翻译:AI系统在部署后有时会表现出有害的意外行为。这通常是因为开发者进行了广泛的诊断和调试工作。减小模型风险的难度很大,因为攻击面如此庞大。查找可能使模型失效的输入是难以实现的。常见的对抗训练(AT)方法使AI系统更具韧性。然而,它们尚未足以避免与开发者对抗训练的不同真实世界故障模式。在这项工作中,我们利用潜在对抗训练(LAT)来抵御潜在漏洞,而不会产生激发其产生模型的输入。LAT依赖于网络实际上用于预测的实际概念的压缩、抽象和结构化 latent 表示。我们使用 LAT 清除特洛伊木马并防御拒绝类别的对抗攻击。我们在图像分类、文本分类和文本生成任务中展示了 LAT 通常在干净数据上比 AT 具有更高的韧性和性能。这表明 LAT 可以成为一个有前途的工具,用于防御未明确识别的失败模式。
https://arxiv.org/abs/2403.05030
The emergence of LLMs has ignited a fresh surge of breakthroughs in NLP applications, particularly in domains such as question-answering systems and text generation. As the need for longer context grows, a significant bottleneck in model deployment emerges due to the linear expansion of the Key-Value (KV) cache with the context length. Existing methods primarily rely on various hypotheses, such as sorting the KV cache based on attention scores for replacement or eviction, to compress the KV cache and improve model throughput. However, heuristics used by these strategies may wrongly evict essential KV cache, which can significantly degrade model performance. In this paper, we propose QAQ, a Quality Adaptive Quantization scheme for the KV cache. We theoretically demonstrate that key cache and value cache exhibit distinct sensitivities to quantization, leading to the formulation of separate quantization strategies for their non-uniform quantization. Through the integration of dedicated outlier handling, as well as an improved attention-aware approach, QAQ achieves up to 10x the compression ratio of the KV cache size with a neglectable impact on model performance. QAQ significantly reduces the practical hurdles of deploying LLMs, opening up new possibilities for longer-context applications. The code is available at this http URL.
大型语言模型的出现引发了对自然语言处理应用的全新突破,特别是在问答系统和文本生成领域。随着对较长上下文的需求增加,由于键值(KV)缓存线性扩展导致的模型部署瓶颈逐渐显现。现有的方法主要依赖于各种假设,例如根据注意力分数对KV缓存进行排序或替换,以压缩KV缓存并提高模型吞吐量。然而,这些策略所使用的启发式方法可能错误地弹出了关键KV缓存,导致模型性能严重下降。在本文中,我们提出了QAQ,一种基于质量的适应性量化方案来对KV缓存进行量化。我们理论证明,关键缓存和值缓存对量化表现出截然不同的敏感性,导致为它们非均匀量化提出单独的量化策略。通过集成专用的异常处理和改善的注意力感知方法,QAQ实现了KV缓存大小压缩比高达10倍,对模型性能的影响可以忽略不计。QAQ显著减少了部署LLM的实用障碍,为长时间应用开辟了新的可能性。代码可在此链接下载:http://www.example.com/qaq。
https://arxiv.org/abs/2403.04643