Neural network pruning offers an effective method for compressing a multilingual automatic speech recognition (ASR) model with minimal performance loss. However, it entails several rounds of pruning and re-training needed to be run for each language. In this work, we propose the use of an adaptive masking approach in two scenarios for pruning a multilingual ASR model efficiently, each resulting in sparse monolingual models or a sparse multilingual model (named as Dynamic ASR Pathways). Our approach dynamically adapts the sub-network, avoiding premature decisions about a fixed sub-network structure. We show that our approach outperforms existing pruning methods when targeting sparse monolingual models. Further, we illustrate that Dynamic ASR Pathways jointly discovers and trains better sub-networks (pathways) of a single multilingual model by adapting from different sub-network initializations, thereby reducing the need for language-specific pruning.
神经网络剪枝是一种有效的方法,以压缩具有最小性能损失的多语言自动语音识别(ASR)模型。然而,它需要进行多个语言的剪枝和再训练。在这项工作中,我们提议使用一种自适应掩蔽方法,在两个场景下高效剪枝多语言ASR模型,每个产生稀疏的 Monolingual 模型或稀疏的 Multilingual 模型(称为动态ASR通道)。我们的方法动态适应子网络,避免过早决定固定的子网络结构。我们表明,当针对稀疏的 Monolingual 模型时,我们的方法比现有的剪枝方法表现更好。此外,我们举例说明,动态ASR通道通过自适应从不同的子网络初始化中学习更好的子网络(通道),从而减少了特定语言的剪枝需求。
https://arxiv.org/abs/2309.13018
Nested Event Extraction (NEE) aims to extract complex event structures where an event contains other events as its arguments recursively. Nested events involve a kind of Pivot Elements (PEs) that simultaneously act as arguments of outer events and as triggers of inner events, and thus connect them into nested structures. This special characteristic of PEs brings challenges to existing NEE methods, as they cannot well cope with the dual identities of PEs. Therefore, this paper proposes a new model, called PerNee, which extracts nested events mainly based on recognizing PEs. Specifically, PerNee first recognizes the triggers of both inner and outer events and further recognizes the PEs via classifying the relation type between trigger pairs. In order to obtain better representations of triggers and arguments to further improve NEE performance, it incorporates the information of both event types and argument roles into PerNee through prompt learning. Since existing NEE datasets (e.g., Genia11) are limited to specific domains and contain a narrow range of event types with nested structures, we systematically categorize nested events in generic domain and construct a new NEE dataset, namely ACE2005-Nest. Experimental results demonstrate that PerNee consistently achieves state-of-the-art performance on ACE2005-Nest, Genia11 and Genia13.
嵌套事件提取(NEE)的目标是提取事件结构中包含其他事件作为其论据的复杂的事件。嵌套事件涉及到一种称为pivot elements(PEs)的特殊元素,它们同时作为外部事件论据和内部事件触发器,将它们连接成嵌套结构。PEs的特殊性质给现有的NEE方法带来了挑战,因为它们无法很好地处理PEs的双重身份。因此,本文提出了一种新模型,称为PerNee,它主要基于识别PEs来提取嵌套事件。具体来说,PerNee首先识别内部和外部事件的触发器,并通过分类触发器之间的关系类型来进一步识别PEs。为了获得更好的触发器和论据表示,以进一步改善NEE性能,它通过prompt learning将两种事件类型和论据角色的信息融入PerNee中。由于现有的NEE数据集(例如Gia11)仅局限于特定的领域,并包含嵌套结构和嵌套事件的狭窄范围,因此我们在通用领域 systematic 分类嵌套事件,并建立了新的NEE数据集,即ACE2005- Nest。实验结果显示,PerNee在ACE2005- Nest、Gia11和Gia13上 consistently 实现了最先进的性能。
https://arxiv.org/abs/2309.12960
Recent advances in Large Language Models (LLMs) have enabled the generation of open-ended high-quality texts, that are non-trivial to distinguish from human-written texts. We refer to such LLM-generated texts as \emph{deepfake texts}. There are currently over 11K text generation models in the huggingface model repo. As such, users with malicious intent can easily use these open-sourced LLMs to generate harmful texts and misinformation at scale. To mitigate this problem, a computational method to determine if a given text is a deepfake text or not is desired--i.e., Turing Test (TT). In particular, in this work, we investigate the more general version of the problem, known as \emph{Authorship Attribution (AA)}, in a multi-class setting--i.e., not only determining if a given text is a deepfake text or not but also being able to pinpoint which LLM is the author. We propose \textbf{TopRoBERTa} to improve existing AA solutions by capturing more linguistic patterns in deepfake texts by including a Topological Data Analysis (TDA) layer in the RoBERTa model. We show the benefits of having a TDA layer when dealing with noisy, imbalanced, and heterogeneous datasets, by extracting TDA features from the reshaped $pooled\_output$ of RoBERTa as input. We use RoBERTa to capture contextual representations (i.e., semantic and syntactic linguistic features), while using TDA to capture the shape and structure of data (i.e., linguistic structures). Finally, \textbf{TopRoBERTa}, outperforms the vanilla RoBERTa in 2/3 datasets, achieving up to 7\% increase in Macro F1 score.
近年来大型语言模型(LLM)的进步使得可以生成任意长度高质量的文本,这些文本难以与人类编写的文本区分开来。我们将这些生成的文本称为 \emph{DeepFake texts}。目前 hugoface 模型 repo 中有超过 11K 个文本生成模型。因此,有恶意意图的用户可以轻松利用这些开源LLM生成大规模的有害文本和虚假信息。为了解决这个问题,我们希望有一种计算方法来确定给定文本是否为DeepFake文本,也就是进行图灵测试(TT)。特别是,在本文中,我们研究了更一般的问题,称为 \emph{作者身份确认(AA)},并在多分类环境中研究这个问题--不仅仅是确定给定文本是否为DeepFake文本,而是能够明确指出哪个LLM是作者。我们提出了 \textbf{TopRoBERTa} 来改进现有的AA解决方案,通过在RoBERTa模型中引入一个拓扑数据分析层,来捕获DeepFake文本中的更多语言学模式。我们展示了使用TDA层来处理噪声、不平衡和异质数据的好处,从RoBERTa的重构输出中提取TDA特征作为输入。我们使用RoBERTa捕获上下文表示(即语义和语法语言学特征),同时使用TDA捕获数据的形状和结构(即语言学结构)。最后, \textbf{TopRoBERTa} 在2/3个数据集上优于传统的RoBERTa,实现了7%的macro F1得分提高。
https://arxiv.org/abs/2309.12934
Affect recognition, encompassing emotions, moods, and feelings, plays a pivotal role in human communication. In the realm of conversational artificial intelligence (AI), the ability to discern and respond to human affective cues is a critical factor for creating engaging and empathetic interactions. This study delves into the capacity of large language models (LLMs) to recognise human affect in conversations, with a focus on both open-domain chit-chat dialogues and task-oriented dialogues. Leveraging three diverse datasets, namely IEMOCAP, EmoWOZ, and DAIC-WOZ, covering a spectrum of dialogues from casual conversations to clinical interviews, we evaluated and compared LLMs' performance in affect recognition. Our investigation explores the zero-shot and few-shot capabilities of LLMs through in-context learning (ICL) as well as their model capacities through task-specific fine-tuning. Additionally, this study takes into account the potential impact of automatic speech recognition (ASR) errors on LLM predictions. With this work, we aim to shed light on the extent to which LLMs can replicate human-like affect recognition capabilities in conversations.
情感识别在人类沟通中扮演着关键角色。在对话型人工智能(AI)领域,能够分辨和响应人类情感 cues 是创造有趣和感同身受的交互的关键因素。本文探讨了大型语言模型(LLM)在对话中识别人类情感的能力,重点研究了公开领域的闲聊对话和任务驱动的对话。利用三个不同的数据集,包括IEMOCAP、EmoWOZ和DAIC-WOZ,涵盖了从闲聊对话到临床访谈的一系列对话,我们评估了和比较了LLM在情感识别方面的表现。我们的研究探索了LLM通过上下文学习(ICL)的零Shot和少量Shot能力,以及通过任务特定微调来提高其模型能力。此外,本文考虑到了自动语音识别(ASR)错误对LLM预测的潜在影响。通过这项工作,我们旨在阐明LLM在对话中能否模拟人类情感识别能力的局限性。
https://arxiv.org/abs/2309.12881
This work aims to provide an overview on the open-source multilanguage tool called StyloMetrix. It offers stylometric text representations that cover various aspects of grammar, syntax and lexicon. StyloMetrix covers four languages: Polish as the primary language, English, Ukrainian and Russian. The normalized output of each feature can become a fruitful course for machine learning models and a valuable addition to the embeddings layer for any deep learning algorithm. We strive to provide a concise, but exhaustive overview on the application of the StyloMetrix vectors as well as explain the sets of the developed linguistic features. The experiments have shown promising results in supervised content classification with simple algorithms as Random Forest Classifier, Voting Classifier, Logistic Regression and others. The deep learning assessments have unveiled the usefulness of the StyloMetrix vectors at enhancing an embedding layer extracted from Transformer architectures. The StyloMetrix has proven itself to be a formidable source for the machine learning and deep learning algorithms to execute different classification tasks.
这项工作旨在提供对开源的多语言工具StyloMetrix的概述。StyloMetrix提供形态学文本表示,涵盖了语法、句法和词汇表的各种方面。StyloMetrix涵盖了四种语言:波兰语作为主要语言,英语、乌克兰语和俄语。每个特征的标准化输出都可以成为机器学习模型的有效课程,并成为任何深度学习算法的嵌入层的重要补充。我们致力于提供简洁但全面的概述,包括StyloMetrix向量的应用范围,并解释所开发的语言学特征的集合。实验在监督的内容分类中取得了良好的结果,使用简单的算法如随机森林分类器、投票分类器、线性回归和其他算法。深度学习评估揭示了StyloMetrix向量在增强Transformer架构提取的嵌入层上的有用性。StyloMetrix已经证明自己是机器学习和深度学习算法执行不同分类任务的强大来源。
https://arxiv.org/abs/2309.12810
As a common approach to learning English, reading comprehension primarily entails reading articles and answering related questions. However, the complexity of designing effective exercises results in students encountering standardized questions, making it challenging to align with individualized learners' reading comprehension ability. By leveraging the advanced capabilities offered by large language models, exemplified by ChatGPT, this paper presents a novel personalized support system for reading comprehension, referred to as ChatPRCS, based on the Zone of Proximal Development theory. ChatPRCS employs methods including reading comprehension proficiency prediction, question generation, and automatic evaluation, among others, to enhance reading comprehension instruction. First, we develop a new algorithm that can predict learners' reading comprehension abilities using their historical data as the foundation for generating questions at an appropriate level of difficulty. Second, a series of new ChatGPT prompt patterns is proposed to address two key aspects of reading comprehension objectives: question generation, and automated evaluation. These patterns further improve the quality of generated questions. Finally, by integrating personalized ability and reading comprehension prompt patterns, ChatPRCS is systematically validated through experiments. Empirical results demonstrate that it provides learners with high-quality reading comprehension questions that are broadly aligned with expert-crafted questions at a statistical level.
作为学习英语的普遍方法,阅读理解主要涉及阅读文章并回答问题。然而,设计有效的练习会导致学生遇到标准化问题,这使得个性化学生的阅读理解能力难以匹配。通过利用大型语言模型的代表 ChatGPT 提供的先进能力,本文提出了一种名为 ChatPRCS 的个人化阅读支持系统,基于渐进发展理论的区。ChatPRCS 采用的方法包括阅读理解能力预测、问题生成和自动评估,以增强阅读理解指示。首先,我们开发了一种新算法,可以使用历史数据作为生成问题的基础,以在适当的难度级别上生成问题。其次,我们提出了一系列新的 ChatGPT 提示模式,以解决阅读理解目标的的两个关键方面:问题生成和自动评估。这些模式进一步改进了生成的问题的质量和质量。最后,通过整合个人能力和阅读理解提示模式,ChatPRCS 通过实验系统性地验证。实证结果表明,它提供高质量的阅读理解问题,这些问题的答案在统计层面上广泛与专家创建的问题一致。
https://arxiv.org/abs/2309.12808
Large Language Models (LLMs), acting as a powerful reasoner and generator, exhibit extraordinary performance across various natural language tasks, such as question answering (QA). Among these tasks, Multi-Hop Question Answering (MHQA) stands as a widely discussed category, necessitating seamless integration between LLMs and the retrieval of external knowledge. Existing methods employ LLM to generate reasoning paths and plans, and utilize IR to iteratively retrieve related knowledge, but these approaches have inherent flaws. On one hand, Information Retriever (IR) is hindered by the low quality of generated queries by LLM. On the other hand, LLM is easily misguided by the irrelevant knowledge by IR. These inaccuracies, accumulated by the iterative interaction between IR and LLM, lead to a disaster in effectiveness at the end. To overcome above barriers, in this paper, we propose a novel pipeline for MHQA called Furthest-Reasoning-with-Plan-Assessment (FuRePA), including an improved framework (Furthest Reasoning) and an attached module (Plan Assessor). 1) Furthest reasoning operates by masking previous reasoning path and generated queries for LLM, encouraging LLM generating chain of thought from scratch in each iteration. This approach enables LLM to break the shackle built by previous misleading thoughts and queries (if any). 2) The Plan Assessor is a trained evaluator that selects an appropriate plan from a group of candidate plans proposed by LLM. Our methods are evaluated on three highly recognized public multi-hop question answering datasets and outperform state-of-the-art on most metrics (achieving a 10%-12% in answer accuracy).
大型语言模型(LLM)作为一种强大的推理和生成工具,在各种自然语言任务中表现出非凡的性能,例如问答(QA)。在这些任务中,MHQA是一个被广泛讨论的类别,需要进行LLM和外部知识检索的无缝集成。现有的方法使用LLM生成推理路径和计划,并使用IR迭代地检索相关知识,但这些方法具有固有的缺陷。一方面,信息检索(IR)受到LLM生成低质量查询的限制。另一方面,LLM很容易受到IR生成的无关知识的影响。这些不准确的误差通过IR和LLM的迭代交互不断增加,最终导致 effectiveness 的灾难。为了克服上述障碍,在本文中,我们提出了MHQA的新型管道,称为“最短推理与计划评估(FuRePA)”,包括改进的框架(最短推理)和一个附加模块(计划评估器)。1) 最短推理通过掩盖前推理路径和生成LLM的查询,鼓励LLM在每次迭代中从头生成思考链。这种方法使LLM能够打破由以前误导性思考和查询(如果有)构建的束缚。2) 计划评估器是一个经过训练的评估者,从LLM提出的一组备选计划中选择适当的计划。我们的方法在三个备受认可的公共多级问答数据集上进行了评估,并在大多数指标上优于最先进的方法(实现回答准确性10%-12%)。
https://arxiv.org/abs/2309.12767
Self-supervised representation learning (SSRL) has improved the performance on downstream phoneme recognition versus supervised models. Training SSRL models requires a large amount of pre-training data and this poses a challenge for low resource languages. A common approach is transferring knowledge from other languages. Instead, we propose to use audio augmentation to pre-train SSRL models in a low resource condition and evaluate phoneme recognition as downstream task. We performed a systematic comparison of augmentation techniques, namely: pitch variation, noise addition, accented target-language speech and other language speech. We found combined augmentations (noise/pitch) was the best augmentation strategy outperforming accent and language knowledge transfer. We compared the performance with various quantities and types of pre-training data. We examined the scaling factor of augmented data to achieve equivalent performance to models pre-trained with target domain speech. Our findings suggest that for resource constrained languages, in-domain synthetic augmentation can outperform knowledge transfer from accented or other language speech.
自监督表示学习(SSRL)已经提高了后续音节识别相对于监督模型的性能。训练SSRL模型需要大量的预训练数据,这对资源有限的语言来说是一个挑战。一种常见方法是从其他语言中转移知识。相反,我们建议利用音频增强来在资源有限的情况下预训练SSRL模型,并将音节识别作为后续任务进行评估。我们对增强技术进行了系统性的比较,包括音调变化、噪音添加、目标语言语音带有口音以及其他语言语音。我们发现合并增强(噪音/音调)是最佳的增强策略,比口音和语言知识转移表现更好。我们与各种数量和类型的预训练数据进行了比较,并研究了增强数据的缩放因子,以获得与目标语言语音预训练模型相当的性能。我们的发现表明,对于资源有限的语言来说,跨域合成增强可以优于带有口音或其他语言语音的知识转移。
https://arxiv.org/abs/2309.12763
Semantic similarity between natural language texts is typically measured either by looking at the overlap between subsequences (e.g., BLEU) or by using embeddings (e.g., BERTScore, S-BERT). Within this paper, we argue that when we are only interested in measuring the semantic similarity, it is better to directly predict the similarity using a fine-tuned model for such a task. Using a fine-tuned model for the STS-B from the GLUE benchmark, we define the STSScore approach and show that the resulting similarity is better aligned with our expectations on a robust semantic similarity measure than other approaches.
自然语言文本之间的语义相似性通常可以通过比较序列之间的重叠(例如,BLEU)或使用嵌入(例如,BERTScore,S-BERT)来测量。在本文中,我们认为,如果只关心测量语义相似性,则最好直接使用专门为该任务微调的模型来预测相似性。使用从GLUE基准测试中的微调模型STS-B来定义STSScore方法,并表明,结果的相似性与我们对于可靠的语义相似性测量期望的对齐更好。
https://arxiv.org/abs/2309.12697
Mixup is an effective data augmentation method that generates new augmented samples by aggregating linear combinations of different original samples. However, if there are noises or aberrant features in the original samples, Mixup may propagate them to the augmented samples, leading to over-sensitivity of the model to these outliers . To solve this problem, this paper proposes a new Mixup method called AMPLIFY. This method uses the Attention mechanism of Transformer itself to reduce the influence of noises and aberrant values in the original samples on the prediction results, without increasing additional trainable parameters, and the computational cost is very low, thereby avoiding the problem of high resource consumption in common Mixup methods such as Sentence Mixup . The experimental results show that, under a smaller computational resource cost, AMPLIFY outperforms other Mixup methods in text classification tasks on 7 benchmark datasets, providing new ideas and new ways to further improve the performance of pre-trained models based on the Attention mechanism, such as BERT, ALBERT, RoBERTa, and GPT. Our code can be obtained at this https URL.
混合( Mixup)是一种有效的数据增强方法,通过聚合不同原始样本的线性组合生成新的增强样本。然而,如果原始样本中存在噪声或异常值,混合可能会将这些异常值传播到增强样本中,导致模型对这些异常值过于敏感。为了解决这一问题,本文提出了一种名为AMPLIFY的新混合方法。这种方法使用Transformer自身的注意力机制来减少原始样本中噪声和异常值对预测结果的影响,而无需增加可训练参数,计算成本也非常小,从而避免了常见的混合方法如句子混合(Sentence Mixup)等的高资源消耗问题。实验结果显示,在计算资源更少的情况下,AMPLIFY在7个基准数据集上的文本分类任务中比其他混合方法表现更好,提供了基于注意力机制的预训练模型的注意力增强新想法和新方法,如BERT、ALBERT、RoBERTa和GPT等。我们的代码可以在这个httpsURL上获取。
https://arxiv.org/abs/2309.12689
Neural language models have exhibited outstanding performance in a range of downstream tasks. However, there is limited understanding regarding the extent to which these models internalize syntactic knowledge, so that various datasets have recently been constructed to facilitate syntactic evaluation of language models across languages. In this paper, we introduce JCoLA (Japanese Corpus of Linguistic Acceptability), which consists of 10,020 sentences annotated with binary acceptability judgments. Specifically, those sentences are manually extracted from linguistics textbooks, handbooks and journal articles, and split into in-domain data (86 %; relatively simple acceptability judgments extracted from textbooks and handbooks) and out-of-domain data (14 %; theoretically significant acceptability judgments extracted from journal articles), the latter of which is categorized by 12 linguistic phenomena. We then evaluate the syntactic knowledge of 9 different types of Japanese language models on JCoLA. The results demonstrated that several models could surpass human performance for the in-domain data, while no models were able to exceed human performance for the out-of-domain data. Error analyses by linguistic phenomena further revealed that although neural language models are adept at handling local syntactic dependencies like argument structure, their performance wanes when confronted with long-distance syntactic dependencies like verbal agreement and NPI licensing.
神经网络语言模型在一系列后续任务中表现出了卓越的表现。然而,对于这些模型内部化语法知识的程度仍存在有限的理解,因此各种数据集近年来被构建起来,以促进跨语言的语法评估模型。在本文中,我们介绍了JCoLA(日语语法接受库),它由10,020个带有二进制接受判定的语句组成。具体来说,这些语句从语言学教材、 Handbook 和期刊中手动提取,并将其分为内部语言数据(86 %;从教材和 Handbook 提取的相对简单的接受判定)和外部语言数据(14 %;从期刊提取的具有理论意义的接受判定),后者按照 12 种语言学现象进行分类。随后,我们评估了 9 种日语语言模型的不同语法知识的 JCoLA。结果表明,有几个模型可以在内部语言数据上超过人类表现,但在外部语言数据上却没有能力超过人类表现。语言学现象的错误分析进一步揭示了尽管神经网络语言模型擅长处理类似于论点结构的局部语法依赖,但它们在与类似于语音同意和NPI授权等长距离语法依赖面前的表现会减弱。
https://arxiv.org/abs/2309.12676
Answering numerical questions over hybrid contents from the given tables and text(TextTableQA) is a challenging task. Recently, Large Language Models (LLMs) have gained significant attention in the NLP community. With the emergence of large language models, In-Context Learning and Chain-of-Thought prompting have become two particularly popular research topics in this field. In this paper, we introduce a new prompting strategy called Hybrid prompt strategy and Retrieval of Thought for TextTableQA. Through In-Context Learning, we prompt the model to develop the ability of retrieval thinking when dealing with hybrid data. Our method achieves superior performance compared to the fully-supervised SOTA on the MultiHiertt dataset in the few-shot setting.
从给定的表格和文本中回答混合内容的问题是一项挑战性的任务。最近,大型语言模型(LLM)在自然语言处理社区中引起了广泛关注。随着大型语言模型的出现,上下文学习和思维链提示已成为该领域的两个最受欢迎的研究主题。在本文中,我们介绍了一种新的提示策略,称为混合提示策略,并介绍了在TextTableQA问题中的思维提取方法。通过上下文学习,我们提示模型在处理混合数据时发展检索思维的能力。我们的方法和在少量样本情况下 MultiHiertt 数据集上的全监督顶级结果相比,取得了更好的表现。
https://arxiv.org/abs/2309.12669
Recent advancements in Natural Language Processing (NLP) have highlighted the potential of sentence embeddings in measuring semantic similarity. Yet, its application in analyzing real-world dyadic interactions and predicting the affect of conversational participants remains largely uncharted. To bridge this gap, the present study utilizes verbal conversations within 50 married couples talking about conflicts and pleasant activities. Transformer-based model all-MiniLM-L6-v2 was employed to obtain the embeddings of the utterances from each speaker. The overall similarity of the conversation was then quantified by the average cosine similarity between the embeddings of adjacent utterances. Results showed that semantic similarity had a positive association with wives' affect during conflict (but not pleasant) conversations. Moreover, this association was not observed with husbands' affect regardless of conversation types. Two validation checks further provided support for the validity of the similarity measure and showed that the observed patterns were not mere artifacts of data. The present study underscores the potency of sentence embeddings in understanding the association between interpersonal dynamics and individual affect, paving the way for innovative applications in affective and relationship sciences.
最近的自然语言处理(NLP)进展已经强调了句子嵌入在测量语义相似性方面的潜力。然而,在分析真实世界中两男一女的互动以及预测对话参与者的影响方面,应用句子嵌入仍然在很大程度上未知。为了填补这一差距,本研究利用50对已婚夫妇在讨论冲突和愉悦活动时的口头对话。采用基于Transformer的模型all-MiniLM-L6-v2从每个参与者的说话中提取了嵌入。然后,整个对话的相似性通过相邻说话者嵌入平均余弦相似度量化。结果表明,在冲突(但非愉悦)对话中,语义相似性与妻子的情绪产生了积极关系。此外,无论对话类型如何,这种关系都没有观察到与丈夫的情绪。两个验证检查进一步支持了相似性度量的精度,并表明所观察到的模式不是数据本身的副产品。本研究强调了句子嵌入在理解人际关系动态和个人情绪之间的相互作用方面的潜力,为情感和关系科学中的创新应用铺平了道路。
https://arxiv.org/abs/2309.12646
Neural language models often fail to generate diverse and informative texts, limiting their applicability in real-world problems. While previous approaches have proposed to address these issues by identifying and penalizing undesirable behaviors (e.g., repetition, overuse of frequent words) from language models, we propose an alternative approach based on an observation: models primarily learn attributes within examples that are likely to cause degeneration problems. Based on this observation, we propose a new approach to prevent degeneration problems by training two models. Specifically, we first train a model that is designed to amplify undesirable patterns. We then enhance the diversity of the second model by focusing on patterns that the first model fails to learn. Extensive experiments on two tasks, namely language modeling and dialogue generation, demonstrate the effectiveness of our approach.
神经网络语言模型往往无法生成多样且有用的文本,限制它们在真实世界问题中的适用性。虽然过去的方案曾提议通过识别和惩罚语言模型中不希望出现的行为(例如重复、过度使用常用的词汇)来解决这些问题,但我们提出了一种基于观察的替代方案:模型主要从示例中学习可能导致退化问题的属性。基于这个观察,我们提出了一种新方法,通过训练两个模型来防止退化问题。具体而言,我们首先训练一个旨在放大不希望出现的模式的模型。然后,通过集中关注第一个模型无法学习的模式,我们增强第二个模型的多样性。针对两个任务(语言建模和对话生成)进行广泛的实验,证明了我们的方法的有效性。
https://arxiv.org/abs/2309.12619
Language models (LMs) are no longer restricted to ML community, and instruction-tuned LMs have led to a rise in autonomous AI agents. As the accessibility of LMs grows, it is imperative that an understanding of their capabilities, intended usage, and development cycle also improves. Model cards are a popular practice for documenting detailed information about an ML model. To automate model card generation, we introduce a dataset of 500 question-answer pairs for 25 ML models that cover crucial aspects of the model, such as its training configurations, datasets, biases, architecture details, and training resources. We employ annotators to extract the answers from the original paper. Further, we explore the capabilities of LMs in generating model cards by answering questions. Our initial experiments with ChatGPT-3.5, LLaMa, and Galactica showcase a significant gap in the understanding of research papers by these aforementioned LMs as well as generating factual textual responses. We posit that our dataset can be used to train models to automate the generation of model cards from paper text and reduce human effort in the model card curation process. The complete dataset is available on this https URL
语言模型(LM)不再局限于机器学习社区,经过指令调整的LM已经导致自主人工智能代理的兴起。随着LM的可用性不断增加,理解其能力、预期使用和开发周期也变得越来越重要。模型卡片是一种常见的方法,用于记录一个机器学习模型的详细信息。为了自动化模型卡片的生成,我们介绍了一个包含25个机器学习模型的500问答对的 dataset,涵盖了模型的关键方面,例如训练配置、数据集、偏差、建筑细节和训练资源。我们雇用了标注员从原始论文中抽取答案。进一步,我们探索了LM 在回答问题时生成模型卡片的能力。我们对ChatGPT-3.5、LLaMa和Galactica的最初实验展示了上述LMs对研究论文的理解和生成实际文本响应方面存在显著差距。我们假设我们的 dataset 可以用于训练模型自动从文本中提取模型卡片,并减少模型卡片整理过程中人类的工作量。完整 dataset 可以在 this https URL 上找到。
https://arxiv.org/abs/2309.12616
Text simplification is a common task where the text is adapted to make it easier to understand. Similarly, text elaboration can make a passage more sophisticated, offering a method to control the complexity of reading comprehension tests. However, text simplification and elaboration tasks are limited to only relatively alter the readability of texts. It is useful to directly modify the readability of any text to an absolute target readability level to cater to a diverse audience. Ideally, the readability of readability-controlled generated text should be independent of the source text. Therefore, we propose a novel readability-controlled text modification task. The task requires the generation of 8 versions at various target readability levels for each input text. We introduce novel readability-controlled text modification metrics. The baselines for this task use ChatGPT and Llama-2, with an extension approach introducing a two-step process (generating paraphrases by passing through the language model twice). The zero-shot approaches are able to push the readability of the paraphrases in the desired direction but the final readability remains correlated with the original text's readability. We also find greater drops in semantic and lexical similarity between the source and target texts with greater shifts in the readability.
文本简化是适应文本以使其更容易理解的常见任务。类似地,文本加工可以让一段文字更加复杂,提供一种方法来控制阅读理解测试的复杂性。然而,文本简化和加工任务只能相对改变文本的阅读难度。最好的做法是直接修改任何文本的阅读难度到绝对的目标阅读难度级别,以满足不同受众的需求。理论上,生成控制阅读难度的文本生成的文本的阅读难度应该独立于源文本。因此,我们提出了一种新的阅读控制文本修改任务。该任务要求为每个输入文本生成不同目标阅读难度版本的8个版本。我们引入了新的阅读控制文本修改度量。该任务的基线使用聊天机器人和Llama-2,并引入了一个两步过程(通过语言模型两次生成变体)。零样本方法能够推动变体的可读性向所需方向前进,但最终可读性仍然与原始文本的阅读难度相关。我们还发现,源文本和目标文本之间的语义和词汇相似度随着阅读难度的变化而减少。
https://arxiv.org/abs/2309.12551
Conventional automatic evaluation metrics, such as BLEU and ROUGE, developed for natural language generation (NLG) tasks, are based on measuring the n-gram overlap between the generated and reference text. These simple metrics may be insufficient for more complex tasks, such as question generation (QG), which requires generating questions that are answerable by the reference answers. Developing a more sophisticated automatic evaluation metric, thus, remains as an urgent problem in QG research. This work proposes a Prompting-based Metric on ANswerability (PMAN), a novel automatic evaluation metric to assess whether the generated questions are answerable by the reference answers for the QG tasks. Extensive experiments demonstrate that its evaluation results are reliable and align with human evaluations. We further apply our metric to evaluate the performance of QG models, which shows our metric complements conventional metrics. Our implementation of a ChatGPT-based QG model achieves state-of-the-art (SOTA) performance in generating answerable questions.
传统的自动评估 metrics,如 BLEU 和 ROUGE,是为自然语言生成(NLG)任务开发的,其基础是测量生成和参考文本之间的 n-gram 重叠。这些简单的 metrics 可能不足以应对更复杂的任务,如问题生成(QG),这需要生成能够回答参考答案的问题。因此,开发更为复杂的自动评估 metric 仍然是 QG 研究中的紧急问题。这项工作提出了一种基于回答能力的 metric(PMAN),这是一种新的自动评估 metric,用于评估生成问题是否能够回答参考答案 for QG 任务。广泛的实验表明,其评估结果可靠,与人类评估结果一致。我们还将其应用于评估 QG 模型的性能,这表明我们的 metric 补充了传统的 metrics。我们实现的基于 ChatGPT 的 QG 模型在生成可回答问题方面实现了最先进的表现(SOTA)。
https://arxiv.org/abs/2309.12546
Instruction-tuned Large Language Models (It-LLMs) have been exhibiting outstanding abilities to reason around cognitive states, intentions, and reactions of all people involved, letting humans guide and comprehend day-to-day social interactions effectively. In fact, several multiple-choice questions (MCQ) benchmarks have been proposed to construct solid assessments of the models' abilities. However, earlier works are demonstrating the presence of inherent "order bias" in It-LLMs, posing challenges to the appropriate evaluation. In this paper, we investigate It-LLMs' resilience abilities towards a series of probing tests using four MCQ benchmarks. Introducing adversarial examples, we show a significant performance gap, mainly when varying the order of the choices, which reveals a selection bias and brings into discussion reasoning abilities. Following a correlation between first positions and model choices due to positional bias, we hypothesized the presence of structural heuristics in the decision-making process of the It-LLMs, strengthened by including significant examples in few-shot scenarios. Finally, by using the Chain-of-Thought (CoT) technique, we elicit the model to reason and mitigate the bias by obtaining more robust models.
指令优化的大型语言模型(It-LLMs)表现出卓越的推理能力,能够处理所有参与者的认知状态、意图和反应,让人类能够有效地指导和理解日常的社交互动。实际上,已经提出了几个多选题问题(MCQ)基准来构建模型能力的全面评估。然而,以前的研究已经表明It-LLMs存在固有的“顺序偏差”,这给适当的评估带来了挑战。在本文中,我们使用四个MCQ基准来研究It-LLMs在面对一系列测试时的坚韧能力。引入对抗性示例后,我们展示了显著的性能差距,主要是在改变选择顺序时,这揭示了选择偏差,并讨论了推理能力。由于位置偏差的影响,我们假设It-LLMs的决策过程中存在结构性启发式,通过在少量情况下包含重要的例子来加强。最后,我们使用思维链(CoT)技术,激发模型推理并减轻偏差,通过获得更加可靠的模型。
https://arxiv.org/abs/2309.12481
Generative Artificial Intelligence is set to revolutionize healthcare delivery by transforming traditional patient care into a more personalized, efficient, and proactive process. Chatbots, serving as interactive conversational models, will probably drive this patient-centered transformation in healthcare. Through the provision of various services, including diagnosis, personalized lifestyle recommendations, and mental health support, the objective is to substantially augment patient health outcomes, all the while mitigating the workload burden on healthcare providers. The life-critical nature of healthcare applications necessitates establishing a unified and comprehensive set of evaluation metrics for conversational models. Existing evaluation metrics proposed for various generic large language models (LLMs) demonstrate a lack of comprehension regarding medical and health concepts and their significance in promoting patients' well-being. Moreover, these metrics neglect pivotal user-centered aspects, including trust-building, ethics, personalization, empathy, user comprehension, and emotional support. The purpose of this paper is to explore state-of-the-art LLM-based evaluation metrics that are specifically applicable to the assessment of interactive conversational models in healthcare. Subsequently, we present an comprehensive set of evaluation metrics designed to thoroughly assess the performance of healthcare chatbots from an end-user perspective. These metrics encompass an evaluation of language processing abilities, impact on real-world clinical tasks, and effectiveness in user-interactive conversations. Finally, we engage in a discussion concerning the challenges associated with defining and implementing these metrics, with particular emphasis on confounding factors such as the target audience, evaluation methods, and prompt techniques involved in the evaluation process.
生成人工智能计划通过将传统病人护理转变为更个性化、高效和主动的过程来革命性地改变医疗保健交付。聊天机器人将成为交互对话模型,可能会推动这种病人为中心的医疗保健变革。通过提供各种服务,包括诊断、个性化生活方式建议和心理健康支持,的目标是实质性增加患者的健康成果,同时减轻医疗保健提供者的工作负担。医疗保健应用程序的生命重要性迫使建立统一和全面的评估 metrics 对对话模型。针对各种通用大型语言模型(LLM)提出的现有评估 metrics 表明对医学和健康概念的理解不足,以及它们在促进患者福祉方面的重要性。此外,这些 metrics 忽略了关键用户中心方面,包括建立信任、伦理、个性化、同理心、用户理解和情感支持。本文的目的是探索适用于医疗保健交互对话模型评估的最新 LLM based 评估 metrics。随后,我们介绍了一个全面 set 旨在从用户的角度全面评估医疗保健聊天机器人的性能。这些 metrics 包括对语言处理能力的评估、对现实世界临床任务的影响以及用户交互对话的有效性。最后,我们参与讨论与定义和实施这些 metrics 相关的挑战,特别注重令人困惑的因素,例如目标受众、评估方法以及评估过程中涉及的快速技巧。
https://arxiv.org/abs/2309.12444
Large language models (LLMs) have demonstrated impressive capabilities in natural language generation. However, their output quality can be inconsistent, posing challenges for generating natural language from logical forms (LFs). This task requires the generated outputs to embody the exact semantics of LFs, without missing any LF semantics or creating any hallucinations. In this work, we tackle this issue by proposing a novel generate-and-rerank approach. Our approach involves initially generating a set of candidate outputs by prompting an LLM and subsequently reranking them using a task-specific reranker model. In addition, we curate a manually collected dataset to evaluate the alignment between different ranking metrics and human judgements. The chosen ranking metrics are utilized to enhance the training and evaluation of the reranker model. By conducting extensive experiments on three diverse datasets, we demonstrate that the candidates selected by our reranker outperform those selected by baseline methods in terms of semantic consistency and fluency, as measured by three comprehensive metrics. Our findings provide strong evidence for the effectiveness of our approach in improving the quality of generated outputs.
大型语言模型(LLMs)在自然语言生成方面展现出了令人印象深刻的能力。然而,它们的输出质量可能不一致,给从逻辑形式(LFs)中生成自然语言带来了挑战。这个任务要求生成的输出包含LFs的精确语义,但不要缺少任何LF的语义或产生幻觉。在本研究中,我们提出了一种全新的生成和重新排序方法来解决这个问题。我们的 approach involves 先通过提示 LLM 生成一组候选输出,然后使用任务特定的重新排序模型重新排序这些输出。此外,我们还收集了一个手动整理的数据集来评估不同排序 metrics 和人类判断之间的对齐。选择的天气排序 metrics 被用于增强重新排序模型的训练和评估。通过在三个不同数据集上开展广泛的实验,我们证明了我们的重新排序模型选择的候选人在语义一致性和流畅度方面比基线方法选择的候选人更好。我们的发现为改善生成输出质量了我们方法的有效性提供了强有力证据。
https://arxiv.org/abs/2309.12294