Numerous recent works aim to enhance the efficacy of Large Language Models (LLMs) through strategic prompting. In particular, the Optimization by PROmpting (OPRO) approach provides state-of-the-art performance by leveraging LLMs as optimizers where the optimization task is to find instructions that maximize the task accuracy. In this paper, we revisit OPRO for automated prompting with relatively small-scale LLMs, such as LLaMa-2 family and Mistral 7B. Our investigation reveals that OPRO shows limited effectiveness in small-scale LLMs, with limited inference capabilities constraining optimization ability. We suggest future automatic prompting engineering to consider both model capabilities and computational costs. Additionally, for small-scale LLMs, we recommend direct instructions that clearly outline objectives and methodologies as robust prompt baselines, ensuring efficient and effective prompt engineering in ongoing research.
许多最近的工作旨在通过策略提示增强大型语言模型(LLMs)的效率。特别是,通过利用LLM作为优化器,Proof of Programming (OPRO)方法在优化任务中提供了最先进的性能。在本文中,我们重新审视了OPROMpting (OPR)方法用于自动提示相对较小的LLM,如LLLaMa-2系列和Mistral 7B。我们的调查显示,在小型LLM上,OPROMpting的优化效果有限,有限的语言能力限制了优化能力。我们建议,在未来的自动提示工程中,要考虑模型的特性和计算成本。此外,对于小型LLM,我们建议使用明确说明要达到的目标和方法的直接指令作为稳健的提示基础,以确保在 ongoing研究中有高效的和有效的提示工程。
https://arxiv.org/abs/2405.10276
Recent efforts have evaluated large language models (LLMs) in areas such as commonsense reasoning, mathematical reasoning, and code generation. However, to the best of our knowledge, no work has specifically investigated the performance of LLMs in natural language generation (NLG) tasks, a pivotal criterion for determining model excellence. Thus, this paper conducts a comprehensive evaluation of well-known and high-performing LLMs, namely ChatGPT, ChatGLM, T5-based models, LLaMA-based models, and Pythia-based models, in the context of NLG tasks. We select English and Chinese datasets encompassing Dialogue Generation and Text Summarization. Moreover, we propose a common evaluation setting that incorporates input templates and post-processing strategies. Our study reports both automatic results, accompanied by a detailed analysis.
近年来,人们在自然语言推理(NLR)、数学推理和代码生成等领域对大型语言模型(LLMs)进行了评估。然而,据我们所知,还没有工作专门研究LLMs在自然语言生成(NLG)任务上的表现,这是衡量模型卓越性的关键标准。因此,本文对包括对话生成和文本摘要在内的几种知名且高性能的LLM进行了全面的评估,这些模型是基于ChatGPT、ChatGLM、T5模型的LLMA模型和Python模型的。我们将英语和中文数据集作为研究对象,涵盖了对话生成和文本摘要。此外,我们提出了一个包含输入模板和后处理策略的通用评估设置。我们的研究既包括自动结果,也包括详细的分析。
https://arxiv.org/abs/2405.10251
Trigger points are a concept introduced by Mau, Lux, and Westheuser (2023) to study qualitative focus group interviews and understand polarisation in Germany. When people communicate, trigger points represent moments when individuals feel that their understanding of what is fair, normal, or appropriate in society is questioned. In the original studies, individuals react affectively to such triggers and show strong and negative emotional responses. In this paper, we introduce the first systematic study of the large-scale effect of individual words as trigger points by analysing a large amount of social media posts. We examine online deliberations on Reddit between 2020 and 2022 and collect >100 million posts from subreddits related to a set of words identified as trigger points in UK politics. We find that such trigger words affect user engagement and have noticeable consequences on animosity in online discussions. We share empirical evidence of trigger words causing animosity, and how they provide incentives for hate speech, adversarial debates, and disagreements. Our work is the first to introduce trigger points to computational studies of online communication. Our findings are relevant to researchers interested in online harms and who examine how citizens debate politics and society in light of affective polarisation.
触发点(trigger points)是Mau、Lux和Westheuser(2023)提出的概念,旨在研究定性焦点小组访谈以了解德国的极化现象。当人们交流时,触发点代表个人在社会中认为公平、正常或适当的看法受到质疑的时刻。在原始研究中,个人会以情感反应的形式对这样的触发点作出反应,并表现出强烈的积极或消极情感。在本文中,我们通过分析大量社交媒体帖子,系统地研究了个人单词作为触发点的大规模影响。我们研究了2020年到2022年间Reddit上的在线辩论,并将与英国政治触发点相关的一组单词的子reddit分为1000多个子reddit。我们发现,这样的触发词会影响用户的参与度,并在网上讨论中产生明显的后果。我们分享了关于触发词导致愤怒的实证证据,以及它们为仇恨言论、对抗性辩论和分歧提供激励的证据。我们的工作是第一个将触发点引入到计算机网络沟通研究中的。我们的研究结果与关注在线伤害的 researchers 相关,他们研究公民如何辩论政治和社会问题,以及情感极化如何影响公民的政治和社会互动。
https://arxiv.org/abs/2405.10213
In this paper, we introduce a novel psychological benchmark, CPsyExam, constructed from questions sourced from Chinese language examinations. CPsyExam is designed to prioritize psychological knowledge and case analysis separately, recognizing the significance of applying psychological knowledge to real-world scenarios. From the pool of 22k questions, we utilize 4k to create the benchmark that offers balanced coverage of subjects and incorporates a diverse range of case analysis techniques.Furthermore, we evaluate a range of existing large language models~(LLMs), spanning from open-sourced to API-based models. Our experiments and analysis demonstrate that CPsyExam serves as an effective benchmark for enhancing the understanding of psychology within LLMs and enables the comparison of LLMs across various granularities.
在本文中,我们引入了一个新的心理基准CPsyExam,它是由来自汉语考试的问题来源构建的。CPsyExam旨在分别优先考虑心理知识和案例分析,认识到将心理知识应用于现实场景的重要性。从22k个问题中,我们利用4k个问题来创建这个基准,为学科和案例分析提供平衡的覆盖,并采用了一系列不同的案例分析技术。此外,我们评估了一系列现有的大型语言模型(LLMs),从开源到基于API模型的模型。我们的实验和分析表明,CPsyExam成为增强LLM中心理学理解的有力基准,并能够比较各种粒度水平的LLM。
https://arxiv.org/abs/2405.10212
Text-to-speech (TTS) development for African languages such as Luganda is still limited, primarily due to the scarcity of high-quality, single-speaker recordings essential for training TTS models. Prior work has focused on utilizing the Luganda Common Voice recordings of multiple speakers aged between 20-49. Although the generated speech is intelligible, it is still of lower quality than the model trained on studio-grade recordings. This is due to the insufficient data preprocessing methods applied to improve the quality of the Common Voice recordings. Furthermore, speech convergence is more difficult to achieve due to varying intonations, as well as background noise. In this paper, we show that the quality of Luganda TTS from Common Voice can improve by training on multiple speakers of close intonation in addition to further preprocessing of the training data. Specifically, we selected six female speakers with close intonation determined by subjectively listening and comparing their voice recordings. In addition to trimming out silent portions from the beginning and end of the recordings, we applied a pre-trained speech enhancement model to reduce background noise and enhance audio quality. We also utilized a pre-trained, non-intrusive, self-supervised Mean Opinion Score (MOS) estimation model to filter recordings with an estimated MOS over 3.5, indicating high perceived quality. Subjective MOS evaluations from nine native Luganda speakers demonstrate that our TTS model achieves a significantly better MOS of 3.55 compared to the reported 2.5 MOS of the existing model. Moreover, for a fair comparison, our model trained on six speakers outperforms models trained on a single-speaker (3.13 MOS) or two speakers (3.22 MOS). This showcases the effectiveness of compensating for the lack of data from one speaker with data from multiple speakers of close intonation to improve TTS quality.
文本转语音(TTS)发展对于非洲语言如卢加丹仍然是有限的,主要原因是因为高质量、单声道录音对于训练TTS模型至关重要。之前的工作主要集中在利用多个年龄在20-49岁的说话者的卢加丹共同声音录音。尽管生成的声音是可以理解的,但它们仍然比训练在工作室级别录音上的模型质量较低。这是因为用于提高共同声音录音质量的数据预处理方法不足。此外,由于变调的存在以及背景噪音,声音收敛更困难。在本文中,我们证明了通过在共同声音训练中使用多声道接近调的说话者,卢加丹TTS的质量可以得到改善。具体来说,我们选择了六名女性说话者,通过主观听力和比较它们的录音,确定了他们的共同声音。除了剪去录音的开头和结尾处的无声部分外,我们还应用了一个预训练的语音增强模型来降低背景噪音并提高音频质量。我们还利用了一个预训练的非侵入性自我监督Mean Opinion Score(MOS)估计模型,该模型可以过滤具有估计MOS超过3.5的录音,表明具有高感知质量。来自九名母语为卢加丹的参与者的主观MOS评估表明,我们的TTS模型实现了比现有模型更高的MOS值(3.55),而现有的模型报告的MOS值为2.5。此外,对于公平的比较,基于多声道的接近调的模型训练胜过基于单声道的模型(3.13 MOS)或双声道的模型(3.22 MOS)。这展示了通过从单一说话者的不足数据中补充多声道接近调的数据来提高TTS质量的有效性。
https://arxiv.org/abs/2405.10211
Scientific document summarization has been a challenging task due to the long structure of the input text. The long input hinders the simultaneous effective modeling of both global high-order relations between sentences and local intra-sentence relations which is the most critical step in extractive summarization. However, existing methods mostly focus on one type of relation, neglecting the simultaneous effective modeling of both relations, which can lead to insufficient learning of semantic representations. In this paper, we propose HAESum, a novel approach utilizing graph neural networks to locally and globally model documents based on their hierarchical discourse structure. First, intra-sentence relations are learned using a local heterogeneous graph. Subsequently, a novel hypergraph self-attention layer is introduced to further enhance the characterization of high-order inter-sentence relations. We validate our approach on two benchmark datasets, and the experimental results demonstrate the effectiveness of HAESum and the importance of considering hierarchical structures in modeling long scientific documents. Our code will be available at \url{this https URL}
科学论文摘要是一项具有挑战性的任务,因为输入文本的长度较长。长输入阻碍了同时有效建模句子间全局高阶关系和句子内部关系,这是提取性摘要中最关键的步骤。然而,现有的方法主要关注一种关系,忽略了同时有效建模两种关系,这可能导致语义表示学习不足。在本文中,我们提出了HAESum,一种利用图神经网络基于文档层次结构局部和全局建模的新颖方法。首先,使用局部异质图学习句子间内部关系。然后,引入了一种新颖的图自注意层,进一步增强高阶句子间关系的描述。我们在两个基准数据集上验证我们的方法,实验结果证明了HAESum的有效性和考虑文档层次结构在建模长科学论文中的重要性。我们的代码将在\url{这个链接}中提供。
https://arxiv.org/abs/2405.10202
The rapid evolution of large language models (LLMs) has ushered in the need for comprehensive assessments of their performance across various dimensions. In this paper, we propose LFED, a Literary Fiction Evaluation Dataset, which aims to evaluate the capability of LLMs on the long fiction comprehension and reasoning. We collect 95 literary fictions that are either originally written in Chinese or translated into Chinese, covering a wide range of topics across several centuries. We define a question taxonomy with 8 question categories to guide the creation of 1,304 questions. Additionally, we conduct an in-depth analysis to ascertain how specific attributes of literary fictions (e.g., novel types, character numbers, the year of publication) impact LLM performance in evaluations. Through a series of experiments with various state-of-the-art LLMs, we demonstrate that these models face considerable challenges in effectively addressing questions related to literary fictions, with ChatGPT reaching only 57.08% under the zero-shot setting. The dataset will be publicly available at this https URL
大规模语言模型的快速演变带来了对它们在各种维度上的表现进行全面评估的需求。在本文中,我们提出了LFED,一个文学小说评估数据集,旨在评估LLM在长篇小说理解和推理方面的能力。我们收集了95部文学作品,这些作品要么最初用中文创作,要么翻译成中文,涵盖了几个世纪内的各种主题。我们定义了一个问题分类器,包括8个问题类别,以指导创建1304个问题。此外,我们进行了深入分析,以确定文学作品的具体属性(如小说类型、角色数量、出版年份)如何影响LLM在评估中的表现。通过一系列与最先进的LLM的实验,我们发现这些模型在有效地回答关于文学小说的相关问题时面临相当大的挑战,在零散设置下,ChatGPT的得分只有57.08%。该数据集将在这个https:// URL上公开发布。
https://arxiv.org/abs/2405.10166
The recent success of large language models (LLMs) has attracted widespread interest to develop role-playing conversational agents personalized to the characteristics and styles of different speakers to enhance their abilities to perform both general and special purpose dialogue tasks. However, the ability to personalize the generated utterances to speakers, whether conducted by human or LLM, has not been well studied. To bridge this gap, our study introduces a novel evaluation challenge: speaker verification in agent-generated conversations, which aimed to verify whether two sets of utterances originate from the same speaker. To this end, we assemble a large dataset collection encompassing thousands of speakers and their utterances. We also develop and evaluate speaker verification models under experiment setups. We further utilize the speaker verification models to evaluate the personalization abilities of LLM-based role-playing models. Comprehensive experiments suggest that the current role-playing models fail in accurately mimicking speakers, primarily due to their inherent linguistic characteristics.
近年来大型语言模型(LLMs)的成功引起了人们对开发个性化的角色扮演对话代理商的广泛关注,以增强他们在执行通用和专用对话任务方面的能力。然而,在将生成的语句个性化给说话人方面(无论是人类还是LLM),尚缺乏深入研究。为了填补这一空白,我们的研究引入了一个新颖的评估挑战:在代理生成的对话中进行说话人验证,旨在验证两组语句是否来自同一个说话人。为此,我们收集了一个包括成千上万个说话人和他们的话语的大型数据集。我们还在一个实验设置中开发和评估了说话人验证模型。进一步利用说话人验证模型评估基于LLM的角色扮演模型的个性化能力。全面的实验结果表明,当前的角色扮演模型在准确复制说话人方面失败,主要原因是它们固有的语言特性。
https://arxiv.org/abs/2405.10150
In this paper, we introduce the Polish Massive Text Embedding Benchmark (PL-MTEB), a comprehensive benchmark for text embeddings in Polish. The PL-MTEB consists of 28 diverse NLP tasks from 5 task types. We adapted the tasks based on previously used datasets by the Polish NLP community. In addition, we created a new PLSC (Polish Library of Science Corpus) dataset consisting of titles and abstracts of scientific publications in Polish, which was used as the basis for two novel clustering tasks. We evaluated 15 publicly available models for text embedding, including Polish and multilingual ones, and collected detailed results for individual tasks and aggregated results for each task type and the entire benchmark. PL-MTEB comes with open-source code at this https URL.
在本文中,我们提出了波兰大规模文本嵌入基准(PL-MTEB),一个用于波兰文本嵌入的全面基准。PL-MTEB由5个任务类型的28个多样任务组成。我们根据波兰自然语言处理社区之前使用的数据集对任务进行了改编。此外,我们创建了一个新的PLSC(波兰科学文献库)数据集,包括波兰科学期刊的标题和摘要,用作两个新颖聚类任务的基线。我们对15个公开可用的模型进行了文本嵌入评估,包括波兰和多语言模型,收集了每个任务类型和整个基准的详细结果。PL-MTEB现在可以在此链接处获取开源代码。
https://arxiv.org/abs/2405.10138
Over the past century, the Turkish language has undergone substantial changes, primarily driven by governmental interventions. In this work, our goal is to investigate the evolution of the Turkish language since the establishment of Türkiye in 1923. Thus, we first introduce Turkronicles which is a diachronic corpus for Turkish derived from the Official Gazette of Türkiye. Turkronicles contains 45,375 documents, detailing governmental actions, making it a pivotal resource for analyzing the linguistic evolution influenced by the state policies. In addition, we expand an existing diachronic Turkish corpus which consists of the records of the Grand National Assembly of Türkiye by covering additional years. Next, combining these two diachronic corpora, we seek answers for two main research questions: How have the Turkish vocabulary and the writing conventions changed since the 1920s? Our analysis reveals that the vocabularies of two different time periods diverge more as the time between them increases, and newly coined Turkish words take the place of their old counterparts. We also observe changes in writing conventions. In particular, the use of circumflex noticeably decreases and words ending with the letters "-b" and "-d" are successively replaced with "-p" and "-t" letters, respectively. Overall, this study quantitatively highlights the dramatic changes in Turkish from various aspects of the language in a diachronic perspective.
在过去的一个世纪里,土耳其语经历了显著的变化,主要是由政府干预驱动的。在这项研究中,我们的目标是调查自1923年土耳其共和国成立以来的土耳其语演变过程。因此,我们首先引入了土耳其语词典,这是土耳其语的动态语料库,来源于土耳其官方公报。土耳其语词典包含45,375个文件,详述政府行动,成为分析受国家政策影响的语言演变的关键资源。此外,我们扩展了一个现有的动态土耳其语语料库,包括覆盖额外年份的土耳其大国民会议记录。接下来,我们将这两个动态语料库相结合,寻求两个主要研究问题的答案:自20世纪以来土耳其词汇和写作规范有哪些变化?我们的分析显示,两个不同时间段的词汇分化越来越明显,新创的土耳其词汇取代了旧的同类词汇。我们还观察到写作规范的变化。特别是,使用折线的明显减少,而且单词 ending with the letters "-b" and "-d" successively replaced with "-p" and "-t" letters, respectively.总的来说,从语法的角度,这次研究定量揭示了土耳其从多个方面语言在动态过程中的剧变。
https://arxiv.org/abs/2405.10133
Integrating multimodal knowledge into large language models (LLMs) represents a significant advancement in dialogue generation capabilities. However, the effective incorporation of such knowledge in zero-resource scenarios remains a substantial challenge due to the scarcity of diverse, high-quality dialogue datasets. To address this, we propose the Visual Implicit Knowledge Distillation Framework (VIKDF), an innovative approach aimed at enhancing LLMs for enriched dialogue generation in zero-resource contexts by leveraging implicit multimodal knowledge. VIKDF comprises two main stages: knowledge distillation, using an Implicit Query Transformer to extract and encode visual implicit knowledge from image-text pairs into knowledge vectors; and knowledge integration, employing a novel Bidirectional Variational Information Fusion technique to seamlessly integrate these distilled vectors into LLMs. This enables the LLMs to generate dialogues that are not only coherent and engaging but also exhibit a deep understanding of the context through implicit multimodal cues, effectively overcoming the limitations of zero-resource scenarios. Our extensive experimentation across two dialogue datasets shows that VIKDF outperforms existing state-of-the-art models in generating high-quality dialogues. The code will be publicly available following acceptance.
将多模态知识集成到大型语言模型(LLMs)中代表了在对话生成能力方面的重要进步。然而,在零资源场景中有效整合这种知识仍然是一个巨大的挑战,因为缺乏多样化、高质量的多模态对话数据集。为了应对这个挑战,我们提出了Visual Implicit Knowledge Distillation Framework(VIKDF),一种旨在通过利用隐含多模态知识来增强LLMs以实现丰富对话生成的创新方法。VIKDF包括两个主要阶段:知识蒸馏,使用Implicit Query Transformer从图像文本对中提取和编码视觉隐含知识,并将其转化为知识向量;知识整合,采用一种新颖的双向变分信息融合技术,将这些蒸馏知识轻松地整合到LLMs中。这使得LLMs能够生成不仅 coherent 而且 engaging的对话,并通过隐含多模态线索展现对上下文的深刻理解,有效克服了零资源场景的局限性。我们在两个对话数据集上的广泛实验表明,VIKDF在生成高质量对话方面超过了现有最先进的模型。代码将在接受后公开可用。
https://arxiv.org/abs/2405.10121
LLM watermarking, which embeds imperceptible yet algorithmically detectable signals in model outputs to identify LLM-generated text, has become crucial in mitigating the potential misuse of large language models. However, the abundance of LLM watermarking algorithms, their intricate mechanisms, and the complex evaluation procedures and perspectives pose challenges for researchers and the community to easily experiment with, understand, and assess the latest advancements. To address these issues, we introduce MarkLLM, an open-source toolkit for LLM watermarking. MarkLLM offers a unified and extensible framework for implementing LLM watermarking algorithms, while providing user-friendly interfaces to ensure ease of access. Furthermore, it enhances understanding by supporting automatic visualization of the underlying mechanisms of these algorithms. For evaluation, MarkLLM offers a comprehensive suite of 12 tools spanning three perspectives, along with two types of automated evaluation pipelines. Through MarkLLM, we aim to support researchers while improving the comprehension and involvement of the general public in LLM watermarking technology, fostering consensus and driving further advancements in research and application. Our code is available at this https URL.
为了减轻大型语言模型的潜在滥用,将不可察觉但可以算法上检测到的信号嵌入到模型输出中以识别由LLM生成的文本已成为至关重要的。然而,LLM水印算法的丰富性、复杂性和评估过程复杂性给研究人员和社区轻松实验、理解和评估最新进展带来了挑战。为解决这些问题,我们引入了MarkLLM,一个用于LLM水印的开源工具包。MarkLLM提供了一个统一的、可扩展的框架,用于实现LLM水印算法,同时提供用户友好的界面,确保访问的便捷性。此外,它通过支持自动可视化算法的底层机制来增强理解。对于评估,MarkLLM提供了一个包括三个方面的全面评估工具包,以及两种类型的自动评估管道。通过MarkLLM,我们旨在支持研究人员,同时改善LLM水印技术的影响和认知,促进共识并在研究和应用中推动进一步发展。我们的代码可以从该链接获取。
https://arxiv.org/abs/2405.10051
Classifying public tenders is a useful task for both companies that are invited to participate and for inspecting fraudulent activities. To facilitate the task for both participants and public administrations, the European Union presented a common taxonomy (\textit{Common Procurement Vocabulary}, CPV) which is mandatory for tenders of certain importance; however, the contracts in which a CPV label is mandatory are the minority compared to all the Public Administrations activities. Classifying over a real-world taxonomy introduces some difficulties that can not be ignored. First of all, some fine-grained classes have an insufficient (if any) number of observations in the training set, while other classes are far more frequent (even thousands of times) than the average. To overcome those difficulties, we present a zero-shot approach, based on a pre-trained language model that relies only on label description and respects the label taxonomy. To train our proposed model, we used industrial data, which comes from \url{this http URL}, a service by \href{this https URL}{SpazioDati s.r.l}. that collects public contracts stipulated in Italy in the last 25 years. Results show that the proposed model achieves better performance in classifying low-frequent classes compared to three different baselines, and is also able to predict never-seen classes.
对公共招标进行分类是一项有益的任务,对于受邀参加的公司和检查欺诈活动都很有用。为了方便企业和公共行政机构的参与,欧盟制定了一个共同招标词汇表(CPV),对于某些重要性的招标是强制性的;然而,强制要求CPV标签的合同数量相对于所有公共行政机构的活动来说只是少数。将分类扩展到现实世界的分类框架中引入了一些困难,不容忽视。首先,一些细粒度分类在训练集中的观察数量不足(如果有的话),而其他分类则比平均观察数量要得多(甚至几千倍)。为了克服这些困难,我们提出了一个零击中方法,基于一个预训练语言模型,仅依赖标签描述并尊重标签分类。为了训练我们所提出的模型,我们使用了工业数据,该数据来自于意大利过去25年内颁布的公共合同的汇总服务。结果表明,与三种不同的基线相比,所提出的模型在分类低频类别的表现更好,同时还能预测未见过的类别。
https://arxiv.org/abs/2405.09983
Toxicity mitigation consists in rephrasing text in order to remove offensive or harmful meaning. Neural natural language processing (NLP) models have been widely used to target and mitigate textual toxicity. However, existing methods fail to detoxify text while preserving the initial non-toxic meaning at the same time. In this work, we propose to apply counterfactual generation methods from the eXplainable AI (XAI) field to target and mitigate textual toxicity. In particular, we perform text detoxification by applying local feature importance and counterfactual generation methods to a toxicity classifier distinguishing between toxic and non-toxic texts. We carry out text detoxification through counterfactual generation on three datasets and compare our approach to three competitors. Automatic and human evaluations show that recently developed NLP counterfactual generators can mitigate toxicity accurately while better preserving the meaning of the initial text as compared to classical detoxification methods. Finally, we take a step back from using automated detoxification tools, and discuss how to manage the polysemous nature of toxicity and the risk of malicious use of detoxification tools. This work is the first to bridge the gap between counterfactual generation and text detoxification and paves the way towards more practical application of XAI methods.
消除文本毒性是指通过重新表述文本以消除具有冒犯或有害意义的语言。为了针对和减轻文本毒性,已经广泛使用了神经自然语言处理(NLP)模型。然而,现有的方法在保持初始非有毒意义的同时无法消除文本的毒性。在这项工作中,我们将来自 Explainable AI(XAI)领域的反事实生成方法应用于针对文本毒性的目标减轻和消除。 特别是,我们通过应用局部特征重要性反事实生成方法对区分有毒和无毒文本的毒性分类器进行文本净化。我们在三个数据集上进行文本净化,并比较我们的方法与三个竞争对手的方法。自动和人类评估表明,新近开发的NLP反事实生成器可以在准确减轻毒性同时更好地保留文本的原始含义,相较于经典净化方法具有更好的效果。 最后,我们从使用自动化净化工具转向如何管理文本的复杂性和 detoxification工具恶意使用的风险。这项工作是第一个将反事实生成与文本净化相桥接的研究,为XAI方法的更实际应用铺平道路。
https://arxiv.org/abs/2405.09948
Transliterating related languages that use different scripts into a common script shows effectiveness in improving crosslingual transfer in downstream tasks. However, this methodology often makes pretraining a model from scratch unavoidable, as transliteration brings about new subwords not covered in existing multilingual pretrained language models (mPLMs). This is not desired because it takes a lot of computation budget for pretraining. A more promising way is to make full use of available mPLMs. To this end, this paper proposes a simple but effective framework: Transliterate-Merge-Initialize (TransMI), which can create a strong baseline well-suited for data that is transliterated into a common script by exploiting an mPLM and its accompanied tokenizer. TransMI has three stages: (a) transliterate the vocabulary of an mPLM into a common script; (b) merge the new vocabulary with the original vocabulary; and (c) initialize the embeddings of the new subwords. We applied TransMI to three recent strong mPLMs, and our experiments demonstrate that TransMI not only preserves their ability to handle non-transliterated data, but also enables the models to effectively process transliterated data: the results show a consistent improvement of 3% to 34%, varying across different models and tasks. We make our code and models publicly available at \url{this https URL}.
翻译:将使用不同脚本的语言之间进行统一转写,可以有效提高跨语言迁移下游任务的效果。然而,这种方法通常使得从零开始预训练模型变得不可避免,因为转写带来了尚未在现有多语言预训练语言模型(mPLMs)中涵盖的新子词。这不是我们所期望的,因为它需要大量的计算预算进行预训练。 更令人乐观的方法是充分利用已有的mPLMs。为此,本文提出了一个简单而有效的框架:Transliterate-Merge-Initialize(TransMI),它可以通过利用一个mPLM及其伴随的词典,将转写成一个通用脚本的词汇表。TransMI有三个阶段: (a)将mPLM的词汇表翻译成通用脚本; (b)将新词汇与原始词汇合并; (c)初始化新子词的嵌入。 我们将TransMI应用于三个最近的强mPLM,并且我们的实验结果表明,TransMI不仅保留了它们处理非转写数据的能力,而且还使模型能够有效处理转写数据:结果表明,模型的性能提高了3%至34%,具体取决于不同的模型和任务。我们将我们的代码和模型公开发布在\url{这个 https URL}上。
https://arxiv.org/abs/2405.09913
While neural approaches using deep learning are the state-of-the-art for natural language processing (NLP) today, pre-neural algorithms and approaches still find a place in NLP textbooks and courses of recent years. In this paper, we compare two introductory NLP courses taught in Australia and India, and examine how Transformer and pre-neural approaches are balanced within the lecture plan and assessments of the courses. We also draw parallels with the objects-first and objects-later debate in CS1 education. We observe that pre-neural approaches add value to student learning by building an intuitive understanding of NLP problems, potential solutions and even Transformer-based models themselves. Despite pre-neural approaches not being state-of-the-art, the paper makes a case for their inclusion in NLP courses today.
尽管深度学习方法在自然语言处理(NLP)领域今天是最新和最先进的,但预先神经算法和 approach 仍然在近几年的自然语言处理教科书和课程中占据一席之地。在本文中,我们比较了澳大利亚和印度教授的两门自然语言处理入门课程,并探讨了Transformer 和预先神经方法在课程计划和评估中的平衡。我们还探讨了与 CS1 教育中的物体先验和物体后来方法辩论的相似之处。我们观察到,预先神经方法通过构建对 NLP 问题的直观理解、潜在解决方案以及甚至 Transformer 模型的认识,为学生的学习增添了价值。尽管预先神经方法不是最先进的,但本文支持它们在当前自然语言处理课程中的 inclusion。
https://arxiv.org/abs/2405.09854
We present Chameleon, a family of early-fusion token-based mixed-modal models capable of understanding and generating images and text in any arbitrary sequence. We outline a stable training approach from inception, an alignment recipe, and an architectural parameterization tailored for the early-fusion, token-based, mixed-modal setting. The models are evaluated on a comprehensive range of tasks, including visual question answering, image captioning, text generation, image generation, and long-form mixed modal generation. Chameleon demonstrates broad and general capabilities, including state-of-the-art performance in image captioning tasks, outperforms Llama-2 in text-only tasks while being competitive with models such as Mixtral 8x7B and Gemini-Pro, and performs non-trivial image generation, all in a single model. It also matches or exceeds the performance of much larger models, including Gemini Pro and GPT-4V, according to human judgments on a new long-form mixed-modal generation evaluation, where either the prompt or outputs contain mixed sequences of both images and text. Chameleon marks a significant step forward in a unified modeling of full multimodal documents.
我们提出了Chameleon模型,这是一款基于早期融合词素的多模态模型,可以理解并以任意顺序生成图像和文本。我们从一开始就概述了稳定训练方法、一个对齐配方和一种针对早期融合、词素基础、多模态设置的建筑参数化。这些模型在包括视觉问题回答、图像标题、文本生成、图像生成和长篇多模态生成在内的各种任务上进行了评估。Chameleon展示了广泛和普遍的能力,包括在图像标题任务中的最先进性能,在仅文本任务中超过了Llama-2的性能,同时与Mixtral 8x7B和Gemini-Pro等模型竞争,所有这些都在单个模型中实现。根据人类对长篇多模态生成评估的新鲜程度,Chameleon在包括图像和文本的混合序列的提示或输出上表现出色,甚至超过了Gemini Pro和GPT-4V等更大模型的性能。Chameleon在统一建模全多模态文本文档方面迈出了重要的一步。
https://arxiv.org/abs/2405.09818
Traditional security mechanisms isolate resources from users who should not access them. We reflect the compositional nature of such security mechanisms back into the structure of LLMs to build a provably secure LLM; that we term SecureLLM. Other approaches to LLM safety attempt to protect against bad actors or bad outcomes, but can only do so to an extent making them inappropriate for sensitive data. SecureLLM blends access security with fine-tuning methods. Each data silo has associated with it a separate fine-tuning and a user has access only to the collection of fine-tunings that they have permission for. The model must then perform on compositional tasks at the intersection of those data silos with the combination of those individual fine-tunings. While applicable to any task like document QA or making API calls, in this work we concern ourselves with models that learn the layouts of new SQL databases to provide natural-language-to-SQL translation capabilities. Existing fine-tuning composition methods fail in this challenging environment, as they are not well-equipped for handling compositional tasks. Compositionality remains a challenge for LLMs. We contribute both a difficult new compositional natural-language-to-SQL translation task and a new perspective on LLM security that allows models to be deployed to secure environments today.
传统的安全机制将资源从不应访问它们的用户与资源进行隔离。我们将这种安全机制的组合性质反映回LLM的结构中,以构建一个可证明安全的LLM;我们称之为SecureLLM。其他LLM安全性方法试图保护用户免受不良行为或不良结果的影响,但它们只能做到这一点,使它们不适合敏感数据。SecureLLM将访问安全与微调方法相结合。每个数据 silo 都与它相关的单独微调和用户只能访问他们有权限访问的微调集合。模型 then 需要在那些数据 silos 的交叉点对组合任务进行操作,并与单个微调的组合进行操作。虽然适用于诸如文档QA或通过API调用等任何任务,但在这项工作中,我们关注的是学习新SQL数据库布局的自然语言到SQL翻译能力的模型。现有的微调组合方法在这个具有挑战性的环境中失败,因为它们在处理组合任务方面不够灵活。可组合性仍然是LLM的一个挑战。我们为LLM的安全性做出了贡献,包括一个困难的新组合自然语言到SQL翻译任务和一个允许模型部署到安全环境的新视角。
https://arxiv.org/abs/2405.09805
Discourse relation classification is an especially difficult task without explicit context markers \cite{Prasad2008ThePD}. Current approaches to implicit relation prediction solely rely on two neighboring sentences being targeted, ignoring the broader context of their surrounding environments \cite{Atwell2021WhereAW}. In this research, we propose three new methods in which to incorporate context in the task of sentence relation prediction: (1) Direct Neighbors (DNs), (2) Expanded Window Neighbors (EWNs), and (3) Part-Smart Random Neighbors (PSRNs). Our findings indicate that the inclusion of context beyond one discourse unit is harmful in the task of discourse relation classification.
语义关系分类是一个尤其困难的任务,如果没有明确的上下文标记 \cite{Prasad2008ThePD},那么当前的隐含关系预测方法仅依赖于目标句子周围的邻居句子,而忽略了它们周围环境更广泛的上下文。在本文研究中,我们提出了三种新的方法来在句子关系预测任务中包含上下文:直接邻居(DNs)、扩展窗口邻居(EWNs)和部分智能随机邻居(PSRNs)。我们的研究结果表明,在语义关系分类任务中,仅仅依靠一个语义单位的上下文是不够的,是有害的。
https://arxiv.org/abs/2405.09735
To understand the complexity of global events, one must navigate a web of interwoven sub-events, identifying those most impactful elements within the larger, abstract macro-event framework at play. This concept can be extended to the field of natural language processing (NLP) % original: by defining abstract event representations as structured event schemas. through the creation of structured event schemas which can serve as representations of these abstract events. Central to our approach is the Schema Curation Interface 3.0 (SCI 3.0), a web application that facilitates real-time editing of event schema properties within a generated graph e.g., adding, removing, or editing sub-events, entities, and relations directly through an interface.
要了解全球事件的复杂性,就必须在交织的子事件中导航,确定在更大的抽象宏观事件框架中起作用的最具影响力的元素。这个概念可以扩展到自然语言处理(NLP)领域。通过将抽象事件表示为结构化事件模式来定义抽象事件。通过创建可以作为这些抽象事件代表的结构化事件模式。我们方法的核心是Schema Curation Interface 3.0(SCI 3.0),一个Web应用程序,通过生成图中的事件模式属性,在实时编辑中编辑事件模式属性,例如添加、删除或编辑子事件、实体和关系。
https://arxiv.org/abs/2405.09733