We present MM-Narrator, a novel system leveraging GPT-4 with multimodal in-context learning for the generation of audio descriptions (AD). Unlike previous methods that primarily focused on downstream fine-tuning with short video clips, MM-Narrator excels in generating precise audio descriptions for videos of extensive lengths, even beyond hours, in an autoregressive manner. This capability is made possible by the proposed memory-augmented generation process, which effectively utilizes both the short-term textual context and long-term visual memory through an efficient register-and-recall mechanism. These contextual memories compile pertinent past information, including storylines and character identities, ensuring an accurate tracking and depicting of story-coherent and character-centric audio descriptions. Maintaining the training-free design of MM-Narrator, we further propose a complexity-based demonstration selection strategy to largely enhance its multi-step reasoning capability via few-shot multimodal in-context learning (MM-ICL). Experimental results on MAD-eval dataset demonstrate that MM-Narrator consistently outperforms both the existing fine-tuning-based approaches and LLM-based approaches in most scenarios, as measured by standard evaluation metrics. Additionally, we introduce the first segment-based evaluator for recurrent text generation. Empowered by GPT-4, this evaluator comprehensively reasons and marks AD generation performance in various extendable dimensions.
我们提出了MM-Narrator,一种利用GPT-4进行多模态上下文学习的新系统,用于生成音频描述(AD)。与之前主要关注下游微调的短视频片段的方法不同,MM-Narrator在生成具有较长视频长度(甚至超过小时)的准确音频描述方面表现出色。这是通过所提出的具有记忆增强的生成过程实现的,该过程通过有效的注册和回忆机制有效地利用了短文本上下文和长时视觉记忆。这些上下文记忆涵盖了相关的历史信息,包括故事情节和角色身份,从而确保了故事连贯和角色为中心的准确音频描述。为了保持MM-Narrator的训练免费设计,我们进一步提出了基于复杂度的演示选择策略,通过少样本多模态上下文学习(MM-ICL)显著增强了其多步骤推理能力。在MAD评估数据集上的实验结果表明,MM-Narrator在大多数场景上都超越了现有的基于微调的方法和基于LLM的方法。此外,我们还介绍了第一个基于段的评估器,用于对递归文本生成进行评估。借助GPT-4的力量,这个评估器在各种可扩展维度上全面推理和标记AD生成性能。
https://arxiv.org/abs/2311.17435
For sensible progress in natural language processing, it is important that we are aware of the limitations of the evaluation metrics we use. In this work, we evaluate how robust metrics are to non-standardized dialects, i.e. spelling differences in language varieties that do not have a standard orthography. To investigate this, we collect a dataset of human translations and human judgments for automatic machine translations from English to two Swiss German dialects. We further create a challenge set for dialect variation and benchmark existing metrics' performances. Our results show that existing metrics cannot reliably evaluate Swiss German text generation outputs, especially on segment level. We propose initial design adaptations that increase robustness in the face of non-standardized dialects, although there remains much room for further improvement. The dataset, code, and models are available here: this https URL
为了在自然语言处理领域取得合理的进步,了解我们使用的评估指标的局限性非常重要。在这项工作中,我们评估了衡量标准如何适应非标准德语,即没有标准的拼写差异的语言变体。为了研究这个问题,我们收集了来自英语到两种瑞士德语的自动机器翻译的人机翻译数据和人类判断。我们进一步创建了一个方言变化挑战集,对现有指标的性能进行了基准测试。我们的结果表明,现有的指标无法可靠地评估瑞士德语文本生成输出,尤其是在段级别。我们提出了最初的设计适应方案,以增加在非标准德语面对的可靠性,尽管还有很大的改进空间。数据集、代码和模型都可以在这里找到:这个链接
https://arxiv.org/abs/2311.16865
This paper introduces a novel approach to post-Optical Character Recognition Correction (POC) for handwritten Cyrillic text, addressing a significant gap in current research methodologies. This gap is due to the lack of large text corporas that provide OCR errors for further training of language-based POC models, which are demanding in terms of corpora size. Our study primarily focuses on the development and application of a synthetic handwriting generation engine based on Bézier curves. Such an engine generates highly realistic handwritten text in any amounts, which we utilize to create a substantial dataset by transforming Russian text corpora sourced from the internet. We apply a Handwritten Text Recognition (HTR) model to this dataset to identify OCR errors, forming the basis for our POC model training. The correction model is trained on a 90-symbol input context, utilizing a pre-trained T5 architecture with a seq2seq correction task. We evaluate our approach on HWR200 and School_notebooks_RU datasets as they provide significant challenges in the HTR domain. Furthermore, POC can be used to highlight errors for teachers, evaluating student performance. This can be done simply by comparing sentences before and after correction, displaying differences in text. Our primary contribution lies in the innovative use of Bézier curves for Cyrillic text generation and subsequent error correction using a specialized POC model. We validate our approach by presenting Word Accuracy Rate (WAR) and Character Accuracy Rate (CAR) results, both with and without post-OCR correction, using real open corporas of handwritten Cyrillic text. These results, coupled with our methodology, are designed to be reproducible, paving the way for further advancements in the field of OCR and handwritten text analysis. Paper contributions can be found in this https URL
本文提出了一种新的手写体识别纠正方法,用于解决当前研究方法论中存在的显著空白。这一空白是因为缺乏大型的文本语料库,这些语料库提供了进一步训练基于语言的POC模型的错误信息。我们的研究主要关注基于Bézier曲线的手写体生成引擎的开发和应用。这种引擎可以生成任意数量高度逼真的手写文本,我们利用这个生成的大量俄语文本语料库来创建一个庞大的数据集。我们将手写文本识别(HTR)模型应用于这个数据集,以识别OCR错误,为POC模型训练奠定基础。修正模型在90个符号输入上下文中训练,利用预训练的T5架构和序列2序列修正任务。我们在HWR200和School_notebooks_RU数据集上评估我们的方法,因为这些数据集在HTR领域存在重大挑战。此外,POC可以用于教师评估学生表现。这可以通过比较修复前后的句子来简单地完成,显示文本中的差异。我们主要的贡献在于创新地使用Bézier曲线生成手写体和利用专用POC模型进行错误纠正。我们通过展示Word Accuracy Rate(WAR)和Character Accuracy Rate(CAR)结果,包括修复前和修复后的结果,使用真实的手写体拉丁文语料库进行验证。这些结果与我们的方法结合在一起,旨在为OCR和手写体分析领域带来进一步的进步。您可以在该链接找到论文贡献:https://url.cnki.net/ after-correction
https://arxiv.org/abs/2311.15896
Recent advancements in multimodal large language models (MLLMs) have achieved significant multimodal generation capabilities, akin to GPT-4. These models predominantly map visual information into language representation space, leveraging the vast knowledge and powerful text generation abilities of LLMs to produce multimodal instruction-following responses. We could term this method as LLMs for Vision because of its employing LLMs for visual-language understanding, yet observe that these MLLMs neglect the potential of harnessing visual knowledge to enhance overall capabilities of LLMs, which could be regraded as Vision Enhancing LLMs. In this paper, we propose an approach called MKS2, aimed at enhancing LLMs through empowering Multimodal Knowledge Storage and Sharing in LLMs. Specifically, we introduce the Modular Visual Memory, a component integrated into the internal blocks of LLMs, designed to store open-world visual information efficiently. Additionally, we present a soft Mixtures-of-Multimodal Experts architecture in LLMs to invoke multimodal knowledge collaboration during generation. Our comprehensive experiments demonstrate that MKS2 substantially augments the reasoning capabilities of LLMs in contexts necessitating physical or commonsense knowledge. It also delivers competitive results on multimodal benchmarks.
近年来,在多模态大型语言模型(MLLMs)方面的进步已经取得了显著的多模态生成能力,与GPT-4相当。这些模型主要将视觉信息映射到语言表示空间,利用LLM的广泛知识和强大的文本生成能力来产生多模态遵循指令的响应。因此,我们可以称这种方法为LLMs for Vision,因为它利用LLMs进行视觉-语言理解。然而,我们观察到,这些MLLMs忽视了利用视觉知识增强LLM总体能力的机会,这可能被视为Vision Enhancing LLMs。在本文中,我们提出了MKS2,旨在通过在LLMs中增强多模态知识存储和共享来提高LLMs。具体来说,我们引入了模块化视觉内存,这是一个集成在LLM内部块中的组件,旨在高效地存储开放世界视觉信息。此外,我们在LLMs中提出了软多模态专家架构,用于在生成过程中激发多模态知识合作。我们的全面实验证明,MKS2极大地增强了在需要物理或常识知识的情境下LLMs的推理能力。此外,在多模态基准测试中,MKS2还取得了竞争力的结果。
https://arxiv.org/abs/2311.15759
This study aims to innovatively explore adaptive applications of large language models (LLM) in urban renewal. It also aims to improve its performance and text generation quality for knowledge question-answering (QA) tasks. Based on the ChatGLM, we automatically generate QA datasets using urban renewal scientific literature corpora in a self-instruct manner and then conduct joint fine-tuning training on the model using the Prefix and LoRA fine-tuning methods to create an LLM for urban renewal. By guiding the LLM to automatically generate QA data based on prompt words and given text, it is possible to quickly obtain datasets in the urban renewal field and provide data support for the fine-tuning training of LLMs. The experimental results show that the joint fine-tuning training method proposed in this study can significantly improve the performance of LLM on the QA tasks. Compared with LoRA fine-tuning, the method improves the Bleu and Rouge metrics on the test by about 5%; compared with the model before fine-tuning, the method improves the Bleu and Rouge metrics by about 15%-20%. This study demonstrates the effectiveness and superiority of the joint fine-tuning method using Prefix and LoRA for ChatGLM in the urban renewal knowledge QA tasks. It provides a new approach for fine-tuning LLMs on urban renewal-related tasks.
本研究旨在创新地探讨大型语言模型(LLM)在 urban renewal 中的自适应应用。它还旨在提高知识问题回答(QA)任务的性能和文本生成质量。基于 ChatGLM,我们通过自我指导的方式使用城市更新科学文献集成生成 QA 数据,然后使用前缀和 LoRA 微调方法对模型进行联合微调,以创建一个用于 urban renewal 的 LLM。通过引导 LLM 根据提示词和给定文本自动生成 QA 数据,可以快速获得城市更新领域的数据,并为 LLM 的微调训练提供数据支持。实验结果表明,本研究中提出的联合微调方法可以显著提高 LLM 在 QA 任务上的性能。与 LoRA 微调相比,该方法在测试中提高了约 5%的 Bleu 和 Red 分数;与未经微调的模型相比,该方法提高了约 15% 至 20%的 Bleu 和 Red 分数。本研究展示了使用 Prefix 和 LoRA 对 ChatGLM 在城市更新知识 QA 任务中进行联合微调的有效性和优越性。它为在 urban renewal 相关任务上微调 LLM 提供了新的方法。
https://arxiv.org/abs/2311.15490
Large language models (LLMs) have emerged as pivotal contributors in contemporary natural language processing and are increasingly being applied across a diverse range of industries. However, these large-scale probabilistic statistical models cannot currently ensure the requisite quality in professional content generation. These models often produce hallucinated text, compromising their practical utility in professional contexts. To assess the authentic reliability of LLMs in text generation, numerous initiatives have developed benchmark evaluations for hallucination phenomena. Nevertheless, these benchmarks frequently utilize constrained generation techniques due to cost and temporal constraints. These techniques encompass the use of directed hallucination induction and strategies that deliberately alter authentic text to produce hallucinations. These approaches are not congruent with the unrestricted text generation demanded by real-world applications. Furthermore, a well-established Chinese-language dataset dedicated to the evaluation of hallucinations in text generation is presently lacking. Consequently, we have developed an Unconstrained Hallucination Generation Evaluation (UHGEval) benchmark, designed to compile outputs produced with minimal restrictions by LLMs. Concurrently, we have established a comprehensive benchmark evaluation framework to aid subsequent researchers in undertaking scalable and reproducible experiments. We have also executed extensive experiments, evaluating prominent Chinese language models and the GPT series models to derive professional performance insights regarding hallucination challenges.
大规模语言模型(LLMs)在当代自然语言处理中发挥了关键作用,并日益应用于各种行业。然而,这些大规模概率统计模型目前无法确保专业内容生成的必需质量。这些模型通常产生幻觉文本,从而降低了其在专业场景下的实际应用价值。为了评估LLMs在文本生成中的真实性可靠性,已经开发了许多用于评估幻觉现象的基准评估。然而,由于成本和时间限制,这些基准评估通常采用约束生成技术。这些技术包括使用指向性幻觉归纳和策略,有意识地改变真实文本以产生幻觉。这些方法与现实应用要求无限制的文本生成不相符。此外,一个专门用于评估文本生成中幻觉的中文数据集目前还缺乏。因此,我们开发了一个无限制幻觉生成评估(UHGEval)基准,旨在通过最小限制限制LLM的输出。同时,我们还建立了一个全面评估框架,帮助后续研究人员进行可扩展和可重复的实验。我们还进行了大量实验,评估了权威中文语言模型以及GPT系列模型在幻觉挑战方面的专业性能。
https://arxiv.org/abs/2311.15296
This document illustrates the use of pyrealb for generating two parallel texts (English and French) from a single source of data. The data selection and text organisation processes are shared between the two languages. only language dependent word and phrasing choices are distinct processes. The realized texts thus convey identical information in both languages without the risk of being lost in translation. This is especially important in cases where strict and simultaneous bilingualism is required. We first present the types of applications targeted by this approach and how the pyrealb English and French realizer can be used for achieving this goal in a natural way. We describe an object-oriented organization to ensure a convenient realization in both languages. To illustrate the process, different types of applications are then briefly sketched with links to the source code. A brief comparison of the text generation is given with the output of an instance of a GPT.
本文展示了使用pyrealb从单一数据源生成两种并行文本(英语和法语)的方法。数据选择和文本组织过程在两种语言之间共享。只有语言相关的单词和短语选择是不同的过程。因此,实现文本在两种语言中传达相同信息,不会丢失翻译的风险。尤其是在需要进行严格的双语主义要求的情况下,这一点尤为重要。我们首先介绍使用这种方法的各类应用以及pyrealb English和法语实现器如何以自然的方式实现这一目标。我们描述了一个面向对象的设置,以确保在两种语言中实现便利的实现。为了说明过程,我们简要描述了不同类型的应用程序以及与源代码的链接。简要比较了文本生成与GPT实例的输出。
https://arxiv.org/abs/2311.14808
As Large Language Models (LLMs) are deployed more widely, customization with respect to vocabulary, style and character becomes more important. In this work we introduce model arithmetic, a novel inference framework for composing and biasing LLMs without the need for model (re)training or highly specific datasets. In addition, the framework allows for more precise control of generated text than direct prompting and prior controlled text generation (CTG) techniques. Using model arithmetic, we can express prior CTG techniques as simple formulas and naturally extend them to new and more effective formulations. Further, we show that speculative sampling, a technique for efficient LLM sampling, extends to our setting. This enables highly efficient text generation with multiple composed models with only marginal overhead over a single model. Our empirical evaluation demonstrates that model arithmetic allows fine-grained control of generated text while outperforming state-of-the-art on the task of toxicity reduction.
随着大型语言模型(LLMs)的部署范围越来越广泛,对于词汇、风格和字符的定制就变得更加重要。在这项工作中,我们引入了模型代数,一种新的用于合成和调整LLM的新推理框架,无需进行模型(重新)训练或高度特定的数据集。此外,该框架比直接提示和受控文本生成(CTG)技术更精确地控制生成的文本。使用模型代数,我们可以将先前的CTG技术表示为简单的公式,并自然地将它们扩展到新的更有效的形式。此外,我们还证明了探索性采样技术(一种高效的LLM抽样技术)扩展到我们设置中。这使得仅使用单个模型即可生成具有微小开销的多组合模型。我们的实证评估表明,模型代数允许在生成文本的同时精细地控制文本,并在毒性降低任务中优于最先进的模型。
https://arxiv.org/abs/2311.14479
Neural machine translation (NMT) is a widely popular text generation task, yet there is a considerable research gap in the development of privacy-preserving NMT models, despite significant data privacy concerns for NMT systems. Differentially private stochastic gradient descent (DP-SGD) is a popular method for training machine learning models with concrete privacy guarantees; however, the implementation specifics of training a model with DP-SGD are not always clarified in existing models, with differing software libraries used and code bases not always being public, leading to reproducibility issues. To tackle this, we introduce DP-NMT, an open-source framework for carrying out research on privacy-preserving NMT with DP-SGD, bringing together numerous models, datasets, and evaluation metrics in one systematic software package. Our goal is to provide a platform for researchers to advance the development of privacy-preserving NMT systems, keeping the specific details of the DP-SGD algorithm transparent and intuitive to implement. We run a set of experiments on datasets from both general and privacy-related domains to demonstrate our framework in use. We make our framework publicly available and welcome feedback from the community.
神经机器翻译(NMT)是一种广泛受欢迎的文本生成任务,然而,在开发保护隐私的NMT模型的过程中,研究空白相当大。尽管对于NMT系统,数据隐私问题相当严重,但不同的软件库使用的实现细节并不总是明确的,代码库也不一定公开,导致了可重复性问题。为了解决这个问题,我们引入了DP-NMT,一个开源框架,用于研究用DP-SGD保护隐私的NMT,将各种模型、数据集和评估指标整合在一个系统中。我们的目标是为研究人员提供一个平台,以推动隐私保护NMT系统的开发,保持DP-SGD算法的具体细节公开和易用。我们在通用和隐私相关数据集上进行了一系列实验,以展示我们框架的使用。我们将我们的框架公开发布,并欢迎来自社区的反馈。
https://arxiv.org/abs/2311.14465
The latest advancements in large language models (LLMs) have revolutionized the field of natural language processing (NLP). Inspired by the success of LLMs in NLP tasks, some recent work has begun investigating the potential of applying LLMs in graph learning tasks. However, most of the existing work focuses on utilizing LLMs as powerful node feature augmenters, leaving employing LLMs to enhance graph topological structures an understudied problem. In this work, we explore how to leverage the information retrieval and text generation capabilities of LLMs to refine/enhance the topological structure of text-attributed graphs (TAGs) under the node classification setting. First, we propose using LLMs to help remove unreliable edges and add reliable ones in the TAG. Specifically, we first let the LLM output the semantic similarity between node attributes through delicate prompt designs, and then perform edge deletion and edge addition based on the similarity. Second, we propose using pseudo-labels generated by the LLM to improve graph topology, that is, we introduce the pseudo-label propagation as a regularization to guide the graph neural network (GNN) in learning proper edge weights. Finally, we incorporate the two aforementioned LLM-based methods for graph topological refinement into the process of GNN training, and perform extensive experiments on four real-world datasets. The experimental results demonstrate the effectiveness of LLM-based graph topology refinement (achieving a 0.15%--2.47% performance gain on public benchmarks).
大语言模型(LLMs)的最新进展已经推动了自然语言处理(NLP)领域的发展。受到LLMs在NLP任务上的成功启发,一些最近的论文开始研究将LLMs应用于图学习任务的可能性。然而,现有的工作主要关注利用LLMs作为强大的节点特征增强器,而没有对使用LLMs增强图拓扑结构进行深入研究。在这项工作中,我们探讨了如何利用LLMs的信息检索和文本生成能力来优化/改善节点分类设置下的文本相关图(TAG)的拓扑结构。首先,我们提出使用LLMs帮助去除不可靠的边并添加可靠的边在TAG中。具体来说,我们首先让LLM通过精细提示设计输出节点属性的语义相似度,然后根据相似度进行边删除和添加。其次,我们提出使用LLM生成的伪标签来改善图拓扑结构,即引入伪标签传播作为指导图神经网络(GNN)学习适当边重量的正则化。最后,我们将基于LLM的两种拓扑结构优化方法纳入GNN训练过程,并在四个真实世界数据集上进行广泛的实验。实验结果表明,LLM基于拓扑结构优化(在公共基准测试上实现0.15%--2.47%的性能提升)的有效性。
https://arxiv.org/abs/2311.14324
Significant progress has been made on text generation by pre-trained language models (PLMs), yet distinguishing between human and machine-generated text poses an escalating challenge. This paper offers an in-depth evaluation of three distinct methods used to address this task: traditional shallow learning, Language Model (LM) fine-tuning, and Multilingual Model fine-tuning. These approaches are rigorously tested on a wide range of machine-generated texts, providing a benchmark of their competence in distinguishing between human-authored and machine-authored linguistic constructs. The results reveal considerable differences in performance across methods, thus emphasizing the continued need for advancement in this crucial area of NLP. This study offers valuable insights and paves the way for future research aimed at creating robust and highly discriminative models.
通过预训练语言模型(PLMs)进行文本生成已经取得了显著的进展,然而在区分人类和机器生成的文本方面仍然是一个不断加剧的挑战。本文对解决这个任务的三个不同方法进行了深入评估:传统浅层学习、语言模型(LM)微调、和多语言模型微调。这些方法在广泛的机器生文本上进行了严格的测试,为它们在区分人类和非人类语言构造方面的能力提供了基准。结果表明,在方法之间性能存在很大的差异,因此强调在自然语言处理这个关键领域需要继续推动发展。这项研究提供了宝贵的见解,为未来的研究铺平了道路,旨在创建健壮且高度判别力强的模型。
https://arxiv.org/abs/2311.12373
We study attribute control in language models through the method of Causal Average Treatment Effect (Causal ATE). Existing methods for the attribute control task in Language Models (LMs) check for the co-occurrence of words in a sentence with the attribute of interest, and control for them. However, spurious correlation of the words with the attribute in the training dataset, can cause models to hallucinate the presence of the attribute when presented with the spurious correlate during inference. We show that the simple perturbation-based method of Causal ATE removes this unintended effect. Additionally, we offer a theoretical foundation for investigating Causal ATE in the classification task, and prove that it reduces the number of false positives -- thereby mitigating the issue of unintended bias. Specifically, we ground it in the problem of toxicity mitigation, where a significant challenge lies in the inadvertent bias that often emerges towards protected groups post detoxification. We show that this unintended bias can be solved by the use of the Causal ATE metric.
我们通过Causal Average Treatment Effect(Causal ATE)方法研究语言模型中属性控制的问题。现有的语言模型(LMs)属性控制任务方法检查句子中单词与感兴趣属性的同时出现情况,并对其进行控制。然而,训练数据中单词与属性之间的伪相关可能会导致模型在推理过程中错觉地存在感兴趣属性。我们证明了Causal ATE的简单扰动方法可以消除这种意外影响。 此外,我们还为研究分类任务中的Causal ATE提供了理论基础,并证明了它减少了假阳性结果的数量——从而减轻了无意偏见的問題。具体来说,我们将其基础放在了毒性缓解问题上,因为在脱毒后常常无意地出现对受保护群体的偏见。我们证明了这种无意偏见可以通过使用Causal ATE指标来解决。
https://arxiv.org/abs/2311.11229
The launch of ChatGPT has garnered global attention, marking a significant milestone in the field of Generative Artificial Intelligence. While Generative AI has been in effect for the past decade, the introduction of ChatGPT has ignited a new wave of research and innovation in the AI domain. This surge in interest has led to the development and release of numerous cutting-edge tools, such as Bard, Stable Diffusion, DALL-E, Make-A-Video, Runway ML, and Jukebox, among others. These tools exhibit remarkable capabilities, encompassing tasks ranging from text generation and music composition, image creation, video production, code generation, and even scientific work. They are built upon various state-of-the-art models, including Stable Diffusion, transformer models like GPT-3 (recent GPT-4), variational autoencoders, and generative adversarial networks. This advancement in Generative AI presents a wealth of exciting opportunities and, simultaneously, unprecedented challenges. Throughout this paper, we have explored these state-of-the-art models, the diverse array of tasks they can accomplish, the challenges they pose, and the promising future of Generative Artificial Intelligence.
ChatGPT的发布吸引了全球关注,标志着生成人工智能(Generative Artificial Intelligence,简称GAI)领域的一个重要里程碑。虽然生成人工智能(GAI)在过去十年里已经存在,但ChatGPT的引入引发了对AI领域的全新研究和技术创新的激情。这一兴趣激增导致了诸如Bard、Stable Diffusion、DALL-E、Make-A-Video、Runway ML和Jukebox等众多尖端工具的开发和发布。这些工具表现出非凡的能力,涵盖从文本生成和音乐创作到图像创建、视频制作、代码生成和科学工作的各种任务。它们基于各种最先进的模型,包括Stable Diffusion、Transformer模型(如GPT-3,最近发布的GPT-4)以及变分自编码器(VAE)和生成对抗网络(GAN)。这一生成人工智能的进步为人们带来了丰富的令人兴奋的机会,同时也带来了前所未有的挑战。在本文中,我们探讨了这些最先进的模型,它们可以实现的各种任务,它们所面临的问题以及生成人工智能令人担忧的前景。
https://arxiv.org/abs/2311.10242
Automatic hate speech detection using deep neural models is hampered by the scarcity of labeled datasets, leading to poor generalization. To mitigate this problem, generative AI has been utilized to generate large amounts of synthetic hate speech sequences from available labeled examples, leveraging the generated data in finetuning large pre-trained language models (LLMs). In this chapter, we provide a review of relevant methods, experimental setups and evaluation of this approach. In addition to general LLMs, such as BERT, RoBERTa and ALBERT, we apply and evaluate the impact of train set augmentation with generated data using LLMs that have been already adapted for hate detection, including RoBERTa-Toxicity, HateBERT, HateXplain, ToxDect, and ToxiGen. An empirical study corroborates our previous findings, showing that this approach improves hate speech generalization, boosting recall performance across data distributions. In addition, we explore and compare the performance of the finetuned LLMs with zero-shot hate detection using a GPT-3.5 model. Our results demonstrate that while better generalization is achieved using the GPT-3.5 model, it achieves mediocre recall and low precision on most datasets. It is an open question whether the sensitivity of models such as GPT-3.5, and onward, can be improved using similar techniques of text generation.
使用深度神经网络模型进行自动仇恨言论检测存在标注数据集的稀缺性,导致泛化能力差。为解决这个问题,生成式人工智能(GAs)被用于从现有标注示例中生成大量合成仇恨言论序列,并利用生成的数据在微调大预训练语言模型(LLMs)时加强训练集 augment。在本章中,我们回顾了相关方法、实验设置以及这种方法的评估。除了通用的 LLMs,如 BERT、RoBERTa 和 ALBERT,我们还使用已经适应仇恨检测的 LLMs 进行了训练集增强和评估,包括 RoBERTa-Toxicity、HateBERT、HateXplain、ToxDect 和 ToxiGen。 一项实证研究证实了我们在之前的观察结果,表明这种方法提高了仇恨言论的泛化能力,提高了数据分布的召回率。此外,我们探讨并比较了使用 GPT-3.5 模型进行微调的 LLMs 与零散仇恨检测模型的性能。我们的结果表明,虽然使用 GPT-3.5 模型可以实现更好的泛化,但它在大多数数据集上的召回率和精度都较低。一个有趣的问题是对 GPT-3.5 模型等模型的敏感性是否可以通过类似的技术进行改进。
https://arxiv.org/abs/2311.09993
Text watermarking has emerged as an important technique for detecting machine-generated text. However, existing methods can severely degrade text quality due to arbitrary vocabulary partitioning, which disrupts the language model's expressiveness and impedes textual coherence. To mitigate this, we introduce XMark, a novel approach that capitalizes on text redundancy within the lexical space. Specifically, XMark incorporates a mutually exclusive rule for synonyms during the language model decoding process, thereby integrating prior knowledge into vocabulary partitioning and preserving the capabilities of language generation. We present theoretical analyses and empirical evidence demonstrating that XMark substantially enhances text generation fluency while maintaining watermark detectability. Furthermore, we investigate watermarking's impact on the emergent abilities of large language models, including zero-shot and few-shot knowledge recall, logical reasoning, and instruction following. Our comprehensive experiments confirm that XMark consistently outperforms existing methods in retaining these crucial capabilities of LLMs.
文本水印作为一种重要的方法,用于检测由机器生成的文本。然而,现有的方法会严重破坏文本质量,因为随意词汇分割会破坏语言模型的表现力并阻碍文本的连贯性。为了减轻这种破坏,我们引入了XMark,一种新方法,它利用词汇空间中的文本冗余来 capitalize。具体来说,XMark 在语言模型解码过程中引入了互斥规则来处理同义词,从而将先验知识融入词汇分割中并保留语言生成能力。我们提供了理论分析和实证证据,证明 XMark 在保持水印检测的同时显著增强文本生成流畅度。此外,我们研究了水印对大型语言模型新兴能力的影响,包括零 shots和少数 shot 知识召回、推理和指令跟随。我们的全面实验证实,XMark 持续优于现有方法,保留 LLM 的关键功能。
https://arxiv.org/abs/2311.09832
Table-to-Text has been traditionally approached as a linear language to text problem. However, visually represented tables are rich in visual information and serve as a concise, effective form of representing data and its relationships. When using text-based approaches, after the linearization process, this information is either lost or represented in a space inefficient manner. This inefficiency has remained a constant challenge for text-based approaches making them struggle with large tables. In this paper, we demonstrate that image representation of tables are more space-efficient than the typical textual linearizations, and multi-modal approaches are competitive in Table-to-Text tasks. We present PixT3, a multimodal table-to-text model that outperforms the state-of-the-art (SotA) in the ToTTo benchmark in a pure Table-to-Text setting while remaining competitive in controlled Table-to-Text scenarios. It also generalizes better in unseen datasets, outperforming ToTTo SotA in all generation settings. Additionally, we introduce a new intermediate training curriculum to reinforce table structural awareness, leading to improved generation and overall faithfulness of the models.
表格到文本(Table-to-Text)问题一直以来都被视为一个线性语言到文本问题。然而,视觉表示的表格富含视觉信息,可以作为表示数据及其关系的简洁而有效的形式。当使用基于文本的方法时,在线性化过程之后,这些信息要么丢失,要么以空间效率低的方式表示。这一低效率一直成为基于文本方法的挑战,使它们在处理大量表格时遇到困难。在本文中,我们证明了表格图像表示比典型的文本线性化更具有空间效率,多模态方法在表格到文本任务中具有竞争力。我们提出了PixT3,一种多模态表格到文本模型,在纯表格到文本设置中超过了最先进的(SotA)水平,同时在受控的表格到文本场景中也具有竞争力。此外,它在对未知数据集的泛化方面也表现更好,在所有生成设置中超过了TotTo SotA。最后,我们还引入了一种新的中间训练课程,以增强表格结构的意识,从而提高了模型的生成和整体可靠性。
https://arxiv.org/abs/2311.09808
In the evaluation of medical text generation, it is essential to scrutinize each piece of information and ensure the utmost accuracy of the evaluation. Existing evaluation metrics either focus on coarse-level evaluation that assigns one score for the whole generated output or rely on evaluation models trained on general domain, resulting in inaccuracies when adapted to the medical domain. To address these issues, we propose a set of factuality-centric evaluation aspects and design corresponding GPT-4-based metrics for medical text generation. We systematically compare these metrics with existing ones on clinical note generation and medical report summarization tasks, revealing low inter-metric correlation. A comprehensive human evaluation confirms that the proposed GPT-4-based metrics exhibit substantially higher agreement with human judgments than existing evaluation metrics. Our study contributes to the understanding of medical text generation evaluation and offers a more reliable alternative to existing metrics.
在对医疗文本生成的评估中,确保每个信息点的准确性至关重要。现有的评估指标要么关注粗粒度评估,为整个生成输出分配一个分数,要么依赖于基于通用领域训练的评估模型,因此在适应医疗领域时会导致不准确。为解决这些问题,我们提出了一个以事实为中心的评估方面,并针对医疗文本生成设计了一系列基于GPT-4的指标。我们在临床笔记生成和医疗报告摘要任务上系统地比较了这些指标与现有指标,揭示了低指标间的相关性。全面的人类评价证实了所提出的GPT-4基指标与人类判断之间的差异显著高于现有评估指标。我们的研究对医疗文本生成评估的理解做出了贡献,并为现有的评估指标提供了一个更可靠的替代方案。
https://arxiv.org/abs/2311.09581
Neural knowledge-to-text generation models often struggle to faithfully generate descriptions for the input facts: they may produce hallucinations that contradict the given facts, or describe facts not present in the input. To reduce hallucinations, we propose a novel decoding method, TWEAK (Think While Effectively Articulating Knowledge). TWEAK treats the generated sequences at each decoding step and its future sequences as hypotheses, and ranks each generation candidate based on how well their corresponding hypotheses support the input facts using a Hypothesis Verification Model (HVM). We first demonstrate the effectiveness of TWEAK by using a Natural Language Inference (NLI) model as the HVM and report improved faithfulness with minimal impact on the quality. We then replace the NLI model with our task-specific HVM trained with a first-of-a-kind dataset, FATE (Fact-Aligned Textual Entailment), which pairs input facts with their faithful and hallucinated descriptions with the hallucinated spans marked. The new HVM improves the faithfulness and the quality further and runs faster. Overall the best TWEAK variants improve on average 2.22/7.17 points on faithfulness measured by FactKB over WebNLG and TekGen/GenWiki, respectively, with only 0.14/0.32 points degradation on quality measured by BERTScore over the same datasets. Since TWEAK is a decoding-only approach, it can be integrated with any neural generative model without retraining.
神经知识到文本生成模型通常在生成输入事实的描述时存在问题:它们可能会产生与给定事实相矛盾的幻觉,或者描述在输入中不存在的事实。为了减少幻觉,我们提出了一个新颖的解码方法:TWEAK(在生成过程中思考有效地表达知识)。TWEAK在解码过程中处理生成的序列及其未来序列作为假设,并使用假设验证模型(HVM)根据假设对输入事实的支持程度对每个生成候选者进行排名。我们首先通过使用自然语言推理(NLI)模型作为HVM并报告最低影响和最佳质量改善来证明TWEAK的有效性。然后,我们将NLI模型用我们使用第一个数据集FATE(事实与文本一致性)训练的特定任务HVM替换,该HVM将输入事实与它们的忠实和幻觉描述与标记的幻觉段对齐。新的HVM进一步提高了 faithfulness 和 quality,并运行更快。总体而言,最佳 TWEAK 变体在 FactKB 和 TekGen/GenWiki 上分别改进了 2.22/7.17 个点,而仅在同一数据集上 BERTScore 上降低了 0.14/0.32 个点。由于 TWEAK 是一种解码- only 方法,因此可以将其集成到任何神经生成模型中而无需重新训练。
https://arxiv.org/abs/2311.09467
Using novel approaches to dataset development, the Biasly dataset captures the nuance and subtlety of misogyny in ways that are unique within the literature. Built in collaboration with multi-disciplinary experts and annotators themselves, the dataset contains annotations of movie subtitles, capturing colloquial expressions of misogyny in North American film. The dataset can be used for a range of NLP tasks, including classification, severity score regression, and text generation for rewrites. In this paper, we discuss the methodology used, analyze the annotations obtained, and provide baselines using common NLP algorithms in the context of misogyny detection and mitigation. We hope this work will promote AI for social good in NLP for bias detection, explanation, and removal.
使用新颖的数据集开发方法,Biasly 数据集独特地捕捉了性别歧视中的细微和微妙之处。该数据集与跨学科的专家和注释者合作编写,其中包括对北美电影中流行表达式的不准翻译的注释。该数据集可用于各种自然语言处理任务,包括分类、严重度评分回归和重写文本生成。在本文中,我们讨论了所采用的方法,分析了获得的注释,并在性别歧视检测和减轻方面提供了使用常见自然语言处理算法的基本模板。我们希望这项工作将促进在 NLP 中实现人工智能的社会价值,包括偏差检测、解释和消除。
https://arxiv.org/abs/2311.09443
Recent improvements in text generation have leveraged human feedback to improve the quality of the generated output. However, human feedback is not always available, especially during inference. In this work, we propose an inference time optimization method FITO to use fine-grained actionable feedback in the form of error type, error location and severity level that are predicted by a learned error pinpoint model for iterative refinement. FITO starts with an initial output, then iteratively incorporates the feedback via a refinement model that generates an improved output conditioned on the feedback. Given the uncertainty of consistent refined samples at iterative steps, we formulate iterative refinement into a local search problem and develop a simulated annealing based algorithm that balances exploration of the search space and optimization for output quality. We conduct experiments on three text generation tasks, including machine translation, long-form question answering (QA) and topical summarization. We observe 0.8 and 0.7 MetricX gain on Chinese-English and English-German translation, 4.5 and 1.8 ROUGE-L gain at long form QA and topic summarization respectively, with a single iteration of refinement. With our simulated annealing algorithm, we see further quality improvements, including up to 1.7 MetricX improvements over the baseline approach.
近年来,在文本生成方面的改进主要利用了人类反馈来提高生成输出的质量。然而,在推理过程中,人类反馈并不总是可用的。在这项工作中,我们提出了一个推理时间优化方法FITO,用于使用由学习到的错误指针模型预测的错误类型、错误位置和严重程度类型的细粒度动作反馈,进行迭代精炼。FITO从初始输出开始,然后通过一个生成改进输出的精炼模型迭代整合反馈。由于在迭代步骤中存在不确定性的连续精炼样本,我们将迭代精炼转化为局部搜索问题,并开发了一种基于模拟退火算法的模拟退火优化算法,该算法在探索搜索空间和优化输出质量之间实现了平衡。我们在三个文本生成任务上进行实验,包括机器翻译、长篇问题回答(QA)和主题概述。我们观察到,在中文和英文翻译中,MetricX gain分别为0.8和0.7,而在长篇QA和主题概述中,ROUGE-L gain分别为4.5和1.8,每次精炼周期都有所改善。使用我们的模拟退火算法,我们看到了进一步的质量和性能改进,包括基线方法上的1.7个MetricX改进。
https://arxiv.org/abs/2311.09336