Dialogue data in real scenarios tend to be sparsely available, rendering data-starved end-to-end dialogue systems trained inadequately. We discover that data utilization efficiency in low-resource scenarios can be enhanced by mining alignment information uncertain utterance and deterministic dialogue state. Therefore, we innovatively implement dual learning in task-oriented dialogues to exploit the correlation of heterogeneous data. In addition, the one-to-one duality is converted into a multijugate duality to reduce the influence of spurious correlations in dual training for generalization. Without introducing additional parameters, our method could be implemented in arbitrary networks. Extensive empirical analyses demonstrate that our proposed method improves the effectiveness of end-to-end task-oriented dialogue systems under multiple benchmarks and obtains state-of-the-art results in low-resource scenarios.
实际场景中的对话数据往往很少可用,导致数据匮乏的端到端对话系统训练不足。我们发现,在资源匮乏的情况下,可以通过挖掘不确定的言词和确定性对话状态的信息,提高数据利用效率。因此,我们创新性地在任务导向的对话中实施双重学习,利用不同数据之间的相关性。此外,将一对一的二元关系转换为多视角的二元关系,以减少在双重训练中伪相关性的影响。在没有引入额外的参数的情况下,我们的方法可以应用于任意网络。广泛的实证分析表明,我们提出的这种方法在多个基准条件下改进了端到端任务导向对话系统的效力,并在资源匮乏的情况下取得了最先进的结果。
https://arxiv.org/abs/2305.16106
Longitudinal Dialogues (LD) are the most challenging type of conversation for human-machine dialogue systems. LDs include the recollections of events, personal thoughts, and emotions specific to each individual in a sparse sequence of dialogue sessions. Dialogue systems designed for LDs should uniquely interact with the users over multiple sessions and long periods of time (e.g. weeks), and engage them in personal dialogues to elaborate on their feelings, thoughts, and real-life events. In this paper, we study the task of response generation in LDs. We evaluate whether general-purpose Pre-trained Language Models (PLM) are appropriate for this purpose. We fine-tune two PLMs, GePpeTto (GPT-2) and iT5, using a dataset of LDs. We experiment with different representations of the personal knowledge extracted from LDs for grounded response generation, including the graph representation of the mentioned events and participants. We evaluate the performance of the models via automatic metrics and the contribution of the knowledge via the Integrated Gradients technique. We categorize the natural language generation errors via human evaluations of contextualization, appropriateness and engagement of the user.
长期对话(LD)是人类-机器对话系统中最具挑战性的通话类型。LD包括个体在对话序列中的特定事件、个人想法和情绪的记忆。为LD设计的对话系统应该在多个对话 session 和长时间内(例如几周)uniquely 与用户交互,并让他们参与个人对话,以详细阐述他们的感受、想法和真实生活中的事件。在本文中,我们研究了LD中的响应生成任务。我们评估了通用预训练语言模型(PLM)是否适合这一目的。我们利用LD数据的集微调了两个PLM:GePpeTto (GPT-2) 和 iT5。我们使用不同的个人知识表示方法,包括从LD中提取的提及的事件和参与者的图形表示,进行了实验。我们通过自动指标和集成梯度技术评估了模型的性能,并利用人类评估了情境化、合适性和用户参与的程度,将自然语言生成错误进行分类。
https://arxiv.org/abs/2305.15908
Efficient utilisation of both intra- and extra-textual context remains one of the critical gaps between machine and human translation. Existing research has primarily focused on providing individual, well-defined types of context in translation, such as the surrounding text or discrete external variables like the speaker's gender. This work introduces MTCue, a novel neural machine translation (NMT) framework that interprets all context (including discrete variables) as text. MTCue learns an abstract representation of context, enabling transferability across different data settings and leveraging similar attributes in low-resource scenarios. With a focus on a dialogue domain with access to document and metadata context, we extensively evaluate MTCue in four language pairs in both translation directions. Our framework demonstrates significant improvements in translation quality over a parameter-matched non-contextual baseline, as measured by BLEU (+0.88) and Comet (+1.58). Moreover, MTCue significantly outperforms a "tagging" baseline at translating English text. Analysis reveals that the context encoder of MTCue learns a representation space that organises context based on specific attributes, such as formality, enabling effective zero-shot control. Pre-training on context embeddings also improves MTCue's few-shot performance compared to the "tagging" baseline. Finally, an ablation study conducted on model components and contextual variables further supports the robustness of MTCue for context-based NMT.
有效地利用内文本和外部文本上下文仍然是机器翻译和人类翻译之间的一个关键差距。现有的研究主要关注提供个人、明确类型的上下文,例如周围的文本或说话者性别等离散外部变量。这项工作介绍了MTCue,一个新的神经网络机器翻译框架,它将所有上下文(包括离散变量)解释为文本。MTCue学习了一种抽象上下文表示,使得在不同数据设置下的可移植性得以实现,并在低资源情况下利用类似的属性。重点关注有文档和元数据上下文访问的对话领域,我们广泛评估了MTCue在两种翻译方向的四对语言之间的性能。我们的框架通过BLEU(+0.88)和Comet(+1.58)的性能测量表现出了翻译质量的重大改进。此外,MTCue在翻译英语文本方面比“标签”基准框架表现得更好。分析表明,MTCue的上下文编码器学习了一个表示空间,以基于特定的属性(如正式性)组织上下文,从而实现有效的零次控制。预处理上下文嵌入的训练也提高了MTCue的少量次性能,与“标签”基准框架相比。最后,对模型组件和上下文变量进行的 ablation研究进一步支持了MTCue对基于上下文的机器翻译的鲁棒性。
https://arxiv.org/abs/2305.15904
Recent years have seen increasing concerns about the private inference of NLP services and Transformer models. However, existing two-party privacy-preserving methods solely consider NLU scenarios, while the private inference of text generation such as translation, dialogue, and code completion remains unsolved. Besides, while migrated to NLG models, existing privacy-preserving methods perform poorly in terms of inference speed, and suffer from the convergence problem during the training stage. To address these issues, we propose MERGE, a fast private text generation framework for Transformer-based language models. Specifically, MERGE reuse the output hidden state as the word embedding to bypass the embedding computation, and reorganize the linear operations in the Transformer module to accelerate the forward procedure. Based on these two optimizations, extensive experiments show that MERGE can achieve a 26.5x speedup under the sequence length 512, and reduce 80\% communication bytes, with an up to 10x speedup to existing state-of-art models.
近年来,人们对自然语言处理服务和Transformer模型的私有推理越来越关注。然而,现有的两方隐私保护方法仅仅考虑了NLU场景,而对于生成文本如翻译、对话和代码补全的私有推理仍然无法解决。此外,在迁移到NLG模型时,现有的隐私保护方法在推理速度方面表现较差,并且在训练阶段会出现收敛问题。为了解决这些问题,我们提出了Merge,一个适用于Transformer基于语言模型的快速私有文本生成框架。具体来说,Merge将输出隐状态用作单词嵌入,绕过嵌入计算,并重新安排Transformer模块中的线性操作,以加速前进过程。基于这两个优化,广泛的实验结果表明,Merge可以在序列长度为512的情况下实现26.5倍速度提升,并减少80\%的通信字节,而现有最先进的模型速度提升可以达到10倍。
https://arxiv.org/abs/2305.15769
Recent years have seen increasing concerns about the unsafe response generation of large-scale dialogue systems, where agents will learn offensive or biased behaviors from the real-world corpus. Some methods are proposed to address the above issue by detecting and replacing unsafe training examples in a pipeline style. Though effective, they suffer from a high annotation cost and adapt poorly to unseen scenarios as well as adversarial attacks. Besides, the neglect of providing safe responses (e.g. simply replacing with templates) will cause the information-missing problem of dialogues. To address these issues, we propose an unsupervised pseudo-label sampling method, TEMP, that can automatically assign potential safe responses. Specifically, our TEMP method groups responses into several clusters and samples multiple labels with an adaptively sharpened sampling strategy, inspired by the observation that unsafe samples in the clusters are usually few and distribute in the tail. Extensive experiments in chitchat and task-oriented dialogues show that our TEMP outperforms state-of-the-art models with weak supervision signals and obtains comparable results under unsupervised learning settings.
近年来,人们对大规模对话系统的不安全响应生成日益关注,这些系统将从现实世界的数据集学习具有攻击性或偏见的行为。有一些方法建议通过在管道中检测并替换不安全的训练示例来解决上述问题。虽然有效,但它们面临着高标注成本,并且对于未观察到的场景和对抗攻击的适应性较差。此外,忽略了提供安全响应(例如简单地替换为模板)将会导致对话信息的丢失问题。为了解决这些问题,我们提出了一种 unsupervised 的伪标签采样方法 TEMP,该方法可以自动分配可能的安全响应。具体而言,我们的 TEMP 方法将响应分为多个簇,并使用自适应的增强采样策略样本多个标签,灵感来自于观察簇中的不安全样本通常很少,分布在尾部。在闲聊和任务导向的对话实验中,广泛研究表明,我们的 TEMP 在弱监督信号下的表现力比先进的模型更强,并能够在无监督学习设置下获得类似的结果。
https://arxiv.org/abs/2305.15757
Tasks involving text generation based on multiple input texts, such as multi-document summarization, long-form question answering and contemporary dialogue applications, challenge models for their ability to properly consolidate partly-overlapping multi-text information. However, these tasks entangle the consolidation phase with the often subjective and ill-defined content selection requirement, impeding proper assessment of models' consolidation capabilities. In this paper, we suggest revisiting the sentence union generation task as an effective well-defined testbed for assessing text consolidation capabilities, decoupling the consolidation challenge from subjective content selection. To support research on this task, we present refined annotation methodology and tools for crowdsourcing sentence union, create the largest union dataset to date and provide an analysis of its rich coverage of various consolidation aspects. We then propose a comprehensive evaluation protocol for union generation, including both human and automatic evaluation. Finally, as baselines, we evaluate state-of-the-art language models on the task, along with a detailed analysis of their capacity to address multi-text consolidation challenges and their limitations.
基于多个输入文本的任务,例如多文档摘要、长篇问题回答和当前对话应用程序,挑战模型使其能够正确合并部分重叠的多文本信息。然而,这些任务将巩固阶段与往往主观且定义不清的内容选择要求联系起来,妨碍正确评估模型的巩固能力。在本文中,我们建议重新考虑 sentence union generation task 作为评估文本巩固能力的有效且定义明确的测试平台,将巩固挑战与主观内容选择要求分离。为支持该任务的研究,我们提出了改进的标注方法和工具,用于 crowdsource sentence union,创建目前最大的合并数据集,并提供了对其丰富覆盖各种巩固方面的分析。然后,我们提出了合并生成 comprehensive 评估协议,包括人类和自动评估。最后,作为基准,我们评估了任务最先进的语言模型的能力,并详细分析了它们如何应对多文本巩固挑战及其限制。
https://arxiv.org/abs/2305.15605
Multi-party dialogues are more difficult for models to understand than one-to-one two-party dialogues, since they involve multiple interlocutors, resulting in interweaving reply-to relations and information flows. To step over these obstacles, an effective way is to pre-train a model that understands the discourse structure of multi-party dialogues, namely, to whom each utterance is replying. However, due to the lack of explicitly annotated discourse labels in multi-party dialogue corpora, previous works fail to scale up the pre-training process by putting aside the unlabeled multi-party conversational data for nothing. To fully utilize the unlabeled data, we propose to treat the discourse structures as latent variables, then jointly infer them and pre-train the discourse-aware model by unsupervised latent variable inference methods. Experiments on multiple downstream tasks show that our pre-trained model outperforms strong baselines by large margins and achieves state-of-the-art (SOTA) results, justifying the effectiveness of our method. The official implementation of this paper is available at this https URL.
多对多的对话比一对一的对话更难让模型理解,因为它们涉及多个对话者,导致回复关系和信息流动交织在一起。要克服这些障碍,一种有效的方法是先训练一个能够理解多对多对话的言语结构模型,即每个说话者的回复对象。然而,由于多对多对话 corpora 中缺乏明确标注的言语标签,以前的工作无法通过将未标记的多对多对话数据视为无标签变量而将 pre-training 过程扩展到更大的规模。为了充分利用未标记数据,我们建议将言语结构视为隐变量,然后使用 unsupervised 隐变量推断方法联合推断它们,并先训练一个言语 aware 模型。多个后续任务的实验结果表明,我们的先训练模型在多项任务中表现出巨大的优势,并取得了最先进的结果,证明了我们方法的有效性。本文的官方实现可在 this https URL 中找到。
https://arxiv.org/abs/2305.15175
Conditional variational autoencoders (CVAEs) have been used recently for diverse response generation, by introducing latent variables to represent the relationship between a dialog context and its potential responses. However, the diversity of the generated responses brought by a CVAE model is limited due to the oversimplified assumption of the isotropic Gaussian prior. We propose, Dior-CVAE, a hierarchical CVAE model with an informative prior produced by a diffusion model. Dior-CVAE derives a series of layer-wise latent variables using attention mechanism and infusing them into decoder layers accordingly. We propose memory dropout in the latent infusion to alleviate posterior collapse. The prior distribution of the latent variables is parameterized by a diffusion model to introduce a multimodal distribution. Overall, experiments on two popular open-domain dialog datasets indicate the advantages of our approach over previous Transformer-based variational dialog models in dialog response generation. We publicly release the code for reproducing Dior-CVAE and all baselines at this https URL.
条件变分自编码器(VAEs)最近被用于多种响应生成,通过引入隐变量来代表对话上下文及其可能响应之间的关系。然而,VAE模型生成的响应多样性受到 isotropicGaussian 前趋式的简化假设的限制。我们提出,Dio-CVAE,一种由扩散模型生成的层级式VAE模型,并提出了 informative prior,该prior 通过注意力机制从每个层生成一组隐变量,并将其注入解码层。我们提出在隐变量注入过程中进行内存删除以减轻后向崩溃。隐变量的前趋分布通过扩散模型参数化以引入多模式分布。总的来说,对两个流行的开放域对话数据集的实验表明,我们的方法和以前的基于Transformer的变分自编码器在对话响应生成方面的优势。我们将在此httpsURL上公开发布代码以复制Dio-CVAE和所有基准模型。
https://arxiv.org/abs/2305.15025
Recently, growing interest has been aroused in extending the multimodal capability of large language models (LLMs), e.g., vision-language (VL) learning, which is regarded as the next milestone of artificial general intelligence. However, existing solutions are prohibitively expensive, which not only need to optimize excessive parameters, but also require another large-scale pre-training before VL instruction tuning. In this paper, we propose a novel and affordable solution for the effective VL adaption of LLMs, called Mixture-of-Modality Adaptation (MMA). Instead of using large neural networks to connect the image encoder and LLM, MMA adopts lightweight modules, i.e., adapters, to bridge the gap between LLMs and VL tasks, which also enables the joint optimization of the image and language models. Meanwhile, MMA is also equipped with a routing algorithm to help LLMs achieve an automatic shift between single- and multi-modal instructions without compromising their ability of natural language understanding. To validate MMA, we apply it to a recent LLM called LLaMA and term this formed large vision-language instructed model as LaVIN. To validate MMA and LaVIN, we conduct extensive experiments under two setups, namely multimodal science question answering and multimodal dialogue. The experimental results not only demonstrate the competitive performance and the superior training efficiency of LaVIN than existing multimodal LLMs, but also confirm its great potential as a general-purpose chatbot. More importantly, the actual expenditure of LaVIN is extremely cheap, e.g., only 1.4 training hours with 3.8M trainable parameters, greatly confirming the effectiveness of MMA. Our project is released at this https URL.
近年来,对大型语言模型(LLM)的多模态能力扩展引起了越来越多的关注,例如视觉语言(VL)学习,被认为是人工智能通用智能的下一个里程碑。然而,现有的解决方案非常昂贵,不仅需要优化过多的参数,还需要在VL指令调整之前进行另一大规模的预训练。在本文中,我们提出了一种新颖且成本较低的解决方案,称为混合模态适应(MMA),以有效适应LLM的VL学习,该解决方案被称为Adapters。 Instead of使用大型神经网络连接图像编码器和LLM,MMA采用轻量级模块,即适配器,以连接LLM和VL任务之间的差异,并实现图像和语言模型的联合优化。同时,MMA还配备了路由算法,以帮助LLM实现单模态和多模态指令的自动转换,而不会影响其自然语言理解能力。为了验证MMA,我们将其应用于最近开发的LLM称为LLaMA,并将形成的大型视觉语言指示模型称为LaVIN。为了验证MMA和LaVIN,我们在两个设置下进行了广泛的实验,即 multimodal科学问题回答和 multimodal对话。实验结果不仅证明了LaVIN比现有的多模态LLM更具竞争力性能和更好的训练效率,还确认了其作为通用聊天机器人的巨大潜力。更重要的是,LaVIN的实际支出非常便宜,例如仅需要1.4小时的训练时间,并具有380万可训练参数,极大地证实了MMA的有效性。我们的项目在此httpsURL发布。
https://arxiv.org/abs/2305.15023
General chat models, like ChatGPT, have attained impressive capability to resolve a wide range of NLP tasks by tuning Large Language Models (LLMs) with high-quality instruction data. However, collecting human-written high-quality data, especially multi-turn dialogues, is expensive and unattainable for most people. Though previous studies have used powerful LLMs to generate the dialogues automatically, but they all suffer from generating untruthful dialogues because of the LLMs hallucination. Therefore, we propose a method called RefGPT to generate enormous truthful and customized dialogues without worrying about factual errors caused by the model hallucination. RefGPT solves the model hallucination in dialogue generation by restricting the LLMs to leverage the given reference instead of reciting their own knowledge to generate dialogues. Additionally, RefGPT adds detailed controls on every utterances to enable highly customization capability, which previous studies have ignored. On the basis of RefGPT, we also propose two high-quality dialogue datasets generated by GPT-4, namely RefGPT-Fact and RefGPT-Code. RefGPT-Fact is 100k multi-turn dialogue datasets based on factual knowledge and RefGPT-Code is 76k multi-turn dialogue dataset covering a wide range of coding scenarios. Our code and datasets are released in this https URL
像ChatGPT这样的通用聊天模型通过调整大型语言模型(LLM)并使用高质量的指示数据,取得了令人印象深刻的能力,解决多种自然语言处理任务。然而,收集高质量的人类编写数据,特别是多轮对话数据,对于大多数人来说是昂贵的并且难以实现。尽管过去的研究曾使用强大的LLM自动生成对话,但它们都因为LLM幻觉而生成不真实的对话。因此,我们提出了一种方法,称为RefGPT,可以生成巨大的真实且定制的对话,而不必担心模型幻觉造成的事实错误。RefGPT通过限制LLM利用给定参考而不是背诵自己的知识来生成对话,解决了对话生成中的LLM幻觉问题。此外,RefGPT对每个发言添加详细的控制,以实现高度定制的能力,而以前的研究则忽视了这一点。基于RefGPT,我们还提出了两个由GPT-4生成的高质量对话数据集,分别是RefGPT-Fact和RefGPT-Code。RefGPT-Fact基于事实知识生成100,000多轮对话数据集,而RefGPT-Code生成76,000多轮对话数据集,涵盖了广泛的编程场景。我们的代码和数据集在这个httpsURL上发布。
https://arxiv.org/abs/2305.14994
We present Dolphin, a novel benchmark that addresses the need for an evaluation framework for the wide collection of Arabic languages and varieties. The proposed benchmark encompasses a broad range of 13 different NLG tasks, including text summarization, machine translation, question answering, and dialogue generation, among others. Dolphin comprises a substantial corpus of 40 diverse and representative public datasets across 50 test splits, carefully curated to reflect real-world scenarios and the linguistic richness of Arabic. It sets a new standard for evaluating the performance and generalization capabilities of Arabic and multilingual models, promising to enable researchers to push the boundaries of current methodologies. We provide an extensive analysis of Dolphin, highlighting its diversity and identifying gaps in current Arabic NLG research. We also evaluate several Arabic and multilingual models on our benchmark, allowing us to set strong baselines against which researchers can compare.
我们提出了 Dolphin,一个 novel 的基准,解决了对阿拉伯语语言和 variety 的广泛收集的评价框架的需求。该基准涵盖了 13 种不同的 NLG 任务,包括文本摘要、机器翻译、问题回答和对话生成等,还有其他任务。 Dolphin 包括 40 个 diverse 和代表性的公共数据集,在 50 个测试split 中 carefully curated,以反映现实世界情况和阿拉伯语的语言丰富性。它制定了评估阿拉伯语和多语言模型性能和泛化能力的新标准,有望使研究人员能够推动当前方法学的 boundaries。我们提供了广泛的分析 Dolphin,强调了其多样性,并识别了当前阿拉伯语 NLG 研究中的空缺。我们还在我们的基准上评估了 several 阿拉伯语和多语言模型,使我们能够设置强大的基线,以供研究人员进行比较。
https://arxiv.org/abs/2305.14989
This paper proposes a framework to address the issue of data scarcity in Document-Grounded Dialogue Systems(DGDS). Our model leverages high-resource languages to enhance the capability of dialogue generation in low-resource languages. Specifically, We present a novel pipeline CLEM (Cross-Lingual Enhanced Model) including adversarial training retrieval (Retriever and Re-ranker), and Fid (fusion-in-decoder) generator. To further leverage high-resource language, we also propose an innovative architecture to conduct alignment across different languages with translated training. Extensive experiment results demonstrate the effectiveness of our model and we achieved 4th place in the DialDoc 2023 Competition. Therefore, CLEM can serve as a solution to resource scarcity in DGDS and provide useful guidance for multi-lingual alignment tasks.
本文提出了一个框架来解决文档grounded对话系统(DGDS)中数据稀缺的问题。我们的模型利用高资源语言来增强低资源语言对话生成的能力。具体来说,我们提出了一种 novel pipeline CLEM(跨语言增强模型),包括对抗训练检索(Retriever and Re-ranker)和 Fid(解码器中的融合)生成器。为了进一步利用高资源语言,我们还提出了一种创新架构,以通过翻译训练进行跨语言对齐。广泛的实验结果显示我们的模型的有效性,我们在2023年Dialdoc竞赛中获得了第四名。因此,CLEM可以作为DGDS中资源稀缺的解决方案,并为多语言对齐任务提供有用的指导。
https://arxiv.org/abs/2305.14949
The use of large language models (LLMs) in natural language processing (NLP) tasks is rapidly increasing, leading to changes in how researchers approach problems in the field. To fully utilize these models' abilities, a better understanding of their behavior for different input protocols is required. With LLMs, users can directly interact with the models through a text-based interface to define and solve various tasks. Hence, understanding the conversational abilities of these LLMs, which may not have been specifically trained for dialog modeling, is also important. This study examines different approaches for building dialog systems using LLMs by considering various aspects of the prompt. As part of prompt tuning, we experiment with various ways of providing instructions, exemplars, current query and additional context. The research also analyzes the representations of dialog history that have the optimal usable-information density. Based on the findings, the paper suggests more compact ways of providing dialog history information while ensuring good performance and reducing model's inference-API costs. The research contributes to a better understanding of how LLMs can be effectively used for building interactive systems.
大型语言模型(LLM)在自然语言处理任务(NLP)中的普及正在迅速增加,导致研究人员在该领域的方法发生了变化。要充分利用这些模型的能力,需要更好地理解它们对不同输入协议的行为。有了LLM,用户可以通过文本界面直接与模型交互,定义和解决各种任务。因此,理解这些LLM可能没有专门训练的对话能力也非常重要。本研究考虑了使用LLM构建对话系统的不同方法,并考虑了提示的不同方面。作为提示优化的一部分,我们实验了各种提供指令、示例、当前查询和额外上下文的方式。研究还分析了具有最佳可用信息密度的对话历史表示。基于研究结果, paper 建议更紧凑的方式来提供对话历史信息,同时确保良好的性能和减少模型的推理API成本。研究有助于更好地理解如何有效地利用LLM构建交互系统。
https://arxiv.org/abs/2305.14919
Perceiving multi-modal information and fulfilling dialogues with humans is a long-term goal of artificial intelligence. Pre-training is commonly regarded as an effective approach for multi-modal dialogue. However, due to the limited availability of multi-modal dialogue data, there is still scarce research on multi-modal dialogue pre-training. Yet another intriguing challenge emerges from the encompassing nature of multi-modal dialogue, which involves various modalities and tasks. Moreover, new forms of tasks may arise at unpredictable points in the future. Hence, it is essential for designed multi-modal dialogue models to possess sufficient flexibility to adapt to such scenarios. This paper proposes \textbf{PaCE}, a unified, structured, compositional multi-modal dialogue pre-training framework. It utilizes a combination of several fundamental experts to accommodate multiple dialogue-related tasks and can be pre-trained using limited dialogue and extensive non-dialogue multi-modal data. Furthermore, we propose a progressive training method where old experts from the past can assist new experts, facilitating the expansion of their capabilities. Experimental results demonstrate that PaCE achieves state-of-the-art results on eight multi-modal dialog benchmarks.
感知多模态信息并与人类进行对话是人工智能的长期目标。预处理通常被视为多模态对话的有效方法。然而,由于多模态对话数据有限,仍有关于多模态对话预处理的研究稀缺。此外,从多模态对话的全面性特性中涌现的另一个令人感兴趣的挑战,涉及各种模式和任务。此外,未来可能会出现新的任务形式,因此在不可预测的时刻,设计多模态对话模型必须具有足够的灵活性来适应这些场景。本文提出了 \textbf{PaCE},一个统一、结构良好、合成的多模态对话预处理框架。它利用多个基本专家的组合来适应多个对话相关任务,并且可以使用有限的对话和非对话多模态数据进行预处理。此外,我们提出了一种渐进的训练方法,其中过去的专家可以协助新的专家,促进其能力扩展。实验结果显示,PaCE在八个多模态对话基准上取得了最先进的结果。
https://arxiv.org/abs/2305.14839
Intent classification (IC) plays an important role in task-oriented dialogue systems as it identifies user intents from given utterances. However, models trained on limited annotations for IC often suffer from a lack of generalization to unseen intent classes. We propose a novel pre-training method for text encoders that uses contrastive learning with intent psuedo-labels to produce embeddings that are well-suited for IC tasks. By applying this pre-training strategy, we also introduce the pre-trained intent-aware encoder (PIE). Specifically, we first train a tagger to identify key phrases within utterances that are crucial for interpreting intents. We then use these extracted phrases to create examples for pre-training a text encoder in a contrastive manner. As a result, our PIE model achieves up to 5.4% and 4.0% higher accuracy than the previous state-of-the-art pre-trained sentence encoder for the N-way zero- and one-shot settings on four IC datasets.
意图分类(IC)在任务导向对话系统中发挥着重要作用,因为它从给定的对话表达中识别用户的意图。然而,训练基于有限意图分类标注模型通常缺乏对 unseen intent 类的泛化能力。我们提出了一种新的意图编码器预训练方法,该方法使用意图伪标签进行 contrastive 学习,以产生适合 IC 任务的嵌入。通过应用这种方法预训练策略,我们还引入了预训练意图意识到编码器(PIE)。具体而言,我们首先训练一个分词器,以识别对话中的关键短语,这些短语对于解释意图至关重要。然后我们使用这些提取的短语创建用于预训练意图编码器的示例,以进行 contrastive 训练。因此,我们的 PIE 模型在 four IC 数据集上的 N-way 零和一次性设置中实现高达 5.4% 和 4.0% 的准确度提高了先前最先进的意图编码器在四个 IC 数据集上的精度。
https://arxiv.org/abs/2305.14827
We present metrics for evaluating dialog systems through a psychologically-grounded "human" lens: conversational agents express a diversity of both states (short-term factors like emotions) and traits (longer-term factors like personality) just as people do. These interpretable metrics consist of five measures from established psychology constructs that can be applied both across dialogs and on turns within dialogs: emotional entropy, linguistic style and emotion matching, as well as agreeableness and empathy. We compare these human metrics against 6 state-of-the-art automatic metrics (e.g. BARTScore and BLEURT) on 7 standard dialog system data sets. We also introduce a novel data set, the Three Bot Dialog Evaluation Corpus, which consists of annotated conversations from ChatGPT, GPT-3, and BlenderBot. We demonstrate the proposed human metrics offer novel information, are uncorrelated with automatic metrics, and lead to increased accuracy beyond existing automatic metrics for predicting crowd-sourced dialog judgements. The interpretability and unique signal of our proposed human-centered framework make it a valuable tool for evaluating and improving dialog systems.
我们通过心理grounded的“人类”视角呈现了评估对话系统的指标。对话代理像人类一样,表达了状态(短期因素,如情感)和特征(长期因素,如人格)的多样性。这些可解释的指标包括从确立心理学构造的五个指标,可以在对话中应用,并在对话中转折时应用:情感熵、语言风格和情感匹配,以及友善度和共情。我们在7个标准对话系统数据集上对上述人类指标与6项先进的自动指标(如BartScore和BLEURT)进行了比较。我们还介绍了一个全新的数据集, Three Bot Dialog Evaluation Corpus,它由ChatGPT、GPT-3和BlenderBot等聊天机器人的注解对话组成。我们证明了所提议的人类指标提供了新信息,与自动指标无关,并超越了现有的自动指标,用于预测众包对话判断的精度。我们提议的人中心框架的可解释性和独特信号使其成为评估和改进对话系统的有价值的工具。
https://arxiv.org/abs/2305.14757
In the constant updates of the product dialogue systems, we need to retrain the natural language understanding (NLU) model as new data from the real users would be merged into the existent data accumulated in the last updates. Within the newly added data, new intents would emerge and might have semantic entanglement with the existing intents, e.g. new intents that are semantically too specific or generic are actually subset or superset of some existing intents in the semantic space, thus impairing the robustness of the NLU model. As the first attempt to solve this problem, we setup a new benchmark consisting of 4 Dialogue Version Control dataSets (DialogVCS). We formulate the intent detection with imperfect data in the system update as a multi-label classification task with positive but unlabeled intents, which asks the models to recognize all the proper intents, including the ones with semantic entanglement, in the inference. We also propose comprehensive baseline models and conduct in-depth analyses for the benchmark, showing that the semantically entangled intents can be effectively recognized with an automatic workflow.
在产品对话系统的不断更新中,我们需要重新训练自然语言理解(NLU)模型,因为从真实用户收集的新数据将合并到上次更新中所积累的现有数据中。在新数据中,新的意图将涌现,并且可能与现有意图语义 entanglement,例如,语义过于特定或通用的新意图实际上是某些现有意图在语义空间中的子集或超集,从而削弱了 NLU 模型的鲁棒性。作为解决这个问题的第一步,我们建立了一个由 4 个对话版本控制数据集(DialogVCS)组成的新基准,我们将不完美的数据在系统中更新中用作一个多标签分类任务,有正面但未标记的意图,该任务要求模型在推理中识别所有正确的意图,包括语义 entanglement 意图。我们还提出了全面的基础模型,并对基准进行了深入分析,表明,通过自动工作流程,语义 entanglement 意图可以 effectively 识别。
https://arxiv.org/abs/2305.14751
Editing real facial images is a crucial task in computer vision with significant demand in various real-world applications. While GAN-based methods have showed potential in manipulating images especially when combined with CLIP, these methods are limited in their ability to reconstruct real images due to challenging GAN inversion capability. Despite the successful image reconstruction achieved by diffusion-based methods, there are still challenges in effectively manipulating fine-gained facial attributes with textual this http URL address these issues and facilitate convenient manipulation of real facial images, we propose a novel approach that conduct text-driven image editing in the semantic latent space of diffusion model. By aligning the temporal feature of the diffusion model with the semantic condition at generative process, we introduce a stable manipulation strategy, which perform precise zero-shot manipulation effectively. Furthermore, we develop an interactive system named ChatFace, which combines the zero-shot reasoning ability of large language models to perform efficient manipulations in diffusion semantic latent space. This system enables users to perform complex multi-attribute manipulations through dialogue, opening up new possibilities for interactive image editing. Extensive experiments confirmed that our approach outperforms previous methods and enables precise editing of real facial images, making it a promising candidate for real-world applications. Project page: this https URL
编辑真实的面部图像是计算机视觉中一个重要的任务,在各种实际应用场景中具有巨大的需求。尽管基于GAN的方法在操纵图像方面表现出了潜力,特别是在结合Clip时更是如此,但这些方法在重建真实图像的能力上受到挑战,因为GAN的逆运算能力具有挑战性。尽管扩散方法成功地重建了图像,但仍然存在有效地操纵微调的面部属性通过文本这个http URL解决这些问题并方便真实面部图像操纵的需求,因此我们提出了一种 novel 的方法,在扩散模型的语义潜在空间中进行文本驱动的图像编辑。通过将扩散模型的时间特性与生成过程中语义条件对齐,我们引入了一种稳定的操纵策略,可以实现精确的零次操作操纵。此外,我们开发了一个名为ChatFace的交互系统,它结合了大型语言模型的零次操作推理能力,在扩散语义潜在空间中高效地进行操纵。该系统使用户通过对话进行复杂的多属性操纵,打开了交互图像编辑的新可能性。广泛的实验确认了我们的 approach 比先前的方法更有效,使用户可以精确地编辑真实的面部图像,使其成为实际应用场景中的有前途的选择。项目页面: this https URL
https://arxiv.org/abs/2305.14742
LLMs (large language models) such as ChatGPT have shown remarkable language understanding and generation capabilities. Although reference-free evaluators based on LLMs show better human alignment than traditional reference-based evaluators, there are many challenges in using reference-free evaluators based on LLMs. Reference-free evaluators are more suitable for open-ended examples with different semantics responses. But not all examples are open-ended. For closed-ended examples with unique correct semantic response, reference-free evaluators will still consider it high quality when giving a response that is inconsistent with the facts and the semantic of reference. In order to comprehensively evaluate the reliability of evaluators based on LLMs, we construct two adversarial meta-evaluation dialogue generation datasets KdConv-ADV and DSTC7-ADV based on KdConv and DSTC7-AVSD, respectively. Compared to previous meta-evaluation benchmarks, KdConv-ADV and DSTC7-ADV are much more challenging since they requires evaluators to be able to reasonably evaluate closed-ended examples with the help of external knowledge or even its own knowledge. Empirical results show that the ability of LLMs to identify unreasonable responses is insufficient. There are risks in using eference-free evaluators based on LLMs to evaluate the quality of dialogue responses.
大型语言模型(如ChatGPT)表现出非凡的语言理解和生成能力。尽管基于LLMs的标准评估者(即没有参考的评估者)比传统的基于参考的标准评估者表现更好的人类对齐,但在使用基于LLMs的标准评估者时仍面临许多挑战。基于LLMs的标准评估者更适合具有不同语义响应的开放式例子。但不是所有的例子都是开放式的。对于具有独特正确语义响应的封闭性例子,即使没有参考,基于LLMs的标准评估者仍然会认为其质量很高。为了全面评估基于LLMs的标准评估者的可靠性,我们分别基于KdConv和DSTC7-AVSD构建了两个对抗性元评估对话生成数据集KdConv-ADV和dstC7-ADV。与以前的元评估基准相比,KdConv-ADV和dstC7-ADV更具挑战性,因为它们要求评估者借助外部知识或甚至自己的知识合理评估封闭性例子。实证结果表明,LLMs识别不合理响应的能力还不够。使用基于LLMs的标准评估者评估对话响应的风险仍然存在。
https://arxiv.org/abs/2305.14658
Understanding the speaker's intended meaning often involves drawing commonsense inferences to reason about what is not stated explicitly. In multi-event sentences, it requires understanding the relationships between events based on contextual knowledge. We propose COMET-M (Multi-Event), an event-centric commonsense model capable of generating commonsense inferences for a target event within a complex sentence. COMET-M builds upon COMET (Bosselut et al., 2019), which excels at generating event-centric inferences for simple sentences, but struggles with the complexity of multi-event sentences prevalent in natural text. To overcome this limitation, we curate a multi-event inference dataset of 35K human-written inferences. We trained COMET-M on the human-written inferences and also created baselines using automatically labeled examples. Experimental results demonstrate the significant performance improvement of COMET-M over COMET in generating multi-event inferences. Moreover, COMET-M successfully produces distinct inferences for each target event, taking the complete context into consideration. COMET-M holds promise for downstream tasks involving natural text such as coreference resolution, dialogue, and story understanding.
理解说话者的意图通常需要运用常识推理来考虑未明确说明的内容。在多事件句子中,需要基于上下文知识理解事件之间的关系。我们提出了COMET-M(多事件),一个基于事件的常识模型,能够在复杂句中为特定事件生成常识推理。COMET-M基于COMET(Bosselut等人,2019),在生成简单句子的常识推理方面表现良好,但在自然文本中处理多事件句子时面临挑战。为了克服这种限制,我们创建了一个35K条人类编写的推理数据集,并对人类编写的推理数据集进行训练,同时也使用自动标签示例创建基线。实验结果显示,COMET-M在生成多事件推理方面相比COMET取得了显著的性能提升。此外,COMET-M成功为每个目标事件生成独特的推理,考虑了整个上下文。COMET-M对于涉及自然文本的下游任务,如共指关系解决、对话和故事理解等具有潜力。
https://arxiv.org/abs/2305.14617