Abstract
Language models have steadily increased in size over the past few years. They achieve a high level of performance on various natural language processing (NLP) tasks such as question answering and summarization. Large language models (LLMs) have been used for generation and can now output human-like text. Due to this, there are other downstream tasks in the realm of dialog that can now harness the LLMs' language understanding capabilities. Dialog evaluation is one task that this paper will explore. It concentrates on prompting with LLMs: BLOOM, OPT, GPT-3, Flan-T5, InstructDial and TNLGv2. The paper shows that the choice of datasets used for training a model contributes to how well it performs on a task as well as on how the prompt should be structured. Specifically, the more diverse and relevant the group of datasets that a model is trained on, the better dialog evaluation performs. This paper also investigates how the number of examples in the prompt and the type of example selection used affect the model's performance.
Abstract (translated)
语言模型在过去几年中 size 稳步增加。它们在各种自然语言处理任务(如问答和概括)中表现出高水平的性能。大型语言模型(LLM)已经被用于生成,现在可以输出类似于人类的文本。由于这个原因,对话领域的其他后续任务现在可以利用LLM 的语言理解能力。对话评估是本文要探索的任务之一。它专注于使用LLM 进行prompt:BLOOM、OPT、GPT-3、Flan-T5、InstructDial 和 TNLGv2。本文表明,用于训练模型的数据集的选择会影响其在任务中的表现,以及prompt 的结构和组织结构。具体来说,模型训练所使用数据的多样化和相关性越高,对话评估表现越好。本文还研究了在prompt 中包含的例子数量和选择的例子类型如何影响模型的表现。
URL
https://arxiv.org/abs/2301.12004