Abstract
Existing methods for creating source-grounded information-seeking dialog datasets are often costly and hard to implement due to their sole reliance on human annotators. We propose combining large language models (LLMs) prompting with human expertise for more efficient and reliable data generation. Instead of the labor-intensive Wizard-of-Oz (WOZ) method, where two annotators generate a dialog from scratch, role-playing agent and user, we use LLM generation to simulate the two roles. Annotators then verify the output and augment it with attribution data. We demonstrate our method by constructing MISeD -- Meeting Information Seeking Dialogs dataset -- the first information-seeking dialog dataset focused on meeting transcripts. Models finetuned with MISeD demonstrate superior performance on our test set, as well as on a novel fully-manual WOZ test set and an existing query-based summarization benchmark, suggesting the utility of our approach.
Abstract (translated)
目前创建源-地面信息检索对话数据集的方法通常代价高昂且难以实现,因为它们仅依赖于人类标注者。我们提出了一种结合大型语言模型(LLMs)提示与人类专业知识的更高效和可靠的数据生成方法。我们摒弃了劳动密集型的Wizard of Oz(WOZ)方法,即两个标注者从零开始生成对话,角色扮演代理和用户,而是利用LLM生成来模拟这两个角色。然后,标注者验证输出并对其进行归因数据增强。我们通过构建MISeD--会议信息检索对话数据集,这是第一个关注会议转录的信息检索对话数据集,利用LLM进行训练的模型在测试集和我们的新完全手动WOZ测试集以及现有的基于查询的摘要基准上的表现都超过了现有水平,表明了我们的方法的实用性。
URL
https://arxiv.org/abs/2405.01121