Abstract
Obtaining training data for Question Answering (QA) is time-consuming and resource-intensive, and existing QA datasets are only available for limited domains and languages. In this work, we explore to what extent high quality training data is actually required for Extractive QA, and investigate the possibility of unsupervised Extractive QA. We approach this problem by first learning to generate context, question and answer triples in an unsupervised manner, which we then use to synthesize Extractive QA training data automatically. To generate such triples, we first sample random context paragraphs from a large corpus of documents and then random noun phrases or named entity mentions from these paragraphs as answers. Next we convert answers in context to "fill-in-the-blank" cloze questions and finally translate them into natural questions. We propose and compare various unsupervised ways to perform cloze-to-natural question translation, including training an unsupervised NMT model using non-aligned corpora of natural questions and cloze questions as well as a rule-based approach. We find that modern QA models can learn to answer human questions surprisingly well using only synthetic training data. We demonstrate that, without using the SQuAD training data at all, our approach achieves 56.4 F1 on SQuAD v1 (64.5 F1 when the answer is a Named entity mention), outperforming early supervised models.
Abstract (translated)
获取问题解答(QA)的培训数据是耗时和资源密集型的,现有的QA数据集只能用于有限的域和语言。在这项工作中,我们探讨了抽取式质量保证实际需要高质量的培训数据的程度,并调查了无监督抽取式质量保证的可能性。我们首先通过学习以无监督的方式生成上下文、问题和答案三个部分来解决这个问题,然后使用这些部分自动合成提取的QA培训数据。为了生成这样的三元组,我们首先从大量文档中抽取随机上下文段落,然后从这些段落中随机提及名词短语或命名实体作为答案。接下来,我们将上下文中的答案转换成“填空”完形填空问题,最后将它们转换成自然问题。我们提出并比较了完成完形填空到自然问题翻译的各种非监督方式,包括使用自然问题和完形填空的非协调语料库训练非监督NMT模型以及基于规则的方法。我们发现,现代的质量保证模型只使用合成的训练数据就可以很好地回答人类的问题。我们证明,在完全不使用班长培训数据的情况下,我们的方法在第1班达到56.4 f1(答案是提及指定实体时为64.5 f1),优于早期监督模型。
URL
https://arxiv.org/abs/1906.04980