XWikiGen: Cross-lingual Summarization for Encyclopedic Text Generation in Low Resource Languages

Abstract
Abstract (translated)
URL
PDF

Abstract

Lack of encyclopedic text contributors, especially on Wikipedia, makes automated text generation for \emph{low resource (LR) languages} a critical problem. Existing work on Wikipedia text generation has focused on \emph{English only} where English reference articles are summarized to generate English Wikipedia pages. But, for low-resource languages, the scarcity of reference articles makes monolingual summarization ineffective in solving this problem. Hence, in this work, we propose \task{}, which is the task of cross-lingual multi-document summarization of text from multiple reference articles, written in various languages, to generate Wikipedia-style text. Accordingly, we contribute a benchmark dataset, \data{}, spanning $\sim$69K Wikipedia articles covering five domains and eight languages. We harness this dataset to train a two-stage system where the input is a set of citations and a section title and the output is a section-specific LR summary. The proposed system is based on a novel idea of neural unsupervised extractive summarization to coarsely identify salient information followed by a neural abstractive model to generate the section-specific text. Extensive experiments show that multi-domain training is better than the multi-lingual setup on average.

Abstract (translated)

缺乏知识文本贡献者,特别是在维基百科上,使得对低资源语言(LR)的自动文本生成成为一个关键问题。现有的维基百科文本生成工作主要关注英语,英语参考文章被总结生成英语维基百科页面。但对于低资源语言,缺乏参考文章使得单语言总结无法有效地解决这个问题。因此,在本工作中,我们提出了任务(task),它是跨语言多文档摘要从多种语言参考文章生成维基百科样式文本的任务。据此,我们提供了一个基准数据集\data{},覆盖大约69,000个维基百科页面,涵盖了五个领域和八个语言。我们利用这些数据集训练了一个两阶段系统,输入是引用和章节标题,输出是特定章节的LR摘要。我们提出的系统基于一种新颖的神经网络非监督提取总结想法,以粗略地识别突出信息,然后使用神经网络抽象模型生成特定章节文本。广泛的实验结果表明,跨域训练比跨语言 setup平均更好。

URL

https://arxiv.org/abs/2303.12308

PDF

https://arxiv.org/pdf/2303.12308.pdf