Cross-lingual Data Augmentation for Document-grounded Dialog Systems in Low Resource Languages

Abstract
Abstract (translated)
URL
PDF

Abstract

This paper proposes a framework to address the issue of data scarcity in Document-Grounded Dialogue Systems(DGDS). Our model leverages high-resource languages to enhance the capability of dialogue generation in low-resource languages. Specifically, We present a novel pipeline CLEM (Cross-Lingual Enhanced Model) including adversarial training retrieval (Retriever and Re-ranker), and Fid (fusion-in-decoder) generator. To further leverage high-resource language, we also propose an innovative architecture to conduct alignment across different languages with translated training. Extensive experiment results demonstrate the effectiveness of our model and we achieved 4th place in the DialDoc 2023 Competition. Therefore, CLEM can serve as a solution to resource scarcity in DGDS and provide useful guidance for multi-lingual alignment tasks.

Abstract (translated)

本文提出了一个框架来解决文档grounded对话系统(DGDS)中数据稀缺的问题。我们的模型利用高资源语言来增强低资源语言对话生成的能力。具体来说，我们提出了一种 novel pipeline CLEM(跨语言增强模型)，包括对抗训练检索(Retriever and Re-ranker)和 Fid(解码器中的融合)生成器。为了进一步利用高资源语言，我们还提出了一种创新架构，以通过翻译训练进行跨语言对齐。广泛的实验结果显示我们的模型的有效性，我们在2023年Dialdoc竞赛中获得了第四名。因此，CLEM可以作为DGDS中资源稀缺的解决方案，并为多语言对齐任务提供有用的指导。

URL

https://arxiv.org/abs/2305.14949

PDF

https://arxiv.org/pdf/2305.14949.pdf