Abstract
Document-level neural machine translation (DocNMT) aims to generate translations that are both coherent and cohesive, in contrast to its sentence-level counterpart. However, due to its longer input length and limited availability of training data, DocNMT often faces the challenge of data sparsity. To overcome this issue, we propose a novel Importance-Aware Data Augmentation (IADA) algorithm for DocNMT that augments the training data based on token importance information estimated by the norm of hidden states and training gradients. We conduct comprehensive experiments on three widely-used DocNMT benchmarks. Our empirical results show that our proposed IADA outperforms strong DocNMT baselines as well as several data augmentation approaches, with statistical significance on both sentence-level and document-level BLEU.
Abstract (translated)
文档级别神经机器翻译(DocNMT)旨在生成既连贯又完整的翻译,与其句子级别 counterpart 不同。然而,由于其较长的输入长度和训练数据有限,DocNMT 通常面临数据稀疏性的挑战。为了克服这一问题,我们提出了一个新颖的基于词重要性信息估计 norms of hidden states 和 training gradients 的 Importance-Aware Data Augmentation (IADA) 算法用于 DocNMT。我们对三个广泛使用的 DocNMT 基准进行全面的实验。我们的实证结果表明,与 strong DocNMT 基线以及多个数据增强方法相比,我们的 IADA 具有显著的优越性,具有统计学意义,无论是句子级别还是文档级别 BLEU。
URL
https://arxiv.org/abs/2401.15360