Despite the growing variety of languages supported by existing multilingual neural machine translation (MNMT) models, most of the world's languages are still being left behind. We aim to extend large-scale MNMT models to a new language, allowing for translation between the newly added and all of the already supported languages in a challenging scenario: using only a parallel corpus between the new language and English. Previous approaches, such as continued training on parallel data including the new language, suffer from catastrophic forgetting (i.e., performance on other languages is reduced). Our novel approach Imit-MNMT treats the task as an imitation learning process, which mimicks the behavior of an expert, a technique widely used in the computer vision area, but not well explored in NLP. More specifically, we construct a pseudo multi-parallel corpus of the new and the original languages by pivoting through English, and imitate the output distribution of the original MNMT model. Extensive experiments show that our approach significantly improves the translation performance between the new and the original languages, without severe catastrophic forgetting. We also demonstrate that our approach is capable of solving copy and off-target problems, which are two common issues existence in current large-scale MNMT models.
尽管现有的多语言神经机器翻译(MNMT)模型支持的语言种类越来越多,但大多数世界语言仍然被遗弃。我们的目标是将大型MNMT模型扩展到一种新的语言,使得在具有挑战性的情况下(即仅使用新语言和英语之间的并行语料库),可以实现翻译:使用只有新语言和英语之间的并行语料库。 previous approaches, such as continued training on parallel data including the new language, suffer from catastrophic forgetting (i.e., the performance on other languages is reduced). Our novel approach Imit-MNMT treats the task as an imitation learning process, which mimics the behavior of an expert, a technique widely used in the computer vision area, but not well explored in NLP. More specifically, we construct a pseudo multi-parallel corpus of the new and the original languages by pivoting through English, and imitate the output distribution of the original MNMT model. Extensive experiments show that our approach significantly improves the translation performance between the new and the original languages, without severe catastrophic forgetting. We also demonstrate that our approach is capable of solving copy and off-target problems, which are two common issues existing in current large-scale MNMT models.
https://arxiv.org/abs/2311.08538
Minimum Bayes Risk (MBR) decoding can significantly improve translation performance of Multilingual Large Language Models (MLLMs). However, MBR decoding is computationally expensive and in this paper, we show how recently developed Reinforcement Learning (RL) technique, Direct Preference Optimization (DPO) can be used to fine-tune MLLMs so that we get the gains from MBR without the additional computation in inference. Our fine-tuned models have significantly improved performance on multiple NMT test sets compared to base MLLMs without preference optimization. Our method boosts the translation performance of MLLMs using relatively small monolingual fine-tuning sets.
最小贝叶斯风险(MBR)解码可以显著提高多语言大型语言模型的翻译性能。然而,MBR解码是计算密集型技术,在本文中,我们展示了如何使用最近开发的可用于精细调整MLLM的强化学习(RL)技术,直接偏好优化(DPO)来实现,以在不增加推理计算的情况下获得MBR的收益。我们对基MLLM进行了微调,在多个NMT测试集上的性能已经显著超过了没有偏优化时的基MLLM。我们的方法通过相对较小的单语种微调集,显著提高了MLLM的翻译性能。
https://arxiv.org/abs/2311.08380
We propose the on-the-fly ensembling of a machine translation model with an LLM, prompted on the same task and input. We perform experiments on 4 language pairs (both directions) with varying data amounts. We find that a slightly weaker-at-translation LLM can improve translations of a NMT model, and ensembling with an LLM can produce better translations than ensembling two stronger MT models. We combine our method with various techniques from LLM prompting, such as in context learning and translation context.
我们提出了一种在LLM的提示下,在相同任务和输入上动态地组装机器翻译模型的方法。我们对4个语言对(两个方向)进行了实验,数据量有所不同。我们发现,一个在翻译方面略微较弱的LLM可以提高NMT模型的翻译质量,而与LLM的集成可以产生比集成两个更强的MT模型更好的翻译结果。我们将我们的方法与LLM提示的各种技术相结合,例如上下文学习和翻译上下文。
https://arxiv.org/abs/2311.08306
Compositional generalisation (CG), in NLP and in machine learning more generally, has been assessed mostly using artificial datasets. It is important to develop benchmarks to assess CG also in real-world natural language tasks in order to understand the abilities and limitations of systems deployed in the wild. To this end, our GenBench Collaborative Benchmarking Task submission utilises the distribution-based compositionality assessment (DBCA) framework to split the Europarl translation corpus into a training and a test set in such a way that the test set requires compositional generalisation capacity. Specifically, the training and test sets have divergent distributions of dependency relations, testing NMT systems' capability of translating dependencies that they have not been trained on. This is a fully-automated procedure to create natural language compositionality benchmarks, making it simple and inexpensive to apply it further to other datasets and languages. The code and data for the experiments is available at this https URL.
组合 generalization(CG)在自然语言处理(NLP)和机器学习更广泛地使用 artificial 数据集进行评估。重要的是为评估 CG 在现实世界的自然语言任务中开发基准,以了解部署在野外的系统的能力和局限性。为此,我们的 GenBench 合作基准测试任务利用基于分布的组合性评估(DBCA)框架将 Europarl 翻译语料库分割成训练和测试集,以使测试集要求具有组合性 generalization 能力。具体来说,训练和测试集的依赖关系分布是不同的,这测试了 NMT 系统在没有经过训练的依赖关系上的翻译能力。这是一个完全自动化的过程,用于创建自然语言组合性基准,使得将其应用于其他数据集和语言变得更加简单和经济。实验的代码和数据可在此链接下载:https://url.in/
https://arxiv.org/abs/2311.08249
Neural Machine Translation (NMT) models, though state-of-the-art for translation, often reflect social biases, particularly gender bias. Existing evaluation benchmarks primarily focus on English as the source language of translation. For source languages other than English, studies often employ gender-neutral sentences for bias evaluation, whereas real-world sentences frequently contain gender information in different forms. Therefore, it makes more sense to evaluate for bias using such source sentences to determine if NMT models can discern gender from the grammatical gender cues rather than relying on biased associations. To illustrate this, we create two gender-specific sentence sets in Hindi to automatically evaluate gender bias in various Hindi-English (HI-EN) NMT systems. We emphasise the significance of tailoring bias evaluation test sets to account for grammatical gender markers in the source language.
神经机器翻译(NMT)模型虽然在翻译领域是最先进的,但通常反映了社会偏见,特别是性别偏见。现有的评估基准主要关注英语作为翻译源语言。对于其他语言,研究通常使用性别中性的句子进行偏见评估,而现实世界的句子常常以不同的形式包含性别信息。因此,更合理的方法是使用这样的源句子来评估偏见,以确定 NMT 模型是否能够从语法的性别线索中分辨出性别,而不是依赖于有偏的关联。为了说明这一点,我们在印地语中创建了两个性别特定的句子集,用于自动评估各种印地-英语(HI-EN)NMT系统中的性别偏见。我们强调了对源语言中语法性别标记的定制对于偏见评估测试集的重要性。
https://arxiv.org/abs/2312.03710
When training a neural network, it will quickly memorise some source-target mappings from your dataset but never learn some others. Yet, memorisation is not easily expressed as a binary feature that is good or bad: individual datapoints lie on a memorisation-generalisation continuum. What determines a datapoint's position on that spectrum, and how does that spectrum influence neural models' performance? We address these two questions for neural machine translation (NMT) models. We use the counterfactual memorisation metric to (1) build a resource that places 5M NMT datapoints on a memorisation-generalisation map, (2) illustrate how the datapoints' surface-level characteristics and a models' per-datum training signals are predictive of memorisation in NMT, (3) and describe the influence that subsets of that map have on NMT systems' performance.
在训练神经网络时,它会迅速记住数据集中的某些源-目标映射,但不会学习其他一些映射。然而,记忆并不容易表示为二进制特征,是好的还是不好的:数据点位于一个记忆-泛化连续线上。决定数据点在这条光谱上的位置,以及这条光谱如何影响神经模型的性能?我们回答了这两个问题,针对神经机器翻译(NMT)模型。我们使用反事实记忆度量来(1)构建一个资源,将500万NMT数据点放在记忆-泛化映射上,(2)说明数据点的表面特征和模型的每数据点训练信号如何预测NMT中的记忆,(3)描述了该映射的子集如何影响NMT系统的性能。
https://arxiv.org/abs/2311.05379
Quality Estimation (QE), the evaluation of machine translation output without the need of explicit references, has seen big improvements in the last years with the use of neural metrics. In this paper we analyze the viability of using QE metrics for filtering out bad quality sentence pairs in the training data of neural machine translation systems~(NMT). While most corpus filtering methods are focused on detecting noisy examples in collections of texts, usually huge amounts of web crawled data, QE models are trained to discriminate more fine-grained quality differences. We show that by selecting the highest quality sentence pairs in the training data, we can improve translation quality while reducing the training size by half. We also provide a detailed analysis of the filtering results, which highlights the differences between both approaches.
质量评估(QE)是无需明确引用来评估机器翻译输出质量的过程,在近几年来,随着使用神经指标,质量评估取得了很大的改进。在本文中,我们分析了在神经机器翻译系统的训练数据中使用QE指标来过滤出低质量句子对的可行性。虽然大多数语料库过滤方法都专注于检测文本集合中的嘈杂例子,通常是有大量的网页爬取数据,但是QE模型被训练来区分更细微的质量差异。我们证明了,通过在训练数据中选择最高质量的句子对,我们可以提高翻译质量,同时将训练大小减半。我们还对过滤结果进行了详细分析,并强调了两种方法之间的差异。
https://arxiv.org/abs/2311.05350
Neural Machine Translation (NMT) models are state-of-the-art for machine translation. However, these models are known to have various social biases, especially gender bias. Most of the work on evaluating gender bias in NMT has focused primarily on English as the source language. For source languages different from English, most of the studies use gender-neutral sentences to evaluate gender bias. However, practically, many sentences that we encounter do have gender information. Therefore, it makes more sense to evaluate for bias using such sentences. This allows us to determine if NMT models can identify the correct gender based on the grammatical gender cues in the source sentence rather than relying on biased correlations with, say, occupation terms. To demonstrate our point, in this work, we use Hindi as the source language and construct two sets of gender-specific sentences: OTSC-Hindi and WinoMT-Hindi that we use to evaluate different Hindi-English (HI-EN) NMT systems automatically for gender bias. Our work highlights the importance of considering the nature of language when designing such extrinsic bias evaluation datasets.
神经机器翻译(NMT)模型是机器翻译的尖端技术。然而,这些模型却被发现存在各种社会偏见,尤其是性别偏见。在评估性别偏见方面,大多数工作都主要关注英语作为源语言。对于其他语言的源语言,大多数研究都使用性别中性的句子来评估性别偏见。然而,在实际应用中,我们遇到的许多句子都有性别信息。因此,使用这样的句子来评估偏见更具有合理性。这让我们能够确定NMT模型是否能够根据源句中的语义性别线索来识别正确的性别,而不是依赖与例如职业术语等有偏见的关联。为了证明我们的观点,在这篇论文中,我们使用印地语作为源语言,构建了两个性别特定的句子:OTSC-印地语和WinoMT-印地语,我们用它们来自动评估不同印地-英语(HI-EN)NMT系统的性别偏见。我们的工作强调了在设计此类外挂偏见评估数据集时考虑语言的性质的重要性。
https://arxiv.org/abs/2311.03767
Contemporary translation engines built upon the encoder-decoder framework have reached a high level of development, while the emergence of Large Language Models (LLMs) has disrupted their position by offering the potential for achieving superior translation quality. Therefore, it is crucial to understand in which scenarios LLMs outperform traditional NMT systems and how to leverage their strengths. In this paper, we first conduct a comprehensive analysis to assess the strengths and limitations of various commercial NMT systems and MT-oriented LLMs. Our findings indicate that neither NMT nor MT-oriented LLMs alone can effectively address all the translation issues, but MT-oriented LLMs can serve as a promising complement to the NMT systems. Building upon these insights, we explore hybrid methods and propose Cooperative Decoding (CoDec), which treats NMT systems as a pretranslation model and MT-oriented LLMs as a supplemental solution to handle complex scenarios beyond the capability of NMT alone. The results on the WMT22 test sets and a newly collected test set WebCrawl demonstrate the effectiveness and efficiency of CoDec, highlighting its potential as a robust solution for combining NMT systems with MT-oriented LLMs in machine translation.
基于编码器-解码器框架的现代翻译引擎已经达到了很高的水平,而大型语言模型的出现通过提供实现卓越翻译质量的可能性打破了这一地位。因此,了解在哪些场景中LLMs比传统的NMT系统表现更好以及如何利用其优势至关重要。在本文中,我们首先对各种商业NMT系统和MT导向的LLM进行了全面的分析,以评估其优缺点。我们的研究结果表明,无论是NMT还是MT导向的LLM都不能有效解决所有翻译问题,但MT导向的LLM可以作为NMT系统的有益补充。基于这些见解,我们探讨了混合方法,并提出了合作解码(CoDec)方法,将NMT系统视为预翻译模型,将MT导向的LLM视为解决NMT系统无法处理的超复杂场景的补充解决方案。在WMT22测试集和收集到的测试集WebCrawl上的结果表明,CoDec具有有效性和效率,其潜在用途是为机器翻译中的NMT系统和MT导向的LLM提供稳健的解决方案。
https://arxiv.org/abs/2311.02851
Neural machine translation (NMT) for low-resource local languages in Indonesia faces significant challenges, including the need for a representative benchmark and limited data availability. This work addresses these challenges by comprehensively analyzing training NMT systems for four low-resource local languages in Indonesia: Javanese, Sundanese, Minangkabau, and Balinese. Our study encompasses various training approaches, paradigms, data sizes, and a preliminary study into using large language models for synthetic low-resource languages parallel data generation. We reveal specific trends and insights into practical strategies for low-resource language translation. Our research demonstrates that despite limited computational resources and textual data, several of our NMT systems achieve competitive performances, rivaling the translation quality of zero-shot gpt-3.5-turbo. These findings significantly advance NMT for low-resource languages, offering valuable guidance for researchers in similar contexts.
神经机器翻译(NMT)在印度尼西亚的低资源本地语言中面临重大挑战,包括需要一个具有代表性的基准和数据可用性有限。本文通过全面分析印度尼西亚四种低资源本地语言的训练NMT系统,解决了这些挑战。本研究包括各种训练方法、范式、数据大小以及研究大型语言模型用于合成低资源语言并行数据生成的初步应用。我们揭示了低资源语言翻译的特定趋势和见解,为研究人员提供了有价值的指导。我们的研究证明,尽管计算资源和文本数据有限,但我们的NMT系统在几个低资源本地语言上实现了与零散吞吐量GPT-3.5相当的竞争性能,甚至可以与零散吞吐量GPT-3.5相媲美。这些发现显著推进了低资源语言的NMT研究,为类似处境的研究人员提供了宝贵的指导。
https://arxiv.org/abs/2311.00998
Continuous-output neural machine translation (CoNMT) replaces the discrete next-word prediction problem with an embedding prediction. The semantic structure of the target embedding space (i.e., closeness of related words) is intuitively believed to be crucial. We challenge this assumption and show that completely random output embeddings can outperform laboriously pretrained ones, especially on larger datasets. Further investigation shows this surprising effect is strongest for rare words, due to the geometry of their embeddings. We shed further light on this finding by designing a mixed strategy that combines random and pre-trained embeddings for different tokens.
连续输出神经机器翻译(CoNMT)用嵌入预测取代了离散的下一个词预测问题。目标嵌入空间(即相关词的接近程度)的语义结构被认为是至关重要的。我们挑战这个假设,并证明了完全随机的输出嵌入可以超越费力地预训练的嵌入,尤其是在更大的数据集上。进一步的调查显示,这一令人惊讶的效果对于稀有词来说是最强的,因为它们的嵌入具有的几何形状。我们通过设计一种结合随机和预训练嵌入的混合策略,进一步阐明了这个发现。
https://arxiv.org/abs/2310.20620
Neural Machine Translation (NMT) has become a significant technology in natural language processing through extensive research and development. However, the deficiency of high-quality bilingual language pair data still poses a major challenge to improving NMT performance. Recent studies are exploring the use of contextual information from pre-trained language model (PLM) to address this problem. Yet, the issue of incompatibility between PLM and NMT model remains unresolved. This study proposes a PLM-integrated NMT (PiNMT) model to overcome the identified problems. The PiNMT model consists of three critical components, PLM Multi Layer Converter, Embedding Fusion, and Cosine Alignment, each playing a vital role in providing effective PLM information to NMT. Furthermore, two training strategies, Separate Learning Rates and Dual Step Training, are also introduced in this paper. By implementing the proposed PiNMT model and training strategy, we achieved state-of-the-art performance on the IWSLT'14 En$\leftrightarrow$De dataset. This study's outcomes are noteworthy as they demonstrate a novel approach for efficiently integrating PLM with NMT to overcome incompatibility and enhance performance.
神经机器翻译(NMT)通过广泛的研究和开发在自然语言处理领域取得了显著的技术突破。然而,高质量的双语语言对仍然是一个主要挑战,以提高NMT性能。最近的研究正在探索使用预训练语言模型(PLM)的上下文信息来解决这个问题。然而,PLM和NMT模型之间的不兼容性问题仍然没有得到解决。本研究提出了一种PLM集成的NMT(PiNMT)模型,以克服已确定的问题。PiNMT模型由三个关键组件组成,包括PLM多层转换器、嵌入式融合和余弦对齐,每个都在为NMT提供有效的PLM信息方面发挥关键作用。此外,本文还介绍了两种训练策略,即分离学习速率和双步训练。通过实现所提出的PiNMT模型和训练策略,我们在IWSLT'14 En$\leftrightarrow$De数据集上取得了最先进的性能。本研究的结果是值得注意的,因为它们表明了一种将PLM与NMT有效地集成的新方法,以克服不兼容性并提高性能。
https://arxiv.org/abs/2310.19680
Community Question-Answering (CQA) portals serve as a valuable tool for helping users within an organization. However, making them accessible to non-English-speaking users continues to be a challenge. Translating questions can broaden the community's reach, benefiting individuals with similar inquiries in various languages. Translating questions using Neural Machine Translation (NMT) poses more challenges, especially in noisy environments, where the grammatical correctness of the questions is not monitored. These questions may be phrased as statements by non-native speakers, with incorrect subject-verb order and sometimes even missing question marks. Creating a synthetic parallel corpus from such data is also difficult due to its noisy nature. To address this issue, we propose a training methodology that fine-tunes the NMT system only using source-side data. Our approach balances adequacy and fluency by utilizing a loss function that combines BERTScore and Masked Language Model (MLM) Score. Our method surpasses the conventional Maximum Likelihood Estimation (MLE) based fine-tuning approach, which relies on synthetic target data, by achieving a 1.9 BLEU score improvement. Our model exhibits robustness while we add noise to our baseline, and still achieve 1.1 BLEU improvement and large improvements on TER and BLEURT metrics. Our proposed methodology is model-agnostic and is only necessary during the training phase. We make the codes and datasets publicly available at \url{this https URL} for facilitating further research.
社区问答(CQA)门户在帮助组织内的用户方面是一个有价值的工具。然而,使其对非英语 speakers 可见仍然是一个挑战。将问题翻译成其他语言可能有助于扩大社区的规模,为各种语言的类似问题用户提供帮助。使用神经机器翻译(NMT)翻译问题存在更多挑战,尤其是在嘈杂的环境中,这些问题没有监控语法正确性。这些问题可能由非母语人士用不正确的主谓顺序 sometimes 甚至缺失问号表述。从这种数据中创建合成平行语料库也很难,因为其噪音性质。为了应对这个问题,我们提出了一个仅使用源端数据微调 NMT 系统的训练方法。我们的方法通过结合 BERTScore 和掩码语言模型(MLM)分数的损失函数实现了 1.9 BLEU 评分提升。我们的方法超越了基于合成目标数据的常规最大似然估计(MLE)微调方法,该方法在噪音中依赖于假想的训练数据。我们的模型在添加噪音时表现出鲁棒性,并且仍然在 TER 和 BLEURT 指标上实现 1.1 BLEU 的改进。我们提出的方法对模型具有无关性,并且在训练阶段只需要使用。我们将代码和数据公开发布在 \url{这个链接},以促进进一步的研究。
https://arxiv.org/abs/2310.15259
Large Language Models (LLM's) have demonstrated considerable success in various Natural Language Processing tasks, but they have yet to attain state-of-the-art performance in Neural Machine Translation (NMT). Nevertheless, their significant performance in tasks demanding a broad understanding and contextual processing shows their potential for translation. To exploit these abilities, we investigate using LLM's for MT and explore recent parameter-efficient fine-tuning techniques. Surprisingly, our initial experiments find that fine-tuning for translation purposes even led to performance degradation. To overcome this, we propose an alternative approach: adapting LLM's as Automatic Post-Editors (APE) rather than direct translators. Building on the LLM's exceptional ability to process and generate lengthy sequences, we also propose extending our approach to document-level translation. We show that leveraging Low-Rank-Adapter fine-tuning for APE can yield significant improvements across both sentence and document-level metrics while generalizing to out-of-domain data. Most notably, we achieve a state-of-the-art accuracy rate of 89\% on the ContraPro test set, which specifically assesses the model's ability to resolve pronoun ambiguities when translating from English to German. Lastly, we investigate a practical scenario involving manual post-editing for document-level translation, where reference context is made available. Here, we demonstrate that leveraging human corrections can significantly reduce the number of edits required for subsequent translations\footnote{Interactive Demo for integrating manual feedback can be found \href{this https URL}{here}}
大语言模型(LLM)在各种自然语言处理任务中已经取得了显著的成功,但在自然语言翻译(NMT)任务上尚未达到最先进水平。然而,它们在需要广泛理解和上下文处理的任务上的显著表现表明了它们在翻译领域的潜力。为了利用这些能力,我们研究了使用LLM进行MT,并探讨了最近参数高效的微调技术。令人惊讶的是,我们的初步实验发现,为翻译目的进行微调甚至导致了性能下降。为了克服这一问题,我们提出了一个替代方法:将LLM适应为自动编辑器(APE),而不是直接翻译者。在LLM处理和生成长序列方面表现卓越的基础上,我们还提出了将我们的方法扩展到文档级别翻译的想法。我们证明了利用低秩适者微调(APE)对于APE可以在句子和文档级别指标上产生显著的改进,同时将 generalizing to out-of-domain data。值得注意的是,我们在ContraPro测试集上实现了89\%的最先进准确性,该测试集专门评估了模型在从英语到德语翻译时解决代词歧义的能力。最后,我们研究了一个涉及文档级别翻译的实际场景,其中参考上下文可用。在这里,我们证明了利用人类修改可以显著减少后续翻译所需的编辑数量。参考文献中的交互式演示可以在这里查看:<https://this https URL here>
https://arxiv.org/abs/2310.14855
Lexical ambiguity is a significant and pervasive challenge in Neural Machine Translation (NMT), with many state-of-the-art (SOTA) NMT systems struggling to handle polysemous words (Campolungo et al., 2022). The same holds for the NMT pretraining paradigm of denoising synthetic "code-switched" text (Pan et al., 2021; Iyer et al., 2023), where word senses are ignored in the noising stage -- leading to harmful sense biases in the pretraining data that are subsequently inherited by the resulting models. In this work, we introduce Word Sense Pretraining for Neural Machine Translation (WSP-NMT) - an end-to-end approach for pretraining multilingual NMT models leveraging word sense-specific information from Knowledge Bases. Our experiments show significant improvements in overall translation quality. Then, we show the robustness of our approach to scale to various challenging data and resource-scarce scenarios and, finally, report fine-grained accuracy improvements on the DiBiMT disambiguation benchmark. Our studies yield interesting and novel insights into the merits and challenges of integrating word sense information and structured knowledge in multilingual pretraining for NMT.
词汇歧义是一个对神经机器翻译(NMT)来说非常重要和普遍的挑战,许多最先进的(SOTA)NMT系统在处理多义词(Campolungo等人,2022)方面都陷入困境。同样,在NMT预训练范式中,面向“伪代码切换”文本的denoising合成预训练方法(Pan等人,2021; Iyer等人,2023)也存在问题,因为在噪音阶段忽略了词义——这导致了预训练数据中的有害语义偏差,并最终被得到的模型所继承。 在这项工作中,我们引入了Word Sense Pre-training for Neural Machine Translation(WSP-NMT) - 一项端到端的方法,用于利用知识库中词义特定的信息预训练多语言NMT模型。我们的实验结果表明,在整体翻译质量方面取得了显著的改进。接着,我们展示了我们的方法对各种具有挑战性的数据和资源稀疏情景的鲁棒性,最后在DiBiMT消歧基准上报告了细粒度的准确性改进。我们的研究得出了对将词义信息和结构化知识集成到多语言NMT预训练中的优缺点和挑战有趣的见解。
https://arxiv.org/abs/2310.14050
Back translation (BT) is one of the most significant technologies in NMT research fields. Existing attempts on BT share a common characteristic: they employ either beam search or random sampling to generate synthetic data with a backward model but seldom work studies the role of synthetic data in the performance of BT. This motivates us to ask a fundamental question: {\em what kind of synthetic data contributes to BT performance?} Through both theoretical and empirical studies, we identify two key factors on synthetic data controlling the back-translation NMT performance, which are quality and importance. Furthermore, based on our findings, we propose a simple yet effective method to generate synthetic data to better trade off both factors so as to yield a better performance for BT. We run extensive experiments on WMT14 DE-EN, EN-DE, and RU-EN benchmark tasks. By employing our proposed method to generate synthetic data, our BT model significantly outperforms the standard BT baselines (i.e., beam and sampling based methods for data generation), which proves the effectiveness of our proposed methods.
回转翻译(BT)是自然语言处理(NMT)领域中最有影响力的技术之一。现有的BT尝试分享一个共同的特点:它们使用 beam search 或随机采样生成带有后模型的合成数据,但很少研究合成数据在BT性能中的作用。这激励我们提出一个基本问题:什么样的合成数据有助于BT性能?通过理论和实证研究,我们确定了两个关键因素来控制BT的后转翻译NMT性能,即质量和重要性。此外,根据我们的研究结果,我们提出了一个简单而有效的方法来生成合成数据,以更好地平衡这两个因素,从而为BT提供更好的性能。我们在 WMT14 DE-EN、EN-DE 和 RU-EN 基准任务上进行了广泛的实验。通过使用我们提出的方法生成合成数据,我们的BT模型在很大程度上超过了标准BT基线(即数据生成方法,基于束搜索和采样),这证明了我们所提出方法的有效性。
https://arxiv.org/abs/2310.13675
This project introduces an advanced English-to-Arabic translator surpassing conventional tools. Leveraging the Helsinki transformer (MarianMT), our approach involves fine-tuning on a self-scraped, purely literary Arabic dataset. Evaluations against Google Translate show consistent outperformance in qualitative assessments. Notably, it excels in cultural sensitivity and context accuracy. This research underscores the Helsinki transformer's superiority for English-to-Arabic translation using a Fusha dataset.
本文介绍了一个先进的英语到阿拉伯语翻译工具,超越了传统的工具。利用赫尔辛基Transformer(MarianMT),我们的方法是在纯文学阿拉伯语数据集上进行微调。与Google Translate的评估显示,在质量评估方面具有稳定的优异表现。值得注意的是,它在文化敏感性和上下文准确性方面表现出色。这项研究突显了使用Fusha数据集的英语到阿拉伯语翻译在赫尔辛基Transformer方面的卓越性。
https://arxiv.org/abs/2310.13613
Transformer models have demonstrated remarkable performance in neural machine translation (NMT). However, their vulnerability to noisy input poses a significant challenge in practical implementation, where generating clean output from noisy input is crucial. The MTNT dataset \cite{MTNT} is widely used as a benchmark for evaluating the robustness of NMT models against noisy input. Nevertheless, its utility is limited due to the presence of noise in both the source and target sentences. To address this limitation, we focus on cleaning the noise from the target sentences in MTNT, making it more suitable as a benchmark for noise evaluation. Leveraging the capabilities of large language models (LLMs), we observe their impressive abilities in noise removal. For example, they can remove emojis while considering their semantic meaning. Additionally, we show that LLM can effectively rephrase slang, jargon, and profanities. The resulting datasets, called C-MTNT, exhibit significantly less noise in the target sentences while preserving the semantic integrity of the original sentences. Our human and GPT-4 evaluations also lead to a consistent conclusion that LLM performs well on this task. Lastly, experiments on C-MTNT showcased its effectiveness in evaluating the robustness of NMT models, highlighting the potential of advanced language models for data cleaning and emphasizing C-MTNT as a valuable resource.
Transformer模型在神经机器翻译(NMT)任务中表现出了卓越的性能。然而,它们对噪声输入的易感性在实际应用中构成了重大挑战,因为从噪声输入中生成干净的输出至关重要。MTNT数据集 \cite{MTNT} 被广泛用作评估NMT模型对抗噪声输入的鲁棒性的基准。然而,由于源句和目标句中存在噪声,其应用价值有限。为了克服这一限制,我们重点关注在MTNT中清洗目标句子的噪声,使其更适合作为噪声评估的基准。利用大型语言模型的(LLM)能力,我们观察到它们在噪声去除方面的出色表现。例如,它们可以在考虑语义含义的同时删除表情符号。此外,我们证明了LLM可以有效重新表述俚语、行话和粗话。由此产生的数据集称为C-MTNT,在目标句子中表现出显著的噪声减少,同时保留了原始句子的语义完整性。我们的人类和GPT-4评估结果也得出了一致的结论,即LLM在 this task 上表现出色。最后,在C-MTNT上的实验还证明了它评估了NMT模型的鲁棒性,突出了先进语言模型数据清洗的潜力,并强调了C-MTNT作为一个有价值的资源的重要性。
https://arxiv.org/abs/2310.13469
k-nearest-neighbor machine translation (kNN-MT) boosts the translation quality of a pre-trained neural machine translation (NMT) model by utilizing translation examples during decoding. Translation examples are stored in a vector database, called a datastore, which contains one entry for each target token from the parallel data it is made from. Due to its size, it is computationally expensive both to construct and to retrieve examples from the datastore. In this paper, we present an efficient and extensible kNN-MT framework, knn-seq, for researchers and developers that is carefully designed to run efficiently, even with a billion-scale large datastore. knn-seq is developed as a plug-in on fairseq and easy to switch models and kNN indexes. Experimental results show that our implemented kNN-MT achieves a comparable gain to the original kNN-MT, and the billion-scale datastore construction took 2.21 hours in the WMT'19 German-to-English translation task. We publish our knn-seq as an MIT-licensed open-source project and the code is available on this https URL . The demo video is available on this https URL .
k-最近邻机器翻译(kNN-MT)通过在解码过程中利用翻译实例来提高预训练的神经机器翻译(NMT)模型的翻译质量。翻译实例存放在一个名为数据存储器(datastore)的向量数据库中,其中包含来自并行数据的每个目标词的条目。由于其大小,构建和从数据存储器中检索实例的计算成本非常高。在本文中,我们为研究人员和开发人员提供一个高效且可扩展的kNN-MT框架,名为 knn-seq,该框架被精心设计,即使在具有亿规模的大型数据存储器的情况下也能保持高效。knn-seq 是在 fairseq 和 easy-to-switch 模型上作为插件开发的,并且易于切换模型和kNN索引。实验结果表明,我们实现的kNN-MT 达到了与原始 kNN-MT 相当的增长,而大型数据存储器的构建在 WMT'19 德国到英语翻译任务中花费了 2.21 小时。我们将 knn-seq 发布为 MIT 授权的开源项目,代码可以从该链接访问。演示视频可以从该链接访问。
https://arxiv.org/abs/2310.12352
Direct neural machine translation (direct NMT) is a type of NMT system that translates text between two non-English languages. Direct NMT systems often face limitations due to the scarcity of parallel data between non-English language pairs. Several approaches have been proposed to address this limitation, such as multilingual NMT and pivot NMT (translation between two languages via English). Task-level Mixture of expert models (Task-level MoE), an inference-efficient variation of Transformer-based models, has shown promising NMT performance for a large number of language pairs. In Task-level MoE, different language groups can use different routing strategies to optimize cross-lingual learning and inference speed. In this work, we examine Task-level MoE's applicability in direct NMT and propose a series of high-performing training and evaluation configurations, through which Task-level MoE-based direct NMT systems outperform bilingual and pivot-based models for a large number of low and high-resource direct pairs, and translation directions. Our Task-level MoE with 16 experts outperforms bilingual NMT, Pivot NMT models for 7 language pairs, while pivot-based models still performed better in 9 pairs and directions.
直接神经机器翻译(Direct NMT)是一种将两种非英语语言之间进行文本翻译的NMT系统。由于非英语语言对之间的并行数据稀少,Direct NMT系统经常面临一些限制。为解决这个问题,已经提出了几种方法,例如多语言NMT和基于英语的转轴NMT(通过英语在两个语言之间进行翻译)。在任务级别专家模型(Task-level MoE)中,一种基于Transformer模型的推理效率的变体,已经在大数量目的语言对上显示出有希望的NMT性能。在Task-level MoE中,不同的语言组可以使用不同的路由策略来优化跨语言学习和推理速度。在本文中,我们研究了Task-level MoE在直接NMT中的应用,并提出了一系列高性能的训练和评估配置,使得基于Task-level MoE的直接NMT系统在大量的低资源和 high-resource 直接对上优于二语和基于英语的转轴NMT模型,以及翻译方向。我们的基于16个专家的Task-level MoE超过了二语NMT,Pivot NMT模型7个语言对,而基于英语的转轴NMT模型在9个语言对和方向上仍然表现更好。
https://arxiv.org/abs/2310.12236