adaptNMT streamlines all processes involved in the development and deployment of RNN and Transformer neural translation models. As an open-source application, it is designed for both technical and non-technical users who work in the field of machine translation. Built upon the widely-adopted OpenNMT ecosystem, the application is particularly useful for new entrants to the field since the setup of the development environment and creation of train, validation and test splits is greatly simplified. Graphing, embedded within the application, illustrates the progress of model training, and SentencePiece is used for creating subword segmentation models. Hyperparameter customization is facilitated through an intuitive user interface, and a single-click model development approach has been implemented. Models developed by adaptNMT can be evaluated using a range of metrics, and deployed as a translation service within the application. To support eco-friendly research in the NLP space, a green report also flags the power consumption and kgCO$_{2}$ emissions generated during model development. The application is freely available.
adaptNMT 简化了涉及机器翻译开发和部署的所有过程。 作为一款开源应用程序,它旨在为该领域内的技术用户和非技术用户提供便利。 基于广泛采用的 OpenNMT 生态系统,该应用程序特别对新进入领域的人有所帮助,因为开发环境和工作流程的设置大大简化了。 图形化嵌入应用程序中,展示了模型训练的进展,而 SentencePiece 则用于创建分词模型。 通过直观的用户界面,促进了超参数的自定义,并实现了一种单击式模型开发方法。 由 adaptNMT 开发的数据模型可以使用一系列指标进行评估,并部署为应用程序内的翻译服务。 为支持 NLP 空间内的环保研究,该应用程序还免费提供了一份绿色报告,报告了模型开发过程中产生的功耗和 kgCO$_{2}$ 排放量。 该应用程序是免费的。
https://arxiv.org/abs/2403.02367
In this study, a human evaluation is carried out on how hyperparameter settings impact the quality of Transformer-based Neural Machine Translation (NMT) for the low-resourced English--Irish pair. SentencePiece models using both Byte Pair Encoding (BPE) and unigram approaches were appraised. Variations in model architectures included modifying the number of layers, evaluating the optimal number of heads for attention and testing various regularisation techniques. The greatest performance improvement was recorded for a Transformer-optimized model with a 16k BPE subword model. Compared with a baseline Recurrent Neural Network (RNN) model, a Transformer-optimized model demonstrated a BLEU score improvement of 7.8 points. When benchmarked against Google Translate, our translation engines demonstrated significant improvements. Furthermore, a quantitative fine-grained manual evaluation was conducted which compared the performance of machine translation systems. Using the Multidimensional Quality Metrics (MQM) error taxonomy, a human evaluation of the error types generated by an RNN-based system and a Transformer-based system was explored. Our findings show the best-performing Transformer system significantly reduces both accuracy and fluency errors when compared with an RNN-based model.
在这项研究中,我们对如何调整超参数对低资源英语-爱尔兰对之间的Transformer基神经机器翻译(NMT)的质量进行了人类评估。使用了Byte Pair Encoding(BPE)和unigram方法的两个SentencePiece模型进行评估。模型架构的变异包括修改层数,评估注意力机制的最佳头数以及尝试各种正则化技术。在用16k个BPE子词模型的Transformer优化模型中记录了最大的性能改进。与基循环神经网络(RNN)模型相比,Transformer优化模型显示BLEU得分提高了7.8分。与谷歌翻译进行基准测试时,我们的翻译引擎表现出了显著的改进。此外,进行了一项定量的细粒度手动评估,比较了机器翻译系统的性能。使用多维质量度量(MQM)错误分类器,研究了基于RNN和Transformer的系统生成的错误类型的性能。我们的研究结果表明,与基于RNN的模型相比,性能最佳的Transformer系统在比较时显著减少了准确性和流畅性误差。
https://arxiv.org/abs/2403.02366
In the current machine translation (MT) landscape, the Transformer architecture stands out as the gold standard, especially for high-resource language pairs. This research delves into its efficacy for low-resource language pairs including both the English$\leftrightarrow$Irish and English$\leftrightarrow$Marathi language pairs. Notably, the study identifies the optimal hyperparameters and subword model type to significantly improve the translation quality of Transformer models for low-resource language pairs. The scarcity of parallel datasets for low-resource languages can hinder MT development. To address this, gaHealth was developed, the first bilingual corpus of health data for the Irish language. Focusing on the health domain, models developed using this in-domain dataset exhibited very significant improvements in BLEU score when compared with models from the LoResMT2021 Shared Task. A subsequent human evaluation using the multidimensional quality metrics error taxonomy showcased the superior performance of the Transformer system in reducing both accuracy and fluency errors compared to an RNN-based counterpart. Furthermore, this thesis introduces adaptNMT and adaptMLLM, two open-source applications streamlined for the development, fine-tuning, and deployment of neural machine translation models. These tools considerably simplify the setup and evaluation process, making MT more accessible to both developers and translators. Notably, adaptNMT, grounded in the OpenNMT ecosystem, promotes eco-friendly natural language processing research by highlighting the environmental footprint of model development. Fine-tuning of MLLMs by adaptMLLM demonstrated advancements in translation performance for two low-resource language pairs: English$\leftrightarrow$Irish and English$\leftrightarrow$Marathi, compared to baselines from the LoResMT2021 Shared Task.
在当前的机器翻译(MT)格局中,Transformer架构脱颖而出,尤其是在高资源语言对中。这项研究深入探讨了其在低资源语言对(包括英语对爱尔兰语和英语对马哈拉蒂语)中的有效性。值得注意的是,该研究发现了对低资源语言对Transformer模型的翻译质量显著改善的最佳超参数和子词模型类型。低资源语言对缺乏并行数据集可能会阻碍MT的发展。为了解决这个问题,gaHealth应运而生,成为第一个针对爱尔兰语的健康数据集的双语语料库。聚焦于健康领域,使用该域数据集训练的模型在BLEU得分方面与来自LoResMT2021共享任务的模型相比具有非常显著的改善。接下来,通过多维质量指标错误分类器的人类评估展示了Transformer系统在减少准确性和流畅性误差方面优越性能。此外,本论文还引入了adaptNMT和adaptMLLM这两个开源应用,专为神经机器翻译模型的开发、微调和支持部署而优化。这些工具大大简化了设置和评估过程,使MT对开发者和翻译者来说更加易用。值得注意的是,基于OpenNMT的adaptNMT通过突出模型开发的环保足迹推动了可持续自然语言处理研究。通过adaptMLLM对MLLM的微调展示了对于两个低资源语言对(英语对爱尔兰语和英语对马哈拉蒂语)的翻译性能的改善,与LoResMT2021共享任务的基线相比。
https://arxiv.org/abs/2403.01580
Autoregressive (AR) and Non-autoregressive (NAR) models are two types of generative models for Neural Machine Translation (NMT). AR models predict tokens in a word-by-word manner and can effectively capture the distribution of real translations. NAR models predict tokens by extracting bidirectional contextual information which can improve the inference speed but they suffer from performance degradation. Previous works utilized AR models to enhance NAR models by reducing the training data's complexity or incorporating the global information into AR models by virtue of NAR models. However, those investigated methods only take advantage of the contextual information of a single type of model while neglecting the diversity in the contextual information that can be provided by different types of models. In this paper, we propose a novel generic collaborative learning method, DCMCL, where AR and NAR models are treated as collaborators instead of teachers and students. To hierarchically leverage the bilateral contextual information, token-level mutual learning and sequence-level contrastive learning are adopted between AR and NAR models. Extensive experiments on four widely used benchmarks show that the proposed DCMCL method can simultaneously improve both AR and NAR models with up to 1.38 and 2.98 BLEU scores respectively, and can also outperform the current best-unified model with up to 0.97 BLEU scores for both AR and NAR decoding.
自回归(AR)和非自回归(NAR)模型是两种神经机器翻译(NMT)中的生成模型。AR模型预测单词级的token,并且可以有效地捕捉到真实翻译的分布。NAR模型通过提取双向上下文信息来预测token,从而提高推理速度,但是它们会受到性能退化的影响。之前的工作利用AR模型来增强NAR模型,通过减少训练数据的复杂性或通过NAR模型将全局信息融入AR模型。然而,这些调查方法只利用了单一类型的模型的上下文信息,而忽略了不同类型模型可以提供的上下文信息的多样性。在本文中,我们提出了一种新颖的通用协同学习方法,称为DCMCL,其中AR和NAR模型被处理为协作伙伴而不是老师和学生。为了分层利用双向上下文信息,我们在AR和NAR模型之间采用了词级相互学习和序列级对比学习。在四个广泛使用的基准上进行的大量实验证明,与单独使用AR和NAR模型相比,所提出的DCMCL方法可以同时提高AR和NAR模型的分数分别为1.38和2.98,并且还可以比当前最佳统一模型分别提高0.97的BLEU分数。
https://arxiv.org/abs/2402.18428
Generally, the decoder-only large language models (LLMs) are adapted to context-aware neural machine translation (NMT) in a concatenating way, where LLMs take the concatenation of the source sentence (i.e., intra-sentence context) and the inter-sentence context as the input, and then to generate the target tokens sequentially. This adaptation strategy, i.e., concatenation mode, considers intra-sentence and inter-sentence contexts with the same priority, despite an apparent difference between the two kinds of contexts. In this paper, we propose an alternative adaptation approach, named Decoding-enhanced Multi-phase Prompt Tuning (DeMPT), to make LLMs discriminately model and utilize the inter- and intra-sentence context and more effectively adapt LLMs to context-aware NMT. First, DeMPT divides the context-aware NMT process into three separate phases. During each phase, different continuous prompts are introduced to make LLMs discriminately model various information. Second, DeMPT employs a heuristic way to further discriminately enhance the utilization of the source-side inter- and intra-sentence information at the final decoding phase. Experiments show that our approach significantly outperforms the concatenation method, and further improves the performance of LLMs in discourse modeling.
通常,只有 decoder-only 大型语言模型(LLMs)是通过连接源句(即内部句子上下文)和间句子上下文来进行语境感知神经机器翻译(NMT),并在连接的同时生成目标词。这种适应策略,即连接模式,尽管两种上下文的表面差异似乎存在,但仍然将内部和间句子上下文赋予相同的优先级。在本文中,我们提出了另一种适应方法,名为解码增强多阶段提示调整(DeMPT),以使LLMs能够有选择性地建模和利用内部和间句子上下文,并更有效地将LLMs适应于语境感知 NMT。首先,DeMPT将语境感知 NMT 过程划分为三个单独的阶段。在每阶段,我们引入了不同的连续提示,使LLMs能够有选择性地建模各种信息。其次,DeMPT采用了一种启发式的方法,在解码阶段进一步增强了源 side 的间和内句子上下文的利用率。实验证明,我们的方法显著优于串联方法,并进一步提高了LLM在论述建模方面的性能。
https://arxiv.org/abs/2402.15200
Neural Machine Translation (NMT) continues to improve in quality and adoption, yet the inadvertent perpetuation of gender bias remains a significant concern. Despite numerous studies on gender bias in translations into English from weakly gendered-languages, there are no benchmarks for evaluating this phenomenon or for assessing mitigation strategies. To address this gap, we introduce GATE X-E, an extension to the GATE (Rarrick et al., 2023) corpus, that consists of human translations from Turkish, Hungarian, Finnish, and Persian into English. Each translation is accompanied by feminine, masculine, and neutral variants. The dataset, which contains between 1250 and 1850 instances for each of the four language pairs, features natural sentences with a wide range of sentence lengths and domains, challenging translation rewriters on various linguistic phenomena. Additionally, we present a translation gender rewriting solution built with GPT-4 and use GATE X-E to evaluate it. We open source our contributions to encourage further research on gender debiasing.
神经机器翻译(NMT)在质量和采用方面继续改进,然而无意中延续性别偏见仍然是一个重要的问题。尽管关于弱化性别偏见在英语翻译中的研究已经很多,但目前尚无评估这一现象或评估减轻策略的基准。为了填补这一空白,我们引入了GATE X-E,一个扩展自GATE(Rarrick等人,2023)语料库的土耳其语、匈牙利语、芬兰语和波斯语对英语的人翻译。每个翻译都附带女性、男性和中性版本。该数据集(每对语言之间包含1250到1850个实例)包含了各种长度的句子和主题,挑战了翻译者应对各种语言现象。此外,我们还提出了一个基于GPT-4的翻译性别重写解决方案,并使用GATE X-E对其进行评估。我们开源我们的贡献,鼓励进一步研究性别偏见。
https://arxiv.org/abs/2402.14277
With the rapid advancement of Neural Machine Translation (NMT), enhancing translation efficiency and quality has become a focal point of research. Despite the commendable performance of general models such as the Transformer in various aspects, they still fall short in processing long sentences and fully leveraging bidirectional contextual information. This paper introduces an improved model based on the Transformer, implementing an asynchronous and segmented bidirectional decoding strategy aimed at elevating translation efficiency and accuracy. Compared to traditional unidirectional translations from left-to-right or right-to-left, our method demonstrates heightened efficiency and improved translation quality, particularly in handling long sentences. Experimental results on the IWSLT2017 dataset confirm the effectiveness of our approach in accelerating translation and increasing accuracy, especially surpassing traditional unidirectional strategies in long sentence translation. Furthermore, this study analyzes the impact of sentence length on decoding outcomes and explores the model's performance in various scenarios. The findings of this research not only provide an effective encoding strategy for the NMT field but also pave new avenues and directions for future studies.
随着 Neural Machine Translation(NMT)的快速发展,提高翻译的效率和质量已成为研究的重点。尽管在各种方面表现出优异性能的通用模型(如Transformer)仍然在处理长句子和充分利用双向上下文信息方面存在不足,但该模型在提高翻译效率和质量方面仍然具有很大的潜力。本文介绍了一种基于Transformer的改进模型,通过实现异步和分割的上下文双向解码策略,旨在提高翻译的效率和准确性。与从左到右或从右到左的传统无向翻译相比,我们的方法在处理长句子方面表现出更高的效率和更好的翻译质量。实验结果表明,我们的方法可以加速翻译并提高准确性,特别是在长句子翻译方面超过了传统的无向策略。此外,本研究分析了句子长度对解码结果的影响,并探讨了模型在不同场景下的性能。这一研究为 NMT 领域提供了有效的编码策略,同时也为未来的研究开辟了新的途径和方向。
https://arxiv.org/abs/2402.14849
Conventional fair graph clustering methods face two primary challenges: i) They prioritize balanced clusters at the expense of cluster cohesion by imposing rigid constraints, ii) Existing methods of both individual and group-level fairness in graph partitioning mostly rely on eigen decompositions and thus, generally lack interpretability. To address these issues, we propose iFairNMTF, an individual Fairness Nonnegative Matrix Tri-Factorization model with contrastive fairness regularization that achieves balanced and cohesive clusters. By introducing fairness regularization, our model allows for customizable accuracy-fairness trade-offs, thereby enhancing user autonomy without compromising the interpretability provided by nonnegative matrix tri-factorization. Experimental evaluations on real and synthetic datasets demonstrate the superior flexibility of iFairNMTF in achieving fairness and clustering performance.
传统的公平图聚类方法面临两个主要挑战:i)它们通过强制的约束来优先考虑平衡的聚类,牺牲了聚类的凝聚性;ii)现有的图划分方法(无论是个人级还是组级公平)主要依赖于矩阵分解,因此通常缺乏可解释性。为了应对这些问题,我们提出了iFairNMTF,一种具有对比性公平 Regularization 的个人公平 Nonnegative Matrix Tri-Factorization 模型,它实现了平衡和凝聚的聚类。通过引入公平 Regularization,我们的模型允许进行自定义的准确-公平权衡,从而增强用户自治,同时不牺牲非负矩阵分解提供的可解释性。在真实和合成数据集上的实验评估证明了iFairNMTF 在实现公平和聚类性能方面的卓越灵活性。
https://arxiv.org/abs/2402.10756
Large language models (LLMs) have demonstrated promising potential in various downstream tasks, including machine translation. However, prior work on LLM-based machine translation has mainly focused on better utilizing training data, demonstrations, or pre-defined and universal knowledge to improve performance, with a lack of consideration of decision-making like human translators. In this paper, we incorporate Thinker with the Drift-Diffusion Model (Thinker-DDM) to address this issue. We then redefine the Drift-Diffusion process to emulate human translators' dynamic decision-making under constrained resources. We conduct extensive experiments under the high-resource, low-resource, and commonsense translation settings using the WMT22 and CommonMT datasets, in which Thinker-DDM outperforms baselines in the first two scenarios. We also perform additional analysis and evaluation on commonsense translation to illustrate the high effectiveness and efficacy of the proposed method.
大语言模型(LLMs)在各种下游任务中已经展示了有前途的应用,包括机器翻译。然而,LLM机器翻译前的研究主要集中在更好地利用训练数据、演示或预定义和普遍知识以提高性能,而忽略了像人类翻译员一样的决策过程。在本文中,我们将Thinker与Drift-Diffusion模型(Thinker-DDM)相结合以解决这个问题。然后,我们将Drift-Diffusion过程重新定义为在有限资源下模拟人类翻译员动态决策的过程。我们在高资源、低资源和常识翻译设置中使用WMT22和CommonMT数据集进行广泛的实验,其中Thinker-DDM在第一和第二场景中的表现优于基线。我们还对常识翻译进行了额外的分析和评估,以说明所提出方法的高度有效性和有效性。
https://arxiv.org/abs/2402.10699
Motivated by the success of unsupervised neural machine translation (UNMT), we introduce an unsupervised sign language translation and generation network (USLNet), which learns from abundant single-modality (text and video) data without parallel sign language data. USLNet comprises two main components: single-modality reconstruction modules (text and video) that rebuild the input from its noisy version in the same modality and cross-modality back-translation modules (text-video-text and video-text-video) that reconstruct the input from its noisy version in the different modality using back-translation procedure.Unlike the single-modality back-translation procedure in text-based UNMT, USLNet faces the cross-modality discrepancy in feature representation, in which the length and the feature dimension mismatch between text and video sequences. We propose a sliding window method to address the issues of aligning variable-length text with video sequences. To our knowledge, USLNet is the first unsupervised sign language translation and generation model capable of generating both natural language text and sign language video in a unified manner. Experimental results on the BBC-Oxford Sign Language dataset (BOBSL) and Open-Domain American Sign Language dataset (OpenASL) reveal that USLNet achieves competitive results compared to supervised baseline models, indicating its effectiveness in sign language translation and generation.
鉴于 unsupervised神经机器翻译 (UNMT) 的成功,我们提出了一个无需并行手语数据的无监督签名语言翻译和生成网络 (USLNet)。USLNet 包含两个主要组件:单模态重建模块(文本和视频)在同一模态和跨模态反向翻译模块(文本-视频文本和视频-文本-视频)在另一模态使用反向翻译过程重构输入。与基于文本的 UNMT 中的单模态反向翻译方法不同,USLNet 面临跨模态信息表示差异,其中文本和视频序列的长度和特征维度不匹配。我们提出了一种滑动窗口方法来解决将长文本与视频序列对齐的问题。据我们所知,USLNet 是第一个能够以统一方式生成自然语言文本和手语视频的无需监督的签名语言翻译和生成模型。BBC-Oxford Sign Language 数据集(BOBSL)和 Open-Domain American Sign Language 数据集(OpenASL)的实验结果表明,与监督基线模型相比,USLNet 具有竞争力的结果,表明其在手语翻译和生成方面的有效性。
https://arxiv.org/abs/2402.07726
This study evaluates the machine translation (MT) quality of two state-of-the-art large language models (LLMs) against a tradition-al neural machine translation (NMT) system across four language pairs in the legal domain. It combines automatic evaluation met-rics (AEMs) and human evaluation (HE) by professional transla-tors to assess translation ranking, fluency and adequacy. The re-sults indicate that while Google Translate generally outperforms LLMs in AEMs, human evaluators rate LLMs, especially GPT-4, comparably or slightly better in terms of producing contextually adequate and fluent translations. This discrepancy suggests LLMs' potential in handling specialized legal terminology and context, highlighting the importance of human evaluation methods in assessing MT quality. The study underscores the evolving capabil-ities of LLMs in specialized domains and calls for reevaluation of traditional AEMs to better capture the nuances of LLM-generated translations.
本研究评估了两种最先进的语言模型(LLMs)在法律领域中相对于传统神经机器翻译(NMT)系统的机器翻译(MT)质量。它通过将自动评估指标(AEMs)和职业翻译员的 human evaluation(HE)相结合来评估翻译排名、流畅度和充分性。研究结果表明,尽管 Google Translate 在 AEMs 中通常表现出色,但职业翻译员对 LLMs 的评估相对较好,尤其是 GPT-4,在产生上下文充分且流畅的翻译方面。这种差异表明 LLMs 在处理专业法律术语和上下文方面的潜在能力,突出了在评估 MT 质量中使用人类评估方法的重要性。该研究强调了 LLMs 在专业领域不断发展的能力,并呼吁重新评估传统的 AEMs,以更好地捕捉 LLM 生成的翻译的细微差别。
https://arxiv.org/abs/2402.07681
We conducted a detailed analysis on the quality of web-mined corpora for two low-resource languages (making three language pairs, English-Sinhala, English-Tamil and Sinhala-Tamil). We ranked each corpus according to a similarity measure and carried out an intrinsic and extrinsic evaluation on different portions of this ranked corpus. We show that there are significant quality differences between different portions of web-mined corpora and that the quality varies across languages and datasets. We also show that, for some web-mined datasets, Neural Machine Translation (NMT) models trained with their highest-ranked 25k portion can be on par with human-curated datasets.
我们对两种低资源语言(英语-斯兰哈、英语-泰米尔和斯兰哈-泰米尔)的网络挖掘语料库的质量进行了详细分析。我们根据相似度指标对每个语料库进行排名,并对这个排名语料库的不同部分进行了内化和外化评估。我们发现,网络挖掘语料库的不同部分之间存在显著的质量差异,而且质量会随着语言和数据集的不同而变化。我们还发现,对于某些网络挖掘数据集,使用其最高排名的25k部分训练的神经机器翻译(NMT)模型可以与人工审核的数据集相媲美。
https://arxiv.org/abs/2402.07446
Standard context-aware neural machine translation (NMT) typically relies on parallel document-level data, exploiting both source and target contexts. Concatenation-based approaches in particular, still a strong baseline for document-level NMT, prepend source and/or target context sentences to the sentences to be translated, with model variants that exploit equal amounts of source and target data on each side achieving state-of-the-art results. In this work, we investigate whether target data should be further promoted within standard concatenation-based approaches, as most document-level phenomena rely on information that is present on the target language side. We evaluate novel concatenation-based variants where the target context is prepended to the source language, either in isolation or in combination with the source context. Experimental results in English-Russian and Basque-Spanish show that including target context in the source leads to large improvements on target language phenomena. On source-dependent phenomena, using only target language context in the source achieves parity with state-of-the-art concatenation approaches, or slightly underperforms, whereas combining source and target context on the source side leads to significant gains across the board.
标准上下文感知神经机器翻译(NMT)通常依赖于并行文档级别数据,并利用源语和目标语上下文。特别是,基于连接的方法在文档级别NMT中仍然是一个强大的基线,将源语和/或目标语上下文句子附加到要翻译的句子前面,利用等量的源语和目标语数据实现达到最先进水平的模型变体。在这项工作中,我们研究了是否应该在标准连接方法中进一步促进目标数据,因为大多数文档级别现象都依赖于目标语言侧存在的信息。我们评估了新颖的连接基线变体,其中目标上下文都被附加到源语言。在英语-俄语和Basque-西班牙语实验中,将目标上下文包含在源语言中,在目标语言现象方面取得了很大的改进。在源相关现象上,仅使用目标语上下文在源中实现与最先进连接方法达到平衡,或者稍有落后,而将源和目标语上下文在源侧结合起来,产生了显著的全面改进。
https://arxiv.org/abs/2402.06342
This study explores four methods of generating paraphrases in Malayalam, utilizing resources available for English paraphrasing and pre-trained Neural Machine Translation (NMT) models. We evaluate the resulting paraphrases using both automated metrics, such as BLEU, METEOR, and cosine similarity, as well as human annotation. Our findings suggest that automated evaluation measures may not be fully appropriate for Malayalam, as they do not consistently align with human judgment. This discrepancy underscores the need for more nuanced paraphrase evaluation approaches especially for highly agglutinative languages.
这项研究探讨了在 Malayalam 中生成paraphrases的四种方法,利用了可用于英语paraphrasing的现有资源和预训练的 Neural Machine Translation(NMT)模型。我们使用自动指标(如 BLEU、METEOR 和余弦相似性)和人类注释来评估所得的paraphrases。我们的发现表明,自动评估措施可能并不完全适用于 Malayalam,因为它们并不始终与人类判断相一致。这一差异突显了在高度黏着性语言中,需要更加细微的paraphrase评估方法。
https://arxiv.org/abs/2401.17827
Document-level neural machine translation (DocNMT) aims to generate translations that are both coherent and cohesive, in contrast to its sentence-level counterpart. However, due to its longer input length and limited availability of training data, DocNMT often faces the challenge of data sparsity. To overcome this issue, we propose a novel Importance-Aware Data Augmentation (IADA) algorithm for DocNMT that augments the training data based on token importance information estimated by the norm of hidden states and training gradients. We conduct comprehensive experiments on three widely-used DocNMT benchmarks. Our empirical results show that our proposed IADA outperforms strong DocNMT baselines as well as several data augmentation approaches, with statistical significance on both sentence-level and document-level BLEU.
文档级别神经机器翻译(DocNMT)旨在生成既连贯又完整的翻译,与其句子级别 counterpart 不同。然而,由于其较长的输入长度和训练数据有限,DocNMT 通常面临数据稀疏性的挑战。为了克服这一问题,我们提出了一个新颖的基于词重要性信息估计 norms of hidden states 和 training gradients 的 Importance-Aware Data Augmentation (IADA) 算法用于 DocNMT。我们对三个广泛使用的 DocNMT 基准进行全面的实验。我们的实证结果表明,与 strong DocNMT 基线以及多个数据增强方法相比,我们的 IADA 具有显著的优越性,具有统计学意义,无论是句子级别还是文档级别 BLEU。
https://arxiv.org/abs/2401.15360
The evolution of Neural Machine Translation (NMT) has been significantly influenced by six core challenges (Koehn and Knowles, 2017), which have acted as benchmarks for progress in this field. This study revisits these challenges, offering insights into their ongoing relevance in the context of advanced Large Language Models (LLMs): domain mismatch, amount of parallel data, rare word prediction, translation of long sentences, attention model as word alignment, and sub-optimal beam search. Our empirical findings indicate that LLMs effectively lessen the reliance on parallel data for major languages in the pretraining phase. Additionally, the LLM-based translation system significantly enhances the translation of long sentences that contain approximately 80 words and shows the capability to translate documents of up to 512 words. However, despite these significant improvements, the challenges of domain mismatch and prediction of rare words persist. While the challenges of word alignment and beam search, specifically associated with NMT, may not apply to LLMs, we identify three new challenges for LLMs in translation tasks: inference efficiency, translation of low-resource languages in the pretraining phase, and human-aligned evaluation. The datasets and models are released at this https URL.
Neural Machine Translation(NMT)的发展受到了六个核心挑战(Koehn和Knowles,2017)的重大影响,这些挑战作为该领域进步的基准。本研究重新审视了这些挑战,并为高级大型语言模型(LLMs)在高级LLMs:领域差异,数据量,稀有词预测,长句翻译,注意力模型作为词对齐以及子最优 beam search。我们的实证研究结果表明,LLM在预训练阶段显著减少了主要语言对并行数据的需求。此外,基于LLM的翻译系统显著增强了包含约80个词的长句翻译能力,并能够将长度达到512个词的文档翻译成功。然而,尽管取得了显著的改进,领域差异和稀有词预测的挑战仍然存在。虽然与NMT相关的词对齐和beam search的挑战不适用于LLMs,但我们为LLM在翻译任务中提出了三个新的挑战:推理效率,预训练阶段低资源语言的翻译,以及人机对齐评估。数据集和模型发布在https://这个URL上。
https://arxiv.org/abs/2401.08350
Federated learning (FL) is a promising approach for solving multilingual tasks, potentially enabling clients with their own language-specific data to collaboratively construct a high-quality neural machine translation (NMT) model. However, communication constraints in practical network systems present challenges for exchanging large-scale NMT engines between FL parties. In this paper, we propose a meta-learning-based adaptive parameter selection methodology, MetaSend, that improves the communication efficiency of model transmissions from clients during FL-based multilingual NMT training. Our approach learns a dynamic threshold for filtering parameters prior to transmission without compromising the NMT model quality, based on the tensor deviations of clients between different FL rounds. Through experiments on two NMT datasets with different language distributions, we demonstrate that MetaSend obtains substantial improvements over baselines in translation quality in the presence of a limited communication budget.
联邦学习(FL)是一种解决多语言任务的的有前途的方法,可能使具有自己语言特定数据的所有客户端协同构建高质量的语言机器翻译(NMT)模型。然而,实际网络系统中的通信限制为在FL各方之间交换大规模NMT引擎设置了挑战。在本文中,我们提出了一个基于元学习的自适应参数选择方法,元发送,以提高基于FL的多语言NMT训练中客户端模型的传输效率。我们的方法基于客户端之间不同FL轮的张量偏差来学习动态阈值,在传输参数之前不牺牲NMT模型质量。通过在两个具有不同语言分布的NMT数据集上的实验,我们证明了元发送在有限通信预算下显著提高了翻译质量。
https://arxiv.org/abs/2401.07456
Detecting the translation direction of parallel text has applications for machine translation training and evaluation, but also has forensic applications such as resolving plagiarism or forgery allegations. In this work, we explore an unsupervised approach to translation direction detection based on the simple hypothesis that $p(\text{translation}|\text{original})>p(\text{original}|\text{translation})$, motivated by the well-known simplification effect in translationese or machine-translationese. In experiments with massively multilingual machine translation models across 20 translation directions, we confirm the effectiveness of the approach for high-resource language pairs, achieving document-level accuracies of 82-96% for NMT-produced translations, and 60-81% for human translations, depending on the model used. Code and demo are available at this https URL
检测并平行文本的翻译方向在机器翻译训练和评估以及法庭应用(如检测抄袭或伪造)方面有应用,但还具有许多应用,如解决抄袭或伪造指控。在这项工作中,我们探讨了一种基于简单假设 $p(\text{translation}|\text{original})>p(\text{original}|\text{translation})$ 的无监督翻译方向检测方法,这是基于翻译语料库中众所周知的美化效应或机器翻译语料库中的简化效应。在20个翻译方向上使用大规模多语言机器翻译模型进行的实验中,我们证实了该方法对于高资源语言对的成功应用,达到每条文档82-96%的NMT产生的翻译准确度,以及60-81%的人际翻译准确度,具体取决于所使用的模型。代码和演示版本可在https://这个URL上获得。
https://arxiv.org/abs/2401.06769
The conversion of content from one language to another utilizing a computer system is known as Machine Translation (MT). Various techniques have come up to ensure effective translations that retain the contextual and lexical interpretation of the source language. End-to-end Neural Machine Translation (NMT) is a popular technique and it is now widely used in real-world MT systems. Massive amounts of parallel datasets (sentences in one language alongside translations in another) are required for MT systems. These datasets are crucial for an MT system to learn linguistic structures and patterns of both languages during the training phase. One such dataset is Samanantar, the largest publicly accessible parallel dataset for Indian languages (ILs). Since the corpus has been gathered from various sources, it contains many incorrect translations. Hence, the MT systems built using this dataset cannot perform to their usual potential. In this paper, we propose an algorithm to remove mistranslations from the training corpus and evaluate its performance and efficiency. Two Indic languages (ILs), namely, Hindi (HIN) and Odia (ODI) are chosen for the experiment. A baseline NMT system is built for these two ILs, and the effect of different dataset sizes is also investigated. The quality of the translations in the experiment is evaluated using standard metrics such as BLEU, METEOR, and RIBES. From the results, it is observed that removing the incorrect translation from the dataset makes the translation quality better. It is also noticed that, despite the fact that the ILs-English and English-ILs systems are trained using the same corpus, ILs-English works more effectively across all the evaluation metrics.
将一种语言的内容转换为另一种语言的内容利用计算机系统进行翻译被称为机器翻译(MT)。为了确保有效的翻译并保留源语言的上下文和词汇解释,已经提出了各种技术。端到端神经机器翻译(NMT)是一种流行的技术,现在在现实世界的MT系统中得到了广泛应用。大量的并行数据集(一种语言的句子及另一种语言的翻译)对于MT系统来说至关重要。 Samanantar 是一个公开可用的针对印度语言的大型并行数据集(ILs)。由于从各种来源收集,因此其中包含许多错误翻译。因此,使用这个数据集构建的MT系统无法发挥其通常的功能。在本文中,我们提出了一个算法来从训练语料库中消除错误翻译,并评估其性能和效率。我们选择了两种IL,即印地语(HIN)和奥迪亚语(ODI)进行实验。为这两个IL构建了 baseline NMT系统,并研究了不同数据集大小对系统性能的影响。实验中使用的翻译质量评估标准包括BLEU、METEOR和RIBES等。从结果中观察到,从数据集中移除错误翻译可以提高翻译质量。此外,还发现IL-英语系统在所有评估指标上都比ENGL-IL系统更有效。
https://arxiv.org/abs/2401.06398
The training paradigm for machine translation has gradually shifted, from learning neural machine translation (NMT) models with extensive parallel corpora to instruction finetuning on pretrained multilingual large language models (LLMs) with high-quality translation pairs. In this paper, we focus on boosting the many-to-many multilingual translation performance of LLMs with an emphasis on zero-shot translation directions. We demonstrate that prompt strategies adopted during instruction finetuning are crucial to zero-shot translation performance and introduce a cross-lingual consistency regularization, XConST, to bridge the representation gap among different languages and improve zero-shot translation performance. XConST is not a new method, but a version of CrossConST (Gao et al., 2023a) adapted for multilingual finetuning on LLMs with translation instructions. Experimental results on ALMA (Xu et al., 2023) and LLaMA-2 (Touvron et al., 2023) show that our approach consistently improves translation performance. Our implementations are available at this https URL.
机器翻译的训练范式逐渐从学习具有广泛并行语料库的神经机器翻译(NMT)模型转向对预训练的多语言大型语言模型(LLM)进行指令微调。在本文中,我们重点提高具有零散翻译方向的多语言LLM的性能,特别关注零散翻译方向。我们证明了在指令微调过程中采用的提示策略对零散翻译性能至关重要,并引入了一种跨语言一致性正则化XConST,以弥合不同语言之间的表示差距,并提高零散翻译性能。XConST不是一种新的方法,而是一种适应于带有翻译指令的多语言LLM的CrossConST的变体。ALMA(Xu et al., 2023)和LLaMA-2(Touvron et al., 2023)的实验结果表明,我们的方法会持续提高翻译性能。我们的实现可以在https://this URL上找到。
https://arxiv.org/abs/2401.05861