The conversion of content from one language to another utilizing a computer system is known as Machine Translation (MT). Various techniques have come up to ensure effective translations that retain the contextual and lexical interpretation of the source language. End-to-end Neural Machine Translation (NMT) is a popular technique and it is now widely used in real-world MT systems. Massive amounts of parallel datasets (sentences in one language alongside translations in another) are required for MT systems. These datasets are crucial for an MT system to learn linguistic structures and patterns of both languages during the training phase. One such dataset is Samanantar, the largest publicly accessible parallel dataset for Indian languages (ILs). Since the corpus has been gathered from various sources, it contains many incorrect translations. Hence, the MT systems built using this dataset cannot perform to their usual potential. In this paper, we propose an algorithm to remove mistranslations from the training corpus and evaluate its performance and efficiency. Two Indic languages (ILs), namely, Hindi (HIN) and Odia (ODI) are chosen for the experiment. A baseline NMT system is built for these two ILs, and the effect of different dataset sizes is also investigated. The quality of the translations in the experiment is evaluated using standard metrics such as BLEU, METEOR, and RIBES. From the results, it is observed that removing the incorrect translation from the dataset makes the translation quality better. It is also noticed that, despite the fact that the ILs-English and English-ILs systems are trained using the same corpus, ILs-English works more effectively across all the evaluation metrics.
将一种语言的内容转换为另一种语言的内容利用计算机系统进行翻译被称为机器翻译(MT)。为了确保有效的翻译并保留源语言的上下文和词汇解释,已经提出了各种技术。端到端神经机器翻译(NMT)是一种流行的技术,现在在现实世界的MT系统中得到了广泛应用。大量的并行数据集(一种语言的句子及另一种语言的翻译)对于MT系统来说至关重要。 Samanantar 是一个公开可用的针对印度语言的大型并行数据集(ILs)。由于从各种来源收集,因此其中包含许多错误翻译。因此,使用这个数据集构建的MT系统无法发挥其通常的功能。在本文中,我们提出了一个算法来从训练语料库中消除错误翻译,并评估其性能和效率。我们选择了两种IL,即印地语(HIN)和奥迪亚语(ODI)进行实验。为这两个IL构建了 baseline NMT系统,并研究了不同数据集大小对系统性能的影响。实验中使用的翻译质量评估标准包括BLEU、METEOR和RIBES等。从结果中观察到,从数据集中移除错误翻译可以提高翻译质量。此外,还发现IL-英语系统在所有评估指标上都比ENGL-IL系统更有效。
https://arxiv.org/abs/2401.06398
The training paradigm for machine translation has gradually shifted, from learning neural machine translation (NMT) models with extensive parallel corpora to instruction finetuning on pretrained multilingual large language models (LLMs) with high-quality translation pairs. In this paper, we focus on boosting the many-to-many multilingual translation performance of LLMs with an emphasis on zero-shot translation directions. We demonstrate that prompt strategies adopted during instruction finetuning are crucial to zero-shot translation performance and introduce a cross-lingual consistency regularization, XConST, to bridge the representation gap among different languages and improve zero-shot translation performance. XConST is not a new method, but a version of CrossConST (Gao et al., 2023a) adapted for multilingual finetuning on LLMs with translation instructions. Experimental results on ALMA (Xu et al., 2023) and LLaMA-2 (Touvron et al., 2023) show that our approach consistently improves translation performance. Our implementations are available at this https URL.
机器翻译的训练范式逐渐从学习具有广泛并行语料库的神经机器翻译(NMT)模型转向对预训练的多语言大型语言模型(LLM)进行指令微调。在本文中,我们重点提高具有零散翻译方向的多语言LLM的性能,特别关注零散翻译方向。我们证明了在指令微调过程中采用的提示策略对零散翻译性能至关重要,并引入了一种跨语言一致性正则化XConST,以弥合不同语言之间的表示差距,并提高零散翻译性能。XConST不是一种新的方法,而是一种适应于带有翻译指令的多语言LLM的CrossConST的变体。ALMA(Xu et al., 2023)和LLaMA-2(Touvron et al., 2023)的实验结果表明,我们的方法会持续提高翻译性能。我们的实现可以在https://this URL上找到。
https://arxiv.org/abs/2401.05861
Speech has long been a barrier to effective communication and connection, persisting as a challenge in our increasingly interconnected world. This research paper introduces a transformative solution to this persistent obstacle an end-to-end speech conversion framework tailored for Hindi-to-English translation, culminating in the synthesis of English audio. By integrating cutting-edge technologies such as XLSR Wav2Vec2 for automatic speech recognition (ASR), mBART for neural machine translation (NMT), and a Text-to-Speech (TTS) synthesis component, this framework offers a unified and seamless approach to cross-lingual communication. We delve into the intricate details of each component, elucidating their individual contributions and exploring the synergies that enable a fluid transition from spoken Hindi to synthesized English audio.
演讲一直是有效沟通和连接的障碍,作为一个持续的挑战,在我们的越来越相互连接的世界中。这篇研究论文提出了一种解决这个持续障碍的变革性解决方案——端到端印度语到英语翻译框架,最终合成英语音频。通过整合诸如XLSR Wav2Vec2自动语音识别(ASR)这样的尖端技术,mBART神经机器翻译(NMT)以及一个文本转语音(TTS)合成组件,这个框架为跨语言交流提供了一个统一且无缝的方法。我们深入研究每个组件的复杂细节,阐明它们各自的贡献,并探讨了使流利切换从 spoken Hindi到合成英语音频的协同作用。
https://arxiv.org/abs/2401.06183
Low-resource languages (LRLs) face challenges in supervised neural machine translation due to limited parallel data, prompting research into unsupervised methods. Unsupervised neural machine translation (UNMT) methods, including back-translation, transfer learning, and pivot-based translation, offer practical solutions for LRL translation, but they are hindered by issues like synthetic data noise, language bias, and error propagation, which can potentially be mitigated by Large Language Models (LLMs). LLMs have advanced NMT with in-context learning (ICL) and supervised fine-tuning methods, but insufficient training data results in poor performance in LRLs. We argue that LLMs can mitigate the linguistic noise with auxiliary languages to improve translations in LRLs. In this paper, we propose Probability-driven Meta-graph Prompter (POMP), a novel approach employing a dynamic, sampling-based graph of multiple auxiliary languages to enhance LLMs' translation capabilities for LRLs. POMP involves constructing a directed acyclic meta-graph for each source language, from which we dynamically sample multiple paths to prompt LLMs to mitigate the linguistic noise and improve translations during training. We use the BLEURT metric to evaluate the translations and back-propagate rewards, estimated by scores, to update the probabilities of auxiliary languages in the paths. Our experiments show significant improvements in the translation quality of three LRLs, demonstrating the effectiveness of our approach.
低资源语言(LRLs)在监督式神经机器翻译中面临挑战,因为缺乏并行数据,促使研究转向无监督方法。无监督神经机器翻译(UNMT)方法,包括反向翻译、迁移学习和基于枢纽的翻译,为LRL翻译提供了实际解决方案,但它们受到诸如合成数据噪声、语言偏见和错误传播等问题困扰,这些问题可能通过大型语言模型(LLMs)进行缓解。LLMs具有先进的NMT与上下文学习(ICL)以及监督微调方法,但足够的训练数据导致在LRL上表现不佳。我们认为,LLMs可以通过辅助语言减小语言噪声,提高LRL的翻译质量。在本文中,我们提出了概率驱动元图提示(POMP)方法,一种新方法,它利用多个辅助语言的动态、抽样式图来增强LLMs的翻译能力。POMP包括为每个源语言构建一个指向有向无环图(DAG),然后从该图中动态抽样多个路径,提示LLMs减少语言噪声并提高训练过程中的翻译质量。我们使用BLEURT度量来评估翻译和反向传播的奖励,根据分数估算,更新路径上的辅助语言的概率。我们的实验结果表明,三种LRL的翻译质量都得到了显著提高,证明了我们的方法的有效性。
https://arxiv.org/abs/2401.05596
Inspired by the increasing interest in leveraging large language models for translation, this paper evaluates the capabilities of large language models (LLMs) represented by ChatGPT in comparison to the mainstream neural machine translation (NMT) engines in translating Chinese diplomatic texts into English. Specifically, we examine the translation quality of ChatGPT and NMT engines as measured by four automated metrics and human evaluation based on an error-typology and six analytic rubrics. Our findings show that automated metrics yield similar results for ChatGPT under different prompts and NMT systems, while human annotators tend to assign noticeably higher scores to ChatGPT when it is provided an example or contextual information about the translation task. Pairwise correlation between automated metrics and dimensions of human evaluation produces weak and non-significant results, suggesting the divergence between the two methods of translation quality assessment. These findings provide valuable insights into the potential of ChatGPT as a capable machine translator, and the influence of prompt engineering on its performance.
受到利用大型语言模型进行翻译 increasing 兴趣的启发,本文评估了 ChatGPT 这样的大型语言模型在将中文外交文本翻译成英语与主流神经机器翻译(NMT)引擎之间的能力。具体来说,我们研究了 ChatGPT 和 NMT 引擎的翻译质量,通过四种自动指标和基于错误类型和六种分析指标的人际评价进行评估。我们的发现表明,对于不同的提示和翻译系统,自动指标在 ChatGPT 上产生了类似的结果,而当提供关于翻译任务的例子或上下文信息时,人类评估者倾向于给 ChatGPT 分配更高的分数。自动指标和人类评估指标之间成对相关,结果为弱和非显著,表明两种翻译质量评估方法的差异。这些发现为 ChatGPT 作为具有潜力的机器翻译工具以及其性能受到提示工程影响的分析提供了宝贵的见解。
https://arxiv.org/abs/2401.05176
Unsupervised Neural Machine Translation (UNMT) focuses on improving NMT results under the assumption there is no human translated parallel data, yet little work has been done so far in highlighting its advantages compared to supervised methods and analyzing its output in aspects other than translation accuracy. We focus on three very diverse languages, French, Gujarati, and Kazakh, and train bilingual NMT models, to and from English, with various levels of supervision, in high- and low- resource setups, measure quality of the NMT output and compare the generated sequences' word order and semantic similarity to source and reference sentences. We also use Layer-wise Relevance Propagation to evaluate the source and target sentences' contribution to the result, expanding the findings of previous works to the UNMT paradigm.
无监督神经机器翻译(UNMT)专注于在不存在人类翻译平行数据的情况下改善NMT结果,然而迄今为止,在强调其与监督方法相比的优势以及分析除了翻译准确性之外的其他方面的工作还很少。我们关注三种非常不同的语言,法语、古吉拉特语和哈萨克斯坦语,在具有各种监督级别的高和低资源设置中,训练双语NMT模型,从英语到各种程度,衡量NMT输出的质量并比较生成序列的词序和语义与源句和参考句子的相似性。我们还使用层间相关传播来评估源句和目标句子的贡献对结果的影响,扩展了之前关于UNMT范式的发现。
https://arxiv.org/abs/2312.12588
Human translators linger on some words and phrases more than others, and predicting this variation is a step towards explaining the underlying cognitive processes. Using data from the CRITT Translation Process Research Database, we evaluate the extent to which surprisal and attentional features derived from a Neural Machine Translation (NMT) model account for reading and production times of human translators. We find that surprisal and attention are complementary predictors of translation difficulty, and that surprisal derived from a NMT model is the single most successful predictor of production duration. Our analyses draw on data from hundreds of translators operating across 13 language pairs, and represent the most comprehensive investigation of human translation difficulty to date.
人类翻译者对一些单词和短语的翻译会持续时间较长,而且预测这种差异是解释潜在认知过程的一个步骤。通过使用来自CRITT翻译过程研究数据库的数据,我们评估了来自神经机器翻译(NMT)模型的超人和注意特征对人类翻译者的阅读和生产时间的解释程度。我们发现,超人和注意力是相互补充的翻译难度的预测因素,而且来自NMT模型的超人是目前最成功的预测生产持续时间的因素。我们的分析基于对13个语言对数百名翻译者的数据,代表了目前对人类翻译难度的最全面的调查。
https://arxiv.org/abs/2312.11852
The growing popularity of neural machine translation (NMT) and LLMs represented by ChatGPT underscores the need for a deeper understanding of their distinct characteristics and relationships. Such understanding is crucial for language professionals and researchers to make informed decisions and tactful use of these cutting-edge translation technology, but remains underexplored. This study aims to fill this gap by investigating three key questions: (1) the distinguishability of ChatGPT-generated translations from NMT and human translation (HT), (2) the linguistic characteristics of each translation type, and (3) the degree of resemblance between ChatGPT-produced translations and HT or NMT. To achieve these objectives, we employ statistical testing, machine learning algorithms, and multidimensional analysis (MDA) to analyze Spokesperson's Remarks and their translations. After extracting a wide range of linguistic features, supervised classifiers demonstrate high accuracy in distinguishing the three translation types, whereas unsupervised clustering techniques do not yield satisfactory results. Another major finding is that ChatGPT-produced translations exhibit greater similarity with NMT than HT in most MDA dimensions, which is further corroborated by distance computing and visualization. These novel insights shed light on the interrelationships among the three translation types and have implications for the future advancements of NMT and generative AI.
随着神经机器翻译(NMT)和大型语言模型(LLM)如ChatGPT的日益普及,需要对它们独特的特征和关系进行更深刻的理解。然而,这种理解仍然被忽视。本研究旨在填补这一空白,通过调查三个关键问题:(1)ChatGPT生成的翻译与NMT和人类翻译(HT)的区分性,(2)每种翻译类型的语言特征,(3)ChatGPT生成的翻译与HT或NMT之间的相似程度。为了实现这些目标,我们采用统计测试、机器学习算法和多维度分析(MDA)分析Spokesperson的讲话及其翻译。在提取了广泛的语料特征之后,有监督分类器在区分三种翻译类型方面表现出高度准确,而无需监督聚类技术的结果并不令人满意。另一个重要发现是,ChatGPT生成的翻译在大多数MDA维度上与NMT的相似性要大于HT,这一发现通过距离计算和可视化得到了进一步证实。这些新的见解揭示了三种翻译类型之间的相互关系,对NMT和生成人工智能的未来发展具有启示意义。
https://arxiv.org/abs/2312.10750
Knowledge distillation, a technique for model compression and performance enhancement, has gained significant traction in Neural Machine Translation (NMT). However, existing research primarily focuses on empirical applications, and there is a lack of comprehensive understanding of how student model capacity, data complexity, and decoding strategies collectively influence distillation effectiveness. Addressing this gap, our study conducts an in-depth investigation into these factors, particularly focusing on their interplay in word-level and sequence-level distillation within NMT. Through extensive experimentation across datasets like IWSLT13 En$\rightarrow$Fr, IWSLT14 En$\rightarrow$De, and others, we empirically validate hypotheses related to the impact of these factors on knowledge distillation. Our research not only elucidates the significant influence of model capacity, data complexity, and decoding strategies on distillation effectiveness but also introduces a novel, optimized distillation approach. This approach, when applied to the IWSLT14 de$\rightarrow$en translation task, achieves state-of-the-art performance, demonstrating its practical efficacy in advancing the field of NMT.
知识蒸馏是一种用于模型压缩和性能增强的技术,在自然语言翻译(NMT)领域已经得到了显著的关注。然而,现有的研究主要关注实证应用,并且缺乏对学生在模型容量、数据复杂性和解码策略方面的集体影响的理解。为了填补这一空白,我们的研究对这些因素进行了深入的调查,特别关注它们在NMT中的词级和序列级别蒸馏之间的相互作用。通过在类似IWSLT13 En$\rightarrow$Fr、IWSLT14 En$\rightarrow$De等数据集上的广泛实验,我们通过实验验证了这些因素对知识蒸馏效果的影响。我们的研究不仅阐明了模型容量、数据复杂性和解码策略对蒸馏效果的重要影响,还引入了一种新的优化蒸馏方法。将这种方法应用于IWSLT14 de$\rightarrow$en翻译任务,实现了与最先进性能相当的结果,表明其在推动NMT领域的发展方面具有实际效果。
https://arxiv.org/abs/2312.08585
In this paper, we empirically study the optimization dynamics of multi-task learning, particularly focusing on those that govern a collection of tasks with significant data imbalance. We present a simple yet effective method of pre-training on high-resource tasks, followed by fine-tuning on a mixture of high/low-resource tasks. We provide a thorough empirical study and analysis of this method's benefits showing that it achieves consistent improvements relative to the performance trade-off profile of standard static weighting. We analyze under what data regimes this method is applicable and show its improvements empirically in neural machine translation (NMT) and multi-lingual language modeling.
在本文中,我们通过经验研究了多任务学习(Multi-task Learning,MT)的优化动态,特别关注那些具有显著数据不平衡的一组任务的优化。我们提出了一个在高端任务上进行预训练,然后在小/高端任务上进行微调的有效方法。我们对这种方法的益处进行了详细的实证研究和分析,表明它相对于标准静态加权方案实现了稳健的改善。我们分析了这种方法适用于哪些数据模式,并用电文机器翻译(NMT)和多语言语言建模等实证研究证明了它的改善。
https://arxiv.org/abs/2312.06134
Improving neural machine translation (NMT) systems with prompting has achieved significant progress in recent years. In this work, we focus on how to integrate multi-knowledge, multiple types of knowledge, into NMT models to enhance the performance with prompting. We propose a unified framework, which can integrate effectively multiple types of knowledge including sentences, terminologies/phrases and translation templates into NMT models. We utilize multiple types of knowledge as prefix-prompts of input for the encoder and decoder of NMT models to guide the translation process. The approach requires no changes to the model architecture and effectively adapts to domain-specific translation without retraining. The experiments on English-Chinese and English-German translation demonstrate that our approach significantly outperform strong baselines, achieving high translation quality and terminology match accuracy.
近年来,通过提示来提高神经机器翻译(NMT)系统取得了显著的进展。在本文中,我们重点探讨了如何将多知识、多种类型的知识集成到NMT模型中,以提高通过提示的性能。我们提出了一个统一的框架,可以有效地将包括句子、词表/短语和翻译模板在内的多种知识类型集成到NMT模型中。我们将知识作为输入的编码器和解码器的前缀提示,以指导翻译过程。在本文中,无需对模型架构进行更改,即可适应领域特定翻译,而有效避免重新训练。英汉和英德翻译的实验证明,我们的方法在性能上显著优于强大的基线,实现了高翻译质量和词表匹配准确性。
https://arxiv.org/abs/2312.04807
Large language models (LLMs) with billions of parameters and pretrained on massive amounts of data are now capable of near or better than state-of-the-art performance in a variety of downstream natural language processing tasks. Neural machine translation (NMT) is one such task that LLMs have been applied to with great success. However, little research has focused on applying LLMs to the more difficult subset of NMT called simultaneous translation (SimulMT), where translation begins before the entire source context is available to the model. In this paper, we address key challenges facing LLMs fine-tuned for SimulMT, validate classical SimulMT concepts and practices in the context of LLMs, explore adapting LLMs that are fine-tuned for NMT to the task of SimulMT, and introduce Simul-LLM, the first open-source fine-tuning and evaluation pipeline development framework for LLMs focused on SimulMT.
大型语言模型(LLMs)具有数十亿个参数,并在大量数据上预训练,现在在各种下游自然语言处理任务中可以实现接近或与最先进水平相当或更好的性能。将LLMs应用于神经机器翻译(NMT)等任务取得了很多成功。然而,很少有研究关注将LLMs应用于更困难的神经机器翻译(SimulMT)子任务,即在模型可以处理全部源上下文之前开始翻译。在本文中,我们研究了为SimulMT对LLMs进行微调的关键挑战,验证了在LLMs的背景下评估古典SimulMT概念和实践,探索将微调的LLMs应用于SimulMT任务,并引入了Simul-LLM,第一个针对SimulMT的开放源代码微调与评估框架。
https://arxiv.org/abs/2312.04691
With the advent of the Transformer architecture, Neural Machine Translation (NMT) results have shown great improvement lately. However, results in low-resource conditions still lag behind in both bilingual and multilingual setups, due to the limited amount of available monolingual and/or parallel data; hence, the need for methods addressing data scarcity in an efficient, and explainable way, is eminent. We propose an explainability-based training approach for NMT, applied in Unsupervised and Supervised model training, for translation of three languages of varying resources, French, Gujarati, Kazakh, to and from English. Our results show our method can be promising, particularly when training in low-resource conditions, outperforming simple training baselines; though the improvement is marginal, it sets the ground for further exploration of the approach and the parameters, and its extension to other languages.
随着Transformer架构的出现,神经机器翻译(NMT)的结果最近取得了很大的改善。然而,在双语和多语言设置中,低资源条件下的结果仍然落后,这是因为可用的小量单语和/或并行数据有限;因此,在高效且可解释的条件下解决数据稀缺问题,具有重要的意义。我们提出了一个以解释性为基础的NMT训练方法,应用于无监督和监督模型训练,用于三种资源丰富程度不同的语言(法语、印地语、哈萨克语)到英语的翻译。我们的结果表明,我们的方法具有很好的前景,尤其是在低资源条件下,能够超越简单的训练基线;尽管改进非常微小,但它为该方法及其在其他国家语言上的扩展奠定了基础。
https://arxiv.org/abs/2312.00214
Recent advances in neural methods have led to substantial improvement in the quality of Neural Machine Translation (NMT) systems. However, these systems frequently produce translations with inaccurate gender (Stanovsky et al., 2019), which can be traced to bias in training data. Saunders and Byrne (2020) tackle this problem with a handcrafted dataset containing balanced gendered profession words. By using this data to fine-tune an existing NMT model, they show that gender bias can be significantly mitigated, albeit at the expense of translation quality due to catastrophic forgetting. They recover some of the lost quality with modified training objectives or additional models at inference. We find, however, that simply supplementing the handcrafted dataset with a random sample from the base model training corpus is enough to significantly reduce the catastrophic forgetting. We also propose a novel domain-adaptation technique that leverages in-domain data created with the counterfactual data generation techniques proposed by Zmigrod et al. (2019) to further improve accuracy on the WinoMT challenge test set without significant loss in translation quality. We show its effectiveness in NMT systems from English into three morphologically rich languages French, Spanish, and Italian. The relevant dataset and code will be available at Github.
近年来,在神经方法上的进展已经使得神经机器翻译(NMT)系统的质量得到了显著提高。然而,这些系统经常产生带有不准确性别(Stanovsky等人,2019)的翻译,这种不准确可能源于训练数据中的偏见。Saunders和Byrne(2020)通过包含平衡性别职业词汇的手动数据集来解决这个问题的方法。通过使用这个数据来微调现有的 NMT 模型,他们表明,虽然性别偏见可以明显减轻,但翻译质量可能会因为灾难性忘记而下降。他们通过修改训练目标或添加推理过程中的额外模型来恢复一些丢失的翻译质量。然而,我们发现,仅通过在手工数据集上补充来自基础模型训练语料库的随机样本,就可以显著地降低灾难性忘记。我们还提出了一种新的领域自适应技术,它利用了Zmigrod等人(2019)提出的反事实数据生成技术来进一步改善 WinoMT 挑战测试集中的准确性,而不会导致翻译质量的显著下降。我们在英语、法语、意大利等三种语义丰富的语言的 NMT 系统中证明了它的有效性。有关数据集和代码将在 Github 上发布。
https://arxiv.org/abs/2311.16362
Machine translation (MT) for low-resource languages such as Ge'ez, an ancient language that is no longer spoken in daily life, faces challenges such as out-of-vocabulary words, domain mismatches, and lack of sufficient labeled training data. In this work, we explore various methods to improve Ge'ez MT, including transfer-learning from related languages, optimizing shared vocabulary and token segmentation approaches, finetuning large pre-trained models, and using large language models (LLMs) for few-shot translation with fuzzy matches. We develop a multilingual neural machine translation (MNMT) model based on languages relatedness, which brings an average performance improvement of about 4 BLEU compared to standard bilingual models. We also attempt to finetune the NLLB-200 model, one of the most advanced translation models available today, but find that it performs poorly with only 4k training samples for Ge'ez. Furthermore, we experiment with using GPT-3.5, a state-of-the-art LLM, for few-shot translation with fuzzy matches, which leverages embedding similarity-based retrieval to find context examples from a parallel corpus. We observe that GPT-3.5 achieves a remarkable BLEU score of 9.2 with no initial knowledge of Ge'ez, but still lower than the MNMT baseline of 15.2. Our work provides insights into the potential and limitations of different approaches for low-resource and ancient language MT.
机器翻译(MT)对于像Ge'ez这样的低资源语言面临诸如词汇缺失、领域不匹配和缺乏充分标注训练数据等挑战。在这项工作中,我们探讨了各种方法来提高Ge'ez MT,包括从相关语言进行迁移学习、优化共享词汇和词段划分方法、对大型预训练模型进行微调以及使用大型语言模型(LLMs)进行微光翻译。我们基于语言相关性开发了一种多语言神经机器翻译(MNMT)模型,与标准双语模型相比,平均性能提高了约4 BLEU。我们还尝试微调NLLB-200模型,这是目前最先进的翻译模型之一,但发现仅用4k个训练样本对Ge'ez进行微光翻译时,其表现不佳。此外,我们还尝试使用GPT-3.5,这是一种最先进的LLM,进行微光翻译,它利用基于嵌入相似性的检索从同义词库中查找上下文示例。我们观察到,GPT-3.5在没有初始知识的情况下,实现了令人惊叹的BLEU得分9.2,但仍然低于MNMT基线15.2。我们的工作揭示了不同方法在低资源和 ancient language MT 中的潜力和限制。
https://arxiv.org/abs/2311.14530
Neural machine translation (NMT) is a widely popular text generation task, yet there is a considerable research gap in the development of privacy-preserving NMT models, despite significant data privacy concerns for NMT systems. Differentially private stochastic gradient descent (DP-SGD) is a popular method for training machine learning models with concrete privacy guarantees; however, the implementation specifics of training a model with DP-SGD are not always clarified in existing models, with differing software libraries used and code bases not always being public, leading to reproducibility issues. To tackle this, we introduce DP-NMT, an open-source framework for carrying out research on privacy-preserving NMT with DP-SGD, bringing together numerous models, datasets, and evaluation metrics in one systematic software package. Our goal is to provide a platform for researchers to advance the development of privacy-preserving NMT systems, keeping the specific details of the DP-SGD algorithm transparent and intuitive to implement. We run a set of experiments on datasets from both general and privacy-related domains to demonstrate our framework in use. We make our framework publicly available and welcome feedback from the community.
神经机器翻译(NMT)是一种广泛受欢迎的文本生成任务,然而,在开发保护隐私的NMT模型的过程中,研究空白相当大。尽管对于NMT系统,数据隐私问题相当严重,但不同的软件库使用的实现细节并不总是明确的,代码库也不一定公开,导致了可重复性问题。为了解决这个问题,我们引入了DP-NMT,一个开源框架,用于研究用DP-SGD保护隐私的NMT,将各种模型、数据集和评估指标整合在一个系统中。我们的目标是为研究人员提供一个平台,以推动隐私保护NMT系统的开发,保持DP-SGD算法的具体细节公开和易用。我们在通用和隐私相关数据集上进行了一系列实验,以展示我们框架的使用。我们将我们的框架公开发布,并欢迎来自社区的反馈。
https://arxiv.org/abs/2311.14465
Despite the remarkable advancements in machine translation, the current sentence-level paradigm faces challenges when dealing with highly-contextual languages like Japanese. In this paper, we explore how context-awareness can improve the performance of the current Neural Machine Translation (NMT) models for English-Japanese business dialogues translation, and what kind of context provides meaningful information to improve translation. As business dialogue involves complex discourse phenomena but offers scarce training resources, we adapted a pretrained mBART model, finetuning on multi-sentence dialogue data, which allows us to experiment with different contexts. We investigate the impact of larger context sizes and propose novel context tokens encoding extra-sentential information, such as speaker turn and scene type. We make use of Conditional Cross-Mutual Information (CXMI) to explore how much of the context the model uses and generalise CXMI to study the impact of the extra-sentential context. Overall, we find that models leverage both preceding sentences and extra-sentential context (with CXMI increasing with context size) and we provide a more focused analysis on honorifics translation. Regarding translation quality, increased source-side context paired with scene and speaker information improves the model performance compared to previous work and our context-agnostic baselines, measured in BLEU and COMET metrics.
尽管机器翻译取得了显著的进步,但针对具有高度上下文的语言(如日本语)的句子级别范式在处理商务对话时面临挑战。在本文中,我们探讨了上下文意识如何改善当前的神经机器翻译(NMT)模型在英语-日语商务对话翻译中的性能,以及何种上下文可以提供有意义的信息来提高翻译。商务对话涉及复杂的会话现象,但提供了稀少的训练资源。因此,我们调整了一个预训练的 mBART 模型,在多句对话数据上进行微调,允许我们实验不同上下文。我们研究了更大上下文大小对翻译性能的影响,并提出了新颖的上下文标记编码额外的会话信息,如说话者回合和场景类型。我们利用条件互信息(CXMI)来探讨模型使用了多少上下文,并将 CXMI 扩展到研究额外的会话上下文的影响。总体而言,我们发现模型利用了先前的句子和额外的上下文(随着上下文大小的增加,CXMI 增加),并且我们对敬语翻译进行了深入分析。关于翻译质量,与场景和说话者信息相结合的增加源端上下文提高了模型的性能,与之前的工作和我们的上下文无关基准相比。
https://arxiv.org/abs/2311.11976
Document-level neural machine translation (DNMT) has shown promising results by incorporating more context information. However, this approach also introduces a length bias problem, whereby DNMT suffers from significant translation quality degradation when decoding documents that are much shorter or longer than the maximum sequence length during training. %i.e., the length bias problem. To solve the length bias problem, we propose to improve the DNMT model in training method, attention mechanism, and decoding strategy. Firstly, we propose to sample the training data dynamically to ensure a more uniform distribution across different sequence lengths. Then, we introduce a length-normalized attention mechanism to aid the model in focusing on target information, mitigating the issue of attention divergence when processing longer sequences. Lastly, we propose a sliding window strategy during decoding that integrates as much context information as possible without exceeding the maximum sequence length. The experimental results indicate that our method can bring significant improvements on several open datasets, and further analysis shows that our method can significantly alleviate the length bias problem.
文档级别神经机器翻译(DNMT)通过引入更多的上下文信息表现出良好的效果。然而,这种方法还引入了一个长度偏差问题,即在训练过程中,DNMT会显著降低对长度过短或过长的文档的翻译质量。换句话说,长度偏差问题。为解决长度偏差问题,我们提出了改进DNMT训练方法、注意机制和解码策略。首先,我们动态采样训练数据以保证不同序列长度的数据分布更加均匀。然后,我们引入了一个长度归一化的注意机制,帮助模型集中注意力于目标信息,减轻了在处理长序列时的关注度偏差问题。最后,我们在解码过程中采用滑动窗口策略,在确保最大序列长度的前提下,整合尽可能多的上下文信息。实验结果表明,我们的方法可以在多个公开数据集上带来显著的改进,而进一步的分析显示,我们的方法可以显著减轻长度偏差问题。
https://arxiv.org/abs/2311.11601
Despite the growing variety of languages supported by existing multilingual neural machine translation (MNMT) models, most of the world's languages are still being left behind. We aim to extend large-scale MNMT models to a new language, allowing for translation between the newly added and all of the already supported languages in a challenging scenario: using only a parallel corpus between the new language and English. Previous approaches, such as continued training on parallel data including the new language, suffer from catastrophic forgetting (i.e., performance on other languages is reduced). Our novel approach Imit-MNMT treats the task as an imitation learning process, which mimicks the behavior of an expert, a technique widely used in the computer vision area, but not well explored in NLP. More specifically, we construct a pseudo multi-parallel corpus of the new and the original languages by pivoting through English, and imitate the output distribution of the original MNMT model. Extensive experiments show that our approach significantly improves the translation performance between the new and the original languages, without severe catastrophic forgetting. We also demonstrate that our approach is capable of solving copy and off-target problems, which are two common issues existence in current large-scale MNMT models.
尽管现有的多语言神经机器翻译(MNMT)模型支持的语言种类越来越多,但大多数世界语言仍然被遗弃。我们的目标是将大型MNMT模型扩展到一种新的语言,使得在具有挑战性的情况下(即仅使用新语言和英语之间的并行语料库),可以实现翻译:使用只有新语言和英语之间的并行语料库。 previous approaches, such as continued training on parallel data including the new language, suffer from catastrophic forgetting (i.e., the performance on other languages is reduced). Our novel approach Imit-MNMT treats the task as an imitation learning process, which mimics the behavior of an expert, a technique widely used in the computer vision area, but not well explored in NLP. More specifically, we construct a pseudo multi-parallel corpus of the new and the original languages by pivoting through English, and imitate the output distribution of the original MNMT model. Extensive experiments show that our approach significantly improves the translation performance between the new and the original languages, without severe catastrophic forgetting. We also demonstrate that our approach is capable of solving copy and off-target problems, which are two common issues existing in current large-scale MNMT models.
https://arxiv.org/abs/2311.08538
Minimum Bayes Risk (MBR) decoding can significantly improve translation performance of Multilingual Large Language Models (MLLMs). However, MBR decoding is computationally expensive and in this paper, we show how recently developed Reinforcement Learning (RL) technique, Direct Preference Optimization (DPO) can be used to fine-tune MLLMs so that we get the gains from MBR without the additional computation in inference. Our fine-tuned models have significantly improved performance on multiple NMT test sets compared to base MLLMs without preference optimization. Our method boosts the translation performance of MLLMs using relatively small monolingual fine-tuning sets.
最小贝叶斯风险(MBR)解码可以显著提高多语言大型语言模型的翻译性能。然而,MBR解码是计算密集型技术,在本文中,我们展示了如何使用最近开发的可用于精细调整MLLM的强化学习(RL)技术,直接偏好优化(DPO)来实现,以在不增加推理计算的情况下获得MBR的收益。我们对基MLLM进行了微调,在多个NMT测试集上的性能已经显著超过了没有偏优化时的基MLLM。我们的方法通过相对较小的单语种微调集,显著提高了MLLM的翻译性能。
https://arxiv.org/abs/2311.08380