Efficient utilisation of both intra- and extra-textual context remains one of the critical gaps between machine and human translation. Existing research has primarily focused on providing individual, well-defined types of context in translation, such as the surrounding text or discrete external variables like the speaker's gender. This work introduces MTCue, a novel neural machine translation (NMT) framework that interprets all context (including discrete variables) as text. MTCue learns an abstract representation of context, enabling transferability across different data settings and leveraging similar attributes in low-resource scenarios. With a focus on a dialogue domain with access to document and metadata context, we extensively evaluate MTCue in four language pairs in both translation directions. Our framework demonstrates significant improvements in translation quality over a parameter-matched non-contextual baseline, as measured by BLEU (+0.88) and Comet (+1.58). Moreover, MTCue significantly outperforms a "tagging" baseline at translating English text. Analysis reveals that the context encoder of MTCue learns a representation space that organises context based on specific attributes, such as formality, enabling effective zero-shot control. Pre-training on context embeddings also improves MTCue's few-shot performance compared to the "tagging" baseline. Finally, an ablation study conducted on model components and contextual variables further supports the robustness of MTCue for context-based NMT.
有效地利用内文本和外部文本上下文仍然是机器翻译和人类翻译之间的一个关键差距。现有的研究主要关注提供个人、明确类型的上下文,例如周围的文本或说话者性别等离散外部变量。这项工作介绍了MTCue,一个新的神经网络机器翻译框架,它将所有上下文(包括离散变量)解释为文本。MTCue学习了一种抽象上下文表示,使得在不同数据设置下的可移植性得以实现,并在低资源情况下利用类似的属性。重点关注有文档和元数据上下文访问的对话领域,我们广泛评估了MTCue在两种翻译方向的四对语言之间的性能。我们的框架通过BLEU(+0.88)和Comet(+1.58)的性能测量表现出了翻译质量的重大改进。此外,MTCue在翻译英语文本方面比“标签”基准框架表现得更好。分析表明,MTCue的上下文编码器学习了一个表示空间,以基于特定的属性(如正式性)组织上下文,从而实现有效的零次控制。预处理上下文嵌入的训练也提高了MTCue的少量次性能,与“标签”基准框架相比。最后,对模型组件和上下文变量进行的 ablation研究进一步支持了MTCue对基于上下文的机器翻译的鲁棒性。
https://arxiv.org/abs/2305.15904
While Neural Machine Translation (NMT) represents the leading approach to Machine Translation (MT), the outputs of NMT models still require translation post-editing to rectify errors and enhance quality, particularly under critical settings. In this work, we formalize the task of translation post-editing with Large Language Models (LLMs) and explore the use of GPT-4 to automatically post-edit NMT outputs across several language pairs. Our results demonstrate that GPT-4 is adept at translation post-editing and produces meaningful edits even when the target language is not English. Notably, we achieve state-of-the-art performance on WMT-22 English-Chinese, English-German, Chinese-English and German-English language pairs using GPT-4 based post-editing, as evaluated by state-of-the-art MT quality metrics.
虽然神经网络机器翻译(NMT)代表了机器翻译(MT)的主要方法,但NMT模型的输出仍然需要翻译后编辑来纠正错误和提高质量,特别是在关键环境中。在本研究中,我们使用大型语言模型(LLM) formal 翻译后编辑任务,并探索使用GPT-4自动编辑多个语言对的NMT输出。我们的结果表明,GPT-4擅长翻译后编辑,即使目标语言不是英语。值得注意的是,我们使用GPT-4基于后编辑的方法在 WMT-22 英语-中文、英语-德语、中文-英语和德语-英语语言对上取得了最先进的性能,并使用了最先进的MT质量度量进行评估。
https://arxiv.org/abs/2305.14878
Traditional neural machine translation (NMT) systems often fail to translate sentences that contain culturally specific information. Most previous NMT methods have incorporated external cultural knowledge during training, which requires fine-tuning on low-frequency items specific to the culture. Recent in-context learning utilizes lightweight prompts to guide large language models (LLMs) to perform machine translation, however, whether such an approach works in terms of injecting culture awareness into machine translation remains unclear. To this end, we introduce a new data curation pipeline to construct a culturally relevant parallel corpus, enriched with annotations of cultural-specific entities. Additionally, we design simple but effective prompting strategies to assist this LLM-based translation. Extensive experiments show that our approaches can largely help incorporate cultural knowledge into LLM-based machine translation, outperforming traditional NMT systems in translating cultural-specific sentences.
传统的神经网络机器翻译(NMT)系统往往无法翻译包含文化背景信息的句子。大多数以前的NMT方法在训练过程中会引入外部的文化知识,这需要对文化特定的低频率items进行微调。最近的上下文学习技术利用轻量级提示来指导大型语言模型(LLM)进行机器翻译,但这种方法是否能将文化意识注入到机器翻译中仍然不清楚。为此,我们引入了一种新的数据集编辑管道来构建一个具有文化相关的并行库,其中包含文化特定的实体注释。此外,我们还设计了一些简单但有效的提示策略来协助基于LLM的机器翻译。广泛的实验结果表明,我们的方法可以在很大程度上帮助将文化知识融入基于LLM的机器翻译中,在翻译文化特定的句子方面比传统的NMT系统更有效。
https://arxiv.org/abs/2305.14328
Using a shared vocabulary is common practice in Multilingual Neural Machine Translation (MNMT). In addition to its simple design, shared tokens play an important role in positive knowledge transfer, which manifests naturally when the shared tokens refer to similar meanings across languages. However, natural flaws exist in such a design as well: 1) when languages use different writing systems, transfer is inhibited, and 2) even if languages use similar writing systems, shared tokens may have completely different meanings in different languages, increasing ambiguity. In this paper, we propose a re-parameterized method for building embeddings to alleviate the first problem. More specifically, we define word-level information transfer pathways via word equivalence classes and rely on graph networks to fuse word embeddings across languages. Our experiments demonstrate the advantages of our approach: 1) the semantics of embeddings are better aligned across languages, 2) our method achieves significant BLEU improvements on high- and low-resource MNMT, and 3) only less than 1.0\% additional trainable parameters are required with a limited increase in computational costs.
在多语言神经网络机器翻译(MNMT)中,使用共享词汇是常见的做法。除了其简单的设计,共享词汇在积极知识传递中发挥着重要作用,这自然地体现在共享词汇在不同语言中具有类似意义的情况下。然而,这种设计也存在一些自然缺陷:1)当不同语言使用不同的书写系统时,传递受到限制;2)即使不同语言使用类似书写系统,共享词汇可能在不同语言中具有完全不同的意义,增加歧义。在本文中,我们提出了一种重新参数化的方法来构建嵌入向量,以解决第一个问题。更具体地说,我们定义了单词级别的信息传递路径,通过单词等价类进行定义,并依赖图网络将单词嵌入向量在不同语言之间融合。我们的实验结果表明我们的 approach 的优势:1)嵌入向量的语义在不同语言之间更加一致;2)我们的方法在高资源和低资源 MNMT 中实现了显著的 BLEU 改善;3)仅需要不到 1.0\% 额外的训练参数,且计算成本有限增加。
https://arxiv.org/abs/2305.14189
Neural machine translation (NMT) methods developed for natural language processing have been shown to be highly successful in automating translation from one natural language to another. Recently, these NMT methods have been adapted to the generation of program code. In NMT for code generation, the task is to generate output source code that satisfies constraints expressed in the input. In the literature, a variety of different input scenarios have been explored, including generating code based on natural language description, lower-level representations such as binary or assembly (neural decompilation), partial representations of source code (code completion and repair), and source code in another language (code translation). In this paper we survey the NMT for code generation literature, cataloging the variety of methods that have been explored according to input and output representations, model architectures, optimization techniques used, data sets, and evaluation methods. We discuss the limitations of existing methods and future research directions
神经网络机器翻译(NMT)方法,是为自然语言处理而开发的,已经被证明能够在自动化从一种自然语言到另一种自然语言的转换方面取得成功。最近,这些NMT方法已经适应到了生成程序代码的 generation 任务。在 NMT 代码生成任务中,任务是生成满足输入所表达的限制的输出源代码。在文献中,已经探索了多种不同的输入场景,包括基于自然语言描述的生成代码、低层次的表示,如二进制或汇编(神经破解)、源代码的部分表示(代码 completion 和修复)以及用另一种语言的源代码(代码翻译)。在本文中,我们综述了 NMT 代码生成文献,并对输入和输出表示、模型架构、优化技术、数据集和评估方法等方面所探索的方法进行分类和总结。我们讨论了现有方法的局限性和未来研究的方向。
https://arxiv.org/abs/2305.13504
Nearest Neighbor Machine Translation ($k$NN-MT) has achieved great success on domain adaptation tasks by integrating pre-trained Neural Machine Translation (NMT) models with domain-specific token-level retrieval. However, the reasons underlying its success have not been thoroughly investigated. In this paper, we provide a comprehensive analysis of $k$NN-MT through theoretical and empirical studies. Initially, we offer a theoretical interpretation of the working mechanism of $k$NN-MT as an efficient technique to implicitly execute gradient descent on the output projection layer of NMT, indicating that it is a specific case of model fine-tuning. Subsequently, we conduct multi-domain experiments and word-level analysis to examine the differences in performance between $k$NN-MT and entire-model fine-tuning. Our findings suggest that: (1) Incorporating $k$NN-MT with adapters yields comparable translation performance to fine-tuning on in-domain test sets, while achieving better performance on out-of-domain test sets; (2) Fine-tuning significantly outperforms $k$NN-MT on the recall of low-frequency domain-specific words, but this gap could be bridged by optimizing the context representations with additional adapter layers.
Nearest Neighbor Machine Translation ($k$NN-MT) 在跨域任务中取得了巨大的成功,通过将预训练的神经网络机器翻译(NMT)模型与特定域的 token-level 检索相结合,实现了模型微调。然而,其成功的原因并没有得到充分的研究。在本文中,我们通过理论和实证研究提供了 $k$NN-MT 的全面分析。一开始,我们提出了一种理论解释 $k$NN-MT 的工作原理,将其作为在 NMT 输出投影层上隐含执行梯度下降的高效技巧,表明它是模型微调的特定案例。随后,我们进行了跨域实验和单词级别的分析,以检查 $k$NN-MT 和整个模型微调在性能上的差异。我们的发现表明:(1) 将 $k$NN-MT 与适配器相结合可以实现与跨域测试集上的微调相比,在域内测试集上取得相似的翻译性能,但在非域内测试集上取得更好的性能;(2) 微调在低频率域特定单词Recall方面显著优于 $k$NN-MT,但可以通过添加适配器层优化上下文表示来弥平这种差异。
https://arxiv.org/abs/2305.13034
Despite advances in multilingual neural machine translation (MNMT), we argue that there are still two major challenges in this area: data imbalance and representation degeneration. The data imbalance problem refers to the imbalance in the amount of parallel corpora for all language pairs, especially for long-tail languages (i.e., very low-resource languages). The representation degeneration problem refers to the problem of encoded tokens tending to appear only in a small subspace of the full space available to the MNMT model. To solve these two issues, we propose Bi-ACL, a framework that uses only target-side monolingual data and a bilingual dictionary to improve the performance of the MNMT model. We define two modules, named bidirectional autoencoder and bidirectional contrastive learning, which we combine with an online constrained beam search and a curriculum learning sampling strategy. Extensive experiments show that our proposed method is more effective both in long-tail languages and in high-resource languages. We also demonstrate that our approach is capable of transferring knowledge between domains and languages in zero-shot scenarios.
尽管多语言神经网络机器翻译(MNMT)取得了进展,但我们认为该领域仍然面临两个主要挑战:数据不平衡和表示退化。数据不平衡指的是所有语言对 parallel corpora 的总量不平衡,特别是长语言(即非常资源的语言)的情况。表示退化指的是编码代币往往会仅在 MNMT 模型可用的全空间 available to the MNMT 模型的一个小 subspace 中出现。为了解决这两个问题,我们提出了 Bi-ACL,一个框架,仅使用目标方向的单语数据和双语言词典来提高 MNMT 模型的性能。我们定义了两个模块,称为双向自编码器和双向比较学习,我们将在线限制束搜索和课程学习采样策略相结合。广泛的实验结果表明,我们提出的方法在长语言和高水平资源语言中都更有效。我们还证明,我们的方法可以在零样本场景下将知识从领域和语言之间转移。
https://arxiv.org/abs/2305.12786
Federated Multilingual Neural Machine Translation (Fed-MNMT) has emerged as a promising paradigm for institutions with limited language resources. This approach allows multiple institutions to act as clients and train a unified model through model synchronization, rather than collecting sensitive data for centralized training. This significantly reduces the cost of corpus collection and preserves data privacy. However, as pre-trained language models (PLMs) continue to increase in size, the communication cost for transmitting parameters during synchronization has become a training speed bottleneck. In this paper, we propose a communication-efficient Fed-MNMT framework that addresses this issue by keeping PLMs frozen and only transferring lightweight adapter modules between clients. Since different language pairs exhibit substantial discrepancies in data distributions, adapter parameters of clients may conflict with each other. To tackle this, we explore various clustering strategies to group parameters for integration and mitigate the negative effects of conflicting parameters. Experimental results demonstrate that our framework reduces communication cost by over 98% while achieving similar or even better performance compared to competitive baselines. Further analysis reveals that clustering strategies effectively solve the problem of linguistic discrepancy and pruning adapter modules further improves communication efficiency.
Federated Multilingual Neural Machine Translation (Fed-MNMT)已经成为缺乏语言资源机构的一个有前途的范式。这种方法允许多个机构作为客户,通过模型同步训练一个统一模型,而不是收集敏感数据进行集中训练。这 significantly降低了语料收集和数据隐私的成本。然而,随着预训练语言模型(PLMs)的越来越大,在同步期间传输参数的通信成本已成为训练速度的瓶颈。在本文中,我们提出了一个通信高效的Fed-MNMT框架,解决这个问题的方法是保持PLMs冻结,仅向客户传输轻量级适配模块。由于不同语言对在数据分布上存在显著差异,客户的适配参数可能会相互冲突。为了解决这一问题,我们探索了各种聚类策略,将参数进行集成并减轻冲突参数的负面影响。实验结果显示,我们的框架可以减少通信成本超过98%,与竞争基准相比,实现类似或甚至更好的性能。进一步分析表明,聚类策略有效地解决了语言差异问题,修剪适配模块进一步提高了通信效率。
https://arxiv.org/abs/2305.12449
Like many other machine learning applications, neural machine translation (NMT) benefits from over-parameterized deep neural models. However, these models have been observed to be brittle: NMT model predictions are sensitive to small input changes and can show significant variation across re-training or incremental model updates. This work studies a frequently used method in NMT, pseudo-label training (PLT), which is common to the related techniques of forward-translation (or self-training) and sequence-level knowledge distillation. While the effect of PLT on quality is well-documented, we highlight a lesser-known effect: PLT can enhance a model's stability to model updates and input perturbations, a set of properties we call model inertia. We study inertia effects under different training settings and we identify distribution simplification as a mechanism behind the observed results.
与其他机器学习应用一样,神经网络机器翻译(NMT)从过度参数化的深度学习模型中受益。然而,这些模型已被观察到是脆的:NMT模型预测对微小的输入变化非常敏感,并在重新训练或增量模型更新中出现显著差异。这项工作研究了NMT中经常使用的方法——伪标签训练(PLT),它是 forward-翻译(或自训练)和序列级知识蒸馏相关技术的常见方法。虽然PLT对质量的影响已经被广泛记录,但我们重点介绍了一个不太为人所知的影响:PLT可以增强模型的稳定性,以适应模型更新和输入干扰,我们称之为模型惯性。我们研究了不同训练设置下的惯性效应,并识别分布简化作为观察到结果的机制。
https://arxiv.org/abs/2305.11808
Our proposed method, ReSeTOX (REdo SEarch if TOXic), addresses the issue of Neural Machine Translation (NMT) generating translation outputs that contain toxic words not present in the input. The objective is to mitigate the introduction of toxic language without the need for re-training. In the case of identified added toxicity during the inference process, ReSeTOX dynamically adjusts the key-value self-attention weights and re-evaluates the beam search hypotheses. Experimental results demonstrate that ReSeTOX achieves a remarkable 57% reduction in added toxicity while maintaining an average translation quality of 99.5% across 164 languages.
我们提出的方法和 ReSeTOX(REdo SEarch if TOXic)解决了神经网络机器翻译(NMT)生成翻译输出中含有输入中没有的有害词汇的问题。目标是在不需要重新训练的情况下减少引入有害语言的数量。如果在推理过程中发现增加了有害词汇,ReSeTOX会动态调整关键值自注意力权重并重新评估束搜索假设。实验结果显示,ReSeTOX实现了令人惊奇的57%的有害语言减少,同时保持了164种语言的翻译平均质量99.5%。
https://arxiv.org/abs/2305.11761
The language-independency of encoded representations within multilingual neural machine translation (MNMT) models is crucial for their generalization ability on zero-shot translation. Neural interlingua representations have been shown as an effective method for achieving this. However, fixed-length neural interlingua representations introduced in previous work can limit its flexibility and representation ability. In this study, we introduce a novel method to enhance neural interlingua representations by making their length variable, thereby overcoming the constraint of fixed-length neural interlingua representations. Our empirical results on zero-shot translation on OPUS, IWSLT, and Europarl datasets demonstrate stable model convergence and superior zero-shot translation results compared to fixed-length neural interlingua representations. However, our analysis reveals the suboptimal efficacy of our approach in translating from certain source languages, wherein we pinpoint the defective model component in our proposed method.
多语言神经网络机器翻译模型(MNMT)中的编码表示对于零样本翻译的泛化能力至关重要。神经Interlingua表示已经被证明是实现这一方法的有效方法。但是,在先前的工作中引入的固定长度神经Interlingua表示可能会限制其灵活性和表示能力。在本研究中,我们引入了一种新方法,通过使其长度可变来增强神经Interlingua表示,从而克服了固定长度神经Interlingua表示的限制。我们在OPUS、IWSLT和europarl数据集上的零样本翻译实验结果表明,稳定的模型收敛和比固定长度神经Interlingua表示更好的零样本翻译结果。但是,我们的分析揭示了我们方法在从某些源语言翻译时的最优效果,其中我们明确指出了我们方法中的不良模型组件。
https://arxiv.org/abs/2305.10190
Previous studies show that intermediate supervision signals benefit various Natural Language Processing tasks. However, it is not clear whether there exist intermediate signals that benefit Neural Machine Translation (NMT). Borrowing techniques from Statistical Machine Translation, we propose intermediate signals which are intermediate sequences from the "source-like" structure to the "target-like" structure. Such intermediate sequences introduce an inductive bias that reflects a domain-agnostic principle of translation, which reduces spurious correlations that are harmful to out-of-domain generalisation. Furthermore, we introduce a full-permutation multi-task learning to alleviate the spurious causal relations from intermediate sequences to the target, which results from exposure bias. The Minimum Bayes Risk decoding algorithm is used to pick the best candidate translation from all permutations to further improve the performance. Experiments show that the introduced intermediate signals can effectively improve the domain robustness of NMT and reduces the amount of hallucinations on out-of-domain translation. Further analysis shows that our methods are especially promising in low-resource scenarios.
先前的研究表明,中间监督信号在许多自然语言处理任务中有益。然而,是否存在有益于神经网络机器翻译(NMT)的中间信号尚不明确。从统计机器翻译中借用技术,我们提出了中间信号,它们是从“源-like”结构到“目标-like”结构的中间序列。这些中间序列引入了一种 inductive 偏差,反映了翻译领域的通用原则,从而减少了有害的非相关性,即减少来自暴露偏差的伪相关。此外,我们引入了全变体的多任务学习,以减轻从中间序列到目标之间的伪因果关系,这由暴露偏差引起。最小 Bayes 风险解码算法被用来从所有变体中选择最佳的翻译候选词,以进一步改善性能。实验表明,引入的中间信号能够有效提高 NMT 领域的 domain 稳定性,并减少跨域幻觉的数量。进一步分析表明,我们在资源短缺的情况下特别有前途。
https://arxiv.org/abs/2305.09154
Most of the speech translation models heavily rely on parallel data, which is hard to collect especially for low-resource languages. To tackle this issue, we propose to build a cascaded speech translation system without leveraging any kind of paired data. We use fully unpaired data to train our unsupervised systems and evaluate our results on CoVoST 2 and CVSS. The results show that our work is comparable with some other early supervised methods in some language pairs. While cascaded systems always suffer from severe error propagation problems, we proposed denoising back-translation (DBT), a novel approach to building robust unsupervised neural machine translation (UNMT). DBT successfully increases the BLEU score by 0.7--0.9 in all three translation directions. Moreover, we simplified the pipeline of our cascaded system to reduce inference latency and conducted a comprehensive analysis of every part of our work. We also demonstrate our unsupervised speech translation results on the established website.
大多数语音翻译模型都严重依赖并行数据,而对于资源匮乏的语言来说收集并行数据非常困难。为了解决这一问题,我们提出了构建分布式语音翻译系统的方法,不使用任何配对数据。我们使用完全配对的数据来训练我们的无监督系统,并在CoVoST 2和CVSS上评估了我们的 results。结果显示,在某些语言对中,我们的工作与一些其他的早期监督方法相当。虽然分布式系统通常会出现严重的错误传播问题,我们提出了去噪反向翻译(DBT),这是一种构建稳健无监督神经网络翻译(UNMT)的新方法。DBT在三个翻译方向上成功地提高了BLEU得分,并减少了推理延迟,并对我们的工作进行了全面的分析。我们还在我们的网站上展示了我们的无监督语音翻译结果。
https://arxiv.org/abs/2305.07455
The multilingual neural machine translation (NMT) model has a promising capability of zero-shot translation, where it could directly translate between language pairs unseen during training. For good transfer performance from supervised directions to zero-shot directions, the multilingual NMT model is expected to learn universal representations across different languages. This paper introduces a cross-lingual consistency regularization, CrossConST, to bridge the representation gap among different languages and boost zero-shot translation performance. The theoretical analysis shows that CrossConST implicitly maximizes the probability distribution for zero-shot translation, and the experimental results on both low-resource and high-resource benchmarks show that CrossConST consistently improves the translation performance. The experimental analysis also proves that CrossConST could close the sentence representation gap and better align the representation space. Given the universality and simplicity of CrossConST, we believe it can serve as a strong baseline for future multilingual NMT research.
多语言神经网络机器翻译模型(NMT)具有零样本翻译的潜力能力,可以在训练期间未观察到的语言对之间直接进行翻译。为了实现从监督方向到零样本方向的良好的迁移性能,NMT模型应该学习跨语言一致性Regularization。本文介绍了跨语言一致性填充(CrossConST)来填补不同语言之间的表示差距并提高零样本翻译性能。理论分析表明,CrossConST实际上最大限度地增加了零样本翻译的概率分布。在低资源和高资源基准上的实验结果表明,CrossConST consistently improve翻译性能。实验还证明,CrossConST可以关闭句子表示差距并更好地对齐表示空间。考虑到CrossConST的通用性和简单性,我们相信它可以作为未来多语言NMT研究的强基准。
https://arxiv.org/abs/2305.07310
Large language models (LLMs) have shown surprisingly good performance in multilingual neural machine translation (MNMT) even when trained without parallel data. Yet, despite the fact that the amount of training data is gigantic, they still struggle with translating rare words, particularly for low-resource languages. Even worse, it is usually unrealistic to retrieve relevant demonstrations for in-context learning with low-resource languages on LLMs, which restricts the practical use of LLMs for translation -- how should we mitigate this problem? To this end, we present a novel method, CoD, which augments LLMs with prior knowledge with the chains of multilingual dictionaries for a subset of input words to elicit translation abilities for LLMs. Extensive experiments indicate that augmenting ChatGPT with CoD elicits large gains by up to 13x ChrF++ points for MNMT (3.08 to 42.63 for English to Serbian written in Cyrillic script) on FLORES-200 full devtest set. We further demonstrate the importance of chaining the multilingual dictionaries, as well as the superiority of CoD to few-shot demonstration for low-resource languages.
大型语言模型(LLMs)在多语言神经网络机器翻译(MNMT)方面表现出令人惊讶地良好性能,即使在没有并行训练数据的情况下也是如此。然而,尽管训练数据量非常大,但它们仍然难以翻译罕见的单词,特别是对于资源有限的语言。更严重的是,通常难以在LLMs中获取与低资源语言上下文学习相关的相关演示,这限制了LLMs对翻译的实际用途——我们应该如何解决这个问题?为此,我们提出了一种新方法,称为CoD,它通过增加LLMs先前的知识,使用多语言词典链对输入单词的一小部分进行扩展,以提取LLMs的翻译能力。广泛的实验表明,通过与CoD相结合,ChatGPT在FLORES-200 full devtest set上的MNMT训练中获得了大量的增益,达到了13xChrF++ points(英语到塞尔维亚用西里尔字母书写的得分范围是3.08到42.63)。我们还证明了链式多语言词典的重要性,以及CoD对于低资源语言 few-shot演示的优越性。
https://arxiv.org/abs/2305.06575
In this paper, we have shown the improvement of English to Bharti Braille machine translation system. We have shown how we can improve a baseline NMT model by adding some linguistic knowledge to it. This was done for five language pairs where English sentences were translated into five Indian languages and then subsequently to corresponding Bharti Braille. This has been demonstrated by adding a sub-module for translating multi-word expressions. The approach shows promising results as across language pairs, we could see improvement in the quality of NMT outputs. The least improvement was observed in English-Nepali language pair with 22.08% and the most improvement was observed in the English-Hindi language pair with 23.30%.
在本文中,我们展示了英语到Bharti盲听机器翻译系统的改进。我们展示了如何通过添加一些语言学知识来改进基准NTM模型。我们针对五个语言对进行了改进,其中英语句子被翻译成五个印度语言,然后相应的Bharti盲听对应。通过添加一个翻译多单词表达式的子模块,证明了这种方法的效果。在不同语言对中,该方法取得了令人充满希望的结果,我们可以看到NTM输出质量的提高。改进最少的是英语到尼泊尔语对,改进率为22.08%,而改进最多的则是英语到印度语对,改进率为23.30%。
https://arxiv.org/abs/2305.06157
Existing neural machine translation (NMT) studies mainly focus on developing dataset-specific models based on data from different tasks (e.g., document translation and chat translation). Although the dataset-specific models have achieved impressive performance, it is cumbersome as each dataset demands a model to be designed, trained, and stored. In this work, we aim to unify these translation tasks into a more general setting. Specifically, we propose a ``versatile'' model, i.e., the Unified Model Learning for NMT (UMLNMT) that works with data from different tasks, and can translate well in multiple settings simultaneously, and theoretically it can be as many as possible. Through unified learning, UMLNMT is able to jointly train across multiple tasks, implementing intelligent on-demand translation. On seven widely-used translation tasks, including sentence translation, document translation, and chat translation, our UMLNMT results in substantial improvements over dataset-specific models with significantly reduced model deployment costs. Furthermore, UMLNMT can achieve competitive or better performance than state-of-the-art dataset-specific methods. Human evaluation and in-depth analysis also demonstrate the superiority of our approach on generating diverse and high-quality translations. Additionally, we provide a new genre translation dataset about famous aphorisms with 186k Chinese->English sentence pairs.
现有的神经网络机器翻译(NMT)研究主要关注开发基于不同任务数据的数据集特定模型(例如,文档翻译和聊天翻译)。虽然数据集特定模型已经取得了令人印象深刻的性能,但它们缺点是每个数据集都需要设计、训练和存储一个模型,这变得繁琐。在这项工作中,我们旨在将这些翻译任务统一到一个更一般的背景下。具体来说,我们提出了一个“多功能”模型,即统一模型学习,即NMT统一模型学习,它可以在不同设置下同时翻译,理论上可以翻译尽可能多的数据集。通过统一学习,NMT统一模型能够同时训练多个任务,实现智能按需翻译。在我们最常用的七个翻译任务中,包括句子翻译、文档翻译和聊天翻译,我们的NMT统一模型比数据集特定模型显著提高了性能,并且模型部署成本大大降低。此外,NMT统一模型可以比先进的数据集特定方法实现更具竞争力或更好的性能。人类评估和深入分析也证明了我们的方法在生成多样化高质量的翻译方面的优势。此外,我们提供了关于著名格言的新类型翻译数据集,其中包括186,000个中文到英语句子对。
https://arxiv.org/abs/2305.02777
Latent variable modeling in non-autoregressive neural machine translation (NAT) is a promising approach to mitigate the multimodality problem. In the previous works, they added an auxiliary model to estimate the posterior distribution of the latent variable conditioned on the source and target sentences. However, it causes several disadvantages, such as redundant information extraction in the latent variable, increasing parameters, and a tendency to ignore a part of the information from the inputs. In this paper, we propose a new latent variable modeling that is based on a dual reconstruction perspective and an advanced hierarchical latent modeling approach. Our proposed method, {\em LadderNMT}, shares a latent space across both languages so that it hypothetically alleviates or solves the above disadvantages. Experimental results quantitatively and qualitatively demonstrate that our proposed latent variable modeling learns an advantageous latent space and significantly improves translation quality in WMT translation tasks.
在非自回归神经网络翻译(NAT)中,latent变量建模是一种缓解多模式问题有前途的方法。在以前的研究中,他们引入了一个辅助模型来估计条件化latent变量的后概率分布。然而,它引起了多个缺点,例如在latent变量中冗余信息的提取,增加参数,以及从输入中忽略部分信息的倾向。在本文中,我们提出了一种新的latent变量建模方法,基于双重重构视角和高级级联latent建模方法。我们提出的方法名为{\em LadderNMT},在两种语言之间共享latent空间,因此可以 hypothetically减轻或解决上述缺点。实验结果定量和定性上都表明,我们提出的latent变量建模学习有利的latent空间,在WMT翻译任务中显著改善了翻译质量。
https://arxiv.org/abs/2305.03511
With the advent of deep learning methods, Neural Machine Translation (NMT) systems have become increasingly powerful. However, deep learning based systems are susceptible to adversarial attacks, where imperceptible changes to the input can cause undesirable changes at the output of the system. To date there has been little work investigating adversarial attacks on sequence-to-sequence systems, such as NMT models. Previous work in NMT has examined attacks with the aim of introducing target phrases in the output sequence. In this work, adversarial attacks for NMT systems are explored from an output perception perspective. Thus the aim of an attack is to change the perception of the output sequence, without altering the perception of the input sequence. For example, an adversary may distort the sentiment of translated reviews to have an exaggerated positive sentiment. In practice it is challenging to run extensive human perception experiments, so a proxy deep-learning classifier applied to the NMT output is used to measure perception changes. Experiments demonstrate that the sentiment perception of NMT systems' output sequences can be changed significantly.
随着深度学习方法的出现,神经网络机器翻译(NMT)系统变得越来越强大。然而,基于深度学习的系统容易受到对抗攻击,即输入的微变化可能会导致系统输出的不理想变化。迄今为止,对序列到序列系统(如NMT模型)的对抗攻击研究较少。以前的NMT研究已检查了攻击,旨在在输出序列中引入目标短语。在这个研究中,对NMT系统的对抗攻击从输出感知的角度进行了探索。因此,攻击的目标是改变输出序列的感知,而不改变输入序列的感知。例如,对抗者可能会扭曲翻译评论的情感,使其表现出过度积极的情感。在实践中,进行广泛的人类感知实验很困难,因此对NMT输出的应用 proxy 深度学习分类器来测量感知变化。实验表明,NMT系统输出序列的情感感知可以发生变化。
https://arxiv.org/abs/2305.01437
Document-level neural machine translation (NMT) has outperformed sentence-level NMT on a number of datasets. However, document-level NMT is still not widely adopted in real-world translation systems mainly due to the lack of large-scale general-domain training data for document-level NMT. We examine the effectiveness of using Paracrawl for learning document-level translation. Paracrawl is a large-scale parallel corpus crawled from the Internet and contains data from various domains. The official Paracrawl corpus was released as parallel sentences (extracted from parallel webpages) and therefore previous works only used Paracrawl for learning sentence-level translation. In this work, we extract parallel paragraphs from Paracrawl parallel webpages using automatic sentence alignments and we use the extracted parallel paragraphs as parallel documents for training document-level translation models. We show that document-level NMT models trained with only parallel paragraphs from Paracrawl can be used to translate real documents from TED, News and Europarl, outperforming sentence-level NMT models. We also perform a targeted pronoun evaluation and show that document-level models trained with Paracrawl data can help context-aware pronoun translation.
在多个数据集上,文档级神经网络机器翻译(NMT)已经比语句级NMT表现更好。然而,在实际应用中,文档级NMT仍然未被广泛采用,主要是因为缺乏大规模通用的训练数据来训练文档级NMT。我们研究了使用Paracrawl学习文档级翻译的方法的有效性。Paracrawl是一个从互联网爬取的大规模并行文本库,包含来自不同领域的数据。官方的Paracrawl文本库以并行句子(从并行页面中提取)发布,因此,以前的工作仅使用Paracrawl学习语句级翻译。在本工作中,我们使用自动句子对齐技术从Paracrawl并行页面中提取并行段落,并将提取的并行段落作为并行文档来训练文档级翻译模型。我们表明,仅使用Paracrawl中的并行段落训练的文档级NMT模型可以用于从TED、新闻和欧拉 l 等网站翻译真实文档,比语句级NMT模型表现更好。我们还进行了有针对性的代词评估,表明使用Paracrawl数据训练的文档级模型可以帮助具有上下文意识的代词翻译。
https://arxiv.org/abs/2304.10216