Pronoun translation is a longstanding challenge in neural machine translation (NMT), often requiring inter-sentential context to ensure linguistic accuracy. To address this, we introduce ProNMT, a novel framework designed to enhance pronoun and overall translation quality in context-aware machine translation systems. ProNMT leverages Quality Estimation (QE) models and a unique Pronoun Generation Likelihood-Based Feedback mechanism to iteratively fine-tune pre-trained NMT models without relying on extensive human annotations. The framework combines QE scores with pronoun-specific rewards to guide training, ensuring improved handling of linguistic nuances. Extensive experiments demonstrate significant gains in pronoun translation accuracy and general translation quality across multiple metrics. ProNMT offers an efficient, scalable, and context-aware approach to improving NMT systems, particularly in translating context-dependent elements like pronouns.
代词翻译一直是神经机器翻译(NMT)领域的一个长期挑战,通常需要跨句子的上下文信息来确保语言准确性。为了解决这一问题,我们引入了ProNMT——一个旨在通过利用质量估计(QE)模型和基于生成可能性的独特反馈机制,在不依赖大量人工标注的情况下,迭代地优化预训练NMT模型以提高代词翻译质量和整体翻译准确性的新型框架。该框架结合了QE分数与特定于代词的奖励来指导训练过程,从而更好地处理语言中的细微差别。 广泛实验表明,ProNMT在多项指标上显著提高了代词翻译准确性以及总体翻译质量。这一方法提供了一种高效、可扩展且上下文感知的方式来改进NMT系统,特别是在翻译依赖于上下文的语言元素(如代词)时表现尤为突出。
https://arxiv.org/abs/2501.03008
The multilingual neural machine translation (MNMT) enables arbitrary translations across multiple languages by training a model with limited parameters using parallel data only. However, the performance of such MNMT models still lags behind that of large language models (LLMs), limiting their practicality. In this work, we address this limitation by introducing registering to achieve the new state-of-the-art of decoder-only MNMT models. Specifically, we insert a set of artificial tokens specifying the target language, called registers, into the input sequence between the source and target tokens. By modifying the attention mask, the target token generation only pays attention to the activation of registers, representing the source tokens in the target language space. Experiments on EC-40, a large-scale benchmark, show that our method outperforms related methods driven by optimizing multilingual representations. We further scale up and collect 9.3 billion sentence pairs across 24 languages from public datasets to pre-train two models, namely MITRE (multilingual translation with registers). One of them, MITRE-913M, outperforms NLLB-3.3B, achieves comparable performance with commercial LLMs, and shows strong adaptability in fine-tuning. Finally, we open-source our models to facilitate further research and development in MNMT: this https URL.
多语言神经机器翻译(MNMT)通过仅使用平行数据训练具有有限参数的模型,实现了任意多个语言之间的翻译。然而,这种MNMT模型的性能仍然落后于大型语言模型(LLMs),这限制了其实际应用价值。在本项工作中,我们通过引入注册机制来解决这一局限性,并达到了解码器独立MNMT模型的新技术水平。具体来说,在输入序列中,我们将一组指定目标语言的人工标记——称为注册——插入到源语言和目标语言令牌之间。通过对注意力掩码进行修改,目标令牌生成仅关注激活的注册标记,这代表了在目标语言空间中的源语言令牌。 我们在EC-40这一大规模基准测试上的实验表明,我们的方法优于基于优化多语言表示的相关方法。我们进一步扩大规模,在公共数据集中收集了跨越24种语言共计93亿个句子对,并使用这些数据预训练了两个模型,即MITRE(带有注册的多语言翻译)。其中的一个模型MITRE-913M超越了NLLB-3.3B,实现了与商业LLMs相当的表现,并在微调中表现出强大的适应性。最后,我们开源我们的模型,以促进MNMT领域的进一步研究和发展:[此链接](请将此处的占位符替换为实际的URL)。
https://arxiv.org/abs/2501.02979
In this project, we develop a practical and efficient solution for automating the Manhwa translation from Indonesian to English. Our approach combines computer vision, text recognition, and natural language processing techniques to streamline the traditionally manual process of Manhwa(Korean comics) translation. The pipeline includes fine-tuned YOLOv5xu for speech bubble detection, Tesseract for OCR and fine-tuned MarianMT for machine translation. By automating these steps, we aim to make Manhwa more accessible to a global audience while saving time and effort compared to manual translation methods. While most Manhwa translation efforts focus on Japanese-to-English, we focus on Indonesian-to-English translation to address the challenges of working with low-resource languages. Our model shows good results at each step and was able to translate from Indonesian to English efficiently.
在这个项目中,我们开发了一种实用且高效的解决方案,用于自动化从印尼语到英语的漫画(又称Manhwa,即韩式漫画)翻译过程。我们的方法结合了计算机视觉、文本识别和自然语言处理技术,以简化传统手动翻译Manhwa的过程。该流程包括使用经过微调的YOLOv5xu进行气泡检测、Tesseract进行OCR文字识别以及经过微调的MarianMT用于机器翻译。通过自动化这些步骤,我们旨在使更多的Manhwa作品能够被全球观众接触和理解,并且相比手动翻译方法可以节省大量时间和精力。 大多数Manhwa翻译工作主要集中在日语到英语的翻译上,而我们的项目则专注于印尼语到英语的翻译,以解决处理低资源语言所面临的挑战。我们的模型在每个步骤中都显示出良好的效果,并能够高效地完成从印尼语到英语的翻译任务。
https://arxiv.org/abs/2501.01629
Due to reasons of convenience and lack of tech literacy, transliteration (i.e., Romanizing native scripts instead of using localization tools) is eminently prevalent in the context of low-resource languages such as Sinhala, which have their own writing script. In this study, our focus is on Romanized Sinhala transliteration. We propose two methods to address this problem: Our baseline is a rule-based method, which is then compared against our second method where we approach the transliteration problem as a sequence-to-sequence task akin to the established Neural Machine Translation (NMT) task. For the latter, we propose a Transformer-based Encode-Decoder solution. We witnessed that the Transformer-based method could grab many ad-hoc patterns within the Romanized scripts compared to the rule-based method. The code base associated with this paper is available on GitHub - this https URL
由于便利性和技术知识的缺乏,音译(即用罗马字母拼写本地脚本而非使用本土化工具)在如僧伽罗语这样的资源匮乏语言中非常普遍,尽管这些语言有自己独特的书写系统。在这项研究中,我们的重点是罗马化的僧伽罗语文本转写。我们提出了两种方法来解决这个问题:一种是基于规则的方法作为基准,然后将其与另一种方法进行比较,在这种方法中,我们将音译问题视为类似成熟的神经机器翻译(NMT)任务的序列到序列任务。对于后者,我们提出了一种基于Transformer的编码-解码解决方案。我们发现,相比于基于规则的方法,基于Transformer的方法能够抓住罗马化脚本中的许多临时模式。与这篇论文相关的代码库可在GitHub上获取 - 这个链接:[请在此处插入实际URL]
https://arxiv.org/abs/2501.00529
We address the challenging task of neural machine translation (NMT) in the entertainment domain, where the objective is to automatically translate a given dialogue from a source language content to a target language. This task has various applications, particularly in automatic dubbing, subtitling, and other content localization tasks, enabling source content to reach a wider audience. Traditional NMT systems typically translate individual sentences in isolation, without facilitating knowledge transfer of crucial elements such as the context and style from previously encountered sentences. In this work, we emphasize the significance of these fundamental aspects in producing pertinent and captivating translations. We demonstrate their significance through several examples and propose a novel framework for entertainment translation, which, to our knowledge, is the first of its kind. Furthermore, we introduce an algorithm to estimate the context and style of the current session and use these estimations to generate a prompt that guides a Large Language Model (LLM) to generate high-quality translations. Our method is both language and LLM-agnostic, making it a general-purpose tool. We demonstrate the effectiveness of our algorithm through various numerical studies and observe significant improvement in the COMET scores over various state-of-the-art LLMs. Moreover, our proposed method consistently outperforms baseline LLMs in terms of win-ratio.
我们解决了娱乐领域中的神经机器翻译(NMT)难题,目标是将源语言的内容对话自动翻译成目标语言。这项任务在自动配音、字幕生成和其他内容本地化方面有着广泛的应用,有助于让更多观众接触到原版内容。传统的NMT系统通常孤立地翻译单个句子,并且无法有效传递之前句子中的关键元素(如上下文和风格)的知识。在这项工作中,我们强调了这些基本因素在产生相关性和吸引力的翻译中所起的关键作用。通过几个示例证明其重要性,我们提出了一种新型框架用于娱乐领域内的翻译工作,据我们所知这是首次尝试。此外,我们还引入了一种算法来估计当前会话中的上下文和风格,并利用这些估算值生成一个提示,指导大型语言模型(LLM)生成高质量的翻译文本。我们的方法对于不同语言和LLM均适用,因此具有通用性。通过多种数值研究证明了该算法的有效性,并且在COMET评分上相对于各种最先进的LLM显著提升。此外,在胜率方面,我们提出的方法始终优于基线LLMs。
https://arxiv.org/abs/2412.20440
Neural Machine Translation (NMT) systems built on multilingual sequence-to-sequence Language Models (msLMs) fail to deliver expected results when the amount of parallel data for a language, as well as the language's representation in the model are limited. This restricts the capabilities of domain-specific NMT systems for low-resource languages (LRLs). As a solution, parallel data from auxiliary domains can be used either to fine-tune or to further pre-train the msLM. We present an evaluation of the effectiveness of these two techniques in the context of domain-specific LRL-NMT. We also explore the impact of domain divergence on NMT model performance. We recommend several strategies for utilizing auxiliary parallel data in building domain-specific NMT models for LRLs.
基于多语言序列到序列语言模型(msLM)构建的神经机器翻译(NMT)系统,在某一语言的平行数据量有限以及该语言在模型中的表示受限时,无法达到预期效果。这限制了低资源语言(LRLs)领域特定NMT系统的功能。作为解决方案,可以使用辅助领域的平行数据来微调或进一步预训练msLM。我们评估了这两种技术在领域特定的LRL-NMT中的有效性,并探讨了领域差异对NMT模型性能的影响。我们推荐了几种策略,用于利用辅助平行数据构建针对低资源语言的领域特定NMT模型。
https://arxiv.org/abs/2412.19522
Neural Machine Translation (NMT) models have shown remarkable performance but remain largely opaque in their decision making processes. The interpretability of these models, especially their internal attention mechanisms, is critical for building trust and verifying that these systems behave as intended. In this work, we introduce a systematic framework to quantitatively evaluate the explainability of an NMT model attention patterns by comparing them against statistical alignments and correlating them with standard machine translation quality metrics. We present a set of metrics attention entropy and alignment agreement and validate them on an English-German test subset from WMT14 using a pre trained mT5 model. Our results indicate that sharper attention distributions correlate with improved interpretability but do not always guarantee better translation quality. These findings advance our understanding of NMT explainability and guide future efforts toward building more transparent and reliable machine translation systems.
神经机器翻译(NMT)模型展示了出色的性能,但其决策过程仍然很大程度上是不透明的。这些模型尤其是它们内部注意力机制的可解释性对于建立信任以及验证系统是否按预期运行至关重要。在这项工作中,我们引入了一个系统的框架来通过将注意力模式与统计对齐进行比较,并将其与标准机器翻译质量指标相关联,以定量评估NMT模型解释性的方法。我们提出了一组度量标准:注意熵和对齐一致性,并使用WMT14中的英语-德语测试子集以及预训练的mT5模型对其进行了验证。 我们的研究结果表明,更尖锐的注意力分布与更好的可解释性相关联,但并不总是保证更高的翻译质量。这些发现推进了我们对NMT可解释性的理解,并指导未来朝着构建更加透明和可靠的机器翻译系统而努力的方向。
https://arxiv.org/abs/2412.18669
This paper analyses how traditional baseline metrics, such as BLEU and TER, and neural-based methods, such as BERTScore and COMET, score several NMT models performance on chat translation and how these metrics perform when compared to human-annotated scores. The results show that for ranking NMT models in chat translations, all metrics seem consistent in deciding which model outperforms the others. This implies that traditional baseline metrics, which are faster and simpler to use, can still be helpful. On the other hand, when it comes to better correlation with human judgment, neural-based metrics outperform traditional metrics, with COMET achieving the highest correlation with the human-annotated score on a chat translation. However, we show that even the best metric struggles when scoring English translations from sentences with anaphoric zero-pronoun in Japanese.
本文分析了传统基准指标(如BLEU和TER)以及基于神经网络的方法(如BERTScore和COMET)如何评估几种NMT模型在聊天翻译中的性能,以及这些指标与人工标注得分相比的表现如何。结果显示,在对聊天翻译中NMT模型进行排名时,所有指标在决定哪个模型表现更优方面似乎是一致的。这表明传统的基准指标由于使用更快、更简单,仍然可以是有用的。另一方面,当涉及到更好地与人类判断相关联时,基于神经网络的指标优于传统指标,在聊天翻译中COMET与人工标注得分的相关性最高。然而,我们还发现,即使是最优的指标在对包含日语中的指示代词零代词的句子进行英语翻译评分时也显得力不从心。
https://arxiv.org/abs/2412.18190
Neural Machine Translation (NMT) models can be specialized by domain adaptation, often involving fine-tuning on a dataset of interest. This process risks catastrophic forgetting: rapid loss of generic translation quality. Forgetting has been widely observed, with many mitigation methods proposed. However, the causes of forgetting and the relationship between forgetting and adaptation data are under-explored. This paper takes a novel approach to understanding catastrophic forgetting during NMT adaptation by investigating the impact of the data. We provide a first investigation of what is forgotten, and why. We examine the relationship between forgetting and the in-domain data, and show that the amount and type of forgetting is linked to that data's target vocabulary coverage. Our findings pave the way toward better informed NMT domain adaptation.
神经机器翻译(NMT)模型可以通过领域适应进行专业化,这通常涉及在感兴趣的特定数据集上微调。这一过程存在灾难性遗忘的风险:即通用翻译质量的快速下降。遗忘现象已被广泛观察到,并提出了许多缓解方法。然而,遗忘的原因以及遗忘与适应数据之间的关系尚未得到充分探索。本文采用了一种新颖的方法来理解NMT领域适应期间的灾难性遗忘,通过研究数据的影响来进行这一探索。我们首次调查了什么被遗忘了,为什么会发生这种情况。我们探讨了遗忘和特定领域数据之间的关系,并表明遗忘的数量和类型与其目标词汇覆盖范围有关。我们的发现为更好地指导NMT领域的适应铺平了道路。
https://arxiv.org/abs/2412.17537
This paper demonstrates that Phrase-Based Statistical Machine Translation (PBSMT) can outperform Transformer-based Neural Machine Translation (NMT) in moderate-resource scenarios, specifically for structurally similar languages, like the Persian-Hindi pair. Despite the Transformer architecture's typical preference for large parallel corpora, our results show that PBSMT achieves a BLEU score of 66.32, significantly exceeding the Transformer-NMT score of 53.7 on the same dataset. Additionally, we explore variations of the SMT architecture, including training on Romanized text and modifying the word order of Persian sentences to match the left-to-right (LTR) structure of Hindi. Our findings highlight the importance of choosing the right architecture based on language pair characteristics and advocate for SMT as a high-performing alternative, even in contexts commonly dominated by NMT.
本文证明,在中等资源场景下,基于短语的统计机器翻译(PBSMT)可以在结构相似的语言对之间超越基于Transformer的神经机器翻译(NMT),如波斯语-印地语配对。尽管Transformer架构通常更倾向于大规模平行语料库,但我们的结果显示,PBSMT在相同数据集上达到了66.32的BLEU分数,显著超过了Transformer-NMT的53.7分。此外,我们还探索了SMT架构的各种变体,包括使用罗马化文本进行训练以及调整波斯语句子的词序以匹配印地语从左到右(LTR)的结构。我们的研究结果强调了根据语言对特征选择合适架构的重要性,并倡导将SMT作为一种高性能替代方案,即使在通常由NMT主导的情况下也是如此。
https://arxiv.org/abs/2412.16877
Neural networks have demonstrated significant advancements in Neural Machine Translation (NMT) compared to conventional phrase-based approaches. However, Multilingual Neural Machine Translation (MNMT) in extremely low-resource settings remains underexplored. This research investigates how knowledge transfer across languages can enhance MNMT in such scenarios. Using the Tatoeba translation challenge dataset from Helsinki NLP, we perform English-German, English-French, and English-Spanish translations, leveraging minimal parallel data to establish cross-lingual mappings. Unlike conventional methods relying on extensive pre-training for specific language pairs, we pre-train our model on English-English translations, setting English as the source language for all tasks. The model is fine-tuned on target language pairs using joint multi-task and sequential transfer learning strategies. Our work addresses three key questions: (1) How can knowledge transfer across languages improve MNMT in extremely low-resource scenarios? (2) How does pruning neuron knowledge affect model generalization, robustness, and catastrophic forgetting? (3) How can TX-Ray interpret and quantify knowledge transfer in trained models? Evaluation using BLEU-4 scores demonstrates that sequential transfer learning outperforms baselines on a 40k parallel sentence corpus, showcasing its efficacy. However, pruning neuron knowledge degrades performance, increases catastrophic forgetting, and fails to improve robustness or generalization. Our findings provide valuable insights into the potential and limitations of knowledge transfer and pruning in MNMT for extremely low-resource settings.
神经网络在神经机器翻译(NMT)中与传统的基于短语的方法相比展现出了显著的进步。然而,在极低资源设置下的多语言神经机器翻译(MNMT)仍然研究不足。这项研究调查了跨语言的知识迁移如何能够改善在这种情景下的MNMT。我们使用来自Helsinki NLP的Tatoeba翻译挑战数据集,进行了英语-德语、英语-法语和英语-西班牙语的翻译工作,利用最少的平行数据建立了跨语言映射。与传统的依赖于特定语言对大量预训练的方法不同,我们在英语-英语的翻译上进行模型的预训练,并将英语设定为所有任务的源语言。该模型通过联合多任务学习和顺序迁移学习策略,在目标语言对上进行了微调。我们的工作回答了三个关键问题:(1)跨语言的知识转移如何改善极低资源场景下的MNMT?(2)剪枝神经元知识如何影响模型的泛化能力、鲁棒性和灾难性遗忘?(3)TX-Ray如何解释和量化训练模型中的知识迁移?通过使用BLEU-4评分进行评估,结果显示顺序迁移学习在包含40k平行句子的数据集上优于基线,展示了其有效性。然而,剪枝神经元知识会降低性能、增加灾难性遗忘,并未能提高鲁棒性和泛化能力。我们的发现为极低资源设置下的MNMT中知识转移和剪枝的潜力与局限提供了有价值的见解。
https://arxiv.org/abs/2412.13881
The commonly used Reinforcement Learning (RL) model, MDPs (Markov Decision Processes), has a basic premise that rewards depend on the current state and action only. However, many real-world tasks are non-Markovian, which has long-term memory and dependency. The reward sparseness problem is further amplified in non-Markovian scenarios. Hence learning a non-Markovian task (NMT) is inherently more difficult than learning a Markovian one. In this paper, we propose a novel \textbf{Par}allel and \textbf{Mod}ular RL framework, ParMod, specifically for learning NMTs specified by temporal logic. With the aid of formal techniques, the NMT is modulaized into a series of sub-tasks based on the automaton structure (equivalent to its temporal logic counterpart). On this basis, sub-tasks will be trained by a group of agents in a parallel fashion, with one agent handling one sub-task. Besides parallel training, the core of ParMod lies in: a flexible classification method for modularizing the NMT, and an effective reward shaping method for improving the sample efficiency. A comprehensive evaluation is conducted on several challenging benchmark problems with respect to various metrics. The experimental results show that ParMod achieves superior performance over other relevant studies. Our work thus provides a good synergy among RL, NMT and temporal logic.
常用的强化学习(RL)模型——马尔可夫决策过程(MDPs),其基本前提是奖励仅依赖于当前状态和动作。然而,许多现实世界任务是非马尔可夫的,具有长期记忆和依赖性。在非马尔可夫场景中,稀疏奖励问题进一步加剧。因此,学习一个非马尔可夫任务(NMT)本质上比学习一个马尔可夫任务更加困难。本文提出了一种新颖的**并行与模块化RL框架**——ParMod,专门用于基于时态逻辑的学习NMTs。借助形式技术,根据自动机结构(相当于其对应的时态逻辑),将NMT分解为一系列子任务。在此基础上,一组代理将以并行的方式训练这些子任务,每个代理处理一个子任务。除了并行训练外,ParMod的核心在于:一种灵活的分类方法用于模块化NMT,以及一种有效的奖励塑形方法以提高采样效率。在多个具有挑战性的基准问题上进行了全面评估,并采用了各种度量标准。实验结果表明,ParMod的表现优于其他相关研究。因此,我们的工作为RL、NMT与时态逻辑提供了一个良好的协同作用。
https://arxiv.org/abs/2412.12700
We introduce MT-LENS, a framework designed to evaluate Machine Translation (MT) systems across a variety of tasks, including translation quality, gender bias detection, added toxicity, and robustness to misspellings. While several toolkits have become very popular for benchmarking the capabilities of Large Language Models (LLMs), existing evaluation tools often lack the ability to thoroughly assess the diverse aspects of MT performance. MT-LENS addresses these limitations by extending the capabilities of LM-eval-harness for MT, supporting state-of-the-art datasets and a wide range of evaluation metrics. It also offers a user-friendly platform to compare systems and analyze translations with interactive visualizations. MT-LENS aims to broaden access to evaluation strategies that go beyond traditional translation quality evaluation, enabling researchers and engineers to better understand the performance of a NMT model and also easily measure system's biases.
我们介绍了一种名为MT-LENS的框架,该框架旨在评估机器翻译(MT)系统在各种任务中的表现,包括翻译质量、性别偏见检测、毒性增加以及对拼写错误的鲁棒性。虽然已有多种工具包广泛用于基准测试大型语言模型(LLMs)的能力,但现有的评估工具往往缺乏全面评估MT性能多样性的能力。MT-LENS通过扩展LM-eval-harness以支持机器翻译,解决了这些局限性,并且兼容最先进的数据集和广泛的评估指标。它还提供了一个用户友好的平台来比较系统并使用交互式可视化分析翻译结果。MT-LENS旨在拓宽对传统翻译质量评价之外的评估策略的访问权限,使研究人员和工程师能够更好地理解NMT模型的表现,并轻松测量系统的偏见。
https://arxiv.org/abs/2412.11615
Large language models (LLMs) have shown superior capabilities in translating figurative language compared to neural machine translation (NMT) systems. However, the impact of different prompting methods and LLM-NMT combinations on idiom translation has yet to be thoroughly investigated. This paper introduces two parallel datasets of sentences containing idiomatic expressions for Persian$\rightarrow$English and English$\rightarrow$Persian translations, with Persian idioms sampled from our PersianIdioms resource, a collection of 2,200 idioms and their meanings. Using these datasets, we evaluate various open- and closed-source LLMs, NMT models, and their combinations. Translation quality is assessed through idiom translation accuracy and fluency. We also find that automatic evaluation methods like LLM-as-a-judge, BLEU and BERTScore are effective for comparing different aspects of model performance. Our experiments reveal that Claude-3.5-Sonnet delivers outstanding results in both translation directions. For English$\rightarrow$Persian, combining weaker LLMs with Google Translate improves results, while Persian$\rightarrow$English translations benefit from single prompts for simpler models and complex prompts for advanced ones.
大型语言模型(LLMs)在翻译比喻性语言方面显示出优于神经机器翻译(NMT)系统的卓越能力。然而,不同的提示方法以及LLM-NMT组合对成语翻译的影响尚未得到彻底研究。本文引入了两个平行的数据集,其中包含波斯语→英语和英语→波斯语的含有成语表达的句子,波斯语成语从我们的PersianIdioms资源中采样,该资源包含了2,200个成语及其含义。利用这些数据集,我们评估了各种开源和闭源LLMs、NMT模型及它们的不同组合。翻译质量通过成语翻译准确性和流利度进行评估。我们还发现自动评估方法如以LLM作为评判者、BLEU 和 BERTScore 对比较不同方面的模型性能非常有效。我们的实验表明,Claude-3.5-Sonnet 在两个翻译方向上都取得了出色的结果。对于英语→波斯语的翻译,较弱的LLMs与Google Translate结合可以改善结果;而波斯语→英语的翻译则从简单模型的单个提示和先进模型的复杂提示中受益。
https://arxiv.org/abs/2412.09993
Neural Machine Translation (NMT) models are typically trained on datasets with limited exposure to Scientific, Technical and Educational domains. Translation models thus, in general, struggle with tasks that involve scientific understanding or technical jargon. Their performance is found to be even worse for low-resource Indian languages. Finding a translation dataset that tends to these domains in particular, poses a difficult challenge. In this paper, we address this by creating a multilingual parallel corpus containing more than 2.8 million rows of English-to-Indic and Indic-to-Indic high-quality translation pairs across 8 Indian languages. We achieve this by bitext mining human-translated transcriptions of NPTEL video lectures. We also finetune and evaluate NMT models using this corpus and surpass all other publicly available models at in-domain tasks. We also demonstrate the potential for generalizing to out-of-domain translation tasks by improving the baseline by over 2 BLEU on average for these Indian languages on the Flores+ benchmark. We are pleased to release our model and dataset via this link: this https URL.
神经机器翻译(NMT)模型通常是在对科学、技术和教育领域暴露有限的数据集上进行训练的。因此,这些模型在涉及科学理解或技术术语的任务中往往表现不佳。对于低资源印度语言来说,其性能甚至更差。找到一个特别倾向于这些领域的翻译数据集是一个艰巨的挑战。在这篇论文中,我们通过创建一个多语种平行语料库来解决这个问题,该语料库包含超过280万行英-印和印-印高质量翻译对,覆盖了8种印度语言。我们通过挖掘NPTEL视频讲座的人工翻译字幕实现了这一目标。此外,我们还使用这个语料库微调并评估NMT模型,并在领域内任务上超越了所有其他公开可用的模型。我们还展示了该模型在外域翻译任务中的泛化能力,在Flores+基准测试中,这些印度语言的基线平均提高了超过2个BLEU分数。我们很高兴通过以下链接发布我们的模型和数据集:此 https URL。
https://arxiv.org/abs/2412.09025
Neural machine translation (NMT) systems amplify lexical biases present in their training data, leading to artificially impoverished language in output translations. These language-level characteristics render automatic translations different from text originally written in a language and human translations, which hinders their usefulness in for example creating evaluation datasets. Attempts to increase naturalness in NMT can fall short in terms of content preservation, where increased lexical diversity comes at the cost of translation accuracy. Inspired by the reinforcement learning from human feedback framework, we introduce a novel method that rewards both naturalness and content preservation. We experiment with multiple perspectives to produce more natural translations, aiming at reducing machine and human translationese. We evaluate our method on English-to-Dutch literary translation, and find that our best model produces translations that are lexically richer and exhibit more properties of human-written language, without loss in translation accuracy.
神经机器翻译(NMT)系统放大了训练数据中存在的词汇偏见,导致输出翻译的语言人为贫乏。这些语言层面的特征使得自动翻译与原语言中最初撰写的内容及人工翻译有所不同,从而阻碍了它们在例如创建评估数据集方面的实用性。试图提高NMT自然度的努力可能会在内容保真方面不足,即增加词汇多样性是以牺牲翻译准确性为代价的。受人类反馈强化学习框架的启发,我们提出了一种新方法,奖励自然度和内容保真度两方面。我们从多个角度进行实验以生成更自然的翻译,旨在减少机器和人工翻译的语言特征差异。我们在英语到荷兰语的文学翻译上评估了我们的方法,并发现我们最佳模型生成的翻译在词汇丰富性方面更强且更多地展现了人类撰写语言的特点,同时没有牺牲翻译准确性。
https://arxiv.org/abs/2412.08473
Existing multilingual neural machine translation (MNMT) approaches mainly focus on improving models with the encoder-decoder architecture to translate multiple languages. However, decoder-only architecture has been explored less in MNMT due to its underperformance when trained on parallel data solely. In this work, we attribute the issue of the decoder-only architecture to its lack of language transfer capability. Specifically, the decoder-only architecture is insufficient in encoding source tokens with the target language features. We propose dividing the decoding process into two stages so that target tokens are explicitly excluded in the first stage to implicitly boost the transfer capability across languages. Additionally, we impose contrastive learning on translation instructions, resulting in improved performance in zero-shot translation. We conduct experiments on TED-19 and OPUS-100 datasets, considering both training from scratch and fine-tuning scenarios. Experimental results show that, compared to the encoder-decoder architecture, our methods not only perform competitively in supervised translations but also achieve improvements of up to 3.39 BLEU, 6.99 chrF++, 3.22 BERTScore, and 4.81 COMET in zero-shot translations.
https://arxiv.org/abs/2412.02101
This paper presents a multi-way parallel English-Tamil-Sinhala corpus annotated with Named Entities (NEs), where Sinhala and Tamil are low-resource languages. Using pre-trained multilingual Language Models (mLMs), we establish new benchmark Named Entity Recognition (NER) results on this dataset for Sinhala and Tamil. We also carry out a detailed investigation on the NER capabilities of different types of mLMs. Finally, we demonstrate the utility of our NER system on a low-resource Neural Machine Translation (NMT) task. Our dataset is publicly released: this https URL.
https://arxiv.org/abs/2412.02056
Many of the world's languages have insufficient data to train high-performing general neural machine translation (NMT) models, let alone domain-specific models, and often the only available parallel data are small amounts of religious texts. Hence, domain adaptation (DA) is a crucial issue faced by contemporary NMT and has, so far, been underexplored for low-resource languages. In this paper, we evaluate a set of methods from both low-resource NMT and DA in a realistic setting, in which we aim to translate between a high-resource and a low-resource language with access to only: a) parallel Bible data, b) a bilingual dictionary, and c) a monolingual target-domain corpus in the high-resource language. Our results show that the effectiveness of the tested methods varies, with the simplest one, DALI, being most effective. We follow up with a small human evaluation of DALI, which shows that there is still a need for more careful investigation of how to accomplish DA for low-resource NMT.
https://arxiv.org/abs/2412.00966
Real-time conversational AI agents face challenges in performing Natural Language Understanding (NLU) in dynamic, outdoor environments like automated drive-thru systems. These settings require NLU models to handle background noise, diverse accents, and multi-intent queries while operating under strict latency and memory constraints on edge devices. Additionally, robustness to errors from upstream Automatic Speech Recognition (ASR) is crucial, as ASR outputs in these environments are often noisy. We introduce Babylon, a transformer-based architecture that tackles NLU as an intent translation task, converting natural language inputs into sequences of regular language units ('transcodes') that encode both intents and slot information. This formulation allows Babylon to manage multi-intent scenarios in a single dialogue turn. Furthermore, Babylon incorporates an LSTM-based token pooling mechanism to preprocess phoneme sequences, reducing input length and optimizing for low-latency, low-memory edge deployment. This also helps mitigate inaccuracies in ASR outputs, enhancing system robustness. While this work focuses on drive-thru ordering, Babylon's design extends to similar noise-prone scenarios, for e.g. ticketing kiosks. Our experiments show that Babylon achieves significantly better accuracy-latency-memory footprint trade-offs over typically employed NMT models like Flan-T5 and BART, demonstrating its effectiveness for real-time NLU in edge deployment settings.
实时对话人工智能代理在动态的户外环境中执行自然语言理解(NLU)时面临挑战,比如自动化的得来速系统。这些设置要求NLU模型处理背景噪音、多样的口音以及多意图查询,同时还要受到严格的延迟和内存限制的影响。此外,在上游自动语音识别(ASR)错误下的鲁棒性也至关重要,因为在这种环境中的ASR输出通常包含噪声。我们引入了Babylon,这是一种基于变换器的架构,它将NLU视为一种意图翻译任务,可以将自然语言输入转换为一系列常规语言单元(“转码”),这些单元编码了意图和槽位信息。这种表述方式使得Babylon能够在单一对话轮次中管理多意图场景。此外,Babylon还采用了基于LSTM的标记池化机制来预处理音素序列,减少输入长度并优化低延迟、低内存边缘设备部署。这也有助于减轻ASR输出中的不准确之处,提高系统的鲁棒性。尽管这项工作专注于得来速点餐场景,但Babylon的设计同样适用于类似噪声干扰的场景,例如售票亭等。实验表明,与通常使用的NMT模型如Flan-T5和BART相比,Babylon在准确性、延迟性和内存占用方面实现了显著更好的权衡,证明了其在边缘部署环境下实时NLU的有效性。
https://arxiv.org/abs/2411.15372