Neural Machine Translation (NMT) driven by Transformer architectures has advanced significantly, yet faces challenges with low-resource language pairs like Vietnamese-Japanese (Vi-Ja). Issues include sparse parallel data and handling linguistic/cultural nuances. Recent progress in Large Language Models (LLMs) with strong reasoning, often refined via Reinforcement Learning (RL), enables high-quality synthetic data generation. We introduce VNJPTranslate, a pipeline designed to systematically address the Vi-Ja translation task. It features a targeted data augmentation strategy using advanced LLMs with Chain-of-Thought prompting for challenging segments identified via corpus analysis. Subsequently, we employ efficient fine-tuning techniques (Unsloth with QLoRA) on a capable, low-parameter autoregressive model (specifically, a fine-tuned version of the 1.8B parameter Sailor model, which is based on the Qwen architecture) to create a practical and high-performing translation system. This integrated approach aims to improve Vi-Ja translation quality significantly over existing baselines.
基于Transformer架构的神经机器翻译(NMT)技术取得了显著进展,但在处理像越南语-日语(Vi-Ja)这样的低资源语言对时仍然面临挑战。这些问题包括平行数据稀疏以及处理语言和文化细微差别的困难。近年来,在大型语言模型(LLMs)中取得的重大进步,这些模型通过强化学习(RL)增强了推理能力,并能够生成高质量的合成数据。我们在此介绍VNJPTranslate,这是一种专门设计来系统性解决Vi-Ja翻译任务的流水线方法。 VNJPTranslate采用了一种针对性的数据增强策略,该策略利用先进的LLMs并通过Chain-of-Thought提示技术针对语料库分析中识别出的难点部分进行处理。随后,我们使用高效的微调技术(如Unsloth和QLoRA)在一款功能强大且参数较少的自回归模型上进行了训练(具体而言,是在1.8B参数规模的Sailor模型基础上进行微调,该模型基于Qwen架构)。这种方法旨在构建一个实用而高性能的翻译系统,并显著提升Vi-Ja翻译的质量超过现有的基准方法。
https://arxiv.org/abs/2504.00339
Low-Resource Languages (LRLs) present significant challenges in natural language processing due to their limited linguistic resources and underrepresentation in standard datasets. While recent advancements in Large Language Models (LLMs) and Neural Machine Translation (NMT) have substantially improved translation capabilities for high-resource languages, performance disparities persist for LRLs, particularly impacting privacy-sensitive and resource-constrained scenarios. This paper systematically evaluates the limitations of current LLMs across 200 languages using benchmarks such as FLORES-200. We also explore alternative data sources, including news articles and bilingual dictionaries, and demonstrate how knowledge distillation from large pre-trained models can significantly improve smaller LRL translations. Additionally, we investigate various fine-tuning strategies, revealing that incremental enhancements markedly reduce performance gaps on smaller LLMs.
低资源语言(LRL)在自然语言处理中面临重大挑战,原因在于其有限的语料资源和在标准数据集中的代表性不足。尽管近年来大型语言模型(LLMs)和神经机器翻译(NMT)的进步大大提高了高资源语言的翻译能力,但在低资源语言上的性能差距仍然存在,特别是在隐私敏感性和资源受限的情境下影响尤为显著。本文系统地评估了当前LLMs在200种语言中的限制,并使用FLORES-200等基准测试进行衡量。我们还探索了替代数据源,包括新闻文章和双语词典,并展示了如何通过从大型预训练模型中提取知识来显著提高较小LRL的翻译质量。此外,本文研究了各种微调策略,揭示增量改进能够明显缩小在小型LLMs上的性能差距。
https://arxiv.org/abs/2503.24102
Fine-tuning multilingual sequence-to-sequence large language models (msLLMs) has shown promise in developing neural machine translation (NMT) systems for low-resource languages (LRLs). However, conventional single-stage fine-tuning methods struggle in extremely low-resource NMT settings, where training data is very limited. This paper contributes to artificial intelligence by proposing two approaches for adapting msLLMs in these challenging scenarios: (1) continual pre-training (CPT), where the msLLM is further trained with domain-specific monolingual data to compensate for the under-representation of LRLs, and (2) intermediate task transfer learning (ITTL), a method that fine-tunes the msLLM with both in-domain and out-of-domain parallel data to enhance its translation capabilities across various domains and tasks. As an application in engineering, these methods are implemented in NMT systems for Sinhala, Tamil, and English (six language pairs) in domain-specific, extremely low-resource settings (datasets containing fewer than 100,000 samples). Our experiments reveal that these approaches enhance translation performance by an average of +1.47 bilingual evaluation understudy (BLEU) score compared to the standard single-stage fine-tuning baseline across all translation directions. Additionally, a multi-model ensemble further improves performance by an additional BLEU score.
针对低资源语言(LRLs)的神经机器翻译(NMT)系统,微调多语种序列到序列大规模语言模型(msLLMs)显示出了一定潜力。然而,在训练数据极为有限的极端低资源NMT环境中,传统的单阶段微调方法面临着挑战。本文通过提出两种适应msLLMs的方法来应对这些具有挑战性的场景:(1) 连续预训练(CPT),其中msLLM进一步使用特定领域的单语种数据进行训练以补偿LRLs的代表性不足;(2) 中间任务迁移学习(ITTL),这是一种利用域内和域外并行数据对msLLM进行微调的方法,旨在提高其在各种领域和任务中的翻译能力。作为工程应用的一部分,这些方法被应用于斯里兰卡语、泰米尔语与英语之间的六种语言组合的神经机器翻译系统,在特定领域的极端低资源环境下(包含少于10万样本的数据集)。实验结果显示,相较于标准单阶段微调基线模型,在所有翻译方向上,这些建议的方法提升了平均+1.47分的双语评估研究工具(BLEU)评分。此外,通过多模型集成进一步提高了性能,并额外增加了BLEU分数。
https://arxiv.org/abs/2503.22582
We explore the impact of multi-source input strategies on machine translation (MT) quality, comparing GPT-4o, a large language model (LLM), with a traditional multilingual neural machine translation (NMT) system. Using intermediate language translations as contextual cues, we evaluate their effectiveness in enhancing English and Chinese translations into Portuguese. Results suggest that contextual information significantly improves translation quality for domain-specific datasets and potentially for linguistically distant language pairs, with diminishing returns observed in benchmarks with high linguistic variability. Additionally, we demonstrate that shallow fusion, a multi-source approach we apply within the NMT system, shows improved results when using high-resource languages as context for other translation pairs, highlighting the importance of strategic context language selection.
我们探讨了多源输入策略对机器翻译(MT)质量的影响,比较了一个大型语言模型(LLM)GPT-4o与传统的多语种神经机器翻译(NMT)系统。通过使用中间语言的翻译作为上下文线索,我们评估了这种方法在增强英语和汉语到葡萄牙语翻译效果中的有效性。结果显示,上下文信息显著提高了领域特定数据集的翻译质量,并且对于语言差异较大的语言对也具有潜在的好处,在语言变异性较高的基准测试中则表现出收益递减的现象。此外,我们还展示了浅层融合方法(我们在NMT系统内应用的一种多源策略)在使用高资源语言作为其他翻译任务上下文时效果更佳,强调了战略性选择上下文语言的重要性。
https://arxiv.org/abs/2503.07195
The field of neural machine translation (NMT) has changed with the advent of large language models (LLMs). Much of the recent emphasis in natural language processing (NLP) has been on modeling machine translation and many other problems using a single pre-trained Transformer decoder, while encoder-decoder architectures, which were the standard in earlier NMT models, have received relatively less attention. In this paper, we explore translation models that are universal, efficient, and easy to optimize, by marrying the world of LLMs with the world of NMT. We apply LLMs to NMT encoding and leave the NMT decoder unchanged. We also develop methods for adapting LLMs to work better with the NMT decoder. Furthermore, we construct a new dataset involving multiple tasks to assess how well the machine translation system generalizes across various tasks. Evaluations on the WMT and our datasets show that results using our method match or surpass a range of baselines in terms of translation quality, but achieve $2.4 \sim 6.5 \times$ inference speedups and a $75\%$ reduction in the memory footprint of the KV cache. It also demonstrates strong generalization across a variety of translation-related tasks.
神经机器翻译(NMT)领域随着大型语言模型(LLMs)的出现发生了变化。最近自然语言处理(NLP)领域的重点是使用单个预训练的Transformer解码器来建模机器翻译及其他许多问题,而早期NMT模型中使用的编码器-解码器架构则相对较少受到关注。在本文中,我们探索了一种将LLM的世界与NMT的世界结合的方法,以创建通用、高效且易于优化的翻译模型。我们将LLMs应用于NMT编码,并保持原有的NMT解码器不变。同时,我们也开发了适应方法来改进LLMs使其更好地配合NMT解码器工作。此外,我们构建了一个包含多个任务的新数据集,用于评估机器翻译系统在各种任务中的泛化能力。我们在WMT和我们的数据集上进行了测试,结果显示使用我们提出的方法所得到的翻译质量与一系列基线方法相比不相上下甚至更优,但实现了2.4至6.5倍的速度提升以及KV缓存内存占用减少了75%。此外,该方法还展示了在各种翻译相关任务上的强大泛化能力。
https://arxiv.org/abs/2503.06594
Parallel Data Curation (PDC) techniques aim to filter out noisy parallel sentences from the web-mined corpora. Prior research has demonstrated that ranking sentence pairs using similarity scores on sentence embeddings derived from Pre-trained Multilingual Language Models (multiPLMs) and training the NMT systems with the top-ranked samples, produces superior NMT performance than when trained using the full dataset. However, previous research has shown that the choice of multiPLM significantly impacts the ranking quality. This paper investigates the reasons behind this disparity across multiPLMs. Using the web-mined corpora CCMatrix and CCAligned for En$\rightarrow$Si, En$\rightarrow$Ta and Si$\rightarrow$Ta, we show that different multiPLMs (LASER3, XLM-R, and LaBSE) are biased towards certain types of sentences, which allows noisy sentences to creep into the top-ranked samples. We show that by employing a series of heuristics, this noise can be removed to a certain extent. This results in improving the results of NMT systems trained with web-mined corpora and reduces the disparity across multiPLMs.
平行数据整理(PDC)技术旨在从网络挖掘的语料库中过滤掉噪声并行句子。先前的研究表明,使用预训练多语言模型(multiPLMs)生成的句子嵌入相似分数对句子对进行排序,并用排名靠前的样本训练神经机器翻译(NMT)系统,比使用完整数据集训练时能获得更优的NMT性能。然而,之前的研究还显示,multiPLM的选择显著影响了排序质量。本文探讨了多语言模型之间存在这种差异的原因。通过使用面向英语到斯洛文尼亚语(En→Si)、英语到泰米尔语(En→Ta)和斯洛文尼亚语到泰米尔语(Si→Ta)的网络挖掘语料库CCMatrix和CCAligned,我们发现不同的multiPLM(如LASER3、XLM-R和LaBSE),在偏好某些类型的句子方面存在偏差,这使得噪声句子能够进入排名靠前的样本中。我们证明通过采用一系列启发式方法可以到一定程度上移除这种噪声,从而改善使用网络挖掘语料库训练的NMT系统的性能,并减少不同multiPLM之间的差异性。
https://arxiv.org/abs/2502.19074
Does multilingual Neural Machine Translation (NMT) lead to The Curse of the Multlinguality or provides the Cross-lingual Knowledge Transfer within a language family? In this study, we explore multiple approaches for extending the available data-regime in NMT and we prove cross-lingual benefits even in 0-shot translation regime for low-resource languages. With this paper, we provide state-of-the-art open-source NMT models for translating between selected Slavic languages. We released our models on the HuggingFace Hub (this https URL) under the CC BY 4.0 license. Slavic language family comprises morphologically rich Central and Eastern European languages. Although counting hundreds of millions of native speakers, Slavic Neural Machine Translation is under-studied in our opinion. Recently, most NMT research focuses either on: high-resource languages like English, Spanish, and German - in WMT23 General Translation Task 7 out of 8 task directions are from or to English; massively multilingual models covering multiple language groups; or evaluation techniques.
多语言神经机器翻译(NMT)会导致“多语种的诅咒”还是能够实现语言家族内的跨语言知识转移?在本研究中,我们探讨了多种方法来扩展NMT的数据范围,并证明即使是低资源语言,在零样本翻译模式下也能获得跨语言的好处。通过本文,我们提供了最先进的开源NMT模型,用于选定的斯拉夫语之间的互译。我们在HuggingFace Hub(此链接)以CC BY 4.0许可证发布了我们的模型。斯拉夫语族包括中欧和东欧形态丰富的语言。尽管有数亿母语使用者,但据我们所知,斯拉夫神经机器翻译研究仍处于不足状态。最近的大多数NMT研究要么集中在资源丰富如英语、西班牙语和德语的语言上——在WMT23通用翻译任务中的8个方向中有7个是与英语相关的;覆盖多种语言群体的大规模多语言模型;或者评估技术。
https://arxiv.org/abs/2502.14509
Embedding models play a crucial role in representing and retrieving information across various NLP applications. Recent advances in large language models (LLMs) have further enhanced the performance of embedding models. While these models are often benchmarked on general-purpose datasets, real-world applications demand domain-specific evaluation. In this work, we introduce the Finance Massive Text Embedding Benchmark (FinMTEB), a specialized counterpart to MTEB designed for the financial domain. FinMTEB comprises 64 financial domain-specific embedding datasets across 7 tasks that cover diverse textual types in both Chinese and English, such as financial news articles, corporate annual reports, ESG reports, regulatory filings, and earnings call transcripts. We also develop a finance-adapted model, FinPersona-E5, using a persona-based data synthetic method to cover diverse financial embedding tasks for training. Through extensive evaluation of 15 embedding models, including FinPersona-E5, we show three key findings: (1) performance on general-purpose benchmarks shows limited correlation with financial domain tasks; (2) domain-adapted models consistently outperform their general-purpose counterparts; and (3) surprisingly, a simple Bag-of-Words (BoW) approach outperforms sophisticated dense embeddings in financial Semantic Textual Similarity (STS) tasks, underscoring current limitations in dense embedding techniques. Our work establishes a robust evaluation framework for financial NLP applications and provides crucial insights for developing domain-specific embedding models.
嵌入模型在自然语言处理(NLP)的各种应用中用于表示和检索信息,起着至关重要的作用。大型语言模型(LLMs)的最新进展进一步提高了嵌入模型的表现力。尽管这些模型通常是在通用数据集上进行基准测试的,但实际应用却需要领域特定的评估方法。在这项工作中,我们引入了金融大规模文本嵌入基准 (FinMTEB),它是为金融领域设计的一种专门的 MTEB 对应版本。FinMTEB 包含 64 个专用于金融领域的嵌入数据集,涵盖了从中文和英文中提取的各种文本类型的任务,包括金融新闻文章、公司年报、ESG 报告、监管文件以及收益电话会议记录等。此外,我们开发了一种基于人格的数据合成方法来构建适用于财务的模型 FinPersona-E5,并使用该方法覆盖多样化的金融嵌入任务进行训练。通过对 15 种嵌入模型(包括 FinPersona-E5)进行全面评估,我们展示了三个关键发现:(1) 在通用基准测试中的表现与金融领域的任务关联性有限;(2) 领域适应型模型始终优于其通用对应版本;以及 (3) 出人意料的是,在金融语义文本相似度(STS)任务中,简单的词袋(BoW)方法的表现超过了复杂的密集嵌入技术。这突显了目前在密集嵌入技术中的局限性。我们的工作为金融 NLP 应用建立了稳健的评估框架,并为开发领域特定的嵌入模型提供了关键见解。
https://arxiv.org/abs/2502.10990
An emerging research direction in NMT involves the use of Quality Estimation (QE) models, which have demonstrated high correlations with human judgment and can enhance translations through Quality-Aware Decoding. Although several approaches have been proposed based on sampling multiple candidate translations, none have integrated these models directly into the decoding process. In this paper, we address this by proposing a novel token-level QE model capable of reliably scoring partial translations. We build a uni-directional QE model for this, as decoder models are inherently trained and efficient on partial sequences. We then present a decoding strategy that integrates the QE model for Quality-Aware decoding and demonstrate that the translation quality improves when compared to the N-best list re-ranking with state-of-the-art QE models (upto $1.39$ XCOMET-XXL $\uparrow$). Finally, we show that our approach provides significant benefits in document translation tasks, where the quality of N-best lists is typically suboptimal.
在神经机器翻译(NMT)领域,一个新兴的研究方向是使用质量估计(QE)模型。这些模型已经展示了与人类判断的高度相关性,并且可以通过质量感知解码来提升翻译效果。尽管已有多种方法提出了基于采样多个候选翻译的方案,但还没有任何一种方法将这些模型直接集成到解码过程中。在本文中,我们通过提出一个新颖的、能够在可靠地评分部分翻译上表现优异的分词级别QE模型解决了这一问题。为此,我们构建了一个单向QE模型,因为译码器模型本质上适用于并能高效处理不完整的序列。接着,我们介绍了一种解码策略,该策略将QE模型集成到质量感知解码中,并展示了与使用最先进的QE模型对N-best列表重新排序相比,我们的方法可以提升翻译质量(最多提高1.39个XCOMET-XXL分数)。最后,我们证明了在文档翻译任务中,这种方法提供了显著的优势,在这种情况下,N-best列表的质量通常不理想。
https://arxiv.org/abs/2502.08561
Emergent Communication (EC) provides a unique window into the language systems that emerge autonomously when agents are trained to jointly achieve shared goals. However, it is difficult to interpret EC and evaluate its relationship with natural languages (NL). This study employs unsupervised neural machine translation (UNMT) techniques to decipher ECs formed during referential games with varying task complexities, influenced by the semantic diversity of the environment. Our findings demonstrate UNMT's potential to translate EC, illustrating that task complexity characterized by semantic diversity enhances EC translatability, while higher task complexity with constrained semantic variability exhibits pragmatic EC, which, although challenging to interpret, remains suitable for translation. This research marks the first attempt, to our knowledge, to translate EC without the aid of parallel data.
新兴通信(EC)为了解当智能体被训练以共同实现共享目标时自主形成的语言系统提供了一个独特的窗口。然而,很难解释EC并评估其与自然语言(NL)的关系。本研究采用无监督神经机器翻译(UNMT)技术来解析在具有不同任务复杂度的指称游戏中形成的EC,这些游戏受到环境语义多样性的影响。我们的研究结果表明,UNMT有潜力将EC进行翻译,并展示了以下发现:由语义多样性定义的任务复杂性增强了EC的可译性;而更高任务复杂性且语义变化受限的情况下,则形成了具有实用性的EC,尽管这种情况下解释起来较为困难,但仍然适合于翻译。据我们所知,这项研究是首次尝试在没有平行数据辅助的情况下将EC进行翻译的研究。
https://arxiv.org/abs/2502.07552
Domain specificity of embedding models is critical for effective performance. However, existing benchmarks, such as FinMTEB, are primarily designed for high-resource languages, leaving low-resource settings, such as Korean, under-explored. Directly translating established English benchmarks often fails to capture the linguistic and cultural nuances present in low-resource domains. In this paper, titled TWICE: What Advantages Can Low-Resource Domain-Specific Embedding Models Bring? A Case Study on Korea Financial Texts, we introduce KorFinMTEB, a novel benchmark for the Korean financial domain, specifically tailored to reflect its unique cultural characteristics in low-resource languages. Our experimental results reveal that while the models perform robustly on a translated version of FinMTEB, their performance on KorFinMTEB uncovers subtle yet critical discrepancies, especially in tasks requiring deeper semantic understanding, that underscore the limitations of direct translation. This discrepancy highlights the necessity of benchmarks that incorporate language-specific idiosyncrasies and cultural nuances. The insights from our study advocate for the development of domain-specific evaluation frameworks that can more accurately assess and drive the progress of embedding models in low-resource settings.
领域特定的嵌入模型对于有效性能至关重要。然而,现有的基准测试(如FinMTEB)主要针对高资源语言设计,这导致像韩语这样的低资源环境被忽视。直接将已建立的英语基准翻译到其他语言往往无法捕捉这些语言中的文化和语言细微差别。在题为《TWICE:低资源领域特定嵌入模型能带来哪些优势?韩国金融文本案例研究》的研究中,我们介绍了KorFinMTEB,这是一个针对韩语金融领域的全新基准测试,特别注重反映其独特的文化特性。实验结果显示,虽然这些模型在翻译版本的FinMTEB上表现出色,但在KorFinMTEB上的表现却揭示了细微但关键的区别,特别是在需要更深层次语义理解的任务中。这种差异突显了将语言特定的独特性及文化细微差纳入基准测试的重要性。 我们研究中的见解倡导开发领域特定的评估框架,以更加准确地衡量和推动低资源环境下的嵌入模型进步。
https://arxiv.org/abs/2502.07131
Multilingual neural machine translation (MNMT) aims at using one single model for multiple translation directions. Recent work applies non-autoregressive Transformers to improve the efficiency of MNMT, but requires expensive knowledge distillation (KD) processes. To this end, we propose an M-DAT approach to non-autoregressive multilingual machine translation. Our system leverages the recent advance of the directed acyclic Transformer (DAT), which does not require KD. We further propose a pivot back-translation (PivotBT) approach to improve the generalization to unseen translation directions. Experiments show that our M-DAT achieves state-of-the-art performance in non-autoregressive MNMT.
多语种神经机器翻译(MNMT)的目标是使用单一模型来处理多个语言之间的翻译方向。最近的工作应用了非自回归Transformer以提高MNMT的效率,但需要昂贵的知识蒸馏(KD)过程。为此,我们提出了一种用于非自回归多语种机器翻译的M-DAT方法。我们的系统利用了近期发展的有向无环变压器(DAT),这种方法不需要知识蒸馏。此外,我们还提出了一种枢轴回译(PivotBT)的方法来改进对未见过的翻译方向的一般化能力。实验表明,我们的M-DAT在非自回归MNMT中达到了最先进的性能。 具体来说: - MNMT旨在使用一个单一模型来进行多种语言之间的相互翻译。 - 最近的研究工作利用了非自回归Transformer技术以提高翻译效率,但这种方法需要复杂且计算成本高昂的知识蒸馏过程。 - 我们提出了一种新的方法M-DAT,它基于有向无环变压器(DAT)架构,无需进行知识蒸馏步骤就能实现高效的多语言翻译。 - 此外,我们还引入了枢轴回译技术来增强模型对新出现的、之前未见过的语言配对之间的翻译能力。 - 实验结果表明,我们的方法在非自回归MNMT领域取得了当前最佳的效果。
https://arxiv.org/abs/2502.04537
In this work, we explore how instance-level memorization in the teacher Neural Machine Translation (NMT) model gets inherited by the student model in sequence-level knowledge distillation (SeqKD). We find that despite not directly seeing the original training data, students memorize more than baseline models (models of the same size, trained on the original data) -- 3.4% for exact matches and 57% for extractive memorization -- and show increased hallucination rates. Further, under this SeqKD setting, we also characterize how students behave on specific training data subgroups, such as subgroups with low quality and specific counterfactual memorization (CM) scores, and find that students exhibit amplified denoising on low-quality subgroups. Finally, we propose a modification to SeqKD named Adaptive-SeqKD, which intervenes in SeqKD to reduce memorization and hallucinations. Overall, we recommend caution when applying SeqKD: students inherit both their teachers' superior performance and their fault modes, thereby requiring active monitoring.
在这项工作中,我们探讨了教师神经机器翻译(NMT)模型中的实例级记忆如何通过序列级别的知识蒸馏(SeqKD)传递给学生模型。研究发现,尽管学生模型没有直接接触原始训练数据,但它们的记忆量比基线模型(与之大小相同,在原始数据上进行训练的模型)更多——精确匹配多出3.4%,提取式记忆多出57%——并且表现出更高的幻觉率。此外,在这种SeqKD设置下,我们还描述了学生模型在特定训练数据子集上的行为表现,例如质量低下的子组和具有特定反事实记忆(CM)得分的子组,并发现学生模型对低质量子组显示出更强的去噪效果。最后,我们提出了一种名为自适应SeqKD的SeqKD改进方法,通过干预减少学生的记忆量和幻觉现象。总体而言,我们在应用SeqKD时建议谨慎行事:学生不仅继承了教师的优点,也继承了他们的缺陷模式,因此需要进行积极监控。
https://arxiv.org/abs/2502.01491
One approach for multilingual data-to-text generation is to translate grammatical configurations upfront from the source language into each target language. These configurations are then used by a surface realizer and in document planning stages to generate output. In this paper, we describe a rule-based NLG implementation of this approach where the configuration is translated by Neural Machine Translation (NMT) combined with a one-time human review, and introduce a cross-language grammar dependency model to create a multilingual NLG system that generates text from the source data, scaling the generation phase without a human in the loop. Additionally, we introduce a method for human post-editing evaluation on the automatically translated text. Our evaluation on the SportSett:Basketball dataset shows that our NLG system performs well, underlining its grammatical correctness in translation tasks.
一种多语言数据到文本生成的方法是提前将源语言的语法结构翻译成每个目标语言。这些配置随后由表层实现器和文档规划阶段使用,以生成输出内容。在本文中,我们描述了一种基于规则的NLG(自然语言生成)实现方法,在该方法中通过神经机器翻译(NMT)结合一次性人工审查来翻译配置,并引入跨语言语法依赖模型,以此创建一个多语言NLG系统,可从源数据生成文本,并且无需人工干预即可扩展生成阶段。此外,我们还介绍了一种对自动翻译后的文本进行人工后期编辑评估的方法。在SportSett:Basketball数据集上的评估表明,我们的NLG系统表现良好,突出了其在翻译任务中的语法正确性。
https://arxiv.org/abs/2501.16135
This study introduces an approach to Estonian text simplification using two model architectures: a neural machine translation model and a fine-tuned large language model (LLaMA). Given the limited resources for Estonian, we developed a new dataset, the Estonian Simplification Dataset, combining translated data and GPT-4.0-generated simplifications. We benchmarked OpenNMT, a neural machine translation model that frames text simplification as a translation task, and fine-tuned the LLaMA model on our dataset to tailor it specifically for Estonian simplification. Manual evaluations on the test set show that the LLaMA model consistently outperforms OpenNMT in readability, grammaticality, and meaning preservation. These findings underscore the potential of large language models for low-resource languages and provide a basis for further research in Estonian text simplification.
这项研究介绍了一种使用两种模型架构对爱沙尼亚语文本进行简化的方法:神经机器翻译模型和经过微调的大规模语言模型(LLaMA)。鉴于爱沙尼亚语资源有限,我们开发了一个新的数据集——爱沙尼亚文本简化数据集,该数据集结合了翻译数据和GPT-4.0生成的简化文本。我们将OpenNMT作为基准,这是一种将文本简化视为翻译任务的神经机器翻译模型,并在我们的数据集上对LLaMA模型进行微调,使其专门适用于爱沙尼亚语文本简化。手动评估测试集结果显示,与OpenNMT相比,LLaMA模型在可读性、语法正确性和意义保持方面始终表现更优。这些发现强调了大规模语言模型在低资源语言中的潜力,并为进一步研究爱沙尼亚语文本简化提供了基础。
https://arxiv.org/abs/2501.15624
Ensembling neural machine translation (NMT) models to produce higher-quality translations than the $L$ individual models has been extensively studied. Recent methods typically employ a candidate selection block (CSB) and an encoder-decoder fusion block (FB), requiring inference across \textit{all} candidate models, leading to significant computational overhead, generally $\Omega(L)$. This paper introduces \textbf{SmartGen}, a reinforcement learning (RL)-based strategy that improves the CSB by selecting a small, fixed number of candidates and identifying optimal groups to pass to the fusion block for each input sentence. Furthermore, previously, the CSB and FB were trained independently, leading to suboptimal NMT performance. Our DQN-based \textbf{SmartGen} addresses this by using feedback from the FB block as a reward during training. We also resolve a key issue in earlier methods, where candidates were passed to the FB without modification, by introducing a Competitive Correction Block (CCB). Finally, we validate our approach with extensive experiments on English-Hindi translation tasks in both directions.
将神经机器翻译(NMT)模型进行集成以产生比单独的$L$个模型更高的质量翻译已经被广泛研究。最近的方法通常使用候选选择模块(CSM)和编码器-解码器融合块(FB),这需要在所有候选模型上执行推理,导致了显著的计算开销,一般为$\Omega(L)$级别。本文介绍了基于强化学习(RL)策略的\textbf{SmartGen}方法,通过选取一小部分固定数量的候选模型,并确定最优组合以传递给融合模块来优化CSM。此外,之前的方法中CSM和FB是独立训练的,导致了次优的NMT性能。我们的DQN基线\textbf{SmartGen}通过在训练过程中使用来自FB块的反馈作为奖励解决了这个问题。我们还解决了一个早期方法中的关键问题,即候选模型直接传递给FB而未进行任何修改,为此引入了竞争校正模块(CCB)。最后,我们在英语-印地语翻译任务中进行了广泛的实验验证了我们的方法的有效性,包括双向翻译情况。
https://arxiv.org/abs/2501.15219
Generating adversarial examples contributes to mainstream neural machine translation~(NMT) robustness. However, popular adversarial policies are apt for fixed tokenization, hindering its efficacy for common character perturbations involving versatile tokenization. Based on existing adversarial generation via reinforcement learning~(RL), we propose the `DexChar policy' that introduces character perturbations for the existing mainstream adversarial policy based on token substitution. Furthermore, we improve the self-supervised matching that provides feedback in RL to cater to the semantic constraints required during training adversaries. Experiments show that our method is compatible with the scenario where baseline adversaries fail, and can generate high-efficiency adversarial examples for analysis and optimization of the system.
生成对抗样本有助于主流神经机器翻译(NMT)的稳健性。然而,流行的对抗策略适用于固定的分词方式,这阻碍了其在涉及多样化分词的常见字符扰动中的有效性。基于现有的通过强化学习(RL)进行对抗生成方法,我们提出了“DexChar 策略”,该策略引入了基于词汇替换的现有主流对抗策略下的字符扰动。此外,我们改进了自我监督匹配以提供训练对手时所需的语义约束方面的反馈。实验表明,我们的方法在基线对抗样本失效的情况下仍然有效,并且能够生成用于分析和优化系统的高效对抗样例。
https://arxiv.org/abs/2501.12183
Despite achieving remarkable performance, machine translation (MT) research remains underexplored in terms of translating cultural elements in languages, such as idioms, proverbs, and colloquial expressions. This paper investigates the capability of state-of-the-art neural machine translation (NMT) and large language models (LLMs) in translating proverbs, which are deeply rooted in cultural contexts. We construct a translation dataset of standalone proverbs and proverbs in conversation for four language pairs. Our experiments show that the studied models can achieve good translation between languages with similar cultural backgrounds, and LLMs generally outperform NMT models in proverb translation. Furthermore, we find that current automatic evaluation metrics such as BLEU, CHRF++ and COMET are inadequate for reliably assessing the quality of proverb translation, highlighting the need for more culturally aware evaluation metrics.
尽管机器翻译(MT)在性能上取得了显著成就,但在翻译语言中的文化元素方面仍存在不足,例如成语、谚语和口语表达。本文研究了最先进的神经机器翻译(NMT)和大型语言模型(LLMs)在翻译谚语方面的能力,这些谚语深深植根于特定的文化背景之中。我们为四种语言对构建了一个独立谚语和会话语境中的谚语的翻译数据集。我们的实验表明,对于文化背景相似的语言而言,所研究的模型能够实现较好的翻译效果,并且大型语言模型在谚语翻译中通常优于神经机器翻译模型。此外,我们发现现有的自动评估指标(如BLEU、CHRF++和COMET)不足以可靠地衡量谚语翻译的质量,这突显了需要更多具有文化意识的评估标准的重要性。
https://arxiv.org/abs/2501.11953
This paper presents an results of the VLSP 2022-2023 Machine Translation Shared Tasks, focusing on Vietnamese-Chinese and Vietnamese-Lao machine translation. The tasks were organized as part of the 9th, 10th annual workshop on Vietnamese Language and Speech Processing (VLSP 2022, VLSP 2023). The objective of the shared task was to build machine translation systems, specifically targeting Vietnamese-Chinese and Vietnamese-Lao translation (corresponding to 4 translation directions). The submission were evaluated on 1,000 pairs for testing (news and general domains) using established metrics like BLEU [11] and SacreBLEU [12]. Additionally, system outputs also were evaluated with human judgment provided by experts in Chinese and Lao languages. These human assessments played a crucial role in ranking the performance of the machine translation models, ensuring a more comprehensive evaluation.
本文介绍了VLSP 2022-2023机器翻译共享任务的结果,重点在于越南语到中文和老挝文的机器翻译。这些任务作为第九届和第十届越南语言与语音处理年度研讨会(VLSP 2022、VLSP 2023)的一部分而组织进行。共享任务的目标是构建针对越中和越老翻译系统的机器翻译系统(涵盖四个翻译方向)。提交的作品在1,000个测试对(新闻和通用领域)上使用BLEU [11] 和SacreBLEU [12] 等公认的指标进行了评估。此外,系统输出还通过中文和老挝语言专家提供的主观评价进行了评估。这些人类评估对于排名机器翻译模型的性能起到了关键作用,确保了更全面的评估。
https://arxiv.org/abs/2501.08621
This paper introduces AFRIDOC-MT, a document-level multi-parallel translation dataset covering English and five African languages: Amharic, Hausa, Swahili, Yorùbá, and Zulu. The dataset comprises 334 health and 271 information technology news documents, all human-translated from English to these languages. We conduct document-level translation benchmark experiments by evaluating neural machine translation (NMT) models and large language models (LLMs) for translations between English and these languages, at both the sentence and pseudo-document levels. These outputs are realigned to form complete documents for evaluation. Our results indicate that NLLB-200 achieved the best average performance among the standard NMT models, while GPT-4o outperformed general-purpose LLMs. Fine-tuning selected models led to substantial performance gains, but models trained on sentences struggled to generalize effectively to longer documents. Furthermore, our analysis reveals that some LLMs exhibit issues such as under-generation, repetition of words or phrases, and off-target translations, especially for African languages.
这篇论文介绍了AFRIDOC-MT,这是一个涵盖英语和五种非洲语言(阿姆哈拉语、豪萨语、斯瓦希里语、约鲁巴语和祖鲁语)的文档级多平行翻译数据集。该数据集包括334份健康类文档和271份信息技术新闻文档,所有这些文档均由人工从英语翻译成上述五种非洲语言。 我们在句子级别和伪文档级别上对神经机器翻译(NMT)模型和大型语言模型(LLM)进行了文档级翻译基准实验,并将输出重新排列以形成完整的文档进行评估。我们的结果显示,在标准的NMT模型中,NLLB-200取得了最佳平均性能,而GPT-4o在通用型LLM中表现更优。对选定模型进行微调可以显著提高性能,但基于句子训练的模型难以有效地泛化到较长文档。 此外,我们的分析还发现了一些LLM存在的问题,例如某些语言生成不足、重复使用单词或短语以及翻译偏离主题的问题,特别是在处理非洲语言时更为明显。
https://arxiv.org/abs/2501.06374