LLMs are proving to be adept at machine translation although due to their generative nature they may at times overgenerate in various ways. These overgenerations are different from the neurobabble seen in NMT and range from LLM self-explanations, to risky confabulations, to appropriate explanations, where the LLM is able to act as a human translator would, enabling greater comprehension for the target audience. Detecting and determining the exact nature of the overgenerations is a challenging task. We detail different strategies we have explored for our work in a commercial setting, and present our results.
https://arxiv.org/abs/2604.15165
Neural machine translation (NMT) from Chinese to low-resource Southeast Asian languages remains severely constrained by the extreme scarcity of clean parallel corpora and the pervasive noise in existing mined data. This chronic shortage not only impedes effective model training but also sustains a large performance gap with high-resource directions, leaving millions of speakers of languages such as Lao, Burmese, and Tagalog with persistently low-quality translation systems despite recent advances in large multilingual models. We introduce \textbf{M}ultilingual \textbf{E}xpert-\textbf{R}eward \textbf{I}nformed \textbf{T}uning (\textbf{MERIT}), a unified translation framework that transforms the traditional English-centric ALT benchmark into a Chinese-centric evaluation suite for five Southeast Asian low-resource languages (LRLs). Our framework combines language-specific token prefixing (LTP) with supervised fine-tuning (SFT) and a novel group relative policy optimization (GRPO) guided by the semantic alignment reward (SAR). These results confirm that, in LRL{\textrightarrow}Chinese translation, targeted data curation and reward-guided optimization dramatically outperform mere model scaling.
神经机器翻译(NMT)在中文到东南亚低资源语言方向的应用,仍因清洁平行语料的极度匮乏及现有挖掘数据中普遍存在的噪声而受到严重制约。这种长期短缺不仅阻碍了有效的模型训练,也延续了与高资源方向之间的巨大性能差距,导致老挝语、缅甸语、他加禄语等语言的数百万使用者,尽管近年来大型多语言模型取得进展,却始终无法获得高质量的翻译系统。我们提出了\textbf{多语言专家奖励引导调优}(\textbf{MERIT})这一统一翻译框架,它将传统的以英语为中心的ALT基准测试转化为面向五种东南亚低资源语言的中文中心评估体系。该框架结合了语言特定词元前缀(LTP)与监督微调(SFT),并创新性地引入由语义对齐奖励(SAR)指导的组相对策略优化(GRPO)。这些结果证实,在低资源语言→中文翻译任务中,针对性的数据整理与奖励引导优化,其效果远胜于单纯的模型规模扩展。
https://arxiv.org/abs/2604.04839
Most of modern neural machine translation (NMT) models are based on an encoder-decoder framework with an attention mechanism. While they perform well on standard datasets, they can have trouble in translation of long inputs that are rare or unseen during training. Incorporating target syntax is one approach to dealing with such length-related problems. We propose a novel syntactic decoder that generates a target-language dependency tree in a top-down, left-to-right order. Experiments show that the proposed top-down string-to-tree decoding generalizes better than conventional sequence-to-sequence decoding in translating long inputs that are not observed in the training data.
大多数现代神经机器翻译(NMT)模型基于带有注意力机制的编码器-解码器框架。尽管它们在标准数据集上表现良好,但在翻译训练中罕见或未见的长输入时可能遇到困难。引入目标语言句法信息是处理此类长度相关问题的途径之一。我们提出了一种新颖的句法解码器,该解码器以自顶向下、从左到右的顺序生成目标语言依赖树。实验表明,所提出的自顶向下字符串转树解码在翻译训练数据中未观察到的长输入时,比传统的序列到序列解码具有更好的泛化能力。
https://arxiv.org/abs/2603.27938
Machine Translation (MT) evaluation has gone beyond metrics, towards more specific linguistic phenomena. Regarding English-Chinese language pairs, passive sentences are constructed and distributed differently due to language variation, thus need special attention in MT. This paper proposes a bidirectional multi-domain dataset of passive sentences, extracted from five Chinese-English parallel corpora and annotated automatically with structure labels according to human translation, and a test set with manually verified annotation. The dataset consists of 73,965 parallel sentence pairs (2,358,731 English words, 3,498,229 Chinese characters). We evaluate two state-of-the-art open-source MT systems with our dataset, and four commercial models with the test set. The results show that, unlike humans, models are more influenced by the voice of the source text rather than the general voice usage of the source language, and therefore tend to maintain the passive voice when translating a passive in either direction. However, models demonstrate some knowledge of the low frequency and predominantly negative context of Chinese passives, leading to higher voice consistency with human translators in English-to-Chinese translation than in Chinese-to-English translation. Commercial NMT models scored higher in metric evaluations, but LLMs showed a better ability to use diverse alternative translations. Datasets and annotation script will be shared upon request.
https://arxiv.org/abs/2603.15227
This paper presents and evaluates an optimized cascaded Nepali speech-to-English text translation (S2TT) system, focusing on mitigating structural noise introduced by Automatic Speech Recognition (ASR). We first establish highly proficient ASR and NMT components: a Wav2Vec2-XLS-R-300m model achieved a state-of-the-art 2.72% CER on OpenSLR-54, and a multi-stage fine-tuned MarianMT model reached a 28.32 BLEU score on the FLORES-200 benchmark. We empirically investigate the influence of punctuation loss, demonstrating that unpunctuated ASR output significantly degrades translation quality, causing a massive 20.7% relative BLEU drop on the FLORES benchmark. To overcome this, we propose and evaluate an intermediate Punctuation Restoration Module (PRM). The final S2TT pipeline was tested across three configurations on a custom dataset. The optimal configuration, which applied the PRM directly to ASR output, achieved a 4.90 BLEU point gain over the direct ASR-to-NMT baseline (BLEU 36.38 vs. 31.48). This improvement was validated by human assessment, which confirmed the optimized pipeline's superior Adequacy (3.673) and Fluency (3.804). This work validates that targeted punctuation restoration is the most effective intervention for mitigating structural noise in the Nepali S2TT pipeline. It establishes an optimized baseline and demonstrates a critical architectural insight for developing cascaded speech translation systems for similar low-resource languages.
这篇论文提出并评估了一种优化的尼泊尔语语音到英文文本翻译(S2TT)系统,重点在于减少自动语音识别(ASR)引入的结构噪音。我们首先建立了高度熟练的ASR和神经机器翻译(NMT)组件:一个Wav2Vec2-XLS-R-300m模型在OpenSLR-54数据集上取得了最先进的2.72% CER(字符错误率),而一个多阶段微调的MarianMT模型在FLORES-200基准测试中达到了28.32 BLEU分。我们通过实证研究探讨了标点符号损失的影响,证明未经标点处理的ASR输出显著降低了翻译质量,在FLORES基准上导致了令人震惊的20.7%相对BLEU分数下降。为克服这一问题,我们提出并评估了一种中间的标点符号恢复模块(PRM)。最终的S2TT流水线在自定义数据集上的三种配置中进行了测试。最优配置是直接将PRM应用于ASR输出,比直接从ASR到NMT的基础模型取得了4.90个BLEU分的增长(36.38 BLEU vs 31.48 BLEU)。这一改进通过人工评估得到了验证,该评估确认了优化流水线在充分性(3.673)和流畅度(3.804)方面的优越表现。这项工作证实,针对标点符号的恢复是减轻尼泊尔语S2TT管道中结构噪音最有效的干预措施,并建立了优化基准并展示了开发类似低资源语言级联语音翻译系统的关键架构洞察力。
https://arxiv.org/abs/2602.21647
Multilingual pretraining typically lacks explicit alignment signals, leading to suboptimal cross-lingual alignment in the representation space. In this work, we show that training standard pretrained models for cross-lingual alignment with a multi-way parallel corpus in a diverse pool of languages can substantially improve multilingual and cross-lingual representations for NLU tasks. We construct a multi-way parallel dataset using translations of English text from an off-the-shelf NMT model for a pool of six target languages and achieve strong cross-lingual alignment through contrastive learning. This leads to substantial performance gains across both seen and unseen languages for multiple tasks from the MTEB benchmark evaluated for XLM-Roberta and multilingual BERT base models. Using a multi-way parallel corpus for contrastive training yields substantial gains on bitext mining (21.3%), semantic similarity (5.3%), and classification (28.4%) compared to English-centric (En-X) bilingually parallel data, where X is sampled from a pool of multiple target languages. Furthermore, finetuning mE5 model on a small dataset with multi-way parallelism significantly improves bitext mining compared to one without, underscoring the importance of multi-way cross-lingual supervision even for models already pretrained for high-quality sentence embeddings.
多语言预训练通常缺乏显式的对齐信号,导致跨语言表示空间中的对齐效果不佳。在本工作中,我们展示了使用一个多向平行语料库(包含多种语言)来训练标准的预训练模型可以显著提升用于自然语言理解任务的多语言和跨语言表示。我们构建了一个多向平行数据集,通过现成的神经机器翻译模型将英语文本翻译为六种目标语言,并采用对比学习实现了强大的跨语言对齐效果。这使得XLM-Roberta和多语言BERT基础模型在MTEB基准测试中的多个任务上,无论是已见还是未见过的语言都取得了显著的表现提升。 与以英语为中心的(En-X)双语平行数据相比,使用多向平行语料库进行对比训练在双语文本挖掘、语义相似度和分类方面分别获得了21.3%、5.3%和28.4%的大幅提升。其中X是从多个目标语言中抽取的一种语言。此外,在一个小规模数据集上对mE5模型进行微调,使用多向平行语料库相比于不使用的方案显著提高了双语文本挖掘的效果,强调了即使对于已经预训练出高质量句子嵌入的模型来说,多向跨语言监督的重要性。
https://arxiv.org/abs/2602.21543
Spoofing detection systems are typically trained using diverse recordings from multiple speakers, often assuming that the resulting embeddings are independent of speaker identity. However, this assumption remains unverified. In this paper, we investigate the impact of speaker information on spoofing detection systems. We propose two approaches within our Speaker-Invariant Multi-Task framework, one that models speaker identity within the embeddings and another that removes it. SInMT integrates multi-task learning for joint speaker recognition and spoofing detection, incorporating a gradient reversal layer. Evaluated using four datasets, our speaker-invariant model reduces the average equal error rate by 17% compared to the baseline, with up to 48% reduction for the most challenging attacks (e.g., A11).
声纹伪造检测系统通常使用来自多位说话者的多样化录音进行训练,常常假设由此产生的嵌入(embeddings)与说话人身份无关。然而,这一假设尚未得到验证。本文研究了说话人信息对伪造检测系统的影响。我们提出了两种方法,在我们的说话人不变的多任务框架(Speaker-Invariant Multi-Task, SInMT)中实现:一种是将说话人身份建模到嵌入中,另一种则是去除该身份信息。SInMT整合了用于联合进行说话人识别和伪造检测的多任务学习,并加入了一个梯度反转层。通过四个数据集的评估,我们的说话人不变模型与基准相比平均等错误率降低了17%,在最具有挑战性的攻击(如A11)情况下,甚至可以降低高达48%的等错误率。
https://arxiv.org/abs/2602.20805
Evaluating machine translation (MT) quality in extremely low-resource language (ELRL) scenarios poses unique challenges, as widely used metrics such as BLEU, effective in high-resource settings, often misrepresent quality in data-scarce contexts. This work presents a comparative analysis of BLEU, an n-gram-based metric, and ChrF++, a character-based metric, for MT evaluation in ELRL settings. We examine how each metric responds to translation artifacts, including hallucinations, repetition, source-text copying, and diacritic (\textit{matra}) variations across three ELRLs: Magahi, Bhojpuri, and Chhattisgarhi, with a focus on outputs from large language models (LLMs) and neural MT (NMT) systems. While recent work often relies solely on ChrF++, our findings show that BLEU, despite its lower absolute scores, provides complementary lexical-precision insights that improve interpretability.
https://arxiv.org/abs/2602.17425
Modern neural translation models based on the Transformer architecture are known for their high performance, particularly when trained on high-resource datasets. A standard next-token prediction training strategy, while widely adopted in practice, may lead to overlooked artifacts such as representation collapse. Previous works have shown that this problem is especially pronounced in the representation of the deeper Transformer layers, where it often fails to efficiently utilize the geometric space. Representation collapse is even more evident in end-to-end training of continuous-output neural machine translation, where the trivial solution would be to set all vectors to the same value. In this work, we analyze the dynamics of representation collapse at different levels of discrete and continuous NMT transformers throughout training. We incorporate an existing regularization method based on angular dispersion and demonstrate empirically that it not only mitigates collapse but also improves translation quality. Furthermore, we show that quantized models exhibit similar collapse behavior and that the benefits of regularization are preserved even after quantization.
https://arxiv.org/abs/2602.17287
Due to the scarcity of part-of-speech annotated data, existing studies on low-resource languages typically adopt unsupervised approaches for POS tagging. Among these, POS tag projection with word alignment method transfers POS tags from a high-resource source language to a low-resource target language based on parallel corpora, making it particularly suitable for low-resource language settings. However, this approach relies heavily on parallel corpora, which are often unavailable for many low-resource languages. To overcome this limitation, we propose a fully unsupervised cross-lingual part-of-speech(POS) tagging framework that relies solely on monolingual corpora by leveraging unsupervised neural machine translation(UNMT) system. This UNMT system first translates sentences from a high-resource language into a low-resource one, thereby constructing pseudo-parallel sentence pairs. Then, we train a POS tagger for the target language following the standard projection procedure based on word alignments. Moreover, we propose a multi-source projection technique to calibrate the projected POS tags on the target side, enhancing to train a more effective POS tagger. We evaluate our framework on 28 language pairs, covering four source languages (English, German, Spanish and French) and seven target languages (Afrikaans, Basque, Finnis, Indonesian, Lithuanian, Portuguese and Turkish). Experimental results show that our method can achieve performance comparable to the baseline cross-lingual POS tagger with parallel sentence pairs, and even exceeds it for certain target languages. Furthermore, our proposed multi-source projection technique further boosts performance, yielding an average improvement of 1.3% over previous methods.
由于词性标注数据的稀缺,现有研究中针对低资源语言通常采用无监督方法进行词性标注。在这些方法中,基于平行语料库将高资源源语言的词性标签转移到低资源目标语言的方法被证明特别适合处理低资源语言环境中的问题。然而,这种方法严重依赖于平行语料库,而这种平行语料库对于许多低资源语言往往不可用。为了克服这一限制,我们提出了一种完全无监督的跨语言词性标注框架,该框架仅依靠单语文本,并利用无监督神经机器翻译(UNMT)系统。此 UNMT 系统首先将高资源语言中的句子翻译成低资源语言,从而构建伪平行句对。然后,根据单词对齐信息,我们按照标准投影流程训练目标语言的词性标注器。此外,我们提出了一种多源投影技术来校准目标侧的投影词性标签,以增强词性标注器的有效训练。 我们在28种语言对上评估了该框架的表现,包括四种源语言(英语、德语、西班牙语和法语)以及七种目标语言(南非荷兰语、巴斯克语、芬兰语、印尼语、立陶宛语、葡萄牙语和土耳其语)。实验结果表明,我们的方法能够实现与使用平行句对的基线跨语言词性标注器相媲美的性能,并且对于某些目标语言甚至超过了它的表现。此外,我们提出的多源投影技术进一步提升了整体性能,在前人工作基础上平均提高了1.3%的表现。
https://arxiv.org/abs/2602.09366
Figures of Speech (FoS) consist of multi-word phrases that are deeply intertwined with culture. While Neural Machine Translation (NMT) performs relatively well with the figurative expressions of high-resource languages, it often faces challenges when dealing with low-resource languages like Sinhala due to limited available data. To address this limitation, we introduce a corpus of 2,344 Sinhala figures of speech with cultural and cross-lingual annotations. We examine this dataset to classify the cultural origins of the figures of speech and to identify their cross-lingual equivalents. Additionally, we have developed a binary classifier to differentiate between two types of FOS in the dataset, achieving an accuracy rate of approximately 92%. We also evaluate the performance of existing LLMs on this dataset. Our findings reveal significant shortcomings in the current capabilities of LLMs, as these models often struggle to accurately convey idiomatic meanings. By making this dataset publicly available, we offer a crucial benchmark for future research in low-resource NLP and culturally aware machine translation.
修辞表达(FoS)由多词短语组成,这些短语与文化密切相关。尽管神经机器翻译(NMT)在处理高资源语言中的比喻性表达时表现相对较好,但在处理像僧伽罗语这样的低资源语言时却常常面临挑战,原因在于可利用的数据量有限。为了解决这一局限性,我们构建了一个包含2,344条带有文化和跨语言注释的僧伽罗语修辞表达的语料库。通过对该数据集进行分析,我们将修辞表达的文化起源进行了分类,并确定了它们在不同语言中的等价表达。此外,我们还开发了一种二元分类器来区分数据集中两种类型的FoS,其准确率约为92%。同时,我们也评估了现有大模型在此数据集上的表现。我们的研究发现显示当前大模型的能力存在显著不足,这些模型往往难以准确传达习语的意义。通过将此数据集公开提供,我们为未来在低资源自然语言处理和文化敏感型机器翻译领域的研究提供了重要的基准。
https://arxiv.org/abs/2602.09866
The financial domain poses substantial challenges for vision-language models (VLMs) due to specialized chart formats and knowledge-intensive reasoning requirements. However, existing financial benchmarks are largely single-turn and rely on a narrow set of question formats, limiting comprehensive evaluation in realistic application scenarios. To address this gap, we propose FinMTM, a multi-turn multimodal benchmark that expands diversity along both data and task dimensions. On the data side, we curate and annotate 11{,}133 bilingual (Chinese and English) financial QA pairs grounded in financial visuals, including candlestick charts, statistical plots, and report figures. On the task side, FinMTM covers single- and multiple-choice questions, multi-turn open-ended dialogues, and agent-based tasks. We further design task-specific evaluation protocols, including a set-overlap scoring rule for multiple-choice questions, a weighted combination of turn-level and session-level scores for multi-turn dialogues, and a composite metric that integrates planning quality with final outcomes for agent tasks. Extensive experimental evaluation of 22 VLMs reveal their limitations in fine-grained visual perception, long-context reasoning, and complex agent workflows.
金融领域对视觉-语言模型(VLMs)提出了重大挑战,原因是其特有的图表格式和知识密集型推理需求。然而,现有的金融基准测试主要集中在单一回合上,并依赖于有限的问题类型,这限制了在现实应用场景中的全面评估。为了弥补这一不足,我们提出了一种多轮多媒体基准测试——FinMTM,该测试通过数据和任务两个维度扩展多样性。 在数据方面,我们整理并标注了11,133对双语(中文和英文)金融问答对,这些问答基于财务可视化图表,包括蜡烛图、统计图和报告图形。在任务层面,FinMTM涵盖了单选题和多选题、多轮开放式对话以及代理任务。 我们进一步设计了特定于任务的评估协议,包括针对多项选择问题的一套重叠评分规则、用于多轮对话中的回合级和会话级得分加权组合方法,以及结合规划质量和最终成果的综合度量标准,专门针对代理任务。 通过22种视觉-语言模型(VLMs)的大规模实验评估显示了它们在细粒度视觉感知、长上下文推理以及复杂代理工作流程方面的局限性。
https://arxiv.org/abs/2602.03130
This work presents EmoAra, an end-to-end emotion-preserving pipeline for cross-lingual spoken communication, motivated by banking customer service where emotional context affects service quality. EmoAra integrates Speech Emotion Recognition, Automatic Speech Recognition, Machine Translation, and Text-to-Speech to process English speech and deliver an Arabic spoken output while retaining emotional nuance. The system uses a CNN-based emotion classifier, Whisper for English transcription, a fine-tuned MarianMT model for English-to-Arabic translation, and MMS-TTS-Ara for Arabic speech synthesis. Experiments report an F1-score of 94% for emotion classification, translation performance of BLEU 56 and BERTScore F1 88.7%, and an average human evaluation score of 81% on banking-domain translations. The implementation and resources are available at the accompanying GitHub repository.
https://arxiv.org/abs/2602.01170
The Bahnar people, an ethnic minority in Vietnam with a rich ancestral heritage, possess a language of immense cultural and historical significance. The government places a strong emphasis on preserving and promoting the Bahnaric language by making it accessible online and encouraging communication across generations. Recent advancements in artificial intelligence, such as Neural Machine Translation (NMT), have brought about a transformation in translation by improving accuracy and fluency. This, in turn, contributes to the revival of the language through educational efforts, communication, and documentation. Specifically, NMT is pivotal in enhancing accessibility for Bahnaric speakers, making information and content more readily available. Nevertheless, the translation of Vietnamese into Bahnaric faces practical challenges due to resource constraints, especially given the limited resources available for the Bahnaric language. To address this, we employ state-of-the-art techniques in NMT along with two augmentation strategies for domain-specific Vietnamese-Bahnaric translation task. Importantly, both approaches are flexible and can be used with various neural machine translation models. Additionally, they do not require complex data preprocessing steps, the training of additional systems, or the acquisition of extra data beyond the existing training parallel corpora.
https://arxiv.org/abs/2601.19124
Neural Machine Translation (NMT) models for low-resource languages suffer significant performance degradation under domain shift. We quantify this challenge using Dhao, an indigenous language of Eastern Indonesia with no digital footprint beyond the New Testament (NT). When applied to the unseen Old Testament (OT), a standard NMT model fine-tuned on the NT drops from an in-domain score of 36.17 chrF++ to 27.11 chrF++. To recover this loss, we introduce a hybrid framework where a fine-tuned NMT model generates an initial draft, which is then refined by a Large Language Model (LLM) using Retrieval-Augmented Generation (RAG). The final system achieves 35.21 chrF++ (+8.10 recovery), effectively matching the original in-domain quality. Our analysis reveals that this performance is driven primarily by the number of retrieved examples rather than the choice of retrieval algorithm. Qualitative analysis confirms the LLM acts as a robust "safety net," repairing severe failures in zero-shot domains.
神经机器翻译(NMT)模型在资源匮乏的语言中遇到领域偏移时,性能会显著下降。我们使用达豪语来量化这一挑战——这是一种印度尼西亚东部的土著语言,在数字世界中的唯一足迹是《新约圣经》(NT)。当我们将其应用于未见过的《旧约圣经》(OT) 时,基于 NT 进行微调的标准 NMT 模型性能从内部领域的 36.17 chrF++ 下降到 27.11 chrF++。为了弥补这种损失,我们引入了一种混合框架:首先使用在 NT 上微调的 NMT 模型生成初步草稿,然后通过检索增强生成(RAG)的方法利用大型语言模型(LLM)对其进行精炼。最终系统达到了 35.21 chrF++(+8.10 的恢复),几乎与原始内部领域的质量相当。 我们的分析表明,这种性能的提升主要归因于检索到的例子的数量,而非所选择的检索算法。定性分析确认了 LLM 在零样本领域中充当了一个可靠的“安全网”,修复了严重的错误。
https://arxiv.org/abs/2601.09982
Low-resource indigenous languages often lack the parallel corpora required for effective neural machine translation (NMT). Synthetic data generation offers a practical strategy for mitigating this limitation in data-scarce settings. In this work, we augment curated parallel datasets for indigenous languages of the Americas with synthetic sentence pairs generated using a high-capacity multilingual translation model. We fine-tune a multilingual mBART model on curated-only and synthetically augmented data and evaluate translation quality using chrF++, the primary metric used in recent AmericasNLP shared tasks for agglutinative languages. We further apply language-specific preprocessing, including orthographic normalization and noise-aware filtering, to reduce corpus artifacts. Experiments on Guarani--Spanish and Quechua--Spanish translation show consistent chrF++ improvements from synthetic data augmentation, while diagnostic experiments on Aymara highlight the limitations of generic preprocessing for highly agglutinative languages.
低资源土著语言通常缺乏有效的神经机器翻译(NMT)所需的平行语料库。合成数据生成为在数据稀缺的情况下缓解这一限制提供了一种实用策略。在这项工作中,我们使用高容量的多语言翻译模型生成合成句子对,以增强美洲土著语言的经过整理的平行数据集。我们将多语言mBART模型针对仅包含经过整理的数据和通过合成增强的数据进行微调,并使用chrF++作为评估指标来衡量翻译质量,这是最近用于黏着语系语言的AmericasNLP共享任务的主要度量标准。我们进一步应用了特定于每种语言的预处理方法,包括正字法规范化和噪声感知过滤,以减少语料库中的伪影。在瓜拉尼语-西班牙语和克丘亚语-西班牙语翻译实验中显示,从合成数据增强中获得了chrF++的一致性改进,而艾马拉语的诊断性试验则突出了通用预处理方法对高度黏着语言的局限性。
https://arxiv.org/abs/2601.03135
Word meaning, representation, and interpretation play fundamental roles in natural language understanding (NLU), natural language processing (NLP), and natural language generation (NLG) tasks. Many of the inherent difficulties in these tasks stem from Multi-word Expressions (MWEs), which complicate the tasks by introducing ambiguity, idiomatic expressions, infrequent usage, and a wide range of variations. Significant effort and substantial progress have been made in addressing the challenging nature of MWEs in Western languages, particularly English. This progress is attributed in part to the well-established research communities and the abundant availability of computational resources. However, the same level of progress is not true for language families such as Chinese and closely related Asian languages, which continue to lag behind in this regard. While sub-word modelling has been successfully applied to many Western languages to address rare words improving phrase comprehension, and enhancing machine translation (MT) through techniques like byte-pair encoding (BPE), it cannot be applied directly to ideograph language scripts like Chinese. In this work, we conduct a systematic study of the Chinese character decomposition technology in the context of MWE-aware neural machine translation (NMT). Furthermore, we report experiments to examine how Chinese character decomposition technology contributes to the representation of the original meanings of Chinese words and characters, and how it can effectively address the challenges of translating MWEs.
单词的意义、表示和解释在自然语言理解(NLU)、自然语言处理(NLP)和自然语言生成(NLG)任务中扮演着核心角色。这些任务中的许多固有困难源自于多词表达(MWE),它们通过引入歧义、惯用语、不常见的使用方式以及广泛的变体,使这些问题变得更加复杂。在西方语言尤其是英语中处理MWE的挑战方面已经付出了巨大的努力并取得了显著进展。这一进步部分归因于研究社区的成熟和计算资源的丰富可用性。然而,在汉语及其相关亚洲语言这样的语言家族中,并没有取得相同水平的进步;这些语言在此领域仍处于落后状态。 虽然通过应用子词模型(如字节对编码BPE技术)来解决西方语言中的罕见词汇问题,提高短语理解能力并改进机器翻译,取得了成功。但直接将这种技术应用于像汉语这样的意符文字脚本是不可行的。在这项工作中,我们系统地研究了在基于MWE感知神经机器翻译(NMT)上下文下的汉字分解技术。此外,我们报告了一些实验结果,旨在探讨汉字分解技术如何有助于中文单词和字符原始意义的表现,并且它能够有效应对翻译MWE带来的挑战。
https://arxiv.org/abs/2512.15556
Continual learning in Neural Machine Translation (NMT) faces the dual challenges of catastrophic forgetting and the high computational cost of retraining. This study establishes Low-Rank Adaptation (LoRA) as a parameter-efficient framework to address these challenges in dedicated NMT architectures. We first demonstrate that LoRA-based fine-tuning adapts NMT models to new languages and domains with performance on par with full-parameter techniques, while utilizing only a fraction of the parameter space. Second, we propose an interactive adaptation method using a calibrated linear combination of LoRA modules. This approach functions as a gate-free mixture of experts, enabling real-time, user-controllable adjustments to domain and style without retraining. Finally, to mitigate catastrophic forgetting, we introduce a novel gradient-based regularization strategy specifically designed for low-rank decomposition matrices. Unlike methods that regularize the full parameter set, our approach weights the penalty on the low-rank updates using historical gradient information. Experimental results indicate that this strategy efficiently preserves prior domain knowledge while facilitating the acquisition of new tasks, offering a scalable paradigm for interactive and continual NMT.
在神经机器翻译(NMT)中的持续学习面临灾难性遗忘和重新训练高计算成本的双重挑战。本研究建立了低秩适应(LoRA)作为解决这些挑战的有效参数框架,特别是在专门针对NMT架构的情况下。 首先,我们展示了基于LoRA的微调可以使NMT模型适应新的语言和领域,并且其性能与全参数技术相当,但仅使用了一小部分参数空间。其次,我们提出了一种交互式适应方法,该方法采用校准后的线性组合来融合LoRA模块,这可以作为无门控专家混合体工作,能够在不重新训练的情况下实时调整领域和风格。 最后,为了解决灾难性遗忘问题,我们引入了一种新的基于梯度的正则化策略,专门针对低秩分解矩阵。与对整个参数集进行规则化的常规方法不同,我们的方法利用历史梯度信息来加权低秩更新的惩罚。实验结果表明,这种策略能够有效地保留先前领域的知识,同时促进新任务的学习,为交互式和持续的NMT提供了一种可扩展的方法论。 简而言之,本研究通过LoRA及其创新性正则化方法,在减少计算负担的同时提高了模型在多语言、多领域设置下的适应性和持久学习能力。
https://arxiv.org/abs/2512.09910
Automatic post-editing (APE) aims to correct errors in machine-translated text, enhancing translation quality, while reducing the need for human intervention. Despite advances in neural machine translation (NMT), the development of effective APE systems has been hindered by the lack of large-scale multilingual datasets specifically tailored to NMT outputs. To address this gap, we present and release LangMark, a new human-annotated multilingual APE dataset for English translation to seven languages: Brazilian Portuguese, French, German, Italian, Japanese, Russian, and Spanish. The dataset has 206,983 triplets, with each triplet consisting of a source segment, its NMT output, and a human post-edited translation. Annotated by expert human linguists, our dataset offers both linguistic diversity and scale. Leveraging this dataset, we empirically show that Large Language Models (LLMs) with few-shot prompting can effectively perform APE, improving upon leading commercial and even proprietary machine translation systems. We believe that this new resource will facilitate the future development and evaluation of APE systems.
https://arxiv.org/abs/2511.17153
The translation of written language has been known since the 3rd century BC; however, its necessity has become increasingly common in the information age. Today, many translators exist, based on encoder-decoder deep architectures, nevertheless, no quantitative objective methods are available to assess their performance, likely because the entropy of even a single language remains unknown. This study presents a quantitative method for estimating translation entropy, with the following key finding. Given a translator, several sentences that differ by only one selected token of a given pivot sentence yield identical translations. Analyzing the statistics of this phenomenon across an ensemble of such sentences, consisting each of a pivot selected token, yields the probabilities of replacing this specific token with others while preserving the translation. These probabilities constitute the entropy of the selected token, and the average across all selected pivot tokens provides an estimate of the translator's overall translation entropy, which is enhanced along the decoder blocks. This entropic measure allows for the quantitative ranking of several publicly available translators and reveals whether mutual translation entropy is symmetric. Extending the proposed method to include the replacement of two tokens in a given pivot sentence demonstrates a multiplicative effect, where translation degeneracy is proportional to the product of the degeneracies of the two tokens. These findings establish translation entropy as a measurable property and objective benchmarking of artificial translators. Results are based on MarianMT, T5-Base and NLLB-200 translators.
https://arxiv.org/abs/2511.13180