Helping deaf and hard-of-hearing people communicate more easily is the main goal of Automatic Sign Language Translation. Although most past research has focused on turning sign language into text, doing the reverse, turning spoken English into sign language animations, has been largely overlooked. That's because it involves multiple steps, such as understanding speech, translating it into sign-friendly grammar, and generating natural human motion. In this work, we introduce a complete pipeline that converts English speech into smooth, realistic 3D sign language animations. Our system starts with Whisper to translate spoken English into text. Then, we use a MarianMT machine translation model to translate that text into American Sign Language (ASL) gloss, a simplified version of sign language that captures meaning without grammar. This model performs well, reaching BLEU scores of 0.7714 and 0.8923. To make the gloss translation more accurate, we also use word embeddings such as Word2Vec and FastText to understand word meanings. Finally, we animate the translated gloss using a 3D keypoint-based motion system trained on Sign3D-WLASL, a dataset we created by extracting body, hand, and face key points from real ASL videos in the WLASL dataset. To support the gloss translation stage, we also built a new dataset called BookGlossCorpus-CG, which turns everyday English sentences from the BookCorpus dataset into ASL gloss using grammar rules. Our system stitches everything together by smoothly interpolating between signs to create natural, continuous animations. Unlike previous works like How2Sign and Phoenix-2014T that focus on recognition or use only one type of data, our pipeline brings together audio, text, and motion in a single framework that goes all the way from spoken English to lifelike 3D sign language animation.
帮助聋人和听力障碍者更轻松地交流是自动手语翻译的主要目标。尽管过去的研究大多集中在将手语转换为文本上,但将其反向操作即把口语英语转换成手语动画的工作却被很大程度上忽视了。这是因为后者涉及多个步骤,包括理解语音、将语音翻译成交互性强的手语语法,并生成自然的人体动作。在这一工作中,我们引入了一个完整的流水线,用于将英语口述语言转换为流畅且逼真的3D手语动画。 我们的系统首先使用Whisper将口语英语转换成文本。然后,采用MarianMT机器翻译模型将该文本翻译成美国手语(ASL)符号,这是一种简化了的手语形式,仅捕捉到意义而忽略了语法细节。这一模型表现良好,在BLEU分数上达到了0.7714和0.8923的评分。为了使符号翻译更加准确,我们还使用如Word2Vec和FastText这样的词嵌入技术来理解单词的意义。 最后,我们采用基于3D关键点的动作系统将翻译后的符号进行动画化处理,该动作系统在Sign3D-WLASL上进行了训练——这是一个我们从WLASL数据集中提取真实ASL视频中的身体、手部和面部关键点所创建的数据集。为了支持符号翻译阶段,我们也建立了一个新的名为BookGlossCorpus-CG的数据集,它将来自BookCorpus的日常英语句子转化为使用语法规则的手语符号。 我们的系统通过在符号之间平滑插值来实现自然流畅的动作衔接。与以往的研究(如How2Sign和Phoenix-2014T)不同,后者主要关注识别或仅利用一种类型的数据,我们所提出的流水线整合了音频、文本及动作,在单一框架中实现了从口语英语到逼真3D手语动画的全过程转换。
https://arxiv.org/abs/2507.06530
In real world translation scenarios, terminology is rarely one-to-one. Instead, multiple valid translations may appear in a terminology dictionary, but correctness of a translation depends on corporate style guides and context. This can be challenging for neural machine translation (NMT) systems. Luckily, in a corporate context, many examples of human post-edits of valid but incorrect terminology exist. The goal of this work is to learn how to disambiguate our terminology based on these corrections. Our approach is based on preference optimization, using the term post-edit as the knowledge to be preferred. While previous work had to rely on unambiguous translation dictionaries to set hard constraints during decoding, or to add soft constraints in the input, our framework requires neither one-to-one dictionaries nor human intervention at decoding time. We report results on English-German post-edited data and find that the optimal combination of supervised fine-tuning and preference optimization, with both term-specific and full sequence objectives, yields statistically significant improvements in term accuracy over a strong NMT baseline without significant losses in COMET score. Additionally, we release test sets from our post-edited data and terminology dictionary.
在真实的翻译场景中,术语的对应关系很少是一对一的。相反,在术语词典中可能会存在多个有效的译文选项,但正确的翻译取决于公司的风格指南和具体上下文环境。这对神经机器翻译(NMT)系统来说是一个挑战。幸运的是,在企业环境中,有许多人类后编辑人员修正不正确术语的实际例子可用。 这项工作的目标是基于这些校正示例学习如何对术语进行语境化区别。我们的方法基于偏好优化,以术语的后编辑结果作为优先选择的知识。以往的研究工作不得不依赖于无歧义的翻译词典来在解码阶段设置硬性约束条件,或者在输入中添加软性约束条件。而我们提出的方法既不需要一对一的词典也不需要在解码时进行人工干预。 我们在英语-德语后编辑数据上报告了实验结果,并发现最优结合监督微调和偏好优化(同时使用术语特定目标和完整序列目标)的方法,相比于一个强大的NMT基线,在术语准确性上有统计显著性的改进,而不会导致COMET评分有明显下降。此外,我们还发布了从我们的后编辑数据集中提取的测试集以及相关的术语词典。
https://arxiv.org/abs/2507.03580
Recent advances in deep learning have made it possible to predict phenotypic measures directly from functional magnetic resonance imaging (fMRI) brain volumes, sparking significant interest in the neuroimaging community. However, existing approaches, primarily based on convolutional neural networks or transformer architectures, often struggle to model the complex relationships inherent in fMRI data, limited by their inability to capture long-range spatial and temporal dependencies. To overcome these shortcomings, we introduce BrainMT, a novel hybrid framework designed to efficiently learn and integrate long-range spatiotemporal attributes in fMRI data. Our framework operates in two stages: (1) a bidirectional Mamba block with a temporal-first scanning mechanism to capture global temporal interactions in a computationally efficient manner; and (2) a transformer block leveraging self-attention to model global spatial relationships across the deep features processed by the Mamba block. Extensive experiments on two large-scale public datasets, UKBioBank and the Human Connectome Project, demonstrate that BrainMT achieves state-of-the-art performance on both classification (sex prediction) and regression (cognitive intelligence prediction) tasks, outperforming existing methods by a significant margin. Our code and implementation details will be made publicly available at this this https URL
最近在深度学习领域的进展使得直接从功能性磁共振成像(fMRI)脑体积数据中预测表型测量成为可能,这一成就引起了神经影像学界的极大兴趣。然而,现有的方法主要基于卷积神经网络或变压器架构,常常难以建模fMRI数据中存在的复杂关系,因为这些模型无法捕捉到长距离的空间和时间依赖性。为了解决这些问题,我们引入了BrainMT,这是一种新颖的混合框架,旨在高效地学习并整合fMRI数据中的长程时空属性。 我们的框架分为两个阶段运作: 1. 一个双向Mamba块,结合了一个以时间为先扫描机制来有效地捕捉全局时间交互。 2. 一个变压器模块利用自注意力机制建模Mamba块处理的深层特征之间的全球空间关系。 在两个大规模公开数据集(英国生物银行和人类连接组项目)上的广泛实验表明,BrainMT在分类任务(性别预测)和回归任务(认知智能预测)上均达到了最先进的性能水平,并且比现有方法有显著的优势。我们的代码和实现细节将在[此处提供链接]公开发布。
https://arxiv.org/abs/2506.22591
This study explores Machine Translationese (MTese) -- the linguistic peculiarities of machine translation outputs -- focusing on the under-researched English-to-Chinese language pair in news texts. We construct a large dataset consisting of 4 sub-corpora and employ a comprehensive five-layer feature set. Then, a chi-square ranking algorithm is applied for feature selection in both classification and clustering tasks. Our findings confirm the presence of MTese in both Neural Machine Translation systems (NMTs) and Large Language Models (LLMs). Original Chinese texts are nearly perfectly distinguishable from both LLM and NMT outputs. Notable linguistic patterns in MT outputs are shorter sentence lengths and increased use of adversative conjunctions. Comparing LLMs and NMTs, we achieve approximately 70% classification accuracy, with LLMs exhibiting greater lexical diversity and NMTs using more brackets. Additionally, translation-specific LLMs show lower lexical diversity but higher usage of causal conjunctions compared to generic LLMs. Lastly, we find no significant differences between LLMs developed by Chinese firms and their foreign counterparts.
这项研究探讨了机器翻译语(Machine Translationese,简称MTese)——即机器翻译输出中的语言特征。重点是较少被研究的英语到中文的语言对,在新闻文本中进行分析。我们构建了一个包含4个子语料库的大数据集,并使用全面的五层特征集合。然后,在分类和聚类任务中应用卡方排名算法来进行特征选择。 我们的发现证实了在神经机器翻译系统(NMTs)和大型语言模型(LLMs)中都存在MTese现象。原始中文文本几乎可以被完全区分为来自这两种系统的输出。值得注意的是,机器翻译的输出具有较短的句子长度以及更频繁地使用对立连词。 将LLMs与NMTs进行比较时,我们实现了大约70%的分类准确性。LLMs表现出更大的词汇多样性,而NMT则更多地使用括号。此外,专门用于翻译任务的LLM显示较低的词汇多样性但因果连词使用的频率更高,相比通用型LLM而言。 最后,我们没有发现中国公司开发的大型语言模型与其外国同行之间存在显著差异。
https://arxiv.org/abs/2506.22050
This study focuses on evaluating the performance of machine translations (MTs) compared to human translations (HTs) in English-to-Chinese children's literature translation (CLT) from a stylometric perspective. The research constructs a Peter Pan corpus, comprising 21 translations: 7 human translations (HTs), 7 large language model translations (LLMs), and 7 neural machine translation outputs (NMTs). The analysis employs a generic feature set (including lexical, syntactic, readability, and n-gram features) and a creative text translation (CTT-specific) feature set, which captures repetition, rhythm, translatability, and miscellaneous levels, yielding 447 linguistic features in total. Using classification and clustering techniques in machine learning, we conduct a stylometric analysis of these translations. Results reveal that in generic features, HTs and MTs exhibit significant differences in conjunction word distributions and the ratio of 1-word-gram-YiYang, while NMTs and LLMs show significant variation in descriptive words usage and adverb ratios. Regarding CTT-specific features, LLMs outperform NMTs in distribution, aligning more closely with HTs in stylistic characteristics, demonstrating the potential of LLMs in CLT.
这项研究重点评估机器翻译(MT)与人工翻译(HT)在从英语到中文的儿童文学翻译(CLT)中的表现,采用的是文体测量学的方法。研究构建了一个《彼得潘》语料库,包含21个译本:7个人工译本(HTs),7个大型语言模型译本(LLMs),以及7个神经机器翻译输出(NMTs)。分析采用了通用特征集(包括词汇、句法、可读性和n-gram特征)和创意文本翻译(CTT特定)的特征集,后者捕捉重复性、节奏感、可译性和杂项水平,总共产生了447种语言特征。通过机器学习中的分类和聚类技术,我们对这些翻译进行了文体分析。 结果显示,在通用特征方面,HTs与MTs在连词分布和1-gram词汇比例上存在显著差异;而在NMTs与LLMs之间,则是在描述性词语的使用和副词比率上的显著变化。就CTT特定特征而言,LLMs的表现优于NMTs,并且其风格特性更接近于HTs,这表明LLMs在CLT中具有潜力。
https://arxiv.org/abs/2506.22038
We show how several graph problems (e.g., vertex-cover, independent-set, $k$-coloring) can be encoded into CNF using only $O(|V|^2 / \lg |V|)$ many clauses, as opposed to the $\Omega(|V|^2)$ constraints used by standard encodings. This somewhat surprising result is a simple consequence of a result of ErdÅs, Chung, and Spencer (1983) about biclique coverings of graphs, and opens theoretical avenues to understand the success of "Bounded Variable Addition'' (Manthey, Heule, and Biere, 2012) as a preprocessing tool. Finally, we show a novel encoding for independent sets in some dense interval graphs using only $O(|V| \lg |V|)$ clauses (the direct encoding uses $\Omega(|V|^2)$), which we have successfully applied to a string-compression encoding posed by Bannai et al. (2022). As a direct byproduct, we obtain a reduction in the encoding size of a scheduling problem posed by Mayank and Modal (2020) from $O(NMT^2)$ to $O(NMT + M T^2 \lg T)$, where $N$ is the number of tasks, $T$ the total timespan, and $M$ the number of machines.
我们展示了如何将若干图论问题(例如顶点覆盖、独立集、$k$-着色)编码为CNF(合取范式),仅使用$O(|V|^2 / \lg |V|)$数量的子句,相比之下传统的编码方法需要$\Omega(|V|^2)$的数量约束。这一令人惊讶的结果是图论中ErdÅ¡os、Chung和Spencer(1983年)关于双完全子图覆盖结果的一个简单推论,并为我们理解“有限变量添加”(Manthey, Heule 和 Biere,2012年)作为预处理工具的成功提供了一个理论途径。最后,我们提出了一种新颖的编码方法,在某些密集区间图中独立集仅使用$O(|V| \lg |V|)$数量的子句(直接编码则需要$\Omega(|V|^2)$),并且这种编码已被成功应用于Bannai等人(2022年)提出的字符串压缩问题。作为直接成果,我们还得到了Mayank和Modal(2020年)提出的一个调度问题的编码大小从$O(NMT^2)$减少到$O(NMT + M T^2 \lg T)$的结果,其中$N$表示任务数量,$T$是总时间跨度,而$M$则是机器的数量。
https://arxiv.org/abs/2506.14042
Neural Machine Translation (NMT) has improved translation by using Transformer-based models, but it still struggles with word ambiguity and context. This problem is especially important in domain-specific applications, which often have problems with unclear sentences or poor data quality. Our research explores how adding information to models can improve translations in the context of e-commerce data. To this end we create ConECT -- a new Czech-to-Polish e-commerce product translation dataset coupled with images and product metadata consisting of 11,400 sentence pairs. We then investigate and compare different methods that are applicable to context-aware translation. We test a vision-language model (VLM), finding that visual context aids translation quality. Additionally, we explore the incorporation of contextual information into text-to-text models, such as the product's category path or image descriptions. The results of our study demonstrate that the incorporation of contextual information leads to an improvement in the quality of machine translation. We make the new dataset publicly available.
神经机器翻译(NMT)通过使用基于Transformer的模型在翻译方面取得了进步,但仍面临词汇歧义和上下文理解的问题。这一问题在特定领域的应用中尤为关键,这些领域常常遇到表述不清或数据质量差的情况。我们的研究探讨了如何通过向模型添加信息来提升电子商务数据中的翻译质量。为此,我们创建了ConECT——一个新的从捷克语到波兰语的电子商务产品翻译数据集,并与图像和产品元数据相结合,包含11,400对句子。接着,我们调查并比较了不同的适用于上下文感知翻译的方法。 我们在测试中使用了一种视觉-语言模型(VLM),发现视觉信息有助于提升翻译质量。此外,我们也探索了将上下文信息整合进文本到文本模型中的方法,比如产品的类别路径或图像描述的集成。研究结果表明,通过引入上下文信息能够显著提高机器翻译的质量。 我们计划公开发布这一新数据集。
https://arxiv.org/abs/2506.04929
Understanding robustness is essential for building reliable NLP systems. Unfortunately, in the context of machine translation, previous work mainly focused on documenting robustness failures or improving robustness. In contrast, we study robustness from a model representation perspective by looking at internal model representations of ungrammatical inputs and how they evolve through model layers. For this purpose, we perform Grammatical Error Detection (GED) probing and representational similarity analysis. Our findings indicate that the encoder first detects the grammatical error, then corrects it by moving its representation toward the correct form. To understand what contributes to this process, we turn to the attention mechanism where we identify what we term Robustness Heads. We find that Robustness Heads attend to interpretable linguistic units when responding to grammatical errors, and that when we fine-tune models for robustness, they tend to rely more on Robustness Heads for updating the ungrammatical word representation.
理解稳健性对于构建可靠的自然语言处理系统至关重要。不幸的是,在机器翻译的背景下,之前的工作主要集中在记录稳健性的失败案例或改进稳健性上。相比之下,我们从模型表示的角度研究稳健性,通过观察非语法输入在各个模型层中的内部表示以及它们如何演变来实现这一目标。为此,我们进行了语法错误检测(GED)探测和表征相似度分析。我们的发现表明,编码器首先检测到语法错误,然后通过将表示形式移动到正确的形式来进行修正。为了了解推动这一过程的因素,我们转向注意力机制,在那里我们识别出了我们称之为“稳健性头”的部分。我们发现,“稳健性头”在响应语法错误时会关注可解释的语言单元,并且当我们针对稳健性进行模型微调时,它们倾向于更多地依赖于“稳健性头”来更新非语法单词的表示形式。
https://arxiv.org/abs/2505.21224
The majority of inhabitants in Hong Kong are able to read and write in standard Chinese but use Cantonese as the primary spoken language in daily life. Spoken Cantonese can be transcribed into Chinese characters, which constitute the so-called written Cantonese. Written Cantonese exhibits significant lexical and grammatical differences from standard written Chinese. The rise of written Cantonese is increasingly evident in the cyber world. The growing interaction between Mandarin speakers and Cantonese speakers is leading to a clear demand for automatic translation between Chinese and Cantonese. This paper describes a transformer-based neural machine translation (NMT) system for written-Chinese-to-written-Cantonese translation. Given that parallel text data of Chinese and Cantonese are extremely scarce, a major focus of this study is on the effort of preparing good amount of training data for NMT. In addition to collecting 28K parallel sentences from previous linguistic studies and scattered internet resources, we devise an effective approach to obtaining 72K parallel sentences by automatically extracting pairs of semantically similar sentences from parallel articles on Chinese Wikipedia and Cantonese Wikipedia. We show that leveraging highly similar sentence pairs mined from Wikipedia improves translation performance in all test sets. Our system outperforms Baidu Fanyi's Chinese-to-Cantonese translation on 6 out of 8 test sets in BLEU scores. Translation examples reveal that our system is able to capture important linguistic transformations between standard Chinese and spoken Cantonese.
香港的大多数居民能够阅读和书写标准汉语,但在日常生活中主要使用粤语作为口语。粤语可以通过汉字转写成所谓的“书面粤语”。书面粤语在词汇和语法上与标准书面中文有显著差异。随着网络世界的不断发展,书面粤语的应用越来越明显。普通话使用者和粤语使用者之间的互动日益增加,这导致了对中文与粤语文本自动翻译的明确需求。本文描述了一种基于Transformer的神经机器翻译(NMT)系统,用于将书面中文翻译成书面粤文。 鉴于汉语文本和粤语文本的平行数据极为稀少,此研究的主要焦点在于为NMT准备大量训练数据的努力上。除了从先前的语言学研究和散落于互联网上的资源中收集28,000个平行句子外,我们还设计了一种有效的方法,通过自动提取来自中文维基百科和粤语维基百科的并行文章中的语义相似句子对来获得72,000个平行句子。结果显示,利用从维基百科中挖掘出的高度相似句对能提高所有测试集上的翻译性能。 我们的系统在BLEU评分上,在8组测试集中有6组超越了百度翻译的中文到粤文翻译效果。翻译示例表明,该系统能够捕捉到标准汉语与口语粤语之间的重要的语言转换。
https://arxiv.org/abs/2505.17816
Neural Machine Translation (NMT) systems face significant challenges when working with low-resource languages, particularly in domain adaptation tasks. These difficulties arise due to limited training data and suboptimal model generalization, As a result, selecting an optimal model for translation is crucial for achieving strong performance on in-domain data, particularly in scenarios where fine-tuning is not feasible or practical. In this paper, we investigate strategies for selecting the most suitable NMT model for a given domain using bandit-based algorithms, including Upper Confidence Bound, Linear UCB, Neural Linear Bandit, and Thompson Sampling. Our method effectively addresses the resource constraints by facilitating optimal model selection with high confidence. We evaluate the approach across three African languages and domains, demonstrating its robustness and effectiveness in both scenarios where target data is available and where it is absent.
神经机器翻译(NMT)系统在处理低资源语言时,特别是在领域适应任务中面临重大挑战。这些困难主要是由于训练数据有限以及模型泛化能力不足所导致的。因此,在没有可行或实际微调的情况下,为特定领域选择最佳翻译模型对于实现强大性能至关重要。 在这篇论文中,我们探讨了使用基于多臂赌博机(bandit-based)算法的选择策略来挑选最适合给定领域的NMT模型,这些算法包括上置信界(Upper Confidence Bound)、线性上置信界(Linear UCB)、神经线性赌博机(Neural Linear Bandit)以及汤普森采样(Thompson Sampling)。我们的方法通过在资源有限的情况下有效进行最佳模型选择,从而高效地解决了这一问题。 我们对三种非洲语言和领域进行了评估,并展示了该方法的鲁棒性和有效性,无论是有目标数据还是无目标数据的情况都能取得良好效果。
https://arxiv.org/abs/2505.15069
The sparse Mixture-of-Experts (MoE) has achieved significant progress for neural machine translation (NMT). However, there exist two limitations in current MoE solutions which may lead to sub-optimal performance: 1) they directly use the task knowledge of NMT into MoE (\emph{e.g.}, domain/linguistics-specific knowledge), which are generally unavailable at practical application and neglect the naturally grouped domain/linguistic properties; 2) the expert selection only depends on the localized token representation without considering the context, which fully grasps the state of each token in a global view. To address the above limitations, we propose THOR-MoE via arming the MoE with hierarchical task-guided and context-responsive routing policies. Specifically, it 1) firstly predicts the domain/language label and then extracts mixed domain/language representation to allocate task-level experts in a hierarchical manner; 2) injects the context information to enhance the token routing from the pre-selected task-level experts set, which can help each token to be accurately routed to more specialized and suitable experts. Extensive experiments on multi-domain translation and multilingual translation benchmarks with different architectures consistently demonstrate the superior performance of THOR-MoE. Additionally, the THOR-MoE operates as a plug-and-play module compatible with existing Top-$k$~\cite{shazeer2017} and Top-$p$~\cite{huang-etal-2024-harder} routing schemes, ensuring broad applicability across diverse MoE architectures. For instance, compared with vanilla Top-$p$~\cite{huang-etal-2024-harder} routing, the context-aware manner can achieve an average improvement of 0.75 BLEU with less than 22\% activated parameters on multi-domain translation tasks.
稀疏的专家混合(Mixture-of-Experts,MoE)在神经机器翻译(NMT)方面取得了显著进展。然而,当前的MoE解决方案存在两个可能导致性能次优的问题:1) 它们直接将NMT的任务知识(例如,特定领域的或语言学的知识)用于MoE中,这些信息在实际应用中通常不可用,并且忽视了自然分组的领域/语言特性;2) 专家选择仅依赖于局部标记表示,而不考虑上下文,这无法从全局视角全面掌握每个标记的状态。为了解决上述问题,我们提出了通过配备层次化任务引导和上下文响应路由策略来增强MoE能力的THOR-MoE模型。 具体来说,1)该方法首先预测领域/语言标签,然后提取混合领域的表示以分层方式分配任务级别的专家;2)注入上下文信息以增强预选的任务级专家集合中的标记路由,这可以帮助每个标记更准确地路由到更加专业化且合适的专家。在不同架构上的多域翻译和跨语言翻译基准测试中进行了广泛的实验,结果一致表明THOR-MoE具有卓越的性能。 此外,THOR-MoE作为即插即用模块与现有的Top-$k$~\cite{shazeer2017} 和 Top-$p$~\cite{huang-etal-2024-harder} 路由方案兼容,在广泛的MoE架构中确保了广泛应用性。例如,与原始的Top-$p$~\cite{huang-etal-2024-harder} 路由相比,上下文感知方式在多域翻译任务上平均提高了0.75个BLEU分数,并且激活参数少于22%。 这段文字详细介绍了THOR-MoE模型如何改进现有的MoE架构以适应神经机器翻译的需求。通过引入层次化的任务引导和增强的上下文响应机制,该模型不仅能够更好地利用语言学和领域知识,还能更精确地为每个标记选择合适的专家,从而提高翻译质量。实验结果表明了THOR-MoE在多域与跨语言翻译中的优越性能,并且其灵活的设计使其适用于多种路由方案,增加了实际应用中的通用性和效率。
https://arxiv.org/abs/2505.14173
Large language model (LLM) shows promising performances in a variety of downstream tasks, such as machine translation (MT). However, using LLMs for translation suffers from high computational costs and significant latency. Based on our evaluation, in most cases, translations using LLMs are comparable to that generated by neural machine translation (NMT) systems. Only in particular scenarios, LLM and NMT models show respective advantages. As a result, integrating NMT and LLM for translation and using LLM only when necessary seems to be a sound solution. A scheduling policy that optimizes translation result while ensuring fast speed and as little LLM usage as possible is thereby required. We compare several scheduling policies and propose a novel and straightforward decider that leverages source sentence features. We conduct extensive experiments on multilingual test sets and the result shows that we can achieve optimal translation performance with minimal LLM usage, demonstrating effectiveness of our decider.
大型语言模型(LLM)在多种下游任务中表现出色,例如机器翻译(MT)。然而,使用LLM进行翻译面临着高昂的计算成本和显著的延迟问题。根据我们的评估,在大多数情况下,基于LLM的翻译效果与神经机器翻译(NMT)系统生成的效果相当。只有在特定场景下,LLM和NMT模型各自展现出优势。 因此,将NMT和LLM结合起来用于翻译,并仅在必要时使用LLM似乎是一个合理的解决方案。为了优化翻译结果并确保快速的速度以及尽可能少地使用LLM,需要制定一个调度策略。我们对比了几个不同的调度策略,并提出了一种新的、简单的决策机制,该机制利用源语言句子的特征来决定是否调用LLM。 我们在多语种测试集上进行了广泛的实验,结果显示我们可以实现最优翻译性能的同时减少对LLM的使用,证明了我们的决策机制的有效性。
https://arxiv.org/abs/2505.13554
While gender bias in modern Neural Machine Translation (NMT) systems has received much attention, traditional evaluation metrics do not to fully capture the extent to which these systems integrate contextual gender cues. We propose a novel evaluation metric called Minimal Pair Accuracy (MPA), which measures the reliance of models on gender cues for gender disambiguation. MPA is designed to go beyond surface-level gender accuracy metrics by focusing on whether models adapt to gender cues in minimal pairs -- sentence pairs that differ solely in the gendered pronoun, namely the explicit indicator of the target's entity gender in the source language (EN). We evaluate a number of NMT models on the English-Italian (EN--IT) language pair using this metric, we show that they ignore available gender cues in most cases in favor of (statistical) stereotypical gender interpretation. We further show that in anti-stereotypical cases, these models tend to more consistently take masculine gender cues into account while ignoring the feminine cues. Furthermore, we analyze the attention head weights in the encoder component and show that while all models encode gender information to some extent, masculine cues elicit a more diffused response compared to the more concentrated and specialized responses to feminine gender cues.
尽管现代神经机器翻译(NMT)系统中的性别偏见已受到广泛关注,但传统的评估指标并不能完全捕捉这些系统在整合上下文性别线索方面的程度。为此,我们提出了一种新的评估指标——最小成对准确度(MPA),该指标衡量模型依赖于性别线索进行性别消歧的程度。MPA旨在超越表面的性别准确性指标,通过关注模型如何根据最小对中的性别代词调整其行为来实现这一点——即源语言中目标实体性别的明确指示器。 我们使用这一新指标评估了多种NMT模型在英语-意大利语(EN--IT)语言对上的性能。结果表明,在大多数情况下,这些模型倾向于忽略可用的性别线索而依赖于统计意义上的刻板印象进行性别解读。此外,我们在反向刻板印象的情况下还发现,这些模型更一致地考虑男性性别线索并忽视女性性别线索。 进一步分析编码器组件中的注意力头权重后我们发现,虽然所有模型都在某种程度上编码了性别信息,但男性提示会引发更为分散的响应,而女性性别的提示则引发了更加集中和专业的反应。
https://arxiv.org/abs/2505.08546
Current machine translation models provide us with high-quality outputs in most scenarios. However, they still face some specific problems, such as detecting which entities should not be changed during translation. In this paper, we explore the abilities of popular NMT models, including models from the OPUS project, Google Translate, MADLAD, and EuroLLM, to preserve entities such as URL addresses, IBAN numbers, or emails when producing translations between four languages: English, German, Polish, and Ukrainian. We investigate the quality of popular NMT models in terms of accuracy, discuss errors made by the models, and examine the reasons for errors. Our analysis highlights specific categories, such as emojis, that pose significant challenges for many models considered. In addition to the analysis, we propose a new multilingual synthetic dataset of 36,000 sentences that can help assess the quality of entity transfer across nine categories and four aforementioned languages.
当前的机器翻译模型在大多数场景中提供了高质量的输出。然而,它们仍然面临一些特定问题,例如检测哪些实体在翻译过程中不应被更改。在这篇论文中,我们研究了流行神经机器翻译(NMT)模型的能力,包括来自OPUS项目、Google Translate、MADLAD和EuroLLM的模型,在四种语言之间进行翻译时保留实体(如网址地址、IBAN号码或电子邮件)的效果。这四种语言分别是英语、德语、波兰语和乌克兰语。我们分析了这些流行NMT模型在准确性方面的质量,讨论了它们所犯的错误,并探讨了导致错误的原因。我们的分析突出了诸如表情符号等特定类别,这对许多考虑中的模型构成了重大挑战。除了这项分析外,我们还提出了一套新的多语言合成数据集,包含36,000个句子,可以用来评估九种类型实体在上述四种语言之间的传输质量。
https://arxiv.org/abs/2505.06010
Most existing multimodal machine translation (MMT) datasets are predominantly composed of static images or short video clips, lacking extensive video data across diverse domains and topics. As a result, they fail to meet the demands of real-world MMT tasks, such as documentary translation. In this study, we developed TopicVD, a topic-based dataset for video-supported multimodal machine translation of documentaries, aiming to advance research in this field. We collected video-subtitle pairs from documentaries and categorized them into eight topics, such as economy and nature, to facilitate research on domain adaptation in video-guided MMT. Additionally, we preserved their contextual information to support research on leveraging the global context of documentaries in video-guided MMT. To better capture the shared semantics between text and video, we propose an MMT model based on a cross-modal bidirectional attention module. Extensive experiments on the TopicVD dataset demonstrate that visual information consistently improves the performance of the NMT model in documentary translation. However, the MMT model's performance significantly declines in out-of-domain scenarios, highlighting the need for effective domain adaptation methods. Additionally, experiments demonstrate that global context can effectively improve translation performance. % Dataset and our implementations are available at this https URL
现有的多模态机器翻译(MMT)数据集主要由静态图像或短视频片段组成,缺乏跨不同领域和主题的大量视频数据。因此,这些数据集无法满足诸如纪录片翻译等实际场景中MMT任务的需求。为此,在这项研究中我们开发了基于话题的视频支持多模态机器翻译的数据集TopicVD,旨在推动该领域的研究进展。我们在文档片中收集并整理了视频-字幕配对,并将它们归类为八个话题(例如经济和自然),以便于研究视频引导下的MMT领域适应性问题。同时我们保留了它们的上下文信息以支持在视频指导下利用纪录片全局背景的研究。 为了更好地捕捉文本与视频之间的共同语义,我们提出了一种基于跨模态双向注意力模块的MMT模型。在TopicVD数据集上的广泛实验表明,在纪录片翻译中视觉信息能够持续改善NMT(神经机器翻译)模型的表现力。然而,当遇到域外场景时,MMT模型表现显著下降,这凸显了有效领域适应方法的需求。此外,实验还显示全局背景能有效地提升翻译性能。 这项研究的数据集及相关实现可以在这个链接下获取:[请在此插入数据集和代码的实际访问网址]
https://arxiv.org/abs/2505.05714
In this paper,we explore the application of Back translation (BT) as a semi-supervised technique to enhance Neural Machine Translation(NMT) models for the English-Luganda language pair, specifically addressing the challenges faced by low-resource languages. The purpose of our study is to demonstrate how BT can mitigate the scarcity of bilingual data by generating synthetic data from monolingual corpora. Our methodology involves developing custom NMT models using both publicly available and web-crawled data, and applying Iterative and Incremental Back translation techniques. We strategically select datasets for incremental back translation across multiple small datasets, which is a novel element of our approach. The results of our study show significant improvements, with translation performance for the English-Luganda pair exceeding previous benchmarks by more than 10 BLEU score units across all translation directions. Additionally, our evaluation incorporates comprehensive assessment metrics such as SacreBLEU, ChrF2, and TER, providing a nuanced understanding of translation quality. The conclusion drawn from our research confirms the efficacy of BT when strategically curated datasets are utilized, establishing new performance benchmarks and demonstrating the potential of BT in enhancing NMT models for low-resource languages.
在这篇论文中,我们探讨了利用反向翻译(Back Translation,BT)作为半监督技术来增强英语-卢甘达语对的神经机器翻译(Neural Machine Translation, NMT)模型的应用,特别关注低资源语言所面临的挑战。我们的研究目的是展示如何通过从单语语料库生成合成数据来缓解双语数据稀缺问题。我们的方法包括使用公开可用和网络爬取的数据开发定制的NMT模型,并应用迭代和增量反向翻译技术。我们在多个小规模数据集中战略性地选择用于增量反向翻译的数据集,这是我们方法的一个新颖元素。 研究结果表明,在所有翻译方向上,英语-卢甘达语对的翻译性能都超过了先前的基准,提高了超过10个BLEU分数单位。此外,我们的评估包括SacreBLEU、ChrF2和TER等全面评估指标,从而为翻译质量提供了细致的理解。根据我们的研究结论,当精心策划的数据集被使用时,BT的有效性得到了确认,这不仅确立了新的性能基准,还展示了BT在增强低资源语言NMT模型方面的潜力。
https://arxiv.org/abs/2505.02463
Large language models (LLMs) and multi-agent orchestration are touted as the next leap in machine translation (MT), but their benefits relative to conventional neural MT (NMT) remain unclear. This paper offers an empirical reality check. We benchmark five paradigms, Google Translate (strong NMT baseline), GPT-4o (general-purpose LLM), o1-preview (reasoning-enhanced LLM), and two GPT-4o-powered agentic workflows (sequential three-stage and iterative refinement), on test data drawn from a legal contract and news prose in three English-source pairs: Spanish, Catalan and Turkish. Automatic evaluation is performed with COMET, BLEU, chrF2 and TER; human evaluation is conducted with expert ratings of adequacy and fluency; efficiency with total input-plus-output token counts mapped to April 2025 pricing. Automatic scores still favour the mature NMT system, which ranks first in seven of twelve metric-language combinations; o1-preview ties or places second in most remaining cases, while both multi-agent workflows trail. Human evaluation reverses part of this narrative: o1-preview produces the most adequate and fluent output in five of six comparisons, and the iterative agent edges ahead once, indicating that reasoning layers capture semantic nuance undervalued by surface metrics. Yet these qualitative gains carry steep costs. The sequential agent consumes roughly five times, and the iterative agent fifteen times, the tokens used by NMT or single-pass LLMs. We advocate multidimensional, cost-aware evaluation protocols and highlight research directions that could tip the balance: leaner coordination strategies, selective agent activation, and hybrid pipelines combining single-pass LLMs with targeted agent intervention.
大型语言模型(LLM)和多代理编排被吹捧为机器翻译(MT)的下一个飞跃,但它们相对于传统的神经网络机器翻译(NMT)的优势仍然不明确。本文提供了一个实证性的现实检验。我们在三种英语源对——西班牙语、加泰罗尼亚语和土耳其语——中选择测试数据来评估五种范式:Google Translate(强大的NMT基准)、GPT-4o(通用LLM)、o1-preview(增强推理的LLM)以及两种由GPT-4o驱动的代理工作流程(顺序三阶段和迭代细化)。自动评价使用COMET、BLEU、chrF2和TER进行;人工评估则通过专家对准确性和流畅性的评分来进行;效率则用输入加输出令牌总数映射到2025年4月的价格来衡量。在十二个指标语言组合中,成熟的NMT系统仍然在七个组合中得分最高;o1-preview在其余大部分情况下的排名紧随其后或并列第二,而两个多代理工作流程则落后。 人工评估部分逆转了这一结果:在六个比较中的五个中,o1-preview产生了最准确和流畅的输出,而在一次比较中迭代代理略胜一筹,这表明推理层捕捉到了表面指标所忽视的意义细微差别。然而,这些定性的提升伴随着高昂的成本:顺序代理消耗的令牌大约是NMT或单次通过LLM五倍左右,而迭代代理则为十五倍。 我们提倡采用多维度且考虑成本的评估协议,并指出可以改变这种平衡的研究方向,包括更精简的协调策略、选择性地激活代理以及结合单次通过LLM和有针对性的代理干预的混合管道。
https://arxiv.org/abs/2505.01560
Conventional retrieval-augmented neural machine translation (RANMT) systems leverage bilingual corpora, e.g., translation memories (TMs). Yet, in many settings, in-domain monolingual target-side corpora are often available. This work explores ways to take advantage of such resources by retrieving relevant segments directly in the target language, based on a source-side query. For this, we design improved cross-lingual retrieval systems, trained with both sentence level and word-level matching objectives. In our experiments with two RANMT architectures, we first demonstrate the benefits of such cross-lingual objectives in a controlled setting, obtaining translation performances that surpass standard TM-based models. We then showcase our method on a real-world set-up, where the target monolingual resources far exceed the amount of parallel data and observe large improvements of our new techniques, which outperform both the baseline setting, and general-purpose cross-lingual retrievers.
传统的检索增强神经机器翻译(RANMT)系统依赖于双语语料库,例如翻译记忆库(TMs)。然而,在许多情况下,目标语言中的单语领域内语料库通常是可以获取的。本研究探索了利用这些资源的方法,通过在目标语言中直接检索与源语言查询相关的段落来实现这一目标。为此,我们设计了改进的跨语言检索系统,并使用句子级别和词汇级别的匹配目标进行训练。 在两个RANMT架构上的实验中,我们在一个控制环境中首先展示了此类跨语言目标的优势,获得了超过标准TM基线模型的翻译性能。然后,我们在一个现实世界的设置下展示我们的方法,在该设置中目标单语资源远远超过了并行数据的数量,并观察到我们新技术带来的显著改进,这些技术在性能上超越了基线设置和通用型跨语言检索器。
https://arxiv.org/abs/2504.21747
Neural machine translation (NMT) systems typically employ maximum a posteriori (MAP) decoding to select the highest-scoring translation from the distribution mass. However, recent evidence highlights the inadequacy of MAP decoding, often resulting in low-quality or even pathological hypotheses -- the decoding objective is not aligned with real-world translation quality. This paper proposes calibrating hypothesis likelihoods with translation quality from a distribution view by directly optimizing their Pearson correlation -- thereby enhancing the effectiveness of translation decoding. With our method, translation on large language models (LLMs) improves substantially after limited training (2K instances per direction). This improvement is orthogonal to those achieved through supervised fine-tuning, leading to substantial gains across a broad range of metrics and human evaluations -- even when applied to top-performing translation-specialized LLMs fine-tuned on high-quality translation data, such as Tower, or when compared to recent preference optimization methods, like CPO. Moreover, the calibrated translation likelihood can directly serve as a strong proxy for translation quality, closely approximating or even surpassing some state-of-the-art translation quality estimation models, like CometKiwi. Lastly, our in-depth analysis demonstrates that calibration enhances the effectiveness of MAP decoding, thereby enabling greater efficiency in real-world deployment. The resulting state-of-the-art translation model, which covers 10 languages, along with the accompanying code and human evaluation data, has been released to the community: this https URL.
神经机器翻译(NMT)系统通常采用最大后验概率(MAP)解码来从分布中选出最高分的译文。然而,近期的研究表明,MAP解码存在不足之处,常常导致低质量甚至病态的假设——其解码目标与现实世界的翻译质量不一致。本文提出了一种通过直接优化它们之间的皮尔逊相关性来校准假设可能性和翻译质量的方法,从而增强翻译解码的有效性。采用我们的方法,在有限训练(每方向2000个实例)后,大规模语言模型(LLMs)的翻译性能显著提升。这种改进与通过监督微调获得的进步相独立,并且在广泛的度量标准和人工评估中取得了实质性的增益——即使应用于经过高质量翻译数据精细调整的顶级专业翻译模型,如Tower或最近的偏好优化方法(CPO)。此外,校准后的翻译可能性可以直接作为翻译质量的强大代理,紧密接近甚至超越一些最先进的翻译质量估计模型,例如CometKiwi。最后,深入分析表明,校准可以增强MAP解码的有效性,在实际部署中实现更高的效率。我们发布的最先进的翻译模型涵盖了10种语言,并附有代码和人工评估数据,具体信息可访问以下链接:this https URL。
https://arxiv.org/abs/2504.19044
In this study, we develop Neural Machine Translation (NMT) and Transformer-based transfer learning models for English-to-Igbo translation - a low-resource African language spoken by over 40 million people across Nigeria and West Africa. Our models are trained on a curated and benchmarked dataset compiled from Bible corpora, local news, Wikipedia articles, and Common Crawl, all verified by native language experts. We leverage Recurrent Neural Network (RNN) architectures, including Long Short-Term Memory (LSTM) and Gated Recurrent Units (GRU), enhanced with attention mechanisms to improve translation accuracy. To further enhance performance, we apply transfer learning using MarianNMT pre-trained models within the SimpleTransformers framework. Our RNN-based system achieves competitive results, closely matching existing English-Igbo benchmarks. With transfer learning, we observe a performance gain of +4.83 BLEU points, reaching an estimated translation accuracy of 70%. These findings highlight the effectiveness of combining RNNs with transfer learning to address the performance gap in low-resource language translation tasks.
在这项研究中,我们为英伊博语(Igbo)翻译开发了神经机器翻译(NMT)和基于Transformer的迁移学习模型。伊博语是一种非洲低资源语言,在尼日利亚和西非地区有超过4000万人使用。我们的模型是在精心整理并基准测试过的数据集上进行训练的,该数据集包括从圣经文献、当地新闻、维基百科文章以及Common Crawl收集的数据,并且所有内容都经过了伊博语母语专家的验证。 我们利用循环神经网络(RNN)架构,包括长短期记忆网络(LSTM)和门控循环单元(GRU),并结合注意力机制以提高翻译准确性。为了进一步提升性能,我们在SimpleTransformers框架中应用了使用MarianNMT预训练模型进行迁移学习的方法。基于RNN的系统取得了与现有英伊博语基准相近的竞争性结果。通过引入迁移学习,我们观察到BLEU分数提高了4.83分,估计翻译准确率为70%。 这些发现表明,在低资源语言翻译任务中结合使用RNN和迁移学习能够有效弥补性能差距。
https://arxiv.org/abs/2504.17252