This paper introduces DuTerm, a novel two-stage architecture for terminology-constrained machine translation. Our system combines a terminology-aware NMT model, adapted via fine-tuning on large-scale synthetic data, with a prompt-based LLM for post-editing. The LLM stage refines NMT output and enforces terminology adherence. We evaluate DuTerm on English-to German, English-to-Spanish, and English-to-Russian with the WMT 2025 Terminology Shared Task corpus. We demonstrate that flexible, context-driven terminology handling by the LLM consistently yields higher quality translations than strict constraint enforcement. Our results highlight a critical trade-off, revealing that an LLM's work best for high-quality translation as context-driven mutators rather than generators.
https://arxiv.org/abs/2511.07461
The linguistic diversity of India poses significant machine translation challenges, especially for underrepresented tribal languages like Bhili, which lack high-quality linguistic resources. This paper addresses the gap by introducing Bhili-Hindi-English Parallel Corpus (BHEPC), the first and largest parallel corpus worldwide comprising 110,000 meticulously curated sentences across Bhili, Hindi, and English. The corpus was created with the assistance of expert human translators. BHEPC spans critical domains such as education, administration, and news, establishing a valuable benchmark for research in low resource machine translation. To establish a comprehensive Bhili Machine Translation benchmark, we evaluated a wide range of proprietary and open-source Multilingual Large Language Models (MLLMs) on bidirectional translation tasks between English/Hindi and Bhili. Comprehensive evaluation demonstrates that the fine-tuned NLLB-200 distilled 600M variant model outperforms others, highlighting the potential of multilingual models in low resource scenarios. Furthermore, we investigated the generative translation capabilities of multilingual LLMs on BHEPC using in-context learning, assessing performance under cross-domain generalization and quantifying distributional divergence. This work bridges a critical resource gap and promotes inclusive natural language processing technologies for low-resource and marginalized languages globally.
https://arxiv.org/abs/2511.00486
This paper presents an end-to-end multilingual translation pipeline that integrates a custom U-Net for text detection, the Tesseract engine for text recognition, and a from-scratch sequence-to-sequence (Seq2Seq) Transformer for Neural Machine Translation (NMT). Our approach first utilizes a U-Net model, trained on a synthetic dataset , to accurately segment and detect text regions from an image. These detected regions are then processed by Tesseract to extract the source text. This extracted text is fed into a custom Transformer model trained from scratch on a multilingual parallel corpus spanning 5 languages. Unlike systems reliant on monolithic pre-trained models, our architecture emphasizes full customization and adaptability. The system is evaluated on its text detection accuracy, text recognition quality, and translation performance via BLEU scores. The complete pipeline demonstrates promising results, validating the viability of a custom-built system for translating text directly from images.
https://arxiv.org/abs/2510.23554
Fine-tuning is widely used to tailor large language models for specific tasks such as neural machine translation (NMT). However, leveraging transfer learning is computationally expensive when fine-tuning large multilingual models with billions of parameters, thus creating a barrier to entry for researchers working on low-resource domains such as Irish translation. Parameter-efficient fine-tuning (PEFT) bridges this gap by training on a fraction of the original model parameters, with the Low-Rank Adaptation (LoRA) approach introducing small, trainable adapter layers. We introduce SemiAdapt and SemiLoRA as semi-supervised inference-efficient approaches that strengthen domain adaptation and lead to improved overall performance in NMT. We demonstrate that SemiAdapt can outperform full-domain fine-tuning, while most notably, SemiLoRA can propel PEFT methods to match or even outperform full-model fine-tuning. We further evaluate domain-by-dataset fine-tuning and demonstrate that our embedding-based inference methods perform especially well on larger and noisier corpora. All Irish translation models developed in this work are released as open resources. These methods aim to make high-quality domain adaptation and fine-tuning more accessible to researchers working with low-resource languages.
微调广泛应用于根据特定任务(如神经机器翻译(NMT))调整大型语言模型。然而,当对具有数十亿参数的多语言大模型进行微调时,利用迁移学习在计算上非常昂贵,这为研究资源匮乏领域的研究人员(例如爱尔兰语翻译)设置了进入障碍。参数高效的微调(PEFT)通过仅训练原始模型部分参数来解决这一问题,并且低秩适应(Low-Rank Adaptation, LoRA)方法引入了较小的可训练适配器层。我们介绍了半监督推理高效的方法SemiAdapt和SemiLoRA,它们加强领域适应并使NMT的整体性能得到提升。我们演示了SemiAdapt可以超越全领域微调,而值得注意的是,SemiLoRA可以使PEFT方法与全模型微调相匹配甚至优于后者。进一步地,我们在按数据集进行领域微调时进行了评估,并展示了我们的基于嵌入的推理方法在更大且噪声更多的语料库上表现尤为出色。本工作中开发的所有爱尔兰翻译模型均以开源资源的形式发布。这些方法旨在使低资源语言的研究人员能够更容易地实现高质量领域的适应和微调。
https://arxiv.org/abs/2510.18725
Machine Translation (MT) has advanced from rule-based and statistical methods to neural approaches based on the Transformer architecture. While these methods have achieved impressive results for high-resource languages, low-resource varieties such as Sylheti remain underexplored. In this work, we investigate Bengali-to-Sylheti translation by fine-tuning multilingual Transformer models and comparing them with zero-shot large language models (LLMs). Experimental results demonstrate that fine-tuned models significantly outperform LLMs, with mBART-50 achieving the highest translation adequacy and MarianMT showing the strongest character-level fidelity. These findings highlight the importance of task-specific adaptation for underrepresented languages and contribute to ongoing efforts toward inclusive language technologies.
机器翻译(MT)已经从基于规则和统计的方法发展到以Transformer架构为基础的神经方法。尽管这些方法在高资源语言上取得了令人印象深刻的结果,但低资源语言如锡莱蒂语的研究仍然较少。在这项工作中,我们通过微调多语言Transformer模型并将其与零样本大型语言模型(LLMs)进行比较来研究孟加拉语到锡莱蒂语的翻译。实验结果表明,经过微调的模型显著优于LLM,在翻译充分性方面,mBART-50表现最佳;在字符级别保真度方面,MarianMT表现出最强的能力。这些发现强调了为未被充分代表的语言进行任务特定适应的重要性,并为包容性语言技术的发展做出了贡献。
https://arxiv.org/abs/2510.18898
While BERT is an effective method for learning monolingual sentence embeddings for semantic similarity and embedding based transfer learning BERT based cross-lingual sentence embeddings have yet to be explored. We systematically investigate methods for learning multilingual sentence embeddings by combining the best methods for learning monolingual and cross-lingual representations including: masked language modeling (MLM), translation language modeling (TLM), dual encoder translation ranking, and additive margin softmax. We show that introducing a pre-trained multilingual language model dramatically reduces the amount of parallel training data required to achieve good performance by 80%. Composing the best of these methods produces a model that achieves 83.7% bi-text retrieval accuracy over 112 languages on Tatoeba, well above the 65.5 achieved by LASER, while still performing competitively on monolingual transfer learning benchmarks. Parallel data mined from CommonCrawl using our best model is shown to train competitive NMT models for en-zh and en-de. We publicly release our best multilingual sentence embedding model for 109+ languages at this https URL.
虽然BERT在学习单语句子嵌入以进行语义相似度和基于嵌入的迁移学习方面非常有效,但基于BERT的跨语言句子嵌入尚未得到充分研究。我们系统地调查了结合最佳单语文本表示与跨语言文本表示方法来学习多语言句子嵌入的方法,包括掩码语言模型(MLM)、翻译语言模型(TLM)、双编码器翻译排名以及加性余弦softmax。我们表明,引入预训练的多语种语言模型可以显著减少实现良好性能所需的平行训练数据量达80%。将这些方法中的最佳组合产生了一个在Tatoeba上涵盖112种语言且双文本检索准确率达到83.7%的模型,这一成绩远超LASER所达到的65.5%,同时该模型仍能在单语迁移学习基准测试中保持竞争力。使用我们的最佳模型从CommonCrawl中挖掘出的平行数据训练了竞争性的英汉和英德NMT(神经机器翻译)模型。我们公开发布了适用于109+种语言的最佳多语言句子嵌入模型,可在此URL下载:[https://github.com/your-repo-url](https://github.com/your-repo-url)。 请注意,在提供的链接处应填入实际的GitHub仓库地址或其他发布平台地址以供访问。
https://arxiv.org/abs/2510.17504
Idiomatic translation remains a significant challenge in machine translation, especially for low resource languages such as Urdu, and has received limited prior attention. To advance research in this area, we introduce the first evaluation datasets for Urdu to English idiomatic translation, covering both Native Urdu and Roman Urdu scripts and annotated with gold-standard English equivalents. We evaluate multiple open-source Large Language Models (LLMs) and Neural Machine Translation (NMT) systems on this task, focusing on their ability to preserve idiomatic and cultural meaning. Automatic metrics including BLEU, BERTScore, COMET, and XCOMET are used to assess translation quality. Our findings indicate that prompt engineering enhances idiomatic translation compared to direct translation, though performance differences among prompt types are relatively minor. Moreover, cross script comparisons reveal that text representation substantially affects translation quality, with Native Urdu inputs producing more accurate idiomatic translations than Roman Urdu.
习语翻译在机器翻译中仍是一项重大挑战,尤其是在像乌尔都语这样的低资源语言领域,这一问题尚未得到充分关注。为了推动该领域的研究进展,我们首次推出了用于乌尔都语到英语习语翻译的评估数据集,涵盖了本土乌尔都文和罗马化乌尔都文两种书写系统,并附有黄金标准的英文对照译文。我们在该项任务中评估了多个开源大型语言模型(LLMs)和神经机器翻译(NMT)系统的性能,重点关注它们在保留习语及文化含义方面的能力。我们使用包括BLEU、BERTScore、COMET和XCOMET在内的自动评价指标来衡量翻译质量。 研究发现表明,与直接翻译相比,提示工程能显著提升习语翻译的效果,尽管不同类型的提示之间存在细微的性能差异。此外,在跨书写系统对比中发现,文本表示对翻译质量有着重要影响:基于本土乌尔都文输入的翻译比罗马化乌尔都文更准确地保留了习语的意义和文化内涵。
https://arxiv.org/abs/2510.17460
To advance a Weather-Ready Nation, the National Weather Service (NWS) is developing a systematic translation program to better serve the 68.8 million people in the U.S. who do not speak English at home. This article outlines the foundation of an automated translation tool for NWS products, powered by artificial intelligence. The NWS has partnered with LILT, whose patented training process enables large language models (LLMs) to adapt neural machine translation (NMT) tools for weather terminology and messaging. Designed for scalability across Weather Forecast Offices (WFOs) and National Centers, the system is currently being developed in Spanish, Simplified Chinese, Vietnamese, and other widely spoken non-English languages. Rooted in best practices for multilingual risk communication, the system provides accurate, timely, and culturally relevant translations, significantly reducing manual translation time and easing operational workloads across the NWS. To guide the distribution of these products, GIS mapping was used to identify language needs across different NWS regions, helping prioritize resources for the communities that need them most. We also integrated ethical AI practices throughout the program's design, ensuring that transparency, fairness, and human oversight guide how automated translations are created, evaluated, and shared with the public. This work has culminated into a website featuring experimental multilingual NWS products, including translated warnings, 7-day forecasts, and educational campaigns, bringing the country one step closer to a national warning system that reaches all Americans.
为了打造一个“天气准备就绪”的国家,美国国家气象局(NWS)正在开发一项系统化的翻译计划,以便更好地服务于在美国家中不讲英语的6880万人口。本文概述了一种由人工智能驱动的自动化翻译工具的基础框架,该工具专门用于NWS产品。NWS已与LILT合作,后者采用专利培训流程使大型语言模型(LLMs)能够适应神经机器翻译(NMT)工具,以应对气象术语和信息传达的需求。这个系统旨在在天气预报办公室(WFOs)和国家中心之间实现可扩展性,并且目前正在西班牙语、简体中文、越南语和其他广泛使用的非英语语言中开发。 基于多语言风险沟通的最佳实践,该系统提供了准确及时的文化相关翻译,大大减少了手动翻译时间,减轻了NWS的操作负担。为指导这些产品在不同地区的分发,GIS地图被用来识别各地区对特定语言的需求,从而帮助优先考虑最需要资源的社区。 在整个计划的设计中还集成了伦理AI实践,确保透明度、公平性及人工监督引导自动化翻译的创建、评估和与公众分享的过程。这一工作已形成了一个网站,展示了一系列实验性的多语言NWS产品,包括翻译警告、七天预报和教育宣传活动,使美国朝着建立覆盖所有公民的全国预警系统的目标迈进了一步。
https://arxiv.org/abs/2510.14369
Quality estimation (QE) reranking is a form of quality-aware decoding which aims to improve machine translation (MT) by scoring and selecting the best candidate from a pool of generated translations. While known to be effective at the sentence level, its application to the increasingly prominent domain of document-level translation remains underexplored. In this work, we evaluate QE reranking performance on document-level (rather than the typical sentence-level) translation, using various learned and large language model (LLM)-based QE metrics. We find that with our best learned metric, SLIDE, BLEURT-20 scores improve by +2.00 with only two candidates, and by +5.09 with 32, across both decoder-only LLM models and encoder-decoder neural machine translation (NMT) models. Using the best LLM-based metric, GEMBA-DA, gains of +1.63 and +4.30 are achieved under the same conditions. Although gains shrink with longer inputs, reranking with 32 candidates yields improvements of +2.34 (SLIDE) and +1.40 (GEMBA-DA) on our longest documents (512-1024 source tokens). These findings demonstrate the practical value of document-level QE, with minimal runtime overhead given suitable translation models and hardware.
https://arxiv.org/abs/2510.08870
This paper introduces a novel approach to Dynamic Artificial Neural Networks (D-ANNs) for multi-task demand forecasting called Neuroplastic Multi-Task Network (NMT-Net). Unlike conventional methods focusing on inference-time dynamics or computational efficiency, our proposed method enables structural adaptability of the computational graph during training, inspired by neuroplasticity as seen in biological systems. Each new task triggers a dynamic network adaptation, including similarity-based task identification and selective training of candidate ANN heads, which are then assessed and integrated into the model based on their performance. We evaluated our framework using three real-world multi-task demand forecasting datasets from Kaggle. We demonstrated its superior performance and consistency, achieving lower RMSE and standard deviation compared to traditional baselines and state-of-the-art multi-task learning methods. NMT-Net offers a scalable, adaptable solution for multi-task and continual learning in time series prediction. The complete code for NMT-Net is available from our GitHub repository.
本文介绍了一种新颖的方法,用于多任务需求预测的动态人工神经网络(D-ANN),称为神经可塑性多任务网络(NMT-Net)。与传统方法专注于推理时间动态或计算效率不同,我们提出的方法在训练期间启用了计算图的结构适应性,灵感来自生物系统中的神经可塑性。每一个新任务都会触发动态网络调整,包括基于相似性的任务识别和候选ANN头的选择性训练,然后根据其性能进行评估并整合到模型中。我们使用Kaggle上的三个现实世界的多任务需求预测数据集对我们的框架进行了评估,并展示了其优越的性能和一致性,在均方根误差(RMSE)和标准差方面优于传统的基准方法和最新的多任务学习方法。 NMT-Net为时间序列预测中的多任务和持续学习提供了可扩展、适应性强的解决方案。完整的NMT-Net代码可以从我们的GitHub仓库获取。
https://arxiv.org/abs/2509.24495
Despite advances in Neural Machine Translation (NMT), low-resource languages like Tigrinya remain underserved due to persistent challenges, including limited corpora, inadequate tokenization strategies, and the lack of standardized evaluation benchmarks. This paper investigates transfer learning techniques using multilingual pretrained models to enhance translation quality for morphologically rich, low-resource languages. We propose a refined approach that integrates language-specific tokenization, informed embedding initialization, and domain-adaptive fine-tuning. To enable rigorous assessment, we construct a high-quality, human-aligned English-Tigrinya evaluation dataset covering diverse domains. Experimental results demonstrate that transfer learning with a custom tokenizer substantially outperforms zero-shot baselines, with gains validated by BLEU, chrF, and qualitative human evaluation. Bonferroni correction is applied to ensure statistical significance across configurations. Error analysis reveals key limitations and informs targeted refinements. This study underscores the importance of linguistically aware modeling and reproducible benchmarks in bridging the performance gap for underrepresented languages. Resources are available at this https URL and this https URL
尽管神经机器翻译(NMT)取得了进展,但由于持续存在的挑战,如有限的语料库、不充分的分词策略和缺乏标准化的评估基准,像提格尼尼亚语这样的低资源语言仍然得不到充分服务。本文探讨了使用多语言预训练模型进行迁移学习技术,以提高形态丰富的低资源语言翻译质量的方法。我们提出了一种经过改进的方法,该方法结合了特定于语言的分词、信息嵌入初始化和领域适应性微调。 为了实现严格的评估,我们构建了一个高质量的人工对齐英提格尼尼亚语评价数据集,涵盖了不同的领域。实验结果表明,在自定义分词器的帮助下进行迁移学习显著优于零样本基准模型,并通过BLEU、chrF和定性的人工评估进行了验证。应用邦弗伦尼校正以确保在不同配置下的统计意义。错误分析揭示了关键的局限性并为针对性改进提供了信息。 这项研究强调了语言意识建模和可重复基准测试在弥合代表性不足的语言性能差距方面的的重要性。相关资源可以在[这里](https://this.url)和[这里](https://that.url)获得。请注意,提供的URL仅为占位符,请访问原文获取实际链接。
https://arxiv.org/abs/2509.20209
India's linguistic landscape is one of the most diverse in the world, comprising over 120 major languages and approximately 1,600 additional languages, with 22 officially recognized as scheduled languages in the Indian Constitution. Despite recent progress in multilingual neural machine translation (NMT), high-quality parallel corpora for Indian languages remain scarce, especially across varied domains. In this paper, we introduce a large-scale, high-quality annotated parallel corpus covering 11 of these languages : English, Telugu, Hindi, Punjabi, Odia, Kashmiri, Sindhi, Dogri, Kannada, Urdu, and Gujarati comprising a total of 772,000 bi-text sentence pairs. The dataset is carefully curated and systematically categorized into three key domains: Government, Health, and General, to enable domain-aware machine translation research and facilitate effective domain adaptation. To demonstrate the utility of CorIL and establish strong benchmarks for future research, we fine-tune and evaluate several state-of-the-art NMT models, including IndicTrans2, NLLB, and BhashaVerse. Our analysis reveals important performance trends and highlights the corpus's value in probing model capabilities. For instance, the results show distinct performance patterns based on language script, with massively multilingual models showing an advantage on Perso-Arabic scripts (Urdu, Sindhi) while other models excel on Indic scripts. This paper provides a detailed domain-wise performance analysis, offering insights into domain sensitivity and cross-script transfer learning. By publicly releasing CorIL, we aim to significantly improve the availability of high-quality training data for Indian languages and provide a valuable resource for the machine translation research community.
印度的语言景观是世界上最多样化的一种,包含了超过120种主要语言和大约1600种其他语言。根据《印度宪法》,其中有22种语言被列为计划语言(scheduled languages)。尽管近期在多语神经机器翻译(NMT)方面取得了一些进展,但在各种领域中高质量的平行语料库仍然十分稀缺。在这篇论文中,我们介绍了一个大规模且高质量标注的平行语料库,涵盖了11种印度的语言:英语、泰卢固语、印地语、旁遮普语、奥迪亚语、克什米尔语、信德语、多格里语、卡纳达语、乌尔都语和古吉拉特语。这个语料库总共包含772,000对双语文本句子。 该数据集经过精心策划,并被系统地分类为三大关键领域:政府、健康以及通用,以支持具有域意识的机器翻译研究并促进有效的领域适应。为了展示CorIL(可能是指上述提到的大规模平行语料库)的实用性,并为未来的研究建立强大的基准,我们对几种最先进的NMT模型进行了微调和评估,包括IndicTrans2、NLLB和BhashaVerse。我们的分析揭示了重要的性能趋势,并突出了该语料库在探索模型能力方面的价值。例如,结果表明,基于书写系统的不同语言模式具有明显的性能差异:大规模多语种模型在使用波斯-阿拉伯书写的语言(如乌尔都语、信德语)上表现出优势,而其他模型则在印度书写系统上的表现更为出色。 本文还提供了详细的领域内性能分析,为理解领域的敏感性和跨书写系统的迁移学习提供见解。通过公开发布CorIL,我们旨在显著提高印度语言高质量训练数据的可用性,并为此领域的机器翻译研究社区提供宝贵的资源。
https://arxiv.org/abs/2509.19941
Domain-specific embedding models have shown promise for applications that require specialized semantic understanding, such as coding agents and financial retrieval systems, often achieving higher performance gains than general models. However, state-of-the-art embedding models are typically based on LLMs, which contain billions of parameters, making deployment challenging in resource-constrained environments. Model compression through pruning offers a promising solution, but existing pruning methods treat all parameters uniformly, failing to distinguish between general semantic representations and domain-specific patterns, leading to suboptimal pruning decisions. Thus, we propose GAPrune, a pruning framework that addresses this challenge by considering both domain importance and preserving general linguistic foundation. Our method uses Fisher Information to measure importance and general-domain gradient alignment to assess parameter behavior, then combines these signals using our Domain Alignment Importance (DAI) scoring. Lower DAI scores indicate that the parameter is either less important for the domain task or creates conflicts between domain and general objectives. Experiments on two domain benchmarks, FinMTEB and ChemTEB, show that GAPrune maintains performance within 2.5% of dense models in one-shot pruning at 50% sparsity, while outperforming all baselines. With retraining in 100 steps, GAPrune achieves +4.51% improvement on FinMTEB and +1.73% on ChemTEB, demonstrating that our pruning strategy not only preserves but enhances domain-specific capabilities. Our findings demonstrate that principled pruning strategies can achieve model compression and enhanced domain specialization, providing the research community with a new approach for development.
领域特定嵌入模型在需要专门语义理解的应用程序中表现出巨大潜力,例如代码代理和金融检索系统,在这些应用中通常能比通用模型实现更高的性能提升。然而,最先进的嵌入模型往往基于大规模语言模型(LLMs),这类模型包含数十亿参数,在资源受限的环境中部署十分具有挑战性。通过剪枝进行模型压缩提供了一种有前景的解决方案,但现有的剪枝方法对所有参数采取同等对待的方式,未能区分一般语义表示和领域特定模式之间的差异,导致了次优的剪枝决策。因此,我们提出了GAPrune框架,该框架旨在同时考虑领域重要性和保留通用语言基础来应对这一挑战。 我们的方法使用Fisher信息度量重要性,并通过领域-通用梯度对齐评估参数行为,然后结合这些信号利用我们的域对齐重要性(DAI)评分。较低的DAI分数意味着该参数要么在领域任务中不那么重要,要么与领域和通用目标之间存在冲突。 实验结果表明,在FinMTEB和ChemTEB两个领域的基准测试上,GAPrune通过一次性50%稀疏度剪枝,性能保持接近稠密模型的97.5%,并且超过了所有基线。经过100步重新训练后,GAPrune在FinMTEB上的表现提高了4.51%,在ChemTEB上提高了1.73%,这证明了我们的剪枝策略不仅能保留领域特定能力,还能进一步增强这些能力。 我们的研究结果表明,基于原则的剪枝策略能够实现模型压缩和领域的专业化提升,并为研发社区提供了一种新的发展方向。
https://arxiv.org/abs/2509.10844
Translating electronic health record (EHR) narratives from English to Spanish is a clinically important yet challenging task due to the lack of a parallel-aligned corpus and the abundant unknown words contained. To address such challenges, we propose \textbf{NOOV} (for No OOV), a new neural machine translation (NMT) system that requires little in-domain parallel-aligned corpus for training. NOOV integrates a bilingual lexicon automatically learned from parallel-aligned corpora and a phrase look-up table extracted from a large biomedical knowledge resource, to alleviate both the unknown word problem and the word-repeat challenge in NMT, enhancing better phrase generation of NMT systems. Evaluation shows that NOOV is able to generate better translation of EHR with improvement in both accuracy and fluency.
翻译电子健康记录(EHR)叙述从英语到西班牙语是一项临床重要但具有挑战性的任务,由于缺乏平行对齐的语料库和大量未知词汇的存在。为了解决这些挑战,我们提出了**NOOV**(无OOV),这是一种新的神经机器翻译(NMT)系统,它需要很少量的领域内平行对齐的语料库进行训练。NOOV集成了从平行对齐语料库自动学习到的双语词典和从大型生物医学知识资源中提取的短语查找表,以减轻NMT中的未知词汇问题和重复单词挑战,从而增强NMT系统在生成短语方面的表现。评估表明,NOOV能够更准确、流畅地翻译EHR。
https://arxiv.org/abs/2508.18607
The advent of neural machine translation (NMT) has revolutionized cross-lingual communication, yet preserving stylistic nuances remains a significant challenge. While existing approaches often require parallel corpora for style preservation, we introduce Babel, a novel framework that enhances stylistic fidelity in NMT using only monolingual corpora. Babel employs two key components: (1) a style detector based on contextual embeddings that identifies stylistic disparities between source and target texts, and (2) a diffusion-based style applicator that rectifies stylistic inconsistencies while maintaining semantic integrity. Our framework integrates with existing NMT systems as a post-processing module, enabling style-aware translation without requiring architectural modifications or parallel stylistic data. Extensive experiments on five diverse domains (law, literature, scientific writing, medicine, and educational content) demonstrate Babel's effectiveness: it identifies stylistic inconsistencies with 88.21% precision and improves stylistic preservation by 150% while maintaining a high semantic similarity score of 0.92. Human evaluation confirms that translations refined by Babel better preserve source text style while maintaining fluency and adequacy.
神经机器翻译(NMT)的出现彻底改变了跨语言交流,然而保留风格细节仍然是一项重大挑战。现有的方法通常需要平行语料库来保存文本风格,而我们引入了一个名为Babel的新框架,它仅使用单语种语料库就能增强NMT中的风格保真度。Babel采用了两个关键组件:(1)一种基于上下文嵌入的风格检测器,用于识别源文本和目标文本之间的风格差异;以及(2)一种扩散型风格应用器,能够在保持语义完整性的前提下修正风格不一致的问题。我们的框架作为一个后处理模块与现有的NMT系统集成在一起,使得无需架构修改或平行风格数据的情况下就能实现基于风格的翻译。在五个不同领域(法律、文学、科学写作、医学和教育内容)进行的广泛实验表明Babel的有效性:它能够以88.21%的准确度识别风格差异,并且将风格保真度提高了150%,同时保持了语义相似性的高分值为0.92。人类评估确认,经过Babel改进的翻译更好地保留了源文本的风格,同时还保持了流畅性和充分性。
https://arxiv.org/abs/2507.13395
Helping deaf and hard-of-hearing people communicate more easily is the main goal of Automatic Sign Language Translation. Although most past research has focused on turning sign language into text, doing the reverse, turning spoken English into sign language animations, has been largely overlooked. That's because it involves multiple steps, such as understanding speech, translating it into sign-friendly grammar, and generating natural human motion. In this work, we introduce a complete pipeline that converts English speech into smooth, realistic 3D sign language animations. Our system starts with Whisper to translate spoken English into text. Then, we use a MarianMT machine translation model to translate that text into American Sign Language (ASL) gloss, a simplified version of sign language that captures meaning without grammar. This model performs well, reaching BLEU scores of 0.7714 and 0.8923. To make the gloss translation more accurate, we also use word embeddings such as Word2Vec and FastText to understand word meanings. Finally, we animate the translated gloss using a 3D keypoint-based motion system trained on Sign3D-WLASL, a dataset we created by extracting body, hand, and face key points from real ASL videos in the WLASL dataset. To support the gloss translation stage, we also built a new dataset called BookGlossCorpus-CG, which turns everyday English sentences from the BookCorpus dataset into ASL gloss using grammar rules. Our system stitches everything together by smoothly interpolating between signs to create natural, continuous animations. Unlike previous works like How2Sign and Phoenix-2014T that focus on recognition or use only one type of data, our pipeline brings together audio, text, and motion in a single framework that goes all the way from spoken English to lifelike 3D sign language animation.
帮助聋人和听力障碍者更轻松地交流是自动手语翻译的主要目标。尽管过去的研究大多集中在将手语转换为文本上,但将其反向操作即把口语英语转换成手语动画的工作却被很大程度上忽视了。这是因为后者涉及多个步骤,包括理解语音、将语音翻译成交互性强的手语语法,并生成自然的人体动作。在这一工作中,我们引入了一个完整的流水线,用于将英语口述语言转换为流畅且逼真的3D手语动画。 我们的系统首先使用Whisper将口语英语转换成文本。然后,采用MarianMT机器翻译模型将该文本翻译成美国手语(ASL)符号,这是一种简化了的手语形式,仅捕捉到意义而忽略了语法细节。这一模型表现良好,在BLEU分数上达到了0.7714和0.8923的评分。为了使符号翻译更加准确,我们还使用如Word2Vec和FastText这样的词嵌入技术来理解单词的意义。 最后,我们采用基于3D关键点的动作系统将翻译后的符号进行动画化处理,该动作系统在Sign3D-WLASL上进行了训练——这是一个我们从WLASL数据集中提取真实ASL视频中的身体、手部和面部关键点所创建的数据集。为了支持符号翻译阶段,我们也建立了一个新的名为BookGlossCorpus-CG的数据集,它将来自BookCorpus的日常英语句子转化为使用语法规则的手语符号。 我们的系统通过在符号之间平滑插值来实现自然流畅的动作衔接。与以往的研究(如How2Sign和Phoenix-2014T)不同,后者主要关注识别或仅利用一种类型的数据,我们所提出的流水线整合了音频、文本及动作,在单一框架中实现了从口语英语到逼真3D手语动画的全过程转换。
https://arxiv.org/abs/2507.06530
In real world translation scenarios, terminology is rarely one-to-one. Instead, multiple valid translations may appear in a terminology dictionary, but correctness of a translation depends on corporate style guides and context. This can be challenging for neural machine translation (NMT) systems. Luckily, in a corporate context, many examples of human post-edits of valid but incorrect terminology exist. The goal of this work is to learn how to disambiguate our terminology based on these corrections. Our approach is based on preference optimization, using the term post-edit as the knowledge to be preferred. While previous work had to rely on unambiguous translation dictionaries to set hard constraints during decoding, or to add soft constraints in the input, our framework requires neither one-to-one dictionaries nor human intervention at decoding time. We report results on English-German post-edited data and find that the optimal combination of supervised fine-tuning and preference optimization, with both term-specific and full sequence objectives, yields statistically significant improvements in term accuracy over a strong NMT baseline without significant losses in COMET score. Additionally, we release test sets from our post-edited data and terminology dictionary.
在真实的翻译场景中,术语的对应关系很少是一对一的。相反,在术语词典中可能会存在多个有效的译文选项,但正确的翻译取决于公司的风格指南和具体上下文环境。这对神经机器翻译(NMT)系统来说是一个挑战。幸运的是,在企业环境中,有许多人类后编辑人员修正不正确术语的实际例子可用。 这项工作的目标是基于这些校正示例学习如何对术语进行语境化区别。我们的方法基于偏好优化,以术语的后编辑结果作为优先选择的知识。以往的研究工作不得不依赖于无歧义的翻译词典来在解码阶段设置硬性约束条件,或者在输入中添加软性约束条件。而我们提出的方法既不需要一对一的词典也不需要在解码时进行人工干预。 我们在英语-德语后编辑数据上报告了实验结果,并发现最优结合监督微调和偏好优化(同时使用术语特定目标和完整序列目标)的方法,相比于一个强大的NMT基线,在术语准确性上有统计显著性的改进,而不会导致COMET评分有明显下降。此外,我们还发布了从我们的后编辑数据集中提取的测试集以及相关的术语词典。
https://arxiv.org/abs/2507.03580
Recent advances in deep learning have made it possible to predict phenotypic measures directly from functional magnetic resonance imaging (fMRI) brain volumes, sparking significant interest in the neuroimaging community. However, existing approaches, primarily based on convolutional neural networks or transformer architectures, often struggle to model the complex relationships inherent in fMRI data, limited by their inability to capture long-range spatial and temporal dependencies. To overcome these shortcomings, we introduce BrainMT, a novel hybrid framework designed to efficiently learn and integrate long-range spatiotemporal attributes in fMRI data. Our framework operates in two stages: (1) a bidirectional Mamba block with a temporal-first scanning mechanism to capture global temporal interactions in a computationally efficient manner; and (2) a transformer block leveraging self-attention to model global spatial relationships across the deep features processed by the Mamba block. Extensive experiments on two large-scale public datasets, UKBioBank and the Human Connectome Project, demonstrate that BrainMT achieves state-of-the-art performance on both classification (sex prediction) and regression (cognitive intelligence prediction) tasks, outperforming existing methods by a significant margin. Our code and implementation details will be made publicly available at this this https URL
最近在深度学习领域的进展使得直接从功能性磁共振成像(fMRI)脑体积数据中预测表型测量成为可能,这一成就引起了神经影像学界的极大兴趣。然而,现有的方法主要基于卷积神经网络或变压器架构,常常难以建模fMRI数据中存在的复杂关系,因为这些模型无法捕捉到长距离的空间和时间依赖性。为了解决这些问题,我们引入了BrainMT,这是一种新颖的混合框架,旨在高效地学习并整合fMRI数据中的长程时空属性。 我们的框架分为两个阶段运作: 1. 一个双向Mamba块,结合了一个以时间为先扫描机制来有效地捕捉全局时间交互。 2. 一个变压器模块利用自注意力机制建模Mamba块处理的深层特征之间的全球空间关系。 在两个大规模公开数据集(英国生物银行和人类连接组项目)上的广泛实验表明,BrainMT在分类任务(性别预测)和回归任务(认知智能预测)上均达到了最先进的性能水平,并且比现有方法有显著的优势。我们的代码和实现细节将在[此处提供链接]公开发布。
https://arxiv.org/abs/2506.22591
This study explores Machine Translationese (MTese) -- the linguistic peculiarities of machine translation outputs -- focusing on the under-researched English-to-Chinese language pair in news texts. We construct a large dataset consisting of 4 sub-corpora and employ a comprehensive five-layer feature set. Then, a chi-square ranking algorithm is applied for feature selection in both classification and clustering tasks. Our findings confirm the presence of MTese in both Neural Machine Translation systems (NMTs) and Large Language Models (LLMs). Original Chinese texts are nearly perfectly distinguishable from both LLM and NMT outputs. Notable linguistic patterns in MT outputs are shorter sentence lengths and increased use of adversative conjunctions. Comparing LLMs and NMTs, we achieve approximately 70% classification accuracy, with LLMs exhibiting greater lexical diversity and NMTs using more brackets. Additionally, translation-specific LLMs show lower lexical diversity but higher usage of causal conjunctions compared to generic LLMs. Lastly, we find no significant differences between LLMs developed by Chinese firms and their foreign counterparts.
这项研究探讨了机器翻译语(Machine Translationese,简称MTese)——即机器翻译输出中的语言特征。重点是较少被研究的英语到中文的语言对,在新闻文本中进行分析。我们构建了一个包含4个子语料库的大数据集,并使用全面的五层特征集合。然后,在分类和聚类任务中应用卡方排名算法来进行特征选择。 我们的发现证实了在神经机器翻译系统(NMTs)和大型语言模型(LLMs)中都存在MTese现象。原始中文文本几乎可以被完全区分为来自这两种系统的输出。值得注意的是,机器翻译的输出具有较短的句子长度以及更频繁地使用对立连词。 将LLMs与NMTs进行比较时,我们实现了大约70%的分类准确性。LLMs表现出更大的词汇多样性,而NMT则更多地使用括号。此外,专门用于翻译任务的LLM显示较低的词汇多样性但因果连词使用的频率更高,相比通用型LLM而言。 最后,我们没有发现中国公司开发的大型语言模型与其外国同行之间存在显著差异。
https://arxiv.org/abs/2506.22050
This study focuses on evaluating the performance of machine translations (MTs) compared to human translations (HTs) in English-to-Chinese children's literature translation (CLT) from a stylometric perspective. The research constructs a Peter Pan corpus, comprising 21 translations: 7 human translations (HTs), 7 large language model translations (LLMs), and 7 neural machine translation outputs (NMTs). The analysis employs a generic feature set (including lexical, syntactic, readability, and n-gram features) and a creative text translation (CTT-specific) feature set, which captures repetition, rhythm, translatability, and miscellaneous levels, yielding 447 linguistic features in total. Using classification and clustering techniques in machine learning, we conduct a stylometric analysis of these translations. Results reveal that in generic features, HTs and MTs exhibit significant differences in conjunction word distributions and the ratio of 1-word-gram-YiYang, while NMTs and LLMs show significant variation in descriptive words usage and adverb ratios. Regarding CTT-specific features, LLMs outperform NMTs in distribution, aligning more closely with HTs in stylistic characteristics, demonstrating the potential of LLMs in CLT.
这项研究重点评估机器翻译(MT)与人工翻译(HT)在从英语到中文的儿童文学翻译(CLT)中的表现,采用的是文体测量学的方法。研究构建了一个《彼得潘》语料库,包含21个译本:7个人工译本(HTs),7个大型语言模型译本(LLMs),以及7个神经机器翻译输出(NMTs)。分析采用了通用特征集(包括词汇、句法、可读性和n-gram特征)和创意文本翻译(CTT特定)的特征集,后者捕捉重复性、节奏感、可译性和杂项水平,总共产生了447种语言特征。通过机器学习中的分类和聚类技术,我们对这些翻译进行了文体分析。 结果显示,在通用特征方面,HTs与MTs在连词分布和1-gram词汇比例上存在显著差异;而在NMTs与LLMs之间,则是在描述性词语的使用和副词比率上的显著变化。就CTT特定特征而言,LLMs的表现优于NMTs,并且其风格特性更接近于HTs,这表明LLMs在CLT中具有潜力。
https://arxiv.org/abs/2506.22038