Continual learning in Neural Machine Translation (NMT) faces the dual challenges of catastrophic forgetting and the high computational cost of retraining. This study establishes Low-Rank Adaptation (LoRA) as a parameter-efficient framework to address these challenges in dedicated NMT architectures. We first demonstrate that LoRA-based fine-tuning adapts NMT models to new languages and domains with performance on par with full-parameter techniques, while utilizing only a fraction of the parameter space. Second, we propose an interactive adaptation method using a calibrated linear combination of LoRA modules. This approach functions as a gate-free mixture of experts, enabling real-time, user-controllable adjustments to domain and style without retraining. Finally, to mitigate catastrophic forgetting, we introduce a novel gradient-based regularization strategy specifically designed for low-rank decomposition matrices. Unlike methods that regularize the full parameter set, our approach weights the penalty on the low-rank updates using historical gradient information. Experimental results indicate that this strategy efficiently preserves prior domain knowledge while facilitating the acquisition of new tasks, offering a scalable paradigm for interactive and continual NMT.
在神经机器翻译(NMT)中的持续学习面临灾难性遗忘和重新训练高计算成本的双重挑战。本研究建立了低秩适应(LoRA)作为解决这些挑战的有效参数框架,特别是在专门针对NMT架构的情况下。 首先,我们展示了基于LoRA的微调可以使NMT模型适应新的语言和领域,并且其性能与全参数技术相当,但仅使用了一小部分参数空间。其次,我们提出了一种交互式适应方法,该方法采用校准后的线性组合来融合LoRA模块,这可以作为无门控专家混合体工作,能够在不重新训练的情况下实时调整领域和风格。 最后,为了解决灾难性遗忘问题,我们引入了一种新的基于梯度的正则化策略,专门针对低秩分解矩阵。与对整个参数集进行规则化的常规方法不同,我们的方法利用历史梯度信息来加权低秩更新的惩罚。实验结果表明,这种策略能够有效地保留先前领域的知识,同时促进新任务的学习,为交互式和持续的NMT提供了一种可扩展的方法论。 简而言之,本研究通过LoRA及其创新性正则化方法,在减少计算负担的同时提高了模型在多语言、多领域设置下的适应性和持久学习能力。
https://arxiv.org/abs/2512.09910
Automatic post-editing (APE) aims to correct errors in machine-translated text, enhancing translation quality, while reducing the need for human intervention. Despite advances in neural machine translation (NMT), the development of effective APE systems has been hindered by the lack of large-scale multilingual datasets specifically tailored to NMT outputs. To address this gap, we present and release LangMark, a new human-annotated multilingual APE dataset for English translation to seven languages: Brazilian Portuguese, French, German, Italian, Japanese, Russian, and Spanish. The dataset has 206,983 triplets, with each triplet consisting of a source segment, its NMT output, and a human post-edited translation. Annotated by expert human linguists, our dataset offers both linguistic diversity and scale. Leveraging this dataset, we empirically show that Large Language Models (LLMs) with few-shot prompting can effectively perform APE, improving upon leading commercial and even proprietary machine translation systems. We believe that this new resource will facilitate the future development and evaluation of APE systems.
https://arxiv.org/abs/2511.17153
The translation of written language has been known since the 3rd century BC; however, its necessity has become increasingly common in the information age. Today, many translators exist, based on encoder-decoder deep architectures, nevertheless, no quantitative objective methods are available to assess their performance, likely because the entropy of even a single language remains unknown. This study presents a quantitative method for estimating translation entropy, with the following key finding. Given a translator, several sentences that differ by only one selected token of a given pivot sentence yield identical translations. Analyzing the statistics of this phenomenon across an ensemble of such sentences, consisting each of a pivot selected token, yields the probabilities of replacing this specific token with others while preserving the translation. These probabilities constitute the entropy of the selected token, and the average across all selected pivot tokens provides an estimate of the translator's overall translation entropy, which is enhanced along the decoder blocks. This entropic measure allows for the quantitative ranking of several publicly available translators and reveals whether mutual translation entropy is symmetric. Extending the proposed method to include the replacement of two tokens in a given pivot sentence demonstrates a multiplicative effect, where translation degeneracy is proportional to the product of the degeneracies of the two tokens. These findings establish translation entropy as a measurable property and objective benchmarking of artificial translators. Results are based on MarianMT, T5-Base and NLLB-200 translators.
https://arxiv.org/abs/2511.13180
This paper introduces DuTerm, a novel two-stage architecture for terminology-constrained machine translation. Our system combines a terminology-aware NMT model, adapted via fine-tuning on large-scale synthetic data, with a prompt-based LLM for post-editing. The LLM stage refines NMT output and enforces terminology adherence. We evaluate DuTerm on English-to German, English-to-Spanish, and English-to-Russian with the WMT 2025 Terminology Shared Task corpus. We demonstrate that flexible, context-driven terminology handling by the LLM consistently yields higher quality translations than strict constraint enforcement. Our results highlight a critical trade-off, revealing that an LLM's work best for high-quality translation as context-driven mutators rather than generators.
https://arxiv.org/abs/2511.07461
The linguistic diversity of India poses significant machine translation challenges, especially for underrepresented tribal languages like Bhili, which lack high-quality linguistic resources. This paper addresses the gap by introducing Bhili-Hindi-English Parallel Corpus (BHEPC), the first and largest parallel corpus worldwide comprising 110,000 meticulously curated sentences across Bhili, Hindi, and English. The corpus was created with the assistance of expert human translators. BHEPC spans critical domains such as education, administration, and news, establishing a valuable benchmark for research in low resource machine translation. To establish a comprehensive Bhili Machine Translation benchmark, we evaluated a wide range of proprietary and open-source Multilingual Large Language Models (MLLMs) on bidirectional translation tasks between English/Hindi and Bhili. Comprehensive evaluation demonstrates that the fine-tuned NLLB-200 distilled 600M variant model outperforms others, highlighting the potential of multilingual models in low resource scenarios. Furthermore, we investigated the generative translation capabilities of multilingual LLMs on BHEPC using in-context learning, assessing performance under cross-domain generalization and quantifying distributional divergence. This work bridges a critical resource gap and promotes inclusive natural language processing technologies for low-resource and marginalized languages globally.
https://arxiv.org/abs/2511.00486
This paper presents an end-to-end multilingual translation pipeline that integrates a custom U-Net for text detection, the Tesseract engine for text recognition, and a from-scratch sequence-to-sequence (Seq2Seq) Transformer for Neural Machine Translation (NMT). Our approach first utilizes a U-Net model, trained on a synthetic dataset , to accurately segment and detect text regions from an image. These detected regions are then processed by Tesseract to extract the source text. This extracted text is fed into a custom Transformer model trained from scratch on a multilingual parallel corpus spanning 5 languages. Unlike systems reliant on monolithic pre-trained models, our architecture emphasizes full customization and adaptability. The system is evaluated on its text detection accuracy, text recognition quality, and translation performance via BLEU scores. The complete pipeline demonstrates promising results, validating the viability of a custom-built system for translating text directly from images.
https://arxiv.org/abs/2510.23554
Fine-tuning is widely used to tailor large language models for specific tasks such as neural machine translation (NMT). However, leveraging transfer learning is computationally expensive when fine-tuning large multilingual models with billions of parameters, thus creating a barrier to entry for researchers working on low-resource domains such as Irish translation. Parameter-efficient fine-tuning (PEFT) bridges this gap by training on a fraction of the original model parameters, with the Low-Rank Adaptation (LoRA) approach introducing small, trainable adapter layers. We introduce SemiAdapt and SemiLoRA as semi-supervised inference-efficient approaches that strengthen domain adaptation and lead to improved overall performance in NMT. We demonstrate that SemiAdapt can outperform full-domain fine-tuning, while most notably, SemiLoRA can propel PEFT methods to match or even outperform full-model fine-tuning. We further evaluate domain-by-dataset fine-tuning and demonstrate that our embedding-based inference methods perform especially well on larger and noisier corpora. All Irish translation models developed in this work are released as open resources. These methods aim to make high-quality domain adaptation and fine-tuning more accessible to researchers working with low-resource languages.
微调广泛应用于根据特定任务(如神经机器翻译(NMT))调整大型语言模型。然而,当对具有数十亿参数的多语言大模型进行微调时,利用迁移学习在计算上非常昂贵,这为研究资源匮乏领域的研究人员(例如爱尔兰语翻译)设置了进入障碍。参数高效的微调(PEFT)通过仅训练原始模型部分参数来解决这一问题,并且低秩适应(Low-Rank Adaptation, LoRA)方法引入了较小的可训练适配器层。我们介绍了半监督推理高效的方法SemiAdapt和SemiLoRA,它们加强领域适应并使NMT的整体性能得到提升。我们演示了SemiAdapt可以超越全领域微调,而值得注意的是,SemiLoRA可以使PEFT方法与全模型微调相匹配甚至优于后者。进一步地,我们在按数据集进行领域微调时进行了评估,并展示了我们的基于嵌入的推理方法在更大且噪声更多的语料库上表现尤为出色。本工作中开发的所有爱尔兰翻译模型均以开源资源的形式发布。这些方法旨在使低资源语言的研究人员能够更容易地实现高质量领域的适应和微调。
https://arxiv.org/abs/2510.18725
Machine Translation (MT) has advanced from rule-based and statistical methods to neural approaches based on the Transformer architecture. While these methods have achieved impressive results for high-resource languages, low-resource varieties such as Sylheti remain underexplored. In this work, we investigate Bengali-to-Sylheti translation by fine-tuning multilingual Transformer models and comparing them with zero-shot large language models (LLMs). Experimental results demonstrate that fine-tuned models significantly outperform LLMs, with mBART-50 achieving the highest translation adequacy and MarianMT showing the strongest character-level fidelity. These findings highlight the importance of task-specific adaptation for underrepresented languages and contribute to ongoing efforts toward inclusive language technologies.
机器翻译(MT)已经从基于规则和统计的方法发展到以Transformer架构为基础的神经方法。尽管这些方法在高资源语言上取得了令人印象深刻的结果,但低资源语言如锡莱蒂语的研究仍然较少。在这项工作中,我们通过微调多语言Transformer模型并将其与零样本大型语言模型(LLMs)进行比较来研究孟加拉语到锡莱蒂语的翻译。实验结果表明,经过微调的模型显著优于LLM,在翻译充分性方面,mBART-50表现最佳;在字符级别保真度方面,MarianMT表现出最强的能力。这些发现强调了为未被充分代表的语言进行任务特定适应的重要性,并为包容性语言技术的发展做出了贡献。
https://arxiv.org/abs/2510.18898
While BERT is an effective method for learning monolingual sentence embeddings for semantic similarity and embedding based transfer learning BERT based cross-lingual sentence embeddings have yet to be explored. We systematically investigate methods for learning multilingual sentence embeddings by combining the best methods for learning monolingual and cross-lingual representations including: masked language modeling (MLM), translation language modeling (TLM), dual encoder translation ranking, and additive margin softmax. We show that introducing a pre-trained multilingual language model dramatically reduces the amount of parallel training data required to achieve good performance by 80%. Composing the best of these methods produces a model that achieves 83.7% bi-text retrieval accuracy over 112 languages on Tatoeba, well above the 65.5 achieved by LASER, while still performing competitively on monolingual transfer learning benchmarks. Parallel data mined from CommonCrawl using our best model is shown to train competitive NMT models for en-zh and en-de. We publicly release our best multilingual sentence embedding model for 109+ languages at this https URL.
虽然BERT在学习单语句子嵌入以进行语义相似度和基于嵌入的迁移学习方面非常有效,但基于BERT的跨语言句子嵌入尚未得到充分研究。我们系统地调查了结合最佳单语文本表示与跨语言文本表示方法来学习多语言句子嵌入的方法,包括掩码语言模型(MLM)、翻译语言模型(TLM)、双编码器翻译排名以及加性余弦softmax。我们表明,引入预训练的多语种语言模型可以显著减少实现良好性能所需的平行训练数据量达80%。将这些方法中的最佳组合产生了一个在Tatoeba上涵盖112种语言且双文本检索准确率达到83.7%的模型,这一成绩远超LASER所达到的65.5%,同时该模型仍能在单语迁移学习基准测试中保持竞争力。使用我们的最佳模型从CommonCrawl中挖掘出的平行数据训练了竞争性的英汉和英德NMT(神经机器翻译)模型。我们公开发布了适用于109+种语言的最佳多语言句子嵌入模型,可在此URL下载:[https://github.com/your-repo-url](https://github.com/your-repo-url)。 请注意,在提供的链接处应填入实际的GitHub仓库地址或其他发布平台地址以供访问。
https://arxiv.org/abs/2510.17504
Idiomatic translation remains a significant challenge in machine translation, especially for low resource languages such as Urdu, and has received limited prior attention. To advance research in this area, we introduce the first evaluation datasets for Urdu to English idiomatic translation, covering both Native Urdu and Roman Urdu scripts and annotated with gold-standard English equivalents. We evaluate multiple open-source Large Language Models (LLMs) and Neural Machine Translation (NMT) systems on this task, focusing on their ability to preserve idiomatic and cultural meaning. Automatic metrics including BLEU, BERTScore, COMET, and XCOMET are used to assess translation quality. Our findings indicate that prompt engineering enhances idiomatic translation compared to direct translation, though performance differences among prompt types are relatively minor. Moreover, cross script comparisons reveal that text representation substantially affects translation quality, with Native Urdu inputs producing more accurate idiomatic translations than Roman Urdu.
习语翻译在机器翻译中仍是一项重大挑战,尤其是在像乌尔都语这样的低资源语言领域,这一问题尚未得到充分关注。为了推动该领域的研究进展,我们首次推出了用于乌尔都语到英语习语翻译的评估数据集,涵盖了本土乌尔都文和罗马化乌尔都文两种书写系统,并附有黄金标准的英文对照译文。我们在该项任务中评估了多个开源大型语言模型(LLMs)和神经机器翻译(NMT)系统的性能,重点关注它们在保留习语及文化含义方面的能力。我们使用包括BLEU、BERTScore、COMET和XCOMET在内的自动评价指标来衡量翻译质量。 研究发现表明,与直接翻译相比,提示工程能显著提升习语翻译的效果,尽管不同类型的提示之间存在细微的性能差异。此外,在跨书写系统对比中发现,文本表示对翻译质量有着重要影响:基于本土乌尔都文输入的翻译比罗马化乌尔都文更准确地保留了习语的意义和文化内涵。
https://arxiv.org/abs/2510.17460
To advance a Weather-Ready Nation, the National Weather Service (NWS) is developing a systematic translation program to better serve the 68.8 million people in the U.S. who do not speak English at home. This article outlines the foundation of an automated translation tool for NWS products, powered by artificial intelligence. The NWS has partnered with LILT, whose patented training process enables large language models (LLMs) to adapt neural machine translation (NMT) tools for weather terminology and messaging. Designed for scalability across Weather Forecast Offices (WFOs) and National Centers, the system is currently being developed in Spanish, Simplified Chinese, Vietnamese, and other widely spoken non-English languages. Rooted in best practices for multilingual risk communication, the system provides accurate, timely, and culturally relevant translations, significantly reducing manual translation time and easing operational workloads across the NWS. To guide the distribution of these products, GIS mapping was used to identify language needs across different NWS regions, helping prioritize resources for the communities that need them most. We also integrated ethical AI practices throughout the program's design, ensuring that transparency, fairness, and human oversight guide how automated translations are created, evaluated, and shared with the public. This work has culminated into a website featuring experimental multilingual NWS products, including translated warnings, 7-day forecasts, and educational campaigns, bringing the country one step closer to a national warning system that reaches all Americans.
为了打造一个“天气准备就绪”的国家,美国国家气象局(NWS)正在开发一项系统化的翻译计划,以便更好地服务于在美国家中不讲英语的6880万人口。本文概述了一种由人工智能驱动的自动化翻译工具的基础框架,该工具专门用于NWS产品。NWS已与LILT合作,后者采用专利培训流程使大型语言模型(LLMs)能够适应神经机器翻译(NMT)工具,以应对气象术语和信息传达的需求。这个系统旨在在天气预报办公室(WFOs)和国家中心之间实现可扩展性,并且目前正在西班牙语、简体中文、越南语和其他广泛使用的非英语语言中开发。 基于多语言风险沟通的最佳实践,该系统提供了准确及时的文化相关翻译,大大减少了手动翻译时间,减轻了NWS的操作负担。为指导这些产品在不同地区的分发,GIS地图被用来识别各地区对特定语言的需求,从而帮助优先考虑最需要资源的社区。 在整个计划的设计中还集成了伦理AI实践,确保透明度、公平性及人工监督引导自动化翻译的创建、评估和与公众分享的过程。这一工作已形成了一个网站,展示了一系列实验性的多语言NWS产品,包括翻译警告、七天预报和教育宣传活动,使美国朝着建立覆盖所有公民的全国预警系统的目标迈进了一步。
https://arxiv.org/abs/2510.14369
Quality estimation (QE) reranking is a form of quality-aware decoding which aims to improve machine translation (MT) by scoring and selecting the best candidate from a pool of generated translations. While known to be effective at the sentence level, its application to the increasingly prominent domain of document-level translation remains underexplored. In this work, we evaluate QE reranking performance on document-level (rather than the typical sentence-level) translation, using various learned and large language model (LLM)-based QE metrics. We find that with our best learned metric, SLIDE, BLEURT-20 scores improve by +2.00 with only two candidates, and by +5.09 with 32, across both decoder-only LLM models and encoder-decoder neural machine translation (NMT) models. Using the best LLM-based metric, GEMBA-DA, gains of +1.63 and +4.30 are achieved under the same conditions. Although gains shrink with longer inputs, reranking with 32 candidates yields improvements of +2.34 (SLIDE) and +1.40 (GEMBA-DA) on our longest documents (512-1024 source tokens). These findings demonstrate the practical value of document-level QE, with minimal runtime overhead given suitable translation models and hardware.
https://arxiv.org/abs/2510.08870
This paper introduces a novel approach to Dynamic Artificial Neural Networks (D-ANNs) for multi-task demand forecasting called Neuroplastic Multi-Task Network (NMT-Net). Unlike conventional methods focusing on inference-time dynamics or computational efficiency, our proposed method enables structural adaptability of the computational graph during training, inspired by neuroplasticity as seen in biological systems. Each new task triggers a dynamic network adaptation, including similarity-based task identification and selective training of candidate ANN heads, which are then assessed and integrated into the model based on their performance. We evaluated our framework using three real-world multi-task demand forecasting datasets from Kaggle. We demonstrated its superior performance and consistency, achieving lower RMSE and standard deviation compared to traditional baselines and state-of-the-art multi-task learning methods. NMT-Net offers a scalable, adaptable solution for multi-task and continual learning in time series prediction. The complete code for NMT-Net is available from our GitHub repository.
本文介绍了一种新颖的方法,用于多任务需求预测的动态人工神经网络(D-ANN),称为神经可塑性多任务网络(NMT-Net)。与传统方法专注于推理时间动态或计算效率不同,我们提出的方法在训练期间启用了计算图的结构适应性,灵感来自生物系统中的神经可塑性。每一个新任务都会触发动态网络调整,包括基于相似性的任务识别和候选ANN头的选择性训练,然后根据其性能进行评估并整合到模型中。我们使用Kaggle上的三个现实世界的多任务需求预测数据集对我们的框架进行了评估,并展示了其优越的性能和一致性,在均方根误差(RMSE)和标准差方面优于传统的基准方法和最新的多任务学习方法。 NMT-Net为时间序列预测中的多任务和持续学习提供了可扩展、适应性强的解决方案。完整的NMT-Net代码可以从我们的GitHub仓库获取。
https://arxiv.org/abs/2509.24495
Despite advances in Neural Machine Translation (NMT), low-resource languages like Tigrinya remain underserved due to persistent challenges, including limited corpora, inadequate tokenization strategies, and the lack of standardized evaluation benchmarks. This paper investigates transfer learning techniques using multilingual pretrained models to enhance translation quality for morphologically rich, low-resource languages. We propose a refined approach that integrates language-specific tokenization, informed embedding initialization, and domain-adaptive fine-tuning. To enable rigorous assessment, we construct a high-quality, human-aligned English-Tigrinya evaluation dataset covering diverse domains. Experimental results demonstrate that transfer learning with a custom tokenizer substantially outperforms zero-shot baselines, with gains validated by BLEU, chrF, and qualitative human evaluation. Bonferroni correction is applied to ensure statistical significance across configurations. Error analysis reveals key limitations and informs targeted refinements. This study underscores the importance of linguistically aware modeling and reproducible benchmarks in bridging the performance gap for underrepresented languages. Resources are available at this https URL and this https URL
尽管神经机器翻译(NMT)取得了进展,但由于持续存在的挑战,如有限的语料库、不充分的分词策略和缺乏标准化的评估基准,像提格尼尼亚语这样的低资源语言仍然得不到充分服务。本文探讨了使用多语言预训练模型进行迁移学习技术,以提高形态丰富的低资源语言翻译质量的方法。我们提出了一种经过改进的方法,该方法结合了特定于语言的分词、信息嵌入初始化和领域适应性微调。 为了实现严格的评估,我们构建了一个高质量的人工对齐英提格尼尼亚语评价数据集,涵盖了不同的领域。实验结果表明,在自定义分词器的帮助下进行迁移学习显著优于零样本基准模型,并通过BLEU、chrF和定性的人工评估进行了验证。应用邦弗伦尼校正以确保在不同配置下的统计意义。错误分析揭示了关键的局限性并为针对性改进提供了信息。 这项研究强调了语言意识建模和可重复基准测试在弥合代表性不足的语言性能差距方面的的重要性。相关资源可以在[这里](https://this.url)和[这里](https://that.url)获得。请注意,提供的URL仅为占位符,请访问原文获取实际链接。
https://arxiv.org/abs/2509.20209
India's linguistic landscape is one of the most diverse in the world, comprising over 120 major languages and approximately 1,600 additional languages, with 22 officially recognized as scheduled languages in the Indian Constitution. Despite recent progress in multilingual neural machine translation (NMT), high-quality parallel corpora for Indian languages remain scarce, especially across varied domains. In this paper, we introduce a large-scale, high-quality annotated parallel corpus covering 11 of these languages : English, Telugu, Hindi, Punjabi, Odia, Kashmiri, Sindhi, Dogri, Kannada, Urdu, and Gujarati comprising a total of 772,000 bi-text sentence pairs. The dataset is carefully curated and systematically categorized into three key domains: Government, Health, and General, to enable domain-aware machine translation research and facilitate effective domain adaptation. To demonstrate the utility of CorIL and establish strong benchmarks for future research, we fine-tune and evaluate several state-of-the-art NMT models, including IndicTrans2, NLLB, and BhashaVerse. Our analysis reveals important performance trends and highlights the corpus's value in probing model capabilities. For instance, the results show distinct performance patterns based on language script, with massively multilingual models showing an advantage on Perso-Arabic scripts (Urdu, Sindhi) while other models excel on Indic scripts. This paper provides a detailed domain-wise performance analysis, offering insights into domain sensitivity and cross-script transfer learning. By publicly releasing CorIL, we aim to significantly improve the availability of high-quality training data for Indian languages and provide a valuable resource for the machine translation research community.
印度的语言景观是世界上最多样化的一种,包含了超过120种主要语言和大约1600种其他语言。根据《印度宪法》,其中有22种语言被列为计划语言(scheduled languages)。尽管近期在多语神经机器翻译(NMT)方面取得了一些进展,但在各种领域中高质量的平行语料库仍然十分稀缺。在这篇论文中,我们介绍了一个大规模且高质量标注的平行语料库,涵盖了11种印度的语言:英语、泰卢固语、印地语、旁遮普语、奥迪亚语、克什米尔语、信德语、多格里语、卡纳达语、乌尔都语和古吉拉特语。这个语料库总共包含772,000对双语文本句子。 该数据集经过精心策划,并被系统地分类为三大关键领域:政府、健康以及通用,以支持具有域意识的机器翻译研究并促进有效的领域适应。为了展示CorIL(可能是指上述提到的大规模平行语料库)的实用性,并为未来的研究建立强大的基准,我们对几种最先进的NMT模型进行了微调和评估,包括IndicTrans2、NLLB和BhashaVerse。我们的分析揭示了重要的性能趋势,并突出了该语料库在探索模型能力方面的价值。例如,结果表明,基于书写系统的不同语言模式具有明显的性能差异:大规模多语种模型在使用波斯-阿拉伯书写的语言(如乌尔都语、信德语)上表现出优势,而其他模型则在印度书写系统上的表现更为出色。 本文还提供了详细的领域内性能分析,为理解领域的敏感性和跨书写系统的迁移学习提供见解。通过公开发布CorIL,我们旨在显著提高印度语言高质量训练数据的可用性,并为此领域的机器翻译研究社区提供宝贵的资源。
https://arxiv.org/abs/2509.19941
Domain-specific embedding models have shown promise for applications that require specialized semantic understanding, such as coding agents and financial retrieval systems, often achieving higher performance gains than general models. However, state-of-the-art embedding models are typically based on LLMs, which contain billions of parameters, making deployment challenging in resource-constrained environments. Model compression through pruning offers a promising solution, but existing pruning methods treat all parameters uniformly, failing to distinguish between general semantic representations and domain-specific patterns, leading to suboptimal pruning decisions. Thus, we propose GAPrune, a pruning framework that addresses this challenge by considering both domain importance and preserving general linguistic foundation. Our method uses Fisher Information to measure importance and general-domain gradient alignment to assess parameter behavior, then combines these signals using our Domain Alignment Importance (DAI) scoring. Lower DAI scores indicate that the parameter is either less important for the domain task or creates conflicts between domain and general objectives. Experiments on two domain benchmarks, FinMTEB and ChemTEB, show that GAPrune maintains performance within 2.5% of dense models in one-shot pruning at 50% sparsity, while outperforming all baselines. With retraining in 100 steps, GAPrune achieves +4.51% improvement on FinMTEB and +1.73% on ChemTEB, demonstrating that our pruning strategy not only preserves but enhances domain-specific capabilities. Our findings demonstrate that principled pruning strategies can achieve model compression and enhanced domain specialization, providing the research community with a new approach for development.
领域特定嵌入模型在需要专门语义理解的应用程序中表现出巨大潜力,例如代码代理和金融检索系统,在这些应用中通常能比通用模型实现更高的性能提升。然而,最先进的嵌入模型往往基于大规模语言模型(LLMs),这类模型包含数十亿参数,在资源受限的环境中部署十分具有挑战性。通过剪枝进行模型压缩提供了一种有前景的解决方案,但现有的剪枝方法对所有参数采取同等对待的方式,未能区分一般语义表示和领域特定模式之间的差异,导致了次优的剪枝决策。因此,我们提出了GAPrune框架,该框架旨在同时考虑领域重要性和保留通用语言基础来应对这一挑战。 我们的方法使用Fisher信息度量重要性,并通过领域-通用梯度对齐评估参数行为,然后结合这些信号利用我们的域对齐重要性(DAI)评分。较低的DAI分数意味着该参数要么在领域任务中不那么重要,要么与领域和通用目标之间存在冲突。 实验结果表明,在FinMTEB和ChemTEB两个领域的基准测试上,GAPrune通过一次性50%稀疏度剪枝,性能保持接近稠密模型的97.5%,并且超过了所有基线。经过100步重新训练后,GAPrune在FinMTEB上的表现提高了4.51%,在ChemTEB上提高了1.73%,这证明了我们的剪枝策略不仅能保留领域特定能力,还能进一步增强这些能力。 我们的研究结果表明,基于原则的剪枝策略能够实现模型压缩和领域的专业化提升,并为研发社区提供了一种新的发展方向。
https://arxiv.org/abs/2509.10844
Translating electronic health record (EHR) narratives from English to Spanish is a clinically important yet challenging task due to the lack of a parallel-aligned corpus and the abundant unknown words contained. To address such challenges, we propose \textbf{NOOV} (for No OOV), a new neural machine translation (NMT) system that requires little in-domain parallel-aligned corpus for training. NOOV integrates a bilingual lexicon automatically learned from parallel-aligned corpora and a phrase look-up table extracted from a large biomedical knowledge resource, to alleviate both the unknown word problem and the word-repeat challenge in NMT, enhancing better phrase generation of NMT systems. Evaluation shows that NOOV is able to generate better translation of EHR with improvement in both accuracy and fluency.
翻译电子健康记录(EHR)叙述从英语到西班牙语是一项临床重要但具有挑战性的任务,由于缺乏平行对齐的语料库和大量未知词汇的存在。为了解决这些挑战,我们提出了**NOOV**(无OOV),这是一种新的神经机器翻译(NMT)系统,它需要很少量的领域内平行对齐的语料库进行训练。NOOV集成了从平行对齐语料库自动学习到的双语词典和从大型生物医学知识资源中提取的短语查找表,以减轻NMT中的未知词汇问题和重复单词挑战,从而增强NMT系统在生成短语方面的表现。评估表明,NOOV能够更准确、流畅地翻译EHR。
https://arxiv.org/abs/2508.18607
The advent of neural machine translation (NMT) has revolutionized cross-lingual communication, yet preserving stylistic nuances remains a significant challenge. While existing approaches often require parallel corpora for style preservation, we introduce Babel, a novel framework that enhances stylistic fidelity in NMT using only monolingual corpora. Babel employs two key components: (1) a style detector based on contextual embeddings that identifies stylistic disparities between source and target texts, and (2) a diffusion-based style applicator that rectifies stylistic inconsistencies while maintaining semantic integrity. Our framework integrates with existing NMT systems as a post-processing module, enabling style-aware translation without requiring architectural modifications or parallel stylistic data. Extensive experiments on five diverse domains (law, literature, scientific writing, medicine, and educational content) demonstrate Babel's effectiveness: it identifies stylistic inconsistencies with 88.21% precision and improves stylistic preservation by 150% while maintaining a high semantic similarity score of 0.92. Human evaluation confirms that translations refined by Babel better preserve source text style while maintaining fluency and adequacy.
神经机器翻译(NMT)的出现彻底改变了跨语言交流,然而保留风格细节仍然是一项重大挑战。现有的方法通常需要平行语料库来保存文本风格,而我们引入了一个名为Babel的新框架,它仅使用单语种语料库就能增强NMT中的风格保真度。Babel采用了两个关键组件:(1)一种基于上下文嵌入的风格检测器,用于识别源文本和目标文本之间的风格差异;以及(2)一种扩散型风格应用器,能够在保持语义完整性的前提下修正风格不一致的问题。我们的框架作为一个后处理模块与现有的NMT系统集成在一起,使得无需架构修改或平行风格数据的情况下就能实现基于风格的翻译。在五个不同领域(法律、文学、科学写作、医学和教育内容)进行的广泛实验表明Babel的有效性:它能够以88.21%的准确度识别风格差异,并且将风格保真度提高了150%,同时保持了语义相似性的高分值为0.92。人类评估确认,经过Babel改进的翻译更好地保留了源文本的风格,同时还保持了流畅性和充分性。
https://arxiv.org/abs/2507.13395
Helping deaf and hard-of-hearing people communicate more easily is the main goal of Automatic Sign Language Translation. Although most past research has focused on turning sign language into text, doing the reverse, turning spoken English into sign language animations, has been largely overlooked. That's because it involves multiple steps, such as understanding speech, translating it into sign-friendly grammar, and generating natural human motion. In this work, we introduce a complete pipeline that converts English speech into smooth, realistic 3D sign language animations. Our system starts with Whisper to translate spoken English into text. Then, we use a MarianMT machine translation model to translate that text into American Sign Language (ASL) gloss, a simplified version of sign language that captures meaning without grammar. This model performs well, reaching BLEU scores of 0.7714 and 0.8923. To make the gloss translation more accurate, we also use word embeddings such as Word2Vec and FastText to understand word meanings. Finally, we animate the translated gloss using a 3D keypoint-based motion system trained on Sign3D-WLASL, a dataset we created by extracting body, hand, and face key points from real ASL videos in the WLASL dataset. To support the gloss translation stage, we also built a new dataset called BookGlossCorpus-CG, which turns everyday English sentences from the BookCorpus dataset into ASL gloss using grammar rules. Our system stitches everything together by smoothly interpolating between signs to create natural, continuous animations. Unlike previous works like How2Sign and Phoenix-2014T that focus on recognition or use only one type of data, our pipeline brings together audio, text, and motion in a single framework that goes all the way from spoken English to lifelike 3D sign language animation.
帮助聋人和听力障碍者更轻松地交流是自动手语翻译的主要目标。尽管过去的研究大多集中在将手语转换为文本上,但将其反向操作即把口语英语转换成手语动画的工作却被很大程度上忽视了。这是因为后者涉及多个步骤,包括理解语音、将语音翻译成交互性强的手语语法,并生成自然的人体动作。在这一工作中,我们引入了一个完整的流水线,用于将英语口述语言转换为流畅且逼真的3D手语动画。 我们的系统首先使用Whisper将口语英语转换成文本。然后,采用MarianMT机器翻译模型将该文本翻译成美国手语(ASL)符号,这是一种简化了的手语形式,仅捕捉到意义而忽略了语法细节。这一模型表现良好,在BLEU分数上达到了0.7714和0.8923的评分。为了使符号翻译更加准确,我们还使用如Word2Vec和FastText这样的词嵌入技术来理解单词的意义。 最后,我们采用基于3D关键点的动作系统将翻译后的符号进行动画化处理,该动作系统在Sign3D-WLASL上进行了训练——这是一个我们从WLASL数据集中提取真实ASL视频中的身体、手部和面部关键点所创建的数据集。为了支持符号翻译阶段,我们也建立了一个新的名为BookGlossCorpus-CG的数据集,它将来自BookCorpus的日常英语句子转化为使用语法规则的手语符号。 我们的系统通过在符号之间平滑插值来实现自然流畅的动作衔接。与以往的研究(如How2Sign和Phoenix-2014T)不同,后者主要关注识别或仅利用一种类型的数据,我们所提出的流水线整合了音频、文本及动作,在单一框架中实现了从口语英语到逼真3D手语动画的全过程转换。
https://arxiv.org/abs/2507.06530
In real world translation scenarios, terminology is rarely one-to-one. Instead, multiple valid translations may appear in a terminology dictionary, but correctness of a translation depends on corporate style guides and context. This can be challenging for neural machine translation (NMT) systems. Luckily, in a corporate context, many examples of human post-edits of valid but incorrect terminology exist. The goal of this work is to learn how to disambiguate our terminology based on these corrections. Our approach is based on preference optimization, using the term post-edit as the knowledge to be preferred. While previous work had to rely on unambiguous translation dictionaries to set hard constraints during decoding, or to add soft constraints in the input, our framework requires neither one-to-one dictionaries nor human intervention at decoding time. We report results on English-German post-edited data and find that the optimal combination of supervised fine-tuning and preference optimization, with both term-specific and full sequence objectives, yields statistically significant improvements in term accuracy over a strong NMT baseline without significant losses in COMET score. Additionally, we release test sets from our post-edited data and terminology dictionary.
在真实的翻译场景中,术语的对应关系很少是一对一的。相反,在术语词典中可能会存在多个有效的译文选项,但正确的翻译取决于公司的风格指南和具体上下文环境。这对神经机器翻译(NMT)系统来说是一个挑战。幸运的是,在企业环境中,有许多人类后编辑人员修正不正确术语的实际例子可用。 这项工作的目标是基于这些校正示例学习如何对术语进行语境化区别。我们的方法基于偏好优化,以术语的后编辑结果作为优先选择的知识。以往的研究工作不得不依赖于无歧义的翻译词典来在解码阶段设置硬性约束条件,或者在输入中添加软性约束条件。而我们提出的方法既不需要一对一的词典也不需要在解码时进行人工干预。 我们在英语-德语后编辑数据上报告了实验结果,并发现最优结合监督微调和偏好优化(同时使用术语特定目标和完整序列目标)的方法,相比于一个强大的NMT基线,在术语准确性上有统计显著性的改进,而不会导致COMET评分有明显下降。此外,我们还发布了从我们的后编辑数据集中提取的测试集以及相关的术语词典。
https://arxiv.org/abs/2507.03580