Active learning (AL) techniques reduce labeling costs for training neural machine translation (NMT) models by selecting smaller representative subsets from unlabeled data for annotation. Diversity sampling techniques select heterogeneous instances, while uncertainty sampling methods select instances with the highest model uncertainty. Both approaches have limitations - diversity methods may extract varied but trivial examples, while uncertainty sampling can yield repetitive, uninformative instances. To bridge this gap, we propose HUDS, a hybrid AL strategy for domain adaptation in NMT that combines uncertainty and diversity for sentence selection. HUDS computes uncertainty scores for unlabeled sentences and subsequently stratifies them. It then clusters sentence embeddings within each stratum using k-MEANS and computes diversity scores by distance to the centroid. A weighted hybrid score that combines uncertainty and diversity is then used to select the top instances for annotation in each AL iteration. Experiments on multi-domain German-English datasets demonstrate the better performance of HUDS over other strong AL baselines. We analyze the sentence selection with HUDS and show that it prioritizes diverse instances having high model uncertainty for annotation in early AL iterations.
主动学习(AL)技术通过从未标注数据中选择较小的代表子集来降低训练神经机器翻译(NMT)模型的标注成本。多样性采样技术选择异质实例,而不确定性采样方法选择具有最高模型不确定性的实例。两种方法都有局限性——多样性方法可能会提取出多样但无关紧要的示例,而不确定性采样方法可能产生重复、无信息性的实例。为了弥合这一差距,我们提出了HUDS,一种用于在NMT领域进行领域自适应的混合AL策略,结合了不确定性和多样性来进行句子选择。HUDS对未标注的句子计算不确定性分数,然后对它们进行分层。接下来,它在每个层次内聚类句子嵌入使用k-MEANS,并通过距离中心计算多样性分数。然后使用加权混合分数结合不确定性和多样性来选择每个AL迭代中用于标注的顶级实例。在多领域德语-英语数据集上的实验表明,HUDS在与其他强AL baseline 相比具有更好的性能。我们分析使用HUDS进行句子选择的性能,并表明,在早期AL迭代中,对具有高模型不确定性的多样实例的优先选择。
https://arxiv.org/abs/2403.09259
NLP in the age of monolithic large language models is approaching its limits in terms of size and information that can be handled. The trend goes to modularization, a necessary step into the direction of designing smaller sub-networks and components with specialized functionality. In this paper, we present the MAMMOTH toolkit: a framework designed for training massively multilingual modular machine translation systems at scale, initially derived from OpenNMT-py and then adapted to ensure efficient training across computation clusters. We showcase its efficiency across clusters of A100 and V100 NVIDIA GPUs, and discuss our design philosophy and plans for future information. The toolkit is publicly available online.
在单体大型语言模型的时代,NLP在处理规模和信息方面的极限即将到来。趋势是模块化,这是朝着设计具有专用功能的小型子网络和组件的必要方向迈出的必要一步。在本文中,我们提出了MAMMOTH工具包:一个用于在规模上训练多语言大规模机器翻译系统的框架,最初来源于OpenNMT-py,然后根据需要在计算集群上实现高效的训练。我们在A100和V100 NVIDIA GPU集群上展示了其效率,并讨论了我们设计哲学和未来的计划。该工具包可以在网上获取公共访问权限。
https://arxiv.org/abs/2403.07544
Foundation models are usually pre-trained on large-scale datasets and then adapted to downstream tasks through tuning. However, the large-scale pre-training datasets, often inaccessible or too expensive to handle, can contain label noise that may adversely affect the generalization of the model and pose unexpected risks. This paper stands out as the first work to comprehensively understand and analyze the nature of noise in pre-training datasets and then effectively mitigate its impacts on downstream tasks. Specifically, through extensive experiments of fully-supervised and image-text contrastive pre-training on synthetic noisy ImageNet-1K, YFCC15M, and CC12M datasets, we demonstrate that, while slight noise in pre-training can benefit in-domain (ID) performance, where the training and testing data share a similar distribution, it always deteriorates out-of-domain (OOD) performance, where training and testing distributions are significantly different. These observations are agnostic to scales of pre-training datasets, pre-training noise types, model architectures, pre-training objectives, downstream tuning methods, and downstream applications. We empirically ascertain that the reason behind this is that the pre-training noise shapes the feature space differently. We then propose a tuning method (NMTune) to affine the feature space to mitigate the malignant effect of noise and improve generalization, which is applicable in both parameter-efficient and black-box tuning manners. We additionally conduct extensive experiments on popular vision and language models, including APIs, which are supervised and self-supervised pre-trained on realistic noisy data for evaluation. Our analysis and results demonstrate the importance of this novel and fundamental research direction, which we term as Noisy Model Learning.
基础模型通常在大规模数据集上进行预训练,然后通过调整适应下游任务。然而,大型预训练数据集通常不可访问或过于昂贵,可能会影响模型的泛化能力并导致意想不到的风险。本文是第一篇全面理解并分析预训练数据中噪音性质的工作,然后有效地减轻其对下游任务的影响。具体来说,我们在 synthetic noisy ImageNet-1K、YFCC15M 和 CC12M 数据集上进行了完全监督和图像文本对比预训练的实验,证明虽然在预训练中轻微的噪音对领域(ID)表现有利,但是始终会恶化到领域外(OOD)表现。这些观察结果对预训练数据的大小、预训练噪音类型、模型架构、预训练目标、下游调整方法和下游应用无依赖。我们通过实验实证,得出结论:噪音背后的原因是预训练噪音塑造了特征空间。然后我们提出了一种调整方法(NMTune),用于将特征空间拉伸以减轻噪音的恶性影响并提高泛化能力,该方法适用于参数高效和黑盒调整方式。此外,我们还对常见的视觉和语言模型进行了广泛的实验,包括用于评估的API,这些模型都是在现实世界的嘈杂数据上进行预训练的。我们的分析和结果证明了这种新颖和基本的研究方向的重要性,我们称之为噪音模型学习。
https://arxiv.org/abs/2403.06869
Large language model (LLM) has achieved promising performance in multilingual machine translation tasks through zero/few-shot prompts or prompt-tuning. However, due to the mixture of multilingual data during the pre-training of LLM, the LLM-based translation models face the off-target issue in both prompt-based methods, including a series of phenomena, namely instruction misunderstanding, translation with wrong language and over-generation. For this issue, this paper introduces an \textbf{\underline{A}}uto-\textbf{\underline{C}}onstriction \textbf{\underline{T}}urning mechanism for \textbf{\underline{M}}ultilingual \textbf{\underline{N}}eural \textbf{\underline{M}}achine \textbf{\underline{T}}ranslation (\model), which is a novel supervised fine-tuning mechanism and orthogonal to the traditional prompt-based methods. In this method, \model automatically constructs a constrained template in the target side by adding trigger tokens ahead of the ground truth. Furthermore, trigger tokens can be arranged and combined freely to represent different task semantics, and they can be iteratively updated to maximize the label likelihood. Experiments are performed on WMT test sets with multiple metrics, and the experimental results demonstrate that \model achieves substantially improved performance across multiple translation directions and reduce the off-target phenomena in the translation.
大语言模型(LLM)通过零/少抽头提示或提示调整在多语言机器翻译任务上取得了良好的性能。然而,由于LLM在预训练过程中混合了多种语言数据,LLM基翻译模型在基于提示的方法中都面临着离靶问题,包括一系列现象,例如指令误解、翻译错误语言和过度生成。为解决这个问题,本文提出了一种自动-约束转换机制(Auto-Constraint Turning mechanism)用于多语言神经机器翻译(model),这是一种新颖的超监督微调机制,与传统的提示方法不同。在这个方法中,模型通过在真实侧前面添加触发词来构建一个约束模板。此外,触发词可以自由排列和组合以表示不同的任务语义,并且可以逐级更新以最大化标签概率。在多个指标的WMT测试集中进行实验,实验结果表明,\model在多个翻译方向上取得了显著的提高,减少了离靶现象。
https://arxiv.org/abs/2403.06745
Existing Neural Machine Translation (NMT) models mainly handle translation in the general domain, while overlooking domains with special writing formulas, such as e-commerce and legal documents. Taking e-commerce as an example, the texts usually include amounts of domain-related words and have more grammar problems, which leads to inferior performances of current NMT methods. To address these problems, we collect two domain-related resources, including a set of term pairs (aligned Chinese-English bilingual terms) and a parallel corpus annotated for the e-commerce domain. Furthermore, we propose a two-step fine-tuning paradigm (named G2ST) with self-contrastive semantic enhancement to transfer one general NMT model to the specialized NMT model for e-commerce. The paradigm can be used for the NMT models based on Large language models (LLMs). Extensive evaluations on real e-commerce titles demonstrate the superior translation quality and robustness of our G2ST approach, as compared with state-of-the-art NMT models such as LLaMA, Qwen, GPT-3.5, and even GPT-4.
现有的神经机器翻译(NMT)模型主要处理通用领域中的翻译,而忽视了具有特殊写作公式的领域,如电子商务和法律文件。以电子商务为例,文本通常包括大量相关词汇,并且存在更多语法问题,导致当前NMT方法的性能较差。为解决这些问题,我们收集了两个相关的资源,包括一组词对(对齐的中文-英文双语术语)和针对电子商务领域的平行语料库。此外,我们提出了一个两步微调范式(名为G2ST),用于将一种通用NMT模型转换为专门针对电子商务领域的NMT模型。该范式可以用于基于大型语言模型的(LLM)NMT模型。对真实电子商务标题的广泛评估表明,与最先进的NMT模型(如LLLM)相比,我们的G2ST方法的翻译质量和稳健性具有显著优势,甚至超过了GPT-3.5和GPT-4。
https://arxiv.org/abs/2403.03689
adaptNMT is an open-source application that offers a streamlined approach to the development and deployment of Recurrent Neural Networks and Transformer models. This application is built upon the widely-adopted OpenNMT ecosystem, and is particularly useful for new entrants to the field, as it simplifies the setup of the development environment and creation of train, validation, and test splits. The application offers a graphing feature that illustrates the progress of model training, and employs SentencePiece for creating subword segmentation models. Furthermore, the application provides an intuitive user interface that facilitates hyperparameter customization. Notably, a single-click model development approach has been implemented, and models developed by adaptNMT can be evaluated using a range of metrics. To encourage eco-friendly research, adaptNMT incorporates a green report that flags the power consumption and kgCO${_2}$ emissions generated during model development. The application is freely available.
adaptNMT是一个开源的应用程序,它提供了一种简化 Recurrent Neural Networks 和 Transformer 模型的开发和部署的方法。该应用程序基于广泛采用的 OpenNMT 生态系统,尤其对于新进入该领域的人来说,它简化了开发环境设置和数据集创建。该应用程序提供了一个图形化界面,展示了模型训练的进度,并使用 SentencePiece 创建了子词分割模型。此外,应用程序提供了一个直观的用户界面,促进了超参数的自定义。值得注意的是,adaptNMT 实现了一个单击模型开发方法,可以使用各种指标对开发模型的性能进行评估。为了鼓励可持续研究,adaptNMT 含有一个绿色报告,指出模型开发过程中产生的能量消耗和 kgCO${_2}$ 排放量。该应用程序是免费提供的。
https://arxiv.org/abs/2403.03582
Neural machine translation (NMT) has progressed rapidly in the past few years, promising improvements and quality translations for different languages. Evaluation of this task is crucial to determine the quality of the translation. Overall, insufficient emphasis is placed on the actual sense of the translation in traditional methods. We propose a bidirectional semantic-based evaluation method designed to assess the sense distance of the translation from the source text. This approach employs the comprehensive multilingual encyclopedic dictionary BabelNet. Through the calculation of the semantic distance between the source and its back translation of the output, our method introduces a quantifiable approach that empowers sentence comparison on the same linguistic level. Factual analysis shows a strong correlation between the average evaluation scores generated by our method and the human assessments across various machine translation systems for English-German language pair. Finally, our method proposes a new multilingual approach to rank MT systems without the need for parallel corpora.
神经机器翻译(NMT)在过去的几年里取得了迅速的进步,为各种语言提供了提高和高质量的翻译承诺。对这项任务的评估是确定翻译质量的关键。总的来说,传统方法对翻译的实际意义的强调不足。我们提出了一种双向基于语义的评价方法,旨在评估翻译从源文本到其反向翻译的语义距离。这种方法采用全面的多语言百科全书词典BabelNet。通过计算源文本和其反向翻译之间的语义距离,我们的方法引入了一种可量化方法,使句子在相同的语言级别上进行比较。实际分析表明,我们方法生成的平均评估分数与各种机器翻译系统为英语-德语语言对之间的 human assessments高度相关。最后,我们的方法提出了一种不需要并行语料库的新的多语言排名方法。
https://arxiv.org/abs/2403.03521
adaptNMT streamlines all processes involved in the development and deployment of RNN and Transformer neural translation models. As an open-source application, it is designed for both technical and non-technical users who work in the field of machine translation. Built upon the widely-adopted OpenNMT ecosystem, the application is particularly useful for new entrants to the field since the setup of the development environment and creation of train, validation and test splits is greatly simplified. Graphing, embedded within the application, illustrates the progress of model training, and SentencePiece is used for creating subword segmentation models. Hyperparameter customization is facilitated through an intuitive user interface, and a single-click model development approach has been implemented. Models developed by adaptNMT can be evaluated using a range of metrics, and deployed as a translation service within the application. To support eco-friendly research in the NLP space, a green report also flags the power consumption and kgCO$_{2}$ emissions generated during model development. The application is freely available.
adaptNMT 简化了涉及机器翻译开发和部署的所有过程。 作为一款开源应用程序,它旨在为该领域内的技术用户和非技术用户提供便利。 基于广泛采用的 OpenNMT 生态系统,该应用程序特别对新进入领域的人有所帮助,因为开发环境和工作流程的设置大大简化了。 图形化嵌入应用程序中,展示了模型训练的进展,而 SentencePiece 则用于创建分词模型。 通过直观的用户界面,促进了超参数的自定义,并实现了一种单击式模型开发方法。 由 adaptNMT 开发的数据模型可以使用一系列指标进行评估,并部署为应用程序内的翻译服务。 为支持 NLP 空间内的环保研究,该应用程序还免费提供了一份绿色报告,报告了模型开发过程中产生的功耗和 kgCO$_{2}$ 排放量。 该应用程序是免费的。
https://arxiv.org/abs/2403.02367
In this study, a human evaluation is carried out on how hyperparameter settings impact the quality of Transformer-based Neural Machine Translation (NMT) for the low-resourced English--Irish pair. SentencePiece models using both Byte Pair Encoding (BPE) and unigram approaches were appraised. Variations in model architectures included modifying the number of layers, evaluating the optimal number of heads for attention and testing various regularisation techniques. The greatest performance improvement was recorded for a Transformer-optimized model with a 16k BPE subword model. Compared with a baseline Recurrent Neural Network (RNN) model, a Transformer-optimized model demonstrated a BLEU score improvement of 7.8 points. When benchmarked against Google Translate, our translation engines demonstrated significant improvements. Furthermore, a quantitative fine-grained manual evaluation was conducted which compared the performance of machine translation systems. Using the Multidimensional Quality Metrics (MQM) error taxonomy, a human evaluation of the error types generated by an RNN-based system and a Transformer-based system was explored. Our findings show the best-performing Transformer system significantly reduces both accuracy and fluency errors when compared with an RNN-based model.
在这项研究中,我们对如何调整超参数对低资源英语-爱尔兰对之间的Transformer基神经机器翻译(NMT)的质量进行了人类评估。使用了Byte Pair Encoding(BPE)和unigram方法的两个SentencePiece模型进行评估。模型架构的变异包括修改层数,评估注意力机制的最佳头数以及尝试各种正则化技术。在用16k个BPE子词模型的Transformer优化模型中记录了最大的性能改进。与基循环神经网络(RNN)模型相比,Transformer优化模型显示BLEU得分提高了7.8分。与谷歌翻译进行基准测试时,我们的翻译引擎表现出了显著的改进。此外,进行了一项定量的细粒度手动评估,比较了机器翻译系统的性能。使用多维质量度量(MQM)错误分类器,研究了基于RNN和Transformer的系统生成的错误类型的性能。我们的研究结果表明,与基于RNN的模型相比,性能最佳的Transformer系统在比较时显著减少了准确性和流畅性误差。
https://arxiv.org/abs/2403.02366
In the current machine translation (MT) landscape, the Transformer architecture stands out as the gold standard, especially for high-resource language pairs. This research delves into its efficacy for low-resource language pairs including both the English$\leftrightarrow$Irish and English$\leftrightarrow$Marathi language pairs. Notably, the study identifies the optimal hyperparameters and subword model type to significantly improve the translation quality of Transformer models for low-resource language pairs. The scarcity of parallel datasets for low-resource languages can hinder MT development. To address this, gaHealth was developed, the first bilingual corpus of health data for the Irish language. Focusing on the health domain, models developed using this in-domain dataset exhibited very significant improvements in BLEU score when compared with models from the LoResMT2021 Shared Task. A subsequent human evaluation using the multidimensional quality metrics error taxonomy showcased the superior performance of the Transformer system in reducing both accuracy and fluency errors compared to an RNN-based counterpart. Furthermore, this thesis introduces adaptNMT and adaptMLLM, two open-source applications streamlined for the development, fine-tuning, and deployment of neural machine translation models. These tools considerably simplify the setup and evaluation process, making MT more accessible to both developers and translators. Notably, adaptNMT, grounded in the OpenNMT ecosystem, promotes eco-friendly natural language processing research by highlighting the environmental footprint of model development. Fine-tuning of MLLMs by adaptMLLM demonstrated advancements in translation performance for two low-resource language pairs: English$\leftrightarrow$Irish and English$\leftrightarrow$Marathi, compared to baselines from the LoResMT2021 Shared Task.
在当前的机器翻译(MT)格局中,Transformer架构脱颖而出,尤其是在高资源语言对中。这项研究深入探讨了其在低资源语言对(包括英语对爱尔兰语和英语对马哈拉蒂语)中的有效性。值得注意的是,该研究发现了对低资源语言对Transformer模型的翻译质量显著改善的最佳超参数和子词模型类型。低资源语言对缺乏并行数据集可能会阻碍MT的发展。为了解决这个问题,gaHealth应运而生,成为第一个针对爱尔兰语的健康数据集的双语语料库。聚焦于健康领域,使用该域数据集训练的模型在BLEU得分方面与来自LoResMT2021共享任务的模型相比具有非常显著的改善。接下来,通过多维质量指标错误分类器的人类评估展示了Transformer系统在减少准确性和流畅性误差方面优越性能。此外,本论文还引入了adaptNMT和adaptMLLM这两个开源应用,专为神经机器翻译模型的开发、微调和支持部署而优化。这些工具大大简化了设置和评估过程,使MT对开发者和翻译者来说更加易用。值得注意的是,基于OpenNMT的adaptNMT通过突出模型开发的环保足迹推动了可持续自然语言处理研究。通过adaptMLLM对MLLM的微调展示了对于两个低资源语言对(英语对爱尔兰语和英语对马哈拉蒂语)的翻译性能的改善,与LoResMT2021共享任务的基线相比。
https://arxiv.org/abs/2403.01580
Autoregressive (AR) and Non-autoregressive (NAR) models are two types of generative models for Neural Machine Translation (NMT). AR models predict tokens in a word-by-word manner and can effectively capture the distribution of real translations. NAR models predict tokens by extracting bidirectional contextual information which can improve the inference speed but they suffer from performance degradation. Previous works utilized AR models to enhance NAR models by reducing the training data's complexity or incorporating the global information into AR models by virtue of NAR models. However, those investigated methods only take advantage of the contextual information of a single type of model while neglecting the diversity in the contextual information that can be provided by different types of models. In this paper, we propose a novel generic collaborative learning method, DCMCL, where AR and NAR models are treated as collaborators instead of teachers and students. To hierarchically leverage the bilateral contextual information, token-level mutual learning and sequence-level contrastive learning are adopted between AR and NAR models. Extensive experiments on four widely used benchmarks show that the proposed DCMCL method can simultaneously improve both AR and NAR models with up to 1.38 and 2.98 BLEU scores respectively, and can also outperform the current best-unified model with up to 0.97 BLEU scores for both AR and NAR decoding.
自回归(AR)和非自回归(NAR)模型是两种神经机器翻译(NMT)中的生成模型。AR模型预测单词级的token,并且可以有效地捕捉到真实翻译的分布。NAR模型通过提取双向上下文信息来预测token,从而提高推理速度,但是它们会受到性能退化的影响。之前的工作利用AR模型来增强NAR模型,通过减少训练数据的复杂性或通过NAR模型将全局信息融入AR模型。然而,这些调查方法只利用了单一类型的模型的上下文信息,而忽略了不同类型模型可以提供的上下文信息的多样性。在本文中,我们提出了一种新颖的通用协同学习方法,称为DCMCL,其中AR和NAR模型被处理为协作伙伴而不是老师和学生。为了分层利用双向上下文信息,我们在AR和NAR模型之间采用了词级相互学习和序列级对比学习。在四个广泛使用的基准上进行的大量实验证明,与单独使用AR和NAR模型相比,所提出的DCMCL方法可以同时提高AR和NAR模型的分数分别为1.38和2.98,并且还可以比当前最佳统一模型分别提高0.97的BLEU分数。
https://arxiv.org/abs/2402.18428
Generally, the decoder-only large language models (LLMs) are adapted to context-aware neural machine translation (NMT) in a concatenating way, where LLMs take the concatenation of the source sentence (i.e., intra-sentence context) and the inter-sentence context as the input, and then to generate the target tokens sequentially. This adaptation strategy, i.e., concatenation mode, considers intra-sentence and inter-sentence contexts with the same priority, despite an apparent difference between the two kinds of contexts. In this paper, we propose an alternative adaptation approach, named Decoding-enhanced Multi-phase Prompt Tuning (DeMPT), to make LLMs discriminately model and utilize the inter- and intra-sentence context and more effectively adapt LLMs to context-aware NMT. First, DeMPT divides the context-aware NMT process into three separate phases. During each phase, different continuous prompts are introduced to make LLMs discriminately model various information. Second, DeMPT employs a heuristic way to further discriminately enhance the utilization of the source-side inter- and intra-sentence information at the final decoding phase. Experiments show that our approach significantly outperforms the concatenation method, and further improves the performance of LLMs in discourse modeling.
通常,只有 decoder-only 大型语言模型(LLMs)是通过连接源句(即内部句子上下文)和间句子上下文来进行语境感知神经机器翻译(NMT),并在连接的同时生成目标词。这种适应策略,即连接模式,尽管两种上下文的表面差异似乎存在,但仍然将内部和间句子上下文赋予相同的优先级。在本文中,我们提出了另一种适应方法,名为解码增强多阶段提示调整(DeMPT),以使LLMs能够有选择性地建模和利用内部和间句子上下文,并更有效地将LLMs适应于语境感知 NMT。首先,DeMPT将语境感知 NMT 过程划分为三个单独的阶段。在每阶段,我们引入了不同的连续提示,使LLMs能够有选择性地建模各种信息。其次,DeMPT采用了一种启发式的方法,在解码阶段进一步增强了源 side 的间和内句子上下文的利用率。实验证明,我们的方法显著优于串联方法,并进一步提高了LLM在论述建模方面的性能。
https://arxiv.org/abs/2402.15200
Neural Machine Translation (NMT) continues to improve in quality and adoption, yet the inadvertent perpetuation of gender bias remains a significant concern. Despite numerous studies on gender bias in translations into English from weakly gendered-languages, there are no benchmarks for evaluating this phenomenon or for assessing mitigation strategies. To address this gap, we introduce GATE X-E, an extension to the GATE (Rarrick et al., 2023) corpus, that consists of human translations from Turkish, Hungarian, Finnish, and Persian into English. Each translation is accompanied by feminine, masculine, and neutral variants. The dataset, which contains between 1250 and 1850 instances for each of the four language pairs, features natural sentences with a wide range of sentence lengths and domains, challenging translation rewriters on various linguistic phenomena. Additionally, we present a translation gender rewriting solution built with GPT-4 and use GATE X-E to evaluate it. We open source our contributions to encourage further research on gender debiasing.
神经机器翻译(NMT)在质量和采用方面继续改进,然而无意中延续性别偏见仍然是一个重要的问题。尽管关于弱化性别偏见在英语翻译中的研究已经很多,但目前尚无评估这一现象或评估减轻策略的基准。为了填补这一空白,我们引入了GATE X-E,一个扩展自GATE(Rarrick等人,2023)语料库的土耳其语、匈牙利语、芬兰语和波斯语对英语的人翻译。每个翻译都附带女性、男性和中性版本。该数据集(每对语言之间包含1250到1850个实例)包含了各种长度的句子和主题,挑战了翻译者应对各种语言现象。此外,我们还提出了一个基于GPT-4的翻译性别重写解决方案,并使用GATE X-E对其进行评估。我们开源我们的贡献,鼓励进一步研究性别偏见。
https://arxiv.org/abs/2402.14277
With the rapid advancement of Neural Machine Translation (NMT), enhancing translation efficiency and quality has become a focal point of research. Despite the commendable performance of general models such as the Transformer in various aspects, they still fall short in processing long sentences and fully leveraging bidirectional contextual information. This paper introduces an improved model based on the Transformer, implementing an asynchronous and segmented bidirectional decoding strategy aimed at elevating translation efficiency and accuracy. Compared to traditional unidirectional translations from left-to-right or right-to-left, our method demonstrates heightened efficiency and improved translation quality, particularly in handling long sentences. Experimental results on the IWSLT2017 dataset confirm the effectiveness of our approach in accelerating translation and increasing accuracy, especially surpassing traditional unidirectional strategies in long sentence translation. Furthermore, this study analyzes the impact of sentence length on decoding outcomes and explores the model's performance in various scenarios. The findings of this research not only provide an effective encoding strategy for the NMT field but also pave new avenues and directions for future studies.
随着 Neural Machine Translation(NMT)的快速发展,提高翻译的效率和质量已成为研究的重点。尽管在各种方面表现出优异性能的通用模型(如Transformer)仍然在处理长句子和充分利用双向上下文信息方面存在不足,但该模型在提高翻译效率和质量方面仍然具有很大的潜力。本文介绍了一种基于Transformer的改进模型,通过实现异步和分割的上下文双向解码策略,旨在提高翻译的效率和准确性。与从左到右或从右到左的传统无向翻译相比,我们的方法在处理长句子方面表现出更高的效率和更好的翻译质量。实验结果表明,我们的方法可以加速翻译并提高准确性,特别是在长句子翻译方面超过了传统的无向策略。此外,本研究分析了句子长度对解码结果的影响,并探讨了模型在不同场景下的性能。这一研究为 NMT 领域提供了有效的编码策略,同时也为未来的研究开辟了新的途径和方向。
https://arxiv.org/abs/2402.14849
Conventional fair graph clustering methods face two primary challenges: i) They prioritize balanced clusters at the expense of cluster cohesion by imposing rigid constraints, ii) Existing methods of both individual and group-level fairness in graph partitioning mostly rely on eigen decompositions and thus, generally lack interpretability. To address these issues, we propose iFairNMTF, an individual Fairness Nonnegative Matrix Tri-Factorization model with contrastive fairness regularization that achieves balanced and cohesive clusters. By introducing fairness regularization, our model allows for customizable accuracy-fairness trade-offs, thereby enhancing user autonomy without compromising the interpretability provided by nonnegative matrix tri-factorization. Experimental evaluations on real and synthetic datasets demonstrate the superior flexibility of iFairNMTF in achieving fairness and clustering performance.
传统的公平图聚类方法面临两个主要挑战:i)它们通过强制的约束来优先考虑平衡的聚类,牺牲了聚类的凝聚性;ii)现有的图划分方法(无论是个人级还是组级公平)主要依赖于矩阵分解,因此通常缺乏可解释性。为了应对这些问题,我们提出了iFairNMTF,一种具有对比性公平 Regularization 的个人公平 Nonnegative Matrix Tri-Factorization 模型,它实现了平衡和凝聚的聚类。通过引入公平 Regularization,我们的模型允许进行自定义的准确-公平权衡,从而增强用户自治,同时不牺牲非负矩阵分解提供的可解释性。在真实和合成数据集上的实验评估证明了iFairNMTF 在实现公平和聚类性能方面的卓越灵活性。
https://arxiv.org/abs/2402.10756
Large language models (LLMs) have demonstrated promising potential in various downstream tasks, including machine translation. However, prior work on LLM-based machine translation has mainly focused on better utilizing training data, demonstrations, or pre-defined and universal knowledge to improve performance, with a lack of consideration of decision-making like human translators. In this paper, we incorporate Thinker with the Drift-Diffusion Model (Thinker-DDM) to address this issue. We then redefine the Drift-Diffusion process to emulate human translators' dynamic decision-making under constrained resources. We conduct extensive experiments under the high-resource, low-resource, and commonsense translation settings using the WMT22 and CommonMT datasets, in which Thinker-DDM outperforms baselines in the first two scenarios. We also perform additional analysis and evaluation on commonsense translation to illustrate the high effectiveness and efficacy of the proposed method.
大语言模型(LLMs)在各种下游任务中已经展示了有前途的应用,包括机器翻译。然而,LLM机器翻译前的研究主要集中在更好地利用训练数据、演示或预定义和普遍知识以提高性能,而忽略了像人类翻译员一样的决策过程。在本文中,我们将Thinker与Drift-Diffusion模型(Thinker-DDM)相结合以解决这个问题。然后,我们将Drift-Diffusion过程重新定义为在有限资源下模拟人类翻译员动态决策的过程。我们在高资源、低资源和常识翻译设置中使用WMT22和CommonMT数据集进行广泛的实验,其中Thinker-DDM在第一和第二场景中的表现优于基线。我们还对常识翻译进行了额外的分析和评估,以说明所提出方法的高度有效性和有效性。
https://arxiv.org/abs/2402.10699
Motivated by the success of unsupervised neural machine translation (UNMT), we introduce an unsupervised sign language translation and generation network (USLNet), which learns from abundant single-modality (text and video) data without parallel sign language data. USLNet comprises two main components: single-modality reconstruction modules (text and video) that rebuild the input from its noisy version in the same modality and cross-modality back-translation modules (text-video-text and video-text-video) that reconstruct the input from its noisy version in the different modality using back-translation procedure.Unlike the single-modality back-translation procedure in text-based UNMT, USLNet faces the cross-modality discrepancy in feature representation, in which the length and the feature dimension mismatch between text and video sequences. We propose a sliding window method to address the issues of aligning variable-length text with video sequences. To our knowledge, USLNet is the first unsupervised sign language translation and generation model capable of generating both natural language text and sign language video in a unified manner. Experimental results on the BBC-Oxford Sign Language dataset (BOBSL) and Open-Domain American Sign Language dataset (OpenASL) reveal that USLNet achieves competitive results compared to supervised baseline models, indicating its effectiveness in sign language translation and generation.
鉴于 unsupervised神经机器翻译 (UNMT) 的成功,我们提出了一个无需并行手语数据的无监督签名语言翻译和生成网络 (USLNet)。USLNet 包含两个主要组件:单模态重建模块(文本和视频)在同一模态和跨模态反向翻译模块(文本-视频文本和视频-文本-视频)在另一模态使用反向翻译过程重构输入。与基于文本的 UNMT 中的单模态反向翻译方法不同,USLNet 面临跨模态信息表示差异,其中文本和视频序列的长度和特征维度不匹配。我们提出了一种滑动窗口方法来解决将长文本与视频序列对齐的问题。据我们所知,USLNet 是第一个能够以统一方式生成自然语言文本和手语视频的无需监督的签名语言翻译和生成模型。BBC-Oxford Sign Language 数据集(BOBSL)和 Open-Domain American Sign Language 数据集(OpenASL)的实验结果表明,与监督基线模型相比,USLNet 具有竞争力的结果,表明其在手语翻译和生成方面的有效性。
https://arxiv.org/abs/2402.07726
This study evaluates the machine translation (MT) quality of two state-of-the-art large language models (LLMs) against a tradition-al neural machine translation (NMT) system across four language pairs in the legal domain. It combines automatic evaluation met-rics (AEMs) and human evaluation (HE) by professional transla-tors to assess translation ranking, fluency and adequacy. The re-sults indicate that while Google Translate generally outperforms LLMs in AEMs, human evaluators rate LLMs, especially GPT-4, comparably or slightly better in terms of producing contextually adequate and fluent translations. This discrepancy suggests LLMs' potential in handling specialized legal terminology and context, highlighting the importance of human evaluation methods in assessing MT quality. The study underscores the evolving capabil-ities of LLMs in specialized domains and calls for reevaluation of traditional AEMs to better capture the nuances of LLM-generated translations.
本研究评估了两种最先进的语言模型(LLMs)在法律领域中相对于传统神经机器翻译(NMT)系统的机器翻译(MT)质量。它通过将自动评估指标(AEMs)和职业翻译员的 human evaluation(HE)相结合来评估翻译排名、流畅度和充分性。研究结果表明,尽管 Google Translate 在 AEMs 中通常表现出色,但职业翻译员对 LLMs 的评估相对较好,尤其是 GPT-4,在产生上下文充分且流畅的翻译方面。这种差异表明 LLMs 在处理专业法律术语和上下文方面的潜在能力,突出了在评估 MT 质量中使用人类评估方法的重要性。该研究强调了 LLMs 在专业领域不断发展的能力,并呼吁重新评估传统的 AEMs,以更好地捕捉 LLM 生成的翻译的细微差别。
https://arxiv.org/abs/2402.07681
We conducted a detailed analysis on the quality of web-mined corpora for two low-resource languages (making three language pairs, English-Sinhala, English-Tamil and Sinhala-Tamil). We ranked each corpus according to a similarity measure and carried out an intrinsic and extrinsic evaluation on different portions of this ranked corpus. We show that there are significant quality differences between different portions of web-mined corpora and that the quality varies across languages and datasets. We also show that, for some web-mined datasets, Neural Machine Translation (NMT) models trained with their highest-ranked 25k portion can be on par with human-curated datasets.
我们对两种低资源语言(英语-斯兰哈、英语-泰米尔和斯兰哈-泰米尔)的网络挖掘语料库的质量进行了详细分析。我们根据相似度指标对每个语料库进行排名,并对这个排名语料库的不同部分进行了内化和外化评估。我们发现,网络挖掘语料库的不同部分之间存在显著的质量差异,而且质量会随着语言和数据集的不同而变化。我们还发现,对于某些网络挖掘数据集,使用其最高排名的25k部分训练的神经机器翻译(NMT)模型可以与人工审核的数据集相媲美。
https://arxiv.org/abs/2402.07446
Standard context-aware neural machine translation (NMT) typically relies on parallel document-level data, exploiting both source and target contexts. Concatenation-based approaches in particular, still a strong baseline for document-level NMT, prepend source and/or target context sentences to the sentences to be translated, with model variants that exploit equal amounts of source and target data on each side achieving state-of-the-art results. In this work, we investigate whether target data should be further promoted within standard concatenation-based approaches, as most document-level phenomena rely on information that is present on the target language side. We evaluate novel concatenation-based variants where the target context is prepended to the source language, either in isolation or in combination with the source context. Experimental results in English-Russian and Basque-Spanish show that including target context in the source leads to large improvements on target language phenomena. On source-dependent phenomena, using only target language context in the source achieves parity with state-of-the-art concatenation approaches, or slightly underperforms, whereas combining source and target context on the source side leads to significant gains across the board.
标准上下文感知神经机器翻译(NMT)通常依赖于并行文档级别数据,并利用源语和目标语上下文。特别是,基于连接的方法在文档级别NMT中仍然是一个强大的基线,将源语和/或目标语上下文句子附加到要翻译的句子前面,利用等量的源语和目标语数据实现达到最先进水平的模型变体。在这项工作中,我们研究了是否应该在标准连接方法中进一步促进目标数据,因为大多数文档级别现象都依赖于目标语言侧存在的信息。我们评估了新颖的连接基线变体,其中目标上下文都被附加到源语言。在英语-俄语和Basque-西班牙语实验中,将目标上下文包含在源语言中,在目标语言现象方面取得了很大的改进。在源相关现象上,仅使用目标语上下文在源中实现与最先进连接方法达到平衡,或者稍有落后,而将源和目标语上下文在源侧结合起来,产生了显著的全面改进。
https://arxiv.org/abs/2402.06342