This paper addresses the challenge of accurately translating technical terms, which are crucial for clear communication in specialized fields. We introduce the Parenthetical Terminology Translation (PTT) task, designed to mitigate potential inaccuracies by displaying the original term in parentheses alongside its translation. To implement this approach, we generated a representative PTT dataset using a collaborative approach with large language models and applied knowledge distillation to fine-tune traditional Neural Machine Translation (NMT) models and small-sized Large Language Models (sLMs). Additionally, we developed a novel evaluation metric to assess both overall translation accuracy and the correct parenthetical presentation of terms. Our findings indicate that sLMs did not consistently outperform NMT models, with fine-tuning proving more effective than few-shot prompting, particularly in models with continued pre-training in the target language. These insights contribute to the advancement of more reliable terminology translation methodologies.
本文解决了在专业领域中准确翻译技术术语的挑战,这些术语对于清晰沟通至关重要。我们引入了Parenthetical Terminology Translation(PTT)任务,旨在通过在原文术语旁边显示其翻译来缓解潜在的不准确性。为了实现这种方法,我们使用合作方法生成了一个PTT数据集,并使用知识蒸馏对传统的神经机器翻译(NMT)模型和小型语言模型(sLMs)进行微调。此外,我们还开发了一种新的评估指标,以评估总翻译准确性以及术语的正确父式呈现。我们的研究结果表明,sLMs并没有始终优于NMT模型,而微调证明比少量的提示更有效,特别是在目标语言中持续预训练的模型中。这些见解有助于推动更可靠术语翻译方法的发展。
https://arxiv.org/abs/2410.00683
For crosslingual conversation and trade, Neural Machine Translation (NMT) is pivotal yet faces persistent challenges with monotony and repetition in generated content. Traditional solutions that rely on penalizing text redundancy or token reoccurrence have shown limited efficacy, particularly for lengthy article and e-commerce descriptions with inherent redundancy, even with the advent of Large Language Models (LLMs). This paper investigates the underlying causes of textual repetition through the lens of information entropy, attributing the phenomenon to the elevated uncertainty within the input text. To address this, a novel algorithm named Contrastive Token Learning with Similarity Decay (CTSD) is introduced, which modulates the suppression of tokens dynamically, informed by varying attention weights and inter-token distances. Furthermore, an e-commerce dataset comprised of title texts of online real items is compiled and released susceptible to hallucination translations to benchmark the algorithm. Extensive evaluations demonstrate that CTSD significantly outperforms existing approaches in precision and generalizability. Additional online A/B testing underscores its practical value, showing marked improvements in user engagement and conversion. Notably, this method has been implemented with full traffic on eight multilingual sites of this http URL, the largest B2B e-commerce platform in the world.
为了进行跨语言对话和贸易,神经机器翻译(NMT)在生成内容的单调化和重复性方面具有关键作用,但生成的内容中存在持续的挑战。传统解决方案依赖于惩罚文本冗余或词项再出现,对于具有固有冗余的长篇文章和电子商务描述,效果有限,即使在大型语言模型(LLMs)出现的情况下。本文通过信息熵的视角研究了文本重复的潜在原因,将这种现象归因于输入文本中的不确定性增加。为解决这个问题,提出了一种名为Contrastive Token Learning with Similarity Decay(CTSD)的新算法,该算法动态地调节词项的抑制,受到不同注意权重和词项间距离的影响。此外,还编译和发布了一个电子商务数据集,包括在线实物的标题文本,以评估算法的性能。大量评估结果表明,CTSD在精度和泛化能力方面显著优于现有方法。进一步的在线A/B测试证实了其实用价值,表明在用户参与度和转化率方面取得了显著的改善。值得注意的是,这种方法在八个多语言网站的全面流量下实现,这是世界上最大的B2B电子商务平台。
https://arxiv.org/abs/2409.19877
Embedding models play a crucial role in representing and retrieving information across various NLP applications. Recent advancements in Large Language Models (LLMs) have further enhanced the performance of embedding models, which are trained on massive amounts of text covering almost every domain. These models are often benchmarked on general-purpose datasets like Massive Text Embedding Benchmark (MTEB), where they demonstrate superior performance. However, a critical question arises: Is the development of domain-specific embedding models necessary when general-purpose models are trained on vast corpora that already include specialized domain texts? In this paper, we empirically investigate this question, choosing the finance domain as an example. We introduce the Finance Massive Text Embedding Benchmark (FinMTEB), a counterpart to MTEB that consists of financial domain-specific text datasets. We evaluate the performance of seven state-of-the-art embedding models on FinMTEB and observe a significant performance drop compared to their performance on MTEB. To account for the possibility that this drop is driven by FinMTEB's higher complexity, we propose four measures to quantify dataset complexity and control for this factor in our analysis. Our analysis provides compelling evidence that state-of-the-art embedding models struggle to capture domain-specific linguistic and semantic patterns, even when trained on large general-purpose corpora. This study sheds light on the necessity of developing domain-specific embedding models in the LLM era, offering valuable insights for researchers and practitioners.
嵌入模型在各种自然语言处理应用中扮演着至关重要的角色,尤其是在大型语言模型(LLMs) recent 进步的加强作用下。这些模型在训练过程中使用大量覆盖几乎所有领域的文本数据,通常在通用数据集上进行比较,表现出优异的性能。然而,一个关键的问题出现了:在训练通用模型以涵盖庞大文本数据集的同时,是否需要开发特定领域的嵌入模型?在本文中,我们通过实证研究来回答这个问题,并选择金融领域作为例子。我们引入了金融百万文本嵌入基准(FinMTEB),这是 MTEB 的一个对应物,由金融领域的特定文本数据组成。我们评估了七个最先进的嵌入模型在 FinMTEB 上的性能,观察到它们的性能明显低于它们在 MTEB 上的性能。为了考虑到 FinMTEB 更高的复杂性,我们提出了四个度量来量化数据集复杂度,并控制这个因素在我们的分析中。我们的分析提供了令人信服的证据,即使训练在大规模通用数据集上,最先进的嵌入模型也无法捕捉到领域特定的语言和语义模式。这项研究阐明了在 LLM 时代开发领域特定嵌入模型的必要性,为研究人员和实践者提供了宝贵的洞见。
https://arxiv.org/abs/2409.18511
One of the growing trends in machine learning is the use of data generation techniques, since the performance of machine learning models is dependent on the quantity of the training dataset. However, in many medical applications, collecting large datasets is challenging due to resource constraints, which leads to overfitting and poor generalization. This paper introduces a novel method, Artificial Data Point Generation in Clustered Latent Space (AGCL), designed to enhance classification performance on small medical datasets through synthetic data generation. The AGCL framework involves feature extraction, K-means clustering, cluster evaluation based on a class separation metric, and the generation of synthetic data points from clusters with distinct class representations. This method was applied to Parkinson's disease screening, utilizing facial expression data, and evaluated across multiple machine learning classifiers. Experimental results demonstrate that AGCL significantly improves classification accuracy compared to baseline, GN and kNNMTD. AGCL achieved the highest overall test accuracy of 83.33% and cross-validation accuracy of 90.90% in majority voting over different emotions, confirming its effectiveness in augmenting small datasets.
机器学习的一个不断增长的趋势是使用数据生成技术,因为机器学习模型的性能取决于训练数据集的数量。然而,在许多医疗应用中,收集大量数据是非常具有挑战性的,由于资源限制,导致过拟合和欠拟合。本文介绍了一种名为人工数据点生成在聚类潜在空间(AGCL)的新方法,旨在通过合成数据生成来提高小型医疗数据的分类性能。AGCL框架包括特征提取、K-means聚类、基于类分离指标的聚类评估和从聚类中生成合成数据点。将该方法应用于帕金森病筛查,利用面部表情数据,并在多个机器学习分类器上进行评估。实验结果表明,AGCL相对于基线、GN和kNNMTD显著提高了分类准确率。AGCL在不同的情感上的整体测试准确率为83.33%,交叉验证准确率为90.90%,证实了其在增强小数据集的有效性。
https://arxiv.org/abs/2409.17685
Reinforcement Learning from Human Feedback (RLHF) and derivative techniques like Direct Preference Optimization (DPO) are task-alignment algorithms used to repurpose general, foundational models for specific tasks. We show that applying task-alignment to neural machine translation (NMT) addresses an existing task--data mismatch in NMT, leading to improvements across all languages of a multilingual model, even when task-alignment is only applied to a subset of those languages. We do so by introducing Direct Quality Optimization (DQO), a variant of DPO leveraging a pre-trained translation quality estimation model as a proxy for human preferences, and verify the improvements with both automatic metrics and human evaluation.
强化学习来自人类反馈(RLHF)和像直接偏好优化(DPO)这样的导数技术是用于将通用基础模型适应特定任务的任务对齐算法。我们证明了将任务对齐应用于自然语言翻译(NMT)解决了NMT中存在的任务-数据不匹配问题,从而在多语言模型的所有语言上改善了任务完成情况。为此,我们引入了直接质量优化(DQO),一种利用预训练翻译质量估计模型作为人类偏好的代理,并通过自动指标和人类评估来验证改进。
https://arxiv.org/abs/2409.17673
This article introduces the submission status of the Translation into Low-Resource Languages of Spain task at (WMT 2024) by Huawei Translation Service Center (HW-TSC). We participated in three translation tasks: spanish to aragonese (es-arg), spanish to aranese (es-arn), and spanish to asturian (es-ast). For these three translation tasks, we use training strategies such as multilingual transfer, regularized dropout, forward translation and back translation, labse denoising, transduction ensemble learning and other strategies to neural machine translation (NMT) model based on training deep transformer-big architecture. By using these enhancement strategies, our submission achieved a competitive result in the final evaluation.
这篇文章介绍了华为翻译服务中心(HW-TSC)在2024年机器翻译国际会议(WMT 2024)上对将西班牙语翻译成低资源语言任务的提交状态。我们参与了三个翻译任务:西班牙语到阿拉贡语(es-arg)、西班牙语到阿拉尼亚语(es-arn)和西班牙语到 Asturian(es-ast)。为了这三个翻译任务,我们使用了一些训练策略,如多语言迁移、正则化失活、前向翻译和反向翻译、labse denoising、转换器集成学习等策略,对基于训练大架构的神经机器翻译(NMT)模型进行优化。通过使用这些增强策略,我们的提交在最终评估中获得了竞争力的结果。
https://arxiv.org/abs/2409.15924
This paper describes the submissions of Huawei Translation Services Center(HW-TSC) to WMT24 chat translation shared task on English$\leftrightarrow$Germany (en-de) bidirection. The experiments involved fine-tuning models using chat data and exploring various strategies, including Minimum Bayesian Risk (MBR) decoding and self-training. The results show significant performance improvements in certain directions, with the MBR self-training method achieving the best results. The Large Language Model also discusses the challenges and potential avenues for further research in the field of chat translation.
本文描述了华为翻译服务中心(HW-TSC)在英语到德国(en-de)双向交互对话数据集上参加WMT24聊天翻译共享任务的研究提交。实验过程中,使用了聊天数据对模型进行微调并探讨了各种策略,包括最小贝叶斯风险(MBR)解码和自训练。结果表明,在某些方向上取得了显著的性能提升,其中MBR自训练方法取得了最佳结果。大型语言模型还讨论了聊天翻译领域进一步研究的挑战和潜在方向。
https://arxiv.org/abs/2409.16331
This paper presents the submission of Huawei Translation Services Center (HW-TSC) to machine translation tasks of the 20th China Conference on Machine Translation (CCMT 2024). We participate in the bilingual machine translation task and multi-domain machine translation task. For these two translation tasks, we use training strategies such as regularized dropout, bidirectional training, data diversification, forward translation, back translation, alternated training, curriculum learning, and transductive ensemble learning to train neural machine translation (NMT) models based on the deep Transformer-big architecture. Furthermore, to explore whether large language model (LLM) can help improve the translation quality of NMT systems, we use supervised fine-tuning to train llama2-13b as an Automatic post-editing (APE) model to improve the translation results of the NMT model on the multi-domain machine translation task. By using these plyometric strategies, our submission achieves a competitive result in the final evaluation.
本文提交了华为翻译服务中心(HW-TSC)在2024年第三届中国机器翻译会议(CCMT 2024)上对机器翻译任务的提交。我们参与了双语机器翻译任务和多领域机器翻译任务。为了训练这两种翻译任务,我们采用了诸如正则化dropout、双向训练、数据多样性、前向翻译、反向翻译、交替训练、课程学习以及转换器集成学习等训练策略,基于深度Transformer-big架构训练神经机器翻译(NMT)模型。 此外,为了探讨大型语言模型(LLM)是否可以帮助提高神经机器翻译(NMT)系统的翻译质量,我们使用有监督的微调来训练llama2-13b作为自动后编辑(APE)模型,以提高NMT模型在多领域机器翻译任务上的翻译结果。通过使用这些跳涨策略,我们的提交在最终评估中实现了竞争力的结果。
https://arxiv.org/abs/2409.14842
This paper presents the submission of Huawei Translate Services Center (HW-TSC) to the WMT24 general machine translation (MT) shared task, where we participate in the English to Chinese (en2zh) language pair. Similar to previous years' work, we use training strategies such as regularized dropout, bidirectional training, data diversification, forward translation, back translation, alternated training, curriculum learning, and transductive ensemble learning to train the neural machine translation (NMT) model based on the deep Transformer-big architecture. The difference is that we also use continue pre-training, supervised fine-tuning, and contrastive preference optimization to train the large language model (LLM) based MT model. By using Minimum Bayesian risk (MBR) decoding to select the final translation from multiple hypotheses for NMT and LLM-based MT models, our submission receives competitive results in the final evaluation.
本文介绍了华为翻译服务中心(HW-TSC)参与MT24通用机器翻译(MT)共享任务,我们参加英语到中文(en2zh)语言对。与以往的工作类似,我们使用如正则化 dropout、双向训练、数据多样化、前向翻译、反向翻译、交替训练、课程学习以及转换器模型(NMT)等训练策略来训练基于深度Transformer架构的神经机器翻译(NMT)模型。不同之处在于,我们还使用了继续预训练、有监督的微调以及对比偏好优化来训练基于MT的大语言模型(LLM)模型。通过使用最小贝叶斯风险(MBR)解码来选择NMT和LLM-based MT模型的最终翻译,我们在最终评估中获得了竞争力的结果。
https://arxiv.org/abs/2409.14800
A rising interest in the modality extension of foundation language models warrants discussion on the most effective, and efficient, multimodal training approach. This work focuses on neural machine translation (NMT) and proposes a joint multimodal training regime of Speech-LLM to include automatic speech translation (AST). We investigate two different foundation model architectures, decoder-only GPT and encoder-decoder T5, extended with Canary-1B's speech encoder. To handle joint multimodal training, we propose a novel training framework called EMMeTT. EMMeTT improves training efficiency with the following: balanced sampling across languages, datasets, and modalities; efficient sequential data iteration; and a novel 2D bucketing scheme for multimodal data, complemented by a batch size optimizer (OOMptimizer). We show that a multimodal training consistently helps with both architectures. Moreover, SALM-T5 trained with EMMeTT retains the original NMT capability while outperforming AST baselines on four-language subsets of FLORES and FLEURS. The resultant Multimodal Translation Model produces strong text and speech translation results at the same time.
对基础语言模型扩展模式的日益关注需要讨论最有效、最有效的多模态训练方法。本文重点关注神经机器翻译(NMT),并提出了一个结合Speech-LLM的联合多模态训练方案,包括自动语音翻译(AST)。我们研究了两种不同的基础模型架构:仅解码器GPT和编码器-解码器T5,用Canary-1B的语音编码器进行了扩展。为了处理联合多模态训练,我们提出了一个名为EMMeTT的新型训练框架。EMMeTT通过以下方式提高了训练效率:平衡跨语言、数据集和模态的采样;高效的序列数据迭代;以及一种新颖的2D桶位方案,由批量大小优化器(OOMptimizer)补充。我们证明了多模态训练在两种架构中都是有效的。此外,使用EMMeTT训练的SALM-T5保留了原始NMT能力,同时在FLORES和FLEURS的四个语言子集上优于AST基线。所得到的跨模态翻译模型可以在同一时间产生强文本和语音翻译结果。
https://arxiv.org/abs/2409.13523
This thesis argues that the currently widely used Natural Language Processing algorithms possibly have various limitations related to the properties of the texts they handle and produce. With the wide adoption of these tools in rapid progress, we must ask what these limitations are and what are the possible implications of integrating such tools even more deeply into our daily lives. As a testbed, we have chosen the task of Neural Machine Translation (NMT). Nevertheless, we aim for general insights and outcomes, applicable even to current Large Language Models (LLMs). We ask whether the algorithms used in NMT have inherent inductive biases that are beneficial for most types of inputs but might harm the processing of untypical texts. To explore this hypothesis, we define a set of measures to quantify text diversity based on its statistical properties, like uniformity or rhythmicity of word-level surprisal, on multiple scales (sentence, discourse, language). We then conduct a series of experiments to investigate whether NMT systems struggle with maintaining the diversity of such texts, potentially reducing the richness of the language generated by these systems, compared to human translators. We search for potential causes of these limitations rooted in training objectives and decoding algorithms. Our ultimate goal is to develop alternatives that do not enforce uniformity in the distribution of statistical properties in the output and that allow for better global planning of the translation, taking into account the intrinsic ambiguity of the translation task.
本论文认为,目前广泛使用的自然语言处理算法可能存在与它们处理和产生的文本相关的各种局限性。随着这些工具的广泛采用,我们必须问一问这些局限性是什么,以及将这样的工具更深入地融入我们的日常生活中可能产生的后果。作为实验台,我们选择了自然机器翻译(NMT)作为任务。然而,我们的目标是获取一般性的见解和结论,这些结论对目前的较大语言模型(LLMs)也适用。我们问NMT中使用的算法是否具有固有的归纳偏见,对大多数类型的输入有益,但对不规则文本的加工造成损害。为了探讨这个假设,我们定义了一系列指标,用于量化基于统计特征文本的多样性,如单词级别的出人意料的统一性或节奏性,在多个尺度(句子、段落、语言)上进行。然后我们进行了一系列实验,以调查NMT系统是否难以维持这种文本多样性,从而减少由这些系统生成的语言的丰富程度,与人类翻译者相比。我们寻找导致这些局限性的潜在原因,源于训练目标和解码算法的目标。我们最终的目标是开发出不需要在输出和分布的统计特性上实现统一性的替代方案,允许更好的全局规划翻译,考虑到翻译任务的固有歧义性。
https://arxiv.org/abs/2409.09568
While neural machine translation (NMT) models achieve success in our daily lives, they show vulnerability to adversarial attacks. Despite being harmful, these attacks also offer benefits for interpreting and enhancing NMT models, thus drawing increased research attention. However, existing studies on adversarial attacks are insufficient in both attacking ability and human imperceptibility due to their sole focus on the scope of language. This paper proposes a novel vision-fused attack (VFA) framework to acquire powerful adversarial text, i.e., more aggressive and stealthy. Regarding the attacking ability, we design the vision-merged solution space enhancement strategy to enlarge the limited semantic solution space, which enables us to search for adversarial candidates with higher attacking ability. For human imperceptibility, we propose the perception-retained adversarial text selection strategy to align the human text-reading mechanism. Thus, the finally selected adversarial text could be more deceptive. Extensive experiments on various models, including large language models (LLMs) like LLaMA and GPT-3.5, strongly support that VFA outperforms the comparisons by large margins (up to 81%/14% improvements on ASR/SSIM).
虽然神经机器翻译(NMT)模型在我们的日常生活中取得了成功,但它们对对抗性攻击非常脆弱。尽管这些攻击是有害的,但它们也为解释和增强 NMT 模型提供了好处,因此吸引了更多的研究关注。然而,现有的对抗性攻击研究由于仅关注语言范围而不足,因此本文提出了一种新颖的视频融合攻击(VFA)框架,以获取强大的对抗性文本,即更具有攻击性和隐秘性的文本。在攻击能力方面,我们设计了愿景融合解决方案空间增强策略,以扩大有限语义解决方案空间,这使我们能够搜索具有更高攻击能力的对抗性候选者。在人类感知方面,我们提出了保留感知的人类文本选择策略,使人类的文本阅读机制与攻击策略保持一致。因此,最终选择的攻击性文本可能更具有欺骗性。在各种模型(包括大型语言模型,如LLMs(如LLLM和GPT-3.5))的广泛实验中,VFA 的优越性明显超过基于ASR(自动文本评分)和SSIM(结构从文本到序列匹配)的比较(ASR/SSIM改进率高达81%/14%)。
https://arxiv.org/abs/2409.05021
Causal language modeling (CLM) serves as the foundational framework underpinning remarkable successes of recent large language models (LLMs). Despite its success, the training approach for next word prediction poses a potential risk of causing the model to overly focus on local dependencies within a sentence. While prior studies have been introduced to predict future N words simultaneously, they were primarily applied to tasks such as masked language modeling (MLM) and neural machine translation (NMT). In this study, we introduce a simple N-gram prediction framework for the CLM task. Moreover, we introduce word difference representation (WDR) as a surrogate and contextualized target representation during model training on the basis of N-gram prediction framework. To further enhance the quality of next word prediction, we propose an ensemble method that incorporates the future N words' prediction results. Empirical evaluations across multiple benchmark datasets encompassing CLM and NMT tasks demonstrate the significant advantages of our proposed methods over the conventional CLM.
因果语言建模(CLM)作为最近大型语言模型(LLMs)取得显著成功的基石框架。尽管CLM的成功,但在下一词预测的训练方法上存在潜在风险,可能导致模型过于关注句子内的局部依赖关系。虽然之前的研究已经介绍用于同时预测未来N个词的方法,但它们主要应用于诸如遮蔽语言建模(MLM)和神经机器翻译(NMT)等任务。在本文中,我们介绍了一个简单的N-gram预测框架用于CLM任务。此外,我们还基于N-gram预测框架引入了词差表示(WDR)作为上下文化目标表示。为了进一步提高下一词预测的质量,我们提出了一种包含未来N个词预测结果的元方法。在多个基准数据集的实证评估中,涵盖了CLM和NMT任务的多个基准数据集,我们的方法显著优于传统的CLM。
https://arxiv.org/abs/2409.03295
The main task of the KGQA system (Knowledge Graph Question Answering) is to convert user input questions into query syntax (such as SPARQL). With the rise of modern popular encoders and decoders like Transformer and ConvS2S, many scholars have shifted the research direction of SPARQL generation to the Neural Machine Translation (NMT) architecture or the generative AI field of Text-to-SPARQL. In NMT-based QA systems, the system treats knowledge base query syntax as a language. It uses NMT-based translation models to translate natural language questions into query syntax. Scholars use popular architectures equipped with cross-attention, such as Transformer, ConvS2S, and BiLSTM, to train translation models for query syntax. To achieve better query results, this paper improved the ConvS2S encoder and added multi-head attention from the Transformer, proposing a Multi-Head Conv encoder (MHC encoder) based on the n-gram language model. The principle is to use convolutional layers to capture local hidden features in the input sequence with different receptive fields, using multi-head attention to calculate dependencies between them. Ultimately, we found that the translation model based on the Multi-Head Conv encoder achieved better performance than other encoders, obtaining 76.52\% and 83.37\% BLEU-1 (BiLingual Evaluation Understudy) on the QALD-9 and LC-QuAD-1.0 datasets, respectively. Additionally, in the end-to-end system experiments on the QALD-9 and LC-QuAD-1.0 datasets, we achieved leading results over other KGQA systems, with Macro F1-measures reaching 52\% and 66\%, respectively. Moreover, the experimental results show that with limited computational resources, if one possesses an excellent encoder-decoder architecture and cross-attention, experts and scholars can achieve outstanding performance equivalent to large pre-trained models using only general embeddings.
KGQA系统(知识图谱问题回答)的主要任务是将用户输入的问题转换为查询语法(如SPARQL)。随着现代流行编码器和解码器(如Transformer和ConvS2S)的兴起,许多学者将SPARQL生成的研究方向转向了神经机器翻译(NMT)架构或文本到SPARQL的生成人工智能领域。在基于NMT的问答系统中,系统将知识库查询语法视为一种语言。它使用基于NMT的翻译模型将自然语言问题转换为查询语法。学者使用配备跨注意力的流行架构(如Transformer和ConvS2S)来训练查询语法的翻译模型。为了获得更好的查询结果,本文改进了ConvS2S编码器并添加了Transformer中的多头注意力,提出了基于n-gram语言模型的多头Conv编码器(MHC编码器)。原理是使用卷积层来捕捉输入序列中不同感受野的局部隐藏特征,使用多头注意力计算它们之间的依赖关系。最终,我们发现,基于多头Conv编码器的翻译模型取得了比其他编码器更好的性能,在QALD-9和LC-QuAD-1.0数据集上的BLEU-1分别达到76.52%和83.37%。此外,在QALD-9和LC-QuAD-1.0数据集上的端到端系统实验中,我们甚至超过了其他KGQA系统的领先水平,将Macro F1-分数达到52%和66%。此外,实验结果表明,在有限计算资源的情况下,如果一个人具有出色的编码器-解码器架构和跨注意,专家和学者可以使用仅有的通用嵌入达到与大型预训练模型相当出色的性能。
https://arxiv.org/abs/2408.13432
Neural Machine Translation (NMT) systems struggle when translating to and from low-resource languages, which lack large-scale data corpora for models to use for training. As manual data curation is expensive and time-consuming, we propose utilizing a generative-adversarial network (GAN) to augment low-resource language data. When training on a very small amount of language data (under 20,000 sentences) in a simulated low-resource setting, our model shows potential at data augmentation, generating monolingual language data with sentences such as "ask me that healthy lunch im cooking up," and "my grandfather work harder than your grandfather before." Our novel data augmentation approach takes the first step in investigating the capability of GANs in low-resource NMT, and our results suggest that there is promise for future extension of GANs to low-resource NMT.
神经机器翻译(NMT)系统在翻译低资源语言时遇到困难,因为缺乏大规模数据集供模型训练。由于手动数据清理费用高且耗时,我们提出使用生成对抗网络(GAN)来增加低资源语言数据。在在一个模拟的低资源环境中,在训练时只使用非常少量的语言数据(不到20,000个句子)时,我们的模型显示出数据增强的潜力,生成像"你想要我做什么健康的午餐吗"和"我爷爷工作比你的爷爷更努力。"我们新的数据增强方法是研究GAN在低资源NMT中的能力的第一步,我们的结果表明,GAN有望在未来扩展到低资源NMT。
https://arxiv.org/abs/2409.00071
Despite the recent popularity of Large Language Models (LLMs) in Machine Translation (MT), their performance in low-resource translation still lags significantly behind Neural Machine Translation (NMT) models. In this paper, we explore what it would take to adapt LLMs for low-resource settings. In particular, we re-examine the role of two factors: a) the importance and application of parallel data, and b) diversity in Supervised Fine-Tuning (SFT). Recently, parallel data has been shown to be less important for MT using LLMs than in previous MT research. Similarly, diversity during SFT has been shown to promote significant transfer in LLMs across languages and tasks. However, for low-resource LLM-MT, we show that the opposite is true for both of these considerations: a) parallel data is critical during both pretraining and SFT, and b) diversity tends to cause interference, not transfer. Our experiments, conducted with 3 LLMs across 2 low-resourced language groups - indigenous American and North-East Indian - reveal consistent patterns in both cases, underscoring the generalizability of our findings. We believe these insights will be valuable for scaling to massively multilingual LLM-MT models that can effectively serve lower-resource languages.
尽管在机器翻译(MT)中大型语言模型(LLMs)最近受到了越来越多的关注,但在低资源翻译中,它们的性能仍然显著落后于神经机器翻译(NMT)模型。在本文中,我们探讨了为低资源环境适应LLMs需要做什么。特别是,我们重新审视了两个因素:a)并行数据的重要性及其应用,以及b)监督微调(SFT)中的多样性。最近,有研究表明,在MT中使用LLMs的并行数据比以前的研究要重要得少。同样,在SFT中,多样性被证明会促进LLMs在语言和任务之间的显著转移。然而,我们发现,对于低资源LLM-MT,这两个因素都是相反的:a)在预训练和SFT过程中,并行数据都是至关重要的,而b)多样性往往会导致干扰,而不是转移。我们用美国原住民和东北部印度两个低资源语言组中的三个LLM进行的实验,揭示了两种情况下的一致模式,强调了我们研究结果的普适性。我们认为这些见解对于扩展能有效服务于低资源语言的大规模多语言LLM-MT模型具有价值。
https://arxiv.org/abs/2408.12780
Back translation, as a technique for extending a dataset, is widely used by researchers in low-resource language translation tasks. It typically translates from the target to the source language to ensure high-quality translation results. This paper proposes a novel way of utilizing a monolingual corpus on the source side to assist Neural Machine Translation (NMT) in low-resource settings. We realize this concept by employing a Generative Adversarial Network (GAN), which augments the training data for the discriminator while mitigating the interference of low-quality synthetic monolingual translations with the generator. Additionally, this paper integrates Translation Memory (TM) with NMT, increasing the amount of data available to the generator. Moreover, we propose a novel procedure to filter the synthetic sentence pairs during the augmentation process, ensuring the high quality of the data.
back翻译作为一种在低资源语言翻译任务中扩展数据集的技术,已经被研究人员广泛应用。它通常将目标语言翻译成源语言,以确保高质量翻译结果。本文提出了一种利用单语料库在源端协助神经机器翻译(NMT)在低资源设置中的新方法。我们通过采用生成对抗网络(GAN)实现了这一概念,该网络在增加训练数据的同时减轻了低质量合成单语料库对生成器的干扰。此外,本文将翻译记忆(TM)与NMT集成,增加了生成器可用数据的数量。此外,我们提出了一种在增强过程中过滤合成句子对的新方法,确保数据的质量。
https://arxiv.org/abs/2408.12079
Recent advancements in neural machine translation (NMT) have revolutionized the field, yet the dependency on extensive parallel corpora limits progress for low-resource languages. Cross-lingual transfer learning offers a promising solution by utilizing data from high-resource languages but often struggles with in-domain NMT. In this paper, we investigate three pivotal aspects: enhancing the domain-specific quality of NMT by fine-tuning domain-relevant data from different language pairs, identifying which domains are transferable in zero-shot scenarios, and assessing the impact of language-specific versus domain-specific factors on adaptation effectiveness. Using English as the source language and Spanish for fine-tuning, we evaluate multiple target languages including Portuguese, Italian, French, Czech, Polish, and Greek. Our findings reveal significant improvements in domain-specific translation quality, especially in specialized fields such as medical, legal, and IT, underscoring the importance of well-defined domain data and transparency of the experiment setup in in-domain transfer learning.
近年来,在神经机器翻译(NMT)领域的进步已经彻底颠覆了该领域,然而对广泛并行语料库的依赖限制了低资源语言的进步。跨语言迁移学习通过利用高资源语言的数据来提供了一个有前景的解决方案,但通常在领域内NMT方面遇到困难。在本文中,我们调查了三个关键方面:通过微调与不同语言对相关的领域数据来提高NMT的领域特定质量,确定在零散场景中可转移的领域,以及评估语言特定和领域特定因素对迁移效果的影响。使用英语作为源语言,对西班牙进行微调,我们评估了多种目标语言,包括葡萄牙语、意大利语、法语、捷克语、波兰语和希腊语。我们的研究结果表明,在领域特定翻译质量方面取得了显著的提高,尤其是在专业领域,如医疗、法律和IT,这强调了在领域内迁移学习中明确定义的领域数据和实验设置的重要性。
https://arxiv.org/abs/2408.11926
Standard Neural Machine Translation (NMT) models have traditionally been trained with Sinusoidal Positional Embeddings (PEs), which are inadequate for capturing long-range dependencies and are inefficient for long-context or document-level translation. In contrast, state-of-the-art large language models (LLMs) employ relative PEs, demonstrating superior length generalization. This work explores the potential for efficiently switching the Positional Embeddings of pre-trained NMT models from absolute sinusoidal PEs to relative approaches such as RoPE and ALiBi. Our findings reveal that sinusoidal PEs can be effectively replaced with RoPE and ALiBi with negligible or no performance loss, achieved by fine-tuning on a small fraction of high-quality data. Additionally, models trained without Positional Embeddings (NoPE) are not a viable solution for Encoder-Decoder architectures, as they consistently under-perform compared to models utilizing any form of Positional Embedding. Furthermore, even a model trained from scratch with these relative PEs slightly under-performs a fine-tuned model, underscoring the efficiency and validity of our hypothesis.
标准神经机器翻译(NMT)模型通常使用正弦余弦位置嵌入(PEs)进行训练,但这些嵌入不适合捕捉长距离依赖关系,并且在长文本或文档级别翻译中效率较低。相比之下,最先进的 large 语言模型(LLMs)采用相对 PEs,表明具有更好的长度泛化能力。本文探讨了从预训练 NMT 模型的绝对正弦余弦位置嵌入(PEs)切换到相对方法(如 RoPE 和 ALiBi)的潜在可能性。我们的研究结果表明,通过在少量高质量数据上微调,可以有效地用 RoPE 和 ALiBi 替换绝对正弦余弦位置嵌入,且不会导致显著的性能损失。此外,没有位置嵌入的模型(NoPE)并不适用于编码器-解码器架构,因为它们与使用任何形式的定位嵌入的模型相比,表现不佳。此外,即使是一个从零开始训练的模型,使用这些相对 PEs 的略微低于 fine-tuned 模型的表现,也表明了我们的假设的有效性和可靠性。
https://arxiv.org/abs/2408.11382
The deep learning language of choice these days is Python; measured by factors such as available libraries and technical support, it is hard to beat. At the same time, software written in lower-level programming languages like C++ retain advantages in speed. We describe a Python interface to Marian NMT, a C++-based training and inference toolkit for sequence-to-sequence models, focusing on machine translation. This interface enables models trained with Marian to be connected to the rich, wide range of tools available in Python. A highlight of the interface is the ability to compute state-of-the-art COMET metrics from Python but using Marian's inference engine, with a speedup factor of up to 7.8$\times$ the existing implementations. We also briefly spotlight a number of other integrations, including Jupyter notebooks, connection with prebuilt models, and a web app interface provided with the package. PyMarian is available in PyPI via $\texttt{pip install pymarian}$.
目前选择深度学习语言的首选是Python,这是由于诸如可用库和技术支持等因素的影响,它很难被击败。与此同时,使用像C++这样的低级编程语言编写的软件在速度上仍然具有优势。我们描述了一个Python接口到Marian NMT,这是一个基于C++的训练和推理工具包,专注于机器翻译。该接口使使用Marian训练的模型可以连接到Python中丰富多样的工具。接口的亮点是使用Python计算最先进的COMET指标,使用Marian的推理引擎,速度提升因子高达7.8倍。我们还简要提到了其他集成,包括Jupyter笔记本、与预构建模型的连接以及随包提供的Web应用程序接口。PyMarian可以通过在PyPI上使用`pip install pymarian`来获取。
https://arxiv.org/abs/2408.11853