While neural machine translation (NMT) models achieve success in our daily lives, they show vulnerability to adversarial attacks. Despite being harmful, these attacks also offer benefits for interpreting and enhancing NMT models, thus drawing increased research attention. However, existing studies on adversarial attacks are insufficient in both attacking ability and human imperceptibility due to their sole focus on the scope of language. This paper proposes a novel vision-fused attack (VFA) framework to acquire powerful adversarial text, i.e., more aggressive and stealthy. Regarding the attacking ability, we design the vision-merged solution space enhancement strategy to enlarge the limited semantic solution space, which enables us to search for adversarial candidates with higher attacking ability. For human imperceptibility, we propose the perception-retained adversarial text selection strategy to align the human text-reading mechanism. Thus, the finally selected adversarial text could be more deceptive. Extensive experiments on various models, including large language models (LLMs) like LLaMA and GPT-3.5, strongly support that VFA outperforms the comparisons by large margins (up to 81%/14% improvements on ASR/SSIM).
虽然神经机器翻译(NMT)模型在我们的日常生活中取得了成功,但它们对对抗性攻击非常脆弱。尽管这些攻击是有害的,但它们也为解释和增强 NMT 模型提供了好处,因此吸引了更多的研究关注。然而,现有的对抗性攻击研究由于仅关注语言范围而不足,因此本文提出了一种新颖的视频融合攻击(VFA)框架,以获取强大的对抗性文本,即更具有攻击性和隐秘性的文本。在攻击能力方面,我们设计了愿景融合解决方案空间增强策略,以扩大有限语义解决方案空间,这使我们能够搜索具有更高攻击能力的对抗性候选者。在人类感知方面,我们提出了保留感知的人类文本选择策略,使人类的文本阅读机制与攻击策略保持一致。因此,最终选择的攻击性文本可能更具有欺骗性。在各种模型(包括大型语言模型,如LLMs(如LLLM和GPT-3.5))的广泛实验中,VFA 的优越性明显超过基于ASR(自动文本评分)和SSIM(结构从文本到序列匹配)的比较(ASR/SSIM改进率高达81%/14%)。
https://arxiv.org/abs/2409.05021
Causal language modeling (CLM) serves as the foundational framework underpinning remarkable successes of recent large language models (LLMs). Despite its success, the training approach for next word prediction poses a potential risk of causing the model to overly focus on local dependencies within a sentence. While prior studies have been introduced to predict future N words simultaneously, they were primarily applied to tasks such as masked language modeling (MLM) and neural machine translation (NMT). In this study, we introduce a simple N-gram prediction framework for the CLM task. Moreover, we introduce word difference representation (WDR) as a surrogate and contextualized target representation during model training on the basis of N-gram prediction framework. To further enhance the quality of next word prediction, we propose an ensemble method that incorporates the future N words' prediction results. Empirical evaluations across multiple benchmark datasets encompassing CLM and NMT tasks demonstrate the significant advantages of our proposed methods over the conventional CLM.
因果语言建模(CLM)作为最近大型语言模型(LLMs)取得显著成功的基石框架。尽管CLM的成功,但在下一词预测的训练方法上存在潜在风险,可能导致模型过于关注句子内的局部依赖关系。虽然之前的研究已经介绍用于同时预测未来N个词的方法,但它们主要应用于诸如遮蔽语言建模(MLM)和神经机器翻译(NMT)等任务。在本文中,我们介绍了一个简单的N-gram预测框架用于CLM任务。此外,我们还基于N-gram预测框架引入了词差表示(WDR)作为上下文化目标表示。为了进一步提高下一词预测的质量,我们提出了一种包含未来N个词预测结果的元方法。在多个基准数据集的实证评估中,涵盖了CLM和NMT任务的多个基准数据集,我们的方法显著优于传统的CLM。
https://arxiv.org/abs/2409.03295
The main task of the KGQA system (Knowledge Graph Question Answering) is to convert user input questions into query syntax (such as SPARQL). With the rise of modern popular encoders and decoders like Transformer and ConvS2S, many scholars have shifted the research direction of SPARQL generation to the Neural Machine Translation (NMT) architecture or the generative AI field of Text-to-SPARQL. In NMT-based QA systems, the system treats knowledge base query syntax as a language. It uses NMT-based translation models to translate natural language questions into query syntax. Scholars use popular architectures equipped with cross-attention, such as Transformer, ConvS2S, and BiLSTM, to train translation models for query syntax. To achieve better query results, this paper improved the ConvS2S encoder and added multi-head attention from the Transformer, proposing a Multi-Head Conv encoder (MHC encoder) based on the n-gram language model. The principle is to use convolutional layers to capture local hidden features in the input sequence with different receptive fields, using multi-head attention to calculate dependencies between them. Ultimately, we found that the translation model based on the Multi-Head Conv encoder achieved better performance than other encoders, obtaining 76.52\% and 83.37\% BLEU-1 (BiLingual Evaluation Understudy) on the QALD-9 and LC-QuAD-1.0 datasets, respectively. Additionally, in the end-to-end system experiments on the QALD-9 and LC-QuAD-1.0 datasets, we achieved leading results over other KGQA systems, with Macro F1-measures reaching 52\% and 66\%, respectively. Moreover, the experimental results show that with limited computational resources, if one possesses an excellent encoder-decoder architecture and cross-attention, experts and scholars can achieve outstanding performance equivalent to large pre-trained models using only general embeddings.
KGQA系统(知识图谱问题回答)的主要任务是将用户输入的问题转换为查询语法(如SPARQL)。随着现代流行编码器和解码器(如Transformer和ConvS2S)的兴起,许多学者将SPARQL生成的研究方向转向了神经机器翻译(NMT)架构或文本到SPARQL的生成人工智能领域。在基于NMT的问答系统中,系统将知识库查询语法视为一种语言。它使用基于NMT的翻译模型将自然语言问题转换为查询语法。学者使用配备跨注意力的流行架构(如Transformer和ConvS2S)来训练查询语法的翻译模型。为了获得更好的查询结果,本文改进了ConvS2S编码器并添加了Transformer中的多头注意力,提出了基于n-gram语言模型的多头Conv编码器(MHC编码器)。原理是使用卷积层来捕捉输入序列中不同感受野的局部隐藏特征,使用多头注意力计算它们之间的依赖关系。最终,我们发现,基于多头Conv编码器的翻译模型取得了比其他编码器更好的性能,在QALD-9和LC-QuAD-1.0数据集上的BLEU-1分别达到76.52%和83.37%。此外,在QALD-9和LC-QuAD-1.0数据集上的端到端系统实验中,我们甚至超过了其他KGQA系统的领先水平,将Macro F1-分数达到52%和66%。此外,实验结果表明,在有限计算资源的情况下,如果一个人具有出色的编码器-解码器架构和跨注意,专家和学者可以使用仅有的通用嵌入达到与大型预训练模型相当出色的性能。
https://arxiv.org/abs/2408.13432
Neural Machine Translation (NMT) systems struggle when translating to and from low-resource languages, which lack large-scale data corpora for models to use for training. As manual data curation is expensive and time-consuming, we propose utilizing a generative-adversarial network (GAN) to augment low-resource language data. When training on a very small amount of language data (under 20,000 sentences) in a simulated low-resource setting, our model shows potential at data augmentation, generating monolingual language data with sentences such as "ask me that healthy lunch im cooking up," and "my grandfather work harder than your grandfather before." Our novel data augmentation approach takes the first step in investigating the capability of GANs in low-resource NMT, and our results suggest that there is promise for future extension of GANs to low-resource NMT.
神经机器翻译(NMT)系统在翻译低资源语言时遇到困难,因为缺乏大规模数据集供模型训练。由于手动数据清理费用高且耗时,我们提出使用生成对抗网络(GAN)来增加低资源语言数据。在在一个模拟的低资源环境中,在训练时只使用非常少量的语言数据(不到20,000个句子)时,我们的模型显示出数据增强的潜力,生成像"你想要我做什么健康的午餐吗"和"我爷爷工作比你的爷爷更努力。"我们新的数据增强方法是研究GAN在低资源NMT中的能力的第一步,我们的结果表明,GAN有望在未来扩展到低资源NMT。
https://arxiv.org/abs/2409.00071
Despite the recent popularity of Large Language Models (LLMs) in Machine Translation (MT), their performance in low-resource translation still lags significantly behind Neural Machine Translation (NMT) models. In this paper, we explore what it would take to adapt LLMs for low-resource settings. In particular, we re-examine the role of two factors: a) the importance and application of parallel data, and b) diversity in Supervised Fine-Tuning (SFT). Recently, parallel data has been shown to be less important for MT using LLMs than in previous MT research. Similarly, diversity during SFT has been shown to promote significant transfer in LLMs across languages and tasks. However, for low-resource LLM-MT, we show that the opposite is true for both of these considerations: a) parallel data is critical during both pretraining and SFT, and b) diversity tends to cause interference, not transfer. Our experiments, conducted with 3 LLMs across 2 low-resourced language groups - indigenous American and North-East Indian - reveal consistent patterns in both cases, underscoring the generalizability of our findings. We believe these insights will be valuable for scaling to massively multilingual LLM-MT models that can effectively serve lower-resource languages.
尽管在机器翻译(MT)中大型语言模型(LLMs)最近受到了越来越多的关注,但在低资源翻译中,它们的性能仍然显著落后于神经机器翻译(NMT)模型。在本文中,我们探讨了为低资源环境适应LLMs需要做什么。特别是,我们重新审视了两个因素:a)并行数据的重要性及其应用,以及b)监督微调(SFT)中的多样性。最近,有研究表明,在MT中使用LLMs的并行数据比以前的研究要重要得少。同样,在SFT中,多样性被证明会促进LLMs在语言和任务之间的显著转移。然而,我们发现,对于低资源LLM-MT,这两个因素都是相反的:a)在预训练和SFT过程中,并行数据都是至关重要的,而b)多样性往往会导致干扰,而不是转移。我们用美国原住民和东北部印度两个低资源语言组中的三个LLM进行的实验,揭示了两种情况下的一致模式,强调了我们研究结果的普适性。我们认为这些见解对于扩展能有效服务于低资源语言的大规模多语言LLM-MT模型具有价值。
https://arxiv.org/abs/2408.12780
Back translation, as a technique for extending a dataset, is widely used by researchers in low-resource language translation tasks. It typically translates from the target to the source language to ensure high-quality translation results. This paper proposes a novel way of utilizing a monolingual corpus on the source side to assist Neural Machine Translation (NMT) in low-resource settings. We realize this concept by employing a Generative Adversarial Network (GAN), which augments the training data for the discriminator while mitigating the interference of low-quality synthetic monolingual translations with the generator. Additionally, this paper integrates Translation Memory (TM) with NMT, increasing the amount of data available to the generator. Moreover, we propose a novel procedure to filter the synthetic sentence pairs during the augmentation process, ensuring the high quality of the data.
back翻译作为一种在低资源语言翻译任务中扩展数据集的技术,已经被研究人员广泛应用。它通常将目标语言翻译成源语言,以确保高质量翻译结果。本文提出了一种利用单语料库在源端协助神经机器翻译(NMT)在低资源设置中的新方法。我们通过采用生成对抗网络(GAN)实现了这一概念,该网络在增加训练数据的同时减轻了低质量合成单语料库对生成器的干扰。此外,本文将翻译记忆(TM)与NMT集成,增加了生成器可用数据的数量。此外,我们提出了一种在增强过程中过滤合成句子对的新方法,确保数据的质量。
https://arxiv.org/abs/2408.12079
Recent advancements in neural machine translation (NMT) have revolutionized the field, yet the dependency on extensive parallel corpora limits progress for low-resource languages. Cross-lingual transfer learning offers a promising solution by utilizing data from high-resource languages but often struggles with in-domain NMT. In this paper, we investigate three pivotal aspects: enhancing the domain-specific quality of NMT by fine-tuning domain-relevant data from different language pairs, identifying which domains are transferable in zero-shot scenarios, and assessing the impact of language-specific versus domain-specific factors on adaptation effectiveness. Using English as the source language and Spanish for fine-tuning, we evaluate multiple target languages including Portuguese, Italian, French, Czech, Polish, and Greek. Our findings reveal significant improvements in domain-specific translation quality, especially in specialized fields such as medical, legal, and IT, underscoring the importance of well-defined domain data and transparency of the experiment setup in in-domain transfer learning.
近年来,在神经机器翻译(NMT)领域的进步已经彻底颠覆了该领域,然而对广泛并行语料库的依赖限制了低资源语言的进步。跨语言迁移学习通过利用高资源语言的数据来提供了一个有前景的解决方案,但通常在领域内NMT方面遇到困难。在本文中,我们调查了三个关键方面:通过微调与不同语言对相关的领域数据来提高NMT的领域特定质量,确定在零散场景中可转移的领域,以及评估语言特定和领域特定因素对迁移效果的影响。使用英语作为源语言,对西班牙进行微调,我们评估了多种目标语言,包括葡萄牙语、意大利语、法语、捷克语、波兰语和希腊语。我们的研究结果表明,在领域特定翻译质量方面取得了显著的提高,尤其是在专业领域,如医疗、法律和IT,这强调了在领域内迁移学习中明确定义的领域数据和实验设置的重要性。
https://arxiv.org/abs/2408.11926
Standard Neural Machine Translation (NMT) models have traditionally been trained with Sinusoidal Positional Embeddings (PEs), which are inadequate for capturing long-range dependencies and are inefficient for long-context or document-level translation. In contrast, state-of-the-art large language models (LLMs) employ relative PEs, demonstrating superior length generalization. This work explores the potential for efficiently switching the Positional Embeddings of pre-trained NMT models from absolute sinusoidal PEs to relative approaches such as RoPE and ALiBi. Our findings reveal that sinusoidal PEs can be effectively replaced with RoPE and ALiBi with negligible or no performance loss, achieved by fine-tuning on a small fraction of high-quality data. Additionally, models trained without Positional Embeddings (NoPE) are not a viable solution for Encoder-Decoder architectures, as they consistently under-perform compared to models utilizing any form of Positional Embedding. Furthermore, even a model trained from scratch with these relative PEs slightly under-performs a fine-tuned model, underscoring the efficiency and validity of our hypothesis.
标准神经机器翻译(NMT)模型通常使用正弦余弦位置嵌入(PEs)进行训练,但这些嵌入不适合捕捉长距离依赖关系,并且在长文本或文档级别翻译中效率较低。相比之下,最先进的 large 语言模型(LLMs)采用相对 PEs,表明具有更好的长度泛化能力。本文探讨了从预训练 NMT 模型的绝对正弦余弦位置嵌入(PEs)切换到相对方法(如 RoPE 和 ALiBi)的潜在可能性。我们的研究结果表明,通过在少量高质量数据上微调,可以有效地用 RoPE 和 ALiBi 替换绝对正弦余弦位置嵌入,且不会导致显著的性能损失。此外,没有位置嵌入的模型(NoPE)并不适用于编码器-解码器架构,因为它们与使用任何形式的定位嵌入的模型相比,表现不佳。此外,即使是一个从零开始训练的模型,使用这些相对 PEs 的略微低于 fine-tuned 模型的表现,也表明了我们的假设的有效性和可靠性。
https://arxiv.org/abs/2408.11382
The deep learning language of choice these days is Python; measured by factors such as available libraries and technical support, it is hard to beat. At the same time, software written in lower-level programming languages like C++ retain advantages in speed. We describe a Python interface to Marian NMT, a C++-based training and inference toolkit for sequence-to-sequence models, focusing on machine translation. This interface enables models trained with Marian to be connected to the rich, wide range of tools available in Python. A highlight of the interface is the ability to compute state-of-the-art COMET metrics from Python but using Marian's inference engine, with a speedup factor of up to 7.8$\times$ the existing implementations. We also briefly spotlight a number of other integrations, including Jupyter notebooks, connection with prebuilt models, and a web app interface provided with the package. PyMarian is available in PyPI via $\texttt{pip install pymarian}$.
目前选择深度学习语言的首选是Python,这是由于诸如可用库和技术支持等因素的影响,它很难被击败。与此同时,使用像C++这样的低级编程语言编写的软件在速度上仍然具有优势。我们描述了一个Python接口到Marian NMT,这是一个基于C++的训练和推理工具包,专注于机器翻译。该接口使使用Marian训练的模型可以连接到Python中丰富多样的工具。接口的亮点是使用Python计算最先进的COMET指标,使用Marian的推理引擎,速度提升因子高达7.8倍。我们还简要提到了其他集成,包括Jupyter笔记本、与预构建模型的连接以及随包提供的Web应用程序接口。PyMarian可以通过在PyPI上使用`pip install pymarian`来获取。
https://arxiv.org/abs/2408.11853
Recent research in neural machine translation (NMT) has shown that training on high-quality machine-generated data can outperform training on human-generated data. This work accompanies the first-ever release of a LLM-generated, MBR-decoded and QE-reranked dataset with both sentence-level and multi-sentence examples. We perform extensive experiments to demonstrate the quality of our dataset in terms of its downstream impact on NMT model performance. We find that training from scratch on our (machine-generated) dataset outperforms training on the (web-crawled) WMT'23 training dataset (which is 300 times larger), and also outperforms training on the top-quality subset of the WMT'23 training dataset. We also find that performing self-distillation by finetuning the LLM which generated this dataset outperforms the LLM's strong few-shot baseline. These findings corroborate the quality of our dataset, and demonstrate the value of high-quality machine-generated data in improving performance of NMT models.
近年来在自然语言机器翻译(NMT)方面的研究已经表明,利用高质量的自动生成数据进行训练可以超越利用人类生成的数据进行训练。这项工作与第一个发布LLM生成的、MBR解码的和QE评分最高的数据集有关,该数据集包括句子级别和多句子示例。我们进行了广泛的实验,以证明我们数据在NMT模型性能方面的下游影响质量。我们发现,从零开始在(自动生成)数据上训练比在(爬虫)WMT'23训练数据上训练(该数据是300倍大的)表现更好,而且也优于WMT'23训练数据中优质子集的训练。我们还发现,通过微调LLM生成此数据集的自我蒸馏方法比LLM的强少样本基线表现更好。这些发现证实了我们的数据质量,并表明高质量自动生成数据在提高NMT模型性能方面的价值。
https://arxiv.org/abs/2408.06537
Beam search decoding is the de-facto method for decoding auto-regressive Neural Machine Translation (NMT) models, including multilingual NMT where the target language is specified as an input. However, decoding multilingual NMT models commonly produces ``off-target'' translations -- yielding translation outputs not in the intended language. In this paper, we first conduct an error analysis of off-target translations for a strong multilingual NMT model and identify how these decodings are produced during beam search. We then propose Language-informed Beam Search (LiBS), a general decoding algorithm incorporating an off-the-shelf Language Identification (LiD) model into beam search decoding to reduce off-target translations. LiBS is an inference-time procedure that is NMT-model agnostic and does not require any additional parallel data. Results show that our proposed LiBS algorithm on average improves +1.1 BLEU and +0.9 BLEU on WMT and OPUS datasets, and reduces off-target rates from 22.9\% to 7.7\% and 65.8\% to 25.3\% respectively.
光束搜索解码是解码自回归神经机器翻译(NMT)模型的标准方法,包括多语言NMT,其中目标语言指定为输入。然而,解码多语言NMT模型通常会产生“离目标”的翻译——导致翻译输出不是目标语言。在本文中,我们首先对一个强大的多语言NMT模型进行离目标翻译的错误分析,并确定这些解码是在光束搜索过程中产生的。然后我们提出了一种语言相关的光束搜索解码(LiBS)算法,将一个标准的语言识别(LiD)模型融入光束搜索解码以减少离目标翻译。LiBS是一种与NMT模型无关的推理时间过程,不需要任何额外的并行数据。结果表明,我们提出的LiBS算法在WMT和OPUS数据集上平均提高了+1.1 BLEU和+0.9 BLEU,并将离目标率从22.9%降低到7.7%和65.8%降低到25.3%左右。
https://arxiv.org/abs/2408.05738
In recent years, neural machine translation (NMT) has been widely used in everyday life. However, the current NMT lacks a mechanism to adjust the difficulty level of translations to match the user's language level. Additionally, due to the bias in the training data for NMT, translations of simple source sentences are often produced with complex words. In particular, this could pose a problem for children, who may not be able to understand the meaning of the translations correctly. In this study, we propose a method that replaces words with high Age of Acquisitions (AoA) in translations with simpler words to match the translations to the user's level. We achieve this by using large language models (LLMs), providing a triple of a source sentence, a translation, and a target word to be replaced. We create a benchmark dataset using back-translation on Simple English Wikipedia. The experimental results obtained from the dataset show that our method effectively replaces high-AoA words with lower-AoA words and, moreover, can iteratively replace most of the high-AoA words while still maintaining high BLEU and COMET scores.
近年来,神经机器翻译(NMT)在日常生活中被广泛应用。然而,当前的 NMT 缺乏调整翻译难度级别的机制,以匹配用户的语言水平。此外,由于 NMT 的训练数据中存在偏见,因此简单的源句子翻译通常使用复杂的词汇。特别是,这可能对儿童构成问题,因为他们可能无法正确理解翻译的含义。在这项研究中,我们提出了一种方法,用高Age of Acquisitions (AoA)的词汇替换翻译中更简单的词汇,以匹配用户的语言水平。我们通过使用大型语言模型(LLMs)实现了这一目标,提供了源句子、翻译和要替换的目标单词。我们使用 Simple English Wikipedia 上的反向翻译来创建基准数据集。数据集中的实验结果表明,我们的方法有效地用较低AoA的词汇替换高AoA的词汇,此外,它还可以迭代地替换大部分高AoA的词汇,同时保持高BLEU和COMET得分。
https://arxiv.org/abs/2408.04217
This practical experience report explores Neural Machine Translation (NMT) models' capability to generate offensive security code from natural language (NL) descriptions, highlighting the significance of contextual understanding and its impact on model performance. Our study employs a dataset comprising real shellcodes to evaluate the models across various scenarios, including missing information, necessary context, and unnecessary context. The experiments are designed to assess the models' resilience against incomplete descriptions, their proficiency in leveraging context for enhanced accuracy, and their ability to discern irrelevant information. The findings reveal that the introduction of contextual data significantly improves performance. However, the benefits of additional context diminish beyond a certain point, indicating an optimal level of contextual information for model training. Moreover, the models demonstrate an ability to filter out unnecessary context, maintaining high levels of accuracy in the generation of offensive security code. This study paves the way for future research on optimizing context use in AI-driven code generation, particularly for applications requiring a high degree of technical precision such as the generation of offensive code.
这份实用经验报告探讨了神经机器翻译(NMT)模型从自然语言(NL)描述中生成攻击性安全代码的能力,突出了上下文理解的重要性及其对模型性能的影响。我们的研究采用了一个由实际shellcode组成的数据集来评估这些模型在各种情景下的表现,包括缺少信息、必要上下文和不必要上下文。实验旨在评估模型在面对不完整描述时的韧性,以及在利用上下文提高准确性方面的能力,以及其辨别无关信息的能力。研究结果表明,引入上下文数据显著提高了性能。然而,额外上下文带来的好处会随着一定程度的增加而减弱,表明了模型训练的最佳上下文信息量。此外,这些模型具有过滤掉多余上下文的能力,生成的攻击性安全代码保持着高水平的准确性。本研究为未来在AI驱动代码生成中优化上下文使用的研究奠定了基础,特别是对于需要高技术精确度的应用,如生成攻击性代码。
https://arxiv.org/abs/2408.02402
Representing symbolic music with compound tokens, where each token consists of several different sub-tokens representing a distinct musical feature or attribute, offers the advantage of reducing sequence length. While previous research has validated the efficacy of compound tokens in music sequence modeling, predicting all sub-tokens simultaneously can lead to suboptimal results as it may not fully capture the interdependencies between them. We introduce the Nested Music Transformer (NMT), an architecture tailored for decoding compound tokens autoregressively, similar to processing flattened tokens, but with low memory usage. The NMT consists of two transformers: the main decoder that models a sequence of compound tokens and the sub-decoder for modeling sub-tokens of each compound token. The experiment results showed that applying the NMT to compound tokens can enhance the performance in terms of better perplexity in processing various symbolic music datasets and discrete audio tokens from the MAESTRO dataset.
使用复合词表示符号音乐,其中每个词由几个不同的子词表示一个独特的音乐特征或属性,可以降低序列长度。 虽然先前的研究证实了复合词在音乐序列建模中的有效性,但同时预测所有子词可能会导致次优结果,因为它可能没有完全捕捉它们之间的相互依存关系。 我们引入了嵌套音乐Transformer(NMT),一种专为解码复合词自回归而设计的架构,类似于处理扁平化词,但具有较低的内存使用率。 NMT由两个Transformer组成:主要解码器模型一个复合词序列,子解码器模型每个复合词的子词。 实验结果表明,将NMT应用于复合词可以提高处理各种符号音乐数据集和MAESTRO数据集中的离散音频标记的性能,同时降低模型的复杂性。
https://arxiv.org/abs/2408.01180
Nearest neighbor machine translation is a successful approach for fast domain adaption, which interpolates the pre-trained transformers with domain-specific token-level k-nearest-neighbor (kNN) retrieval without retraining. Despite kNN MT's success, searching large reference corpus and fixed interpolation between the kNN and pre-trained model led to computational complexity and translation quality challenges. Among other papers, Dai et al. proposed methods to obtain a small number of reference samples dynamically for which they introduced a distance-aware interpolation method using an equation that includes free parameters. This paper proposes a simply trainable nearest neighbor machine translation and carry out inference experiments on GPU. Similar to Dai et al., we first adaptively construct a small datastore for each input sentence. Second, we train a single-layer network for the interpolation coefficient between the knnMT and pre-trained result to automatically interpolate in different domains. Experimental results on different domains show that our proposed method either improves or sometimes maintain the translation quality of methods in Dai et al. while being automatic. In addition, our GPU inference results demonstrate that knnMT can be integrated into GPUs with a drop of only 5% in terms of speed.
https://arxiv.org/abs/2407.19965
Applying differential privacy (DP) by means of the DP-SGD algorithm to protect individual data points during training is becoming increasingly popular in NLP. However, the choice of granularity at which DP is applied is often neglected. For example, neural machine translation (NMT) typically operates on the sentence-level granularity. From the perspective of DP, this setup assumes that each sentence belongs to a single person and any two sentences in the training dataset are independent. This assumption is however violated in many real-world NMT datasets, e.g. those including dialogues. For proper application of DP we thus must shift from sentences to entire documents. In this paper, we investigate NMT at both the sentence and document levels, analyzing the privacy/utility trade-off for both scenarios, and evaluating the risks of not using the appropriate privacy granularity in terms of leaking personally identifiable information (PII). Our findings indicate that the document-level NMT system is more resistant to membership inference attacks, emphasizing the significance of using the appropriate granularity when working with DP.
https://arxiv.org/abs/2407.18789
This paper studies gender bias in machine translation through the lens of Large Language Models (LLMs). Four widely-used test sets are employed to benchmark various base LLMs, comparing their translation quality and gender bias against state-of-the-art Neural Machine Translation (NMT) models for English to Catalan (En $\rightarrow$ Ca) and English to Spanish (En $\rightarrow$ Es) translation directions. Our findings reveal pervasive gender bias across all models, with base LLMs exhibiting a higher degree of bias compared to NMT models. To combat this bias, we explore prompting engineering techniques applied to an instruction-tuned LLM. We identify a prompt structure that significantly reduces gender bias by up to 12% on the WinoMT evaluation dataset compared to more straightforward prompts. These results significantly reduce the gender bias accuracy gap between LLMs and traditional NMT systems.
https://arxiv.org/abs/2407.18786
Imposing constraints on machine translation systems presents a challenging issue because these systems are not trained to make use of constraints in generating adequate, fluent translations. In this paper, we leverage the capabilities of large language models (LLMs) for constrained translation, given that LLMs can easily adapt to this task by taking translation instructions and constraints as prompts. However, LLMs cannot always guarantee the adequacy of translation, and, in some cases, ignore the given constraints. This is in part because LLMs might be overly confident in their predictions, overriding the influence of the constraints. To overcome this overiding behaviour, we propose to add a revision process that encourages LLMs to correct the outputs by prompting them about the constraints that have not yet been met. We evaluate our approach on four constrained translation tasks, encompassing both lexical and structural constraints in multiple constraint domains. Experiments show 15\% improvement in constraint-based translation accuracy over standard LLMs and the approach also significantly outperforms neural machine translation (NMT) state-of-the-art methods.
对机器翻译系统施加约束 presents a challenging issue because these systems are not trained to use constraints in generating adequate, fluent translations. In this paper, we leverage the capabilities of large language models (LLMs) for constrained translation, given that LLMs can easily adapt to this task by taking translation instructions and constraints as prompts. However, LLMs cannot always guarantee the adequacy of translation, and, in some cases, ignore the given constraints. This is in part because LLMs might be overly confident in their predictions, overriding the influence of the constraints. To overcome this overiding behavior, we propose to add a revision process that encourages LLMs to correct the outputs by prompting them about the constraints that have not yet been met. We evaluate our approach on four constrained translation tasks, encompassing both lexical and structural constraints in multiple constraint domains. Experiments show 15\% improvement in constraint-based translation accuracy over standard LLMs and the approach also significantly outperforms neural machine translation (NMT) state-of-the-art methods.
https://arxiv.org/abs/2407.13164
Machine translation is indispensable in healthcare for enabling the global dissemination of medical knowledge across languages. However, complex medical terminology poses unique challenges to achieving adequate translation quality and accuracy. This study introduces a novel "LLMs-in-the-loop" approach to develop supervised neural machine translation models optimized specifically for medical texts. While large language models (LLMs) have demonstrated powerful capabilities, this research shows that small, specialized models trained on high-quality in-domain (mostly synthetic) data can outperform even vastly larger LLMs. Custom parallel corpora in six languages were compiled from scientific articles, synthetically generated clinical documents, and medical texts. Our LLMs-in-the-loop methodology employs synthetic data generation, rigorous evaluation, and agent orchestration to enhance performance. We developed small medical translation models using the MarianMT base model. We introduce a new medical translation test dataset to standardize evaluation in this domain. Assessed using BLEU, METEOR, ROUGE, and BERT scores on this test set, our MarianMT-based models outperform Google Translate, DeepL, and GPT-4-Turbo. Results demonstrate that our LLMs-in-the-loop approach, combined with fine-tuning high-quality, domain-specific data, enables specialized models to outperform general-purpose and some larger systems. This research, part of a broader series on expert small models, paves the way for future healthcare-related AI developments, including deidentification and bio-medical entity extraction models. Our study underscores the potential of tailored neural translation models and the LLMs-in-the-loop methodology to advance the field through improved data generation, evaluation, agent, and modeling techniques.
机器翻译在医疗保健领域中至关重要,可促进全球医疗知识的传播。然而,复杂的医疗术语给实现足够的翻译质量和准确性带来了独特的挑战。这项研究介绍了一种新颖的“LLMs-in-the-loop”方法,以开发针对医疗文本的监督神经机器翻译模型。虽然大型语言模型(LLMs)已经展示了强大的能力,但这项研究表明,使用高质量的领域内(主要是合成)数据训练的小型专业模型可以超越甚至是在巨大LLM上训练的模型。我们编译了六种语言的定制并行语料库,包括科学文章、合成的临床文件和医疗文本。我们使用MarianMT基础模型开发了小型医疗翻译模型。我们还引入了一个新的医疗翻译测试数据集,以标准化该领域的评估。使用BLEU、METEOR、ROUGE和BERT分数评估该测试集中的模型,我们基于MarianMT的模型在BLEU、METEOR、ROUGE和BERT上均优于Google Translate、DeepL和GPT-4-Turbo。结果表明,我们基于LLMs-in-the-loop方法,结合使用高质量的领域特定数据,定制模型能够超越通用和某些较大的系统。这项研究作为一系列专家小模型的一部分,为未来的医疗保健相关AI发展铺平了道路,包括去识别和生物医学实体提取模型。我们的研究强调了定制神经翻译模型和LLMs-in-the-loop方法在改进数据生成、评估、代理和建模技术方面的潜力。
https://arxiv.org/abs/2407.12126
Pre-trained large language models (LLM) are starting to be widely used in many applications. In this work, we explore the use of these models in interactive machine translation (IMT) environments. In particular, we have chosen mBART (multilingual Bidirectional and Auto-Regressive Transformer) and mT5 (multilingual Text-to-Text Transfer Transformer) as the LLMs to perform our experiments. The system generates perfect translations interactively using the feedback provided by the user at each iteration. The Neural Machine Translation (NMT) model generates a preliminary hypothesis with the feedback, and the user validates new correct segments and performs a word correction--repeating the process until the sentence is correctly translated. We compared the performance of mBART, mT5, and a state-of-the-art (SoTA) machine translation model on a benchmark dataset regarding user effort, Word Stroke Ratio (WSR), Key Stroke Ratio (KSR), and Mouse Action Ratio (MAR). The experimental results indicate that mBART performed comparably with SoTA models, suggesting that it is a viable option for this field of IMT. The implications of this finding extend to the development of new machine translation models for interactive environments, as it indicates that some novel pre-trained models exhibit SoTA performance in this domain, highlighting the potential benefits of adapting these models to specific needs.
预训练的大型语言模型(LLM)已经在许多应用中得到了广泛应用。在这项工作中,我们探讨了在交互式机器翻译(IMT)环境中使用这些模型的应用。特别是,我们选择了mBART(多语言双向和自回归Transformer)和mT5(多语言文本到文本转移Transformer)作为LLM进行实验。系统会根据用户在每个迭代提供的反馈生成完美的翻译。 神经机器翻译(NMT)模型在反馈的基础上生成初步假设,用户验证新正确的段落并进行单词修正——重复这个过程,直到句子正确翻译为止。我们比较了mBART、mT5和最先进的(SoTA)机器翻译模型在基准数据集上的用户努力、词 stroke ratio(WSR)、关键 stroke ratio(KSR)和鼠标动作比率(MAR)。实验结果表明,mBART与SoTA模型表现相当,表明它是这个领域IMT的可行选项。这一发现对未来为交互式环境开发新的机器翻译模型具有启示意义,因为它表明,在这个领域,一些新颖的预训练模型具有与SoTA模型相当的表现,这强调了将 these models适应特定需求的重要性。
https://arxiv.org/abs/2407.06990