Neural machine translation (NMT) has shown impressive performance when trained on large-scale corpora. However, generic NMT systems have demonstrated poor performance on out-of-domain translation. To mitigate this issue, several domain adaptation methods have recently been proposed which often lead to better translation quality than genetic NMT systems. While there has been some continuous progress in NMT for English and other European languages, domain adaption in Arabic has received little attention in the literature. The current study, therefore, aims to explore the effectiveness of domain-specific adaptation for Arabic MT (AMT), in yet unexplored domain, financial news articles. To this end, we developed carefully a parallel corpus for Arabic-English (AR- EN) translation in the financial domain for benchmarking different domain adaptation methods. We then fine-tuned several pre-trained NMT and Large Language models including ChatGPT-3.5 Turbo on our dataset. The results showed that the fine-tuning is successful using just a few well-aligned in-domain AR-EN segments. The quality of ChatGPT translation was superior than other models based on automatic and human evaluations. To the best of our knowledge, this is the first work on fine-tuning ChatGPT towards financial domain transfer learning. To contribute to research in domain translation, we made our datasets and fine-tuned models available at this https URL.
神经网络机器翻译(NMT)在大规模语料库上训练时表现令人印象深刻。然而,通用NMT系统在跨域翻译方面表现出较差的性能。为了解决这个问题,近年来提出了许多域适应方法,这些方法通常比遗传的NMT系统提供更好的翻译质量。虽然英语和其他欧洲语言在NMT方面取得了一些进展,但在阿拉伯语的域适应方面在文献中却较少关注。因此,本研究的目的是探索阿拉伯MT(AMT)在尚未探索过的域——金融新闻 articles 中的域特定适应效果。为此,我们仔细开发了金融域中的阿拉伯-英语(AR- EN)翻译平行语料库,以基准不同的域适应方法。随后,我们对几个预先训练的NMT和大型语言模型,包括 ChatGPT-3.5 Turbo 进行了微调,在我们的数据集上成功进行了微调。结果显示,仅仅使用一些与域相关的 AR-EN Segments 就可以成功进行微调。ChatGPT 翻译的质量基于自动和人工评估被认为比其他模型更好。据我们所知,这是第一个针对金融域迁移学习的研究。为了做出贡献到域翻译研究,我们将该数据集和微调模型放在了这个 https URL 上。
https://arxiv.org/abs/2309.12863
While resources for English language are fairly sufficient to understand content on social media, similar resources in Arabic are still immature. The main reason that the resources in Arabic are insufficient is that Arabic has many dialects in addition to the standard version (MSA). Arabs do not use MSA in their daily communications; rather, they use dialectal versions. Unfortunately, social users transfer this phenomenon into their use of social media platforms, which in turn has raised an urgent need for building suitable AI models for language-dependent applications. Existing machine translation (MT) systems designed for MSA fail to work well with Arabic dialects. In light of this, it is necessary to adapt to the informal nature of communication on social networks by developing MT systems that can effectively handle the various dialects of Arabic. Unlike for MSA that shows advanced progress in MT systems, little effort has been exerted to utilize Arabic dialects for MT systems. While few attempts have been made to build translation datasets for dialectal Arabic, they are domain dependent and are not OSN cultural-language friendly. In this work, we attempt to alleviate these limitations by proposing an online social network-based multidialect Arabic dataset that is crafted by contextually translating English tweets into four Arabic dialects: Gulf, Yemeni, Iraqi, and Levantine. To perform the translation, we followed our proposed guideline framework for content translation, which could be universally applicable for translation between foreign languages and local dialects. We validated the authenticity of our proposed dataset by developing neural MT models for four Arabic dialects. Our results have shown a superior performance of our NMT models trained using our dataset. We believe that our dataset can reliably serve as an Arabic multidialectal translation dataset for informal MT tasks.
虽然英语资源已经足够理解社交媒体上的内容是足够的,但阿拉伯语的资源仍然不成熟。主要原因则是阿拉伯语除了标准版本(MSA)还有许多方言,而MSA并未被广泛应用于日常通信中,而是使用方言版本。不幸的是,社交媒体用户将这种现象转移到了社交媒体平台上,这导致紧急需要建立适合语言依存应用程序的适当人工智能模型。现有的机器翻译(MT)系统设计为MSA,却无法很好地与阿拉伯方言工作。与MSA在MT系统方面表现出进展不同, little effort has been exerted to utilize阿拉伯方言为MT系统。虽然少数尝试用来构建方言阿拉伯语翻译数据集,但它们是域相关的,并且不是OSN文化语言友好的。在本研究中,我们试图通过提出一个在线社交媒体为基础的多方言阿拉伯语数据集来减轻这些限制,该数据集是通过 contextually translate English tweets into four阿拉伯方言构建的:海湾、乌干达、伊拉克和黎巴嫩。为了进行翻译,我们遵循了我们提出的 content translation 的指导框架,该框架可以适用于 foreign languages 和当地方言之间的翻译。我们开发了四个阿拉伯方言的神经网络MT模型来验证数据集的真假性。我们的结果表明,我们训练的NMT模型在利用我们的数据集时表现更好。我们相信,我们的数据集可以可靠地用作阿拉伯多方言MT任务的数据集。
https://arxiv.org/abs/2309.12137
Resolving semantic ambiguity has long been recognised as a central challenge in the field of machine translation. Recent work on benchmarking translation performance on ambiguous sentences has exposed the limitations of conventional Neural Machine Translation (NMT) systems, which fail to capture many of these cases. Large language models (LLMs) have emerged as a promising alternative, demonstrating comparable performance to traditional NMT models while introducing new paradigms for controlling the target outputs. In this paper, we study the capabilities of LLMs to translate ambiguous sentences containing polysemous words and rare word senses. We also propose two ways to improve the handling of such ambiguity through in-context learning and fine-tuning on carefully curated ambiguous datasets. Experiments show that our methods can match or outperform state-of-the-art systems such as DeepL and NLLB in four out of five language directions. Our research provides valuable insights into effectively adapting LLMs for disambiguation during machine translation.
解决语义歧义一直被视为机器翻译领域的核心挑战。最近在评估翻译性能的歧义句子基准工作揭示了传统神经网络机器翻译(NMT)系统的局限性,这些系统无法处理许多这类情况。大型语言模型(LLMs)已经崛起成为一个良好的替代方案,表现出与传统的NMT模型相当的性能,同时引入了控制目标输出的新范式。在本文中,我们研究LLMs如何翻译包含多义字和罕见字意义的歧义句子。我们还提出了两种方法,通过上下文学习和在 carefully curated 歧义数据集上进行微调,改进了对这种歧义的处理。实验表明,我们的方法可以在五个语言方向上与 DeepL 和 NLLB等最先进的系统竞争。我们的研究提供了宝贵的见解,以便在机器翻译中有效地适应LLMs进行歧义处理。
https://arxiv.org/abs/2309.11668
Recent research in decoding methods for Natural Language Generation (NLG) tasks has shown that the traditional beam search and greedy decoding algorithms are not optimal, because model probabilities do not always align with human preferences. Stronger decoding methods, including Quality Estimation (QE) reranking and Minimum Bayes' Risk (MBR) decoding, have since been proposed to mitigate the model-perplexity-vs-quality mismatch. While these decoding methods achieve state-of-the-art performance, they are prohibitively expensive to compute. In this work, we propose MBR finetuning and QE finetuning which distill the quality gains from these decoding methods at training time, while using an efficient decoding algorithm at inference time. Using the canonical NLG task of Neural Machine Translation (NMT), we show that even with self-training, these finetuning methods significantly outperform the base model. Moreover, when using an external LLM as a teacher model, these finetuning methods outperform finetuning on human-generated references. These findings suggest new ways to leverage monolingual data to achieve improvements in model quality that are on par with, or even exceed, improvements from human-curated data, while maintaining maximum efficiency during decoding.
最近的研究表明,对于自然语言生成任务(NLG)的解码方法,传统的棒搜索和贪心解码算法并不最优,因为模型概率不一定与人类偏好对齐。更强的解码方法,包括质量估计(QE)重新排名和最小 Bayes' 风险(MBR)解码,已经被提出以减轻模型疑惑与质量不匹配的问题。尽管这些解码方法取得了最先进的性能,但它们计算代价过高。在本研究中,我们提出了 MBR 微调和 QE 微调,在训练时间中提取从这些解码方法中获得的质量增益,同时使用高效的解码算法在推断时间使用。使用神经网络机器翻译(NMT)任务的标准化 NLG 任务,我们表明,即使采用自训练方法,这些微调方法仍然显著优于基础模型。此外,当使用外部 LLM 作为教师模型时,这些微调方法在人类生成参考方面表现更好。这些发现提出了利用单语言数据利用质量与人类编辑数据相当的改进模型质量,同时保持解码的最大效率的新方法。
https://arxiv.org/abs/2309.10966
Pathology diagnosis based on EEG signals and decoding brain activity holds immense importance in understanding neurological disorders. With the advancement of artificial intelligence methods and machine learning techniques, the potential for accurate data-driven diagnoses and effective treatments has grown significantly. However, applying machine learning algorithms to real-world datasets presents diverse challenges at multiple levels. The scarcity of labelled data, especially in low regime scenarios with limited availability of real patient cohorts due to high costs of recruitment, underscores the vital deployment of scaling and transfer learning techniques. In this study, we explore a real-world pathology classification task to highlight the effectiveness of data and model scaling and cross-dataset knowledge transfer. As such, we observe varying performance improvements through data scaling, indicating the need for careful evaluation and labelling. Additionally, we identify the challenges of possible negative transfer and emphasize the significance of some key components to overcome distribution shifts and potential spurious correlations and achieve positive transfer. We see improvement in the performance of the target model on the target (NMT) datasets by using the knowledge from the source dataset (TUAB) when a low amount of labelled data was available. Our findings indicate a small and generic model (e.g. ShallowNet) performs well on a single dataset, however, a larger model (e.g. TCN) performs better on transfer and learning from a larger and diverse dataset.
基于EEG信号和解码脑活动的尸体病理学诊断在理解神经退行性疾病方面具有重要意义。随着人工智能技术和机器学习技术的进步,使用准确的数据驱动诊断和有效的治疗方案的潜力已经显著增长。然而,将机器学习算法应用于实际数据集面临着多个层次的不同挑战。标记数据短缺,特别是低政权场景下由于招聘成本很高而有限的真实患者群体,强调了 Scale-up 和转移学习技术的重要部署。在本研究中,我们探索了一个实际尸体病理学分类任务,以强调数据和模型 Scale-up 和跨数据集知识转移的有效性。因此,我们通过数据 Scale-up 观察了不同的性能改进,这表明需要仔细评估和标记。此外,我们识别了可能 negative 转移的挑战并强调了某些关键组件的重要性,以克服分布 Shift 和潜在的伪相关,并实现积极转移。我们使用来自原始数据集(TUAB)的知识,在可用少量标记数据的情况下,在目标模型(NMT)数据集上观察到目标模型的性能改进。我们的发现表明,一个小而通用的模型(如浅度网络)在单个数据集上表现良好,然而,一个大而多样化模型(如 TCN)在从更大而不同的数据集上进行转移和学习方面表现更好。
https://arxiv.org/abs/2309.10910
The text editing tasks, including sentence fusion, sentence splitting and rephrasing, text simplification, and Grammatical Error Correction (GEC), share a common trait of dealing with highly similar input and output sequences. This area of research lies at the intersection of two well-established fields: (i) fully autoregressive sequence-to-sequence approaches commonly used in tasks like Neural Machine Translation (NMT) and (ii) sequence tagging techniques commonly used to address tasks such as Part-of-speech tagging, Named-entity recognition (NER), and similar. In the pursuit of a balanced architecture, researchers have come up with numerous imaginative and unconventional solutions, which we're discussing in the Related Works section. Our approach to addressing text editing tasks is called RedPenNet and is aimed at reducing architectural and parametric redundancies presented in specific Sequence-To-Edits models, preserving their semi-autoregressive advantages. Our models achieve $F_{0.5}$ scores of 77.60 on the BEA-2019 (test), which can be considered as state-of-the-art the only exception for system combination and 67.71 on the UAGEC+Fluency (test) benchmarks. This research is being conducted in the context of the UNLP 2023 workshop, where it was presented as a paper as a paper for the Shared Task in Grammatical Error Correction (GEC) for Ukrainian. This study aims to apply the RedPenNet approach to address the GEC problem in the Ukrainian language.
文本编辑任务,包括句子融合、句子分割和重新表达、文本简化和语法错误纠正(GEC),具有处理高度相似的输入和输出序列的共同特征。该领域的研究位于两个已知的领域之间的交叉点:(i)常用于神经网络机器翻译任务的完整自回归序列到序列方法,以及(ii)常用于处理语音标记、命名实体识别(NER)和其他任务的常见序列标记技术。在寻求平衡架构的过程中,研究人员提出了许多想象力和非常规的解决方案,我们将在相关作品部分讨论这些解决方案。我们处理文本编辑任务的方法被称为红笔Net,旨在减少特定序列编辑模型中呈现的结构和参数冗余,并保留其半自回归的优势。我们在BEA-2019测试中所获得的F_{0.5}得分为77.60,可以被视为系统组合的唯一例外,而在UAGEC+Fluency基准测试中得分为67.71。本研究是在2023年UNLP研讨会的背景下进行的,该研讨会上将其作为论文,作为乌克兰语语法错误纠正共享任务的文章。本研究旨在应用红笔Net方法来解决乌克兰语中的GEC问题。
https://arxiv.org/abs/2309.10898
Decoder-only Large Language Models (LLMs) have demonstrated potential in machine translation (MT), albeit with performance slightly lagging behind traditional encoder-decoder Neural Machine Translation (NMT) systems. However, LLMs offer a unique advantage: the ability to control the properties of the output through prompts. In this study, we harness this flexibility to explore LLaMa's capability to produce gender-specific translations for languages with grammatical gender. Our results indicate that LLaMa can generate gender-specific translations with competitive accuracy and gender bias mitigation when compared to NLLB, a state-of-the-art multilingual NMT system. Furthermore, our experiments reveal that LLaMa's translations are robust, showing significant performance drops when evaluated against opposite-gender references in gender-ambiguous datasets but maintaining consistency in less ambiguous contexts. This research provides insights into the potential and challenges of using LLMs for gender-specific translations and highlights the importance of in-context learning to elicit new tasks in LLMs.
Decoder-only large language models (LLMs) 在机器翻译(MT)方面已经展示出潜力,尽管其表现略低于传统的编码-解码神经网络机器翻译(NMT)系统。然而,LLMs 提供了独特的优势:能够通过提示来控制输出的性质。在本研究中,我们利用这种灵活性来探索 LLaMa 能够为具有语法性别的语言提供性别特定翻译的能力。我们的结果显示,LLaMa 能够产生竞争准确性的性别特定翻译,并且减轻性别偏见,与先进的多语言 NMT 系统 NLLB 进行比较时也是如此。此外,我们的实验表明,LLaMa 的翻译质量很高,在性别歧义数据集上评估时表现出显著的性能下降,但在上下文较少的情况下保持一致性。本研究提供了关于使用LLMs 进行性别特定翻译的潜力和挑战的见解,并突出了在LLMs 中上下文学习的重要性,以引出新的任务。
https://arxiv.org/abs/2309.03175
Neural Machine Translation (NMT) models have become successful, but their performance remains poor when translating on new domains with a limited number of data. In this paper, we present a novel approach Epi-Curriculum to address low-resource domain adaptation (DA), which contains a new episodic training framework along with denoised curriculum learning. Our episodic training framework enhances the model's robustness to domain shift by episodically exposing the encoder/decoder to an inexperienced decoder/encoder. The denoised curriculum learning filters the noised data and further improves the model's adaptability by gradually guiding the learning process from easy to more difficult tasks. Experiments on English-German and English-Romanian translation show that: (i) Epi-Curriculum improves both model's robustness and adaptability in seen and unseen domains; (ii) Our episodic training framework enhances the encoder and decoder's robustness to domain shift.
神经网络机器翻译(NMT)模型已经取得了成功,但在处理数据数量有限的新型领域时,其表现仍然不佳。在本文中,我们提出了一种名为Epi-Curriculum的新方法,以解决资源受限领域适应问题(DA)。该方法包含一个新的有向无环训练框架,以及去噪的 curriculum 学习。我们的有向无环训练框架通过有向地暴露编码器/解码器和有经验的编码器/解码器,增强模型对领域迁移的鲁棒性。去噪的 curriculum 学习过滤了噪声数据,并逐步引导从容易到更难的任务的学习过程,从而进一步改进了模型的适应度。在英德语翻译和英罗马尼亚语翻译的实验中,结果表明:(一)Epi-Curriculum从可见到不见的领域上都改善了模型的鲁棒性和适应度;(二)我们的有向无环训练框架增强了编码器和解码器对领域迁移的鲁棒性。
https://arxiv.org/abs/2309.02640
The study investigates the effectiveness of utilizing multimodal information in Neural Machine Translation (NMT). While prior research focused on using multimodal data in low-resource scenarios, this study examines how image features impact translation when added to a large-scale, pre-trained unimodal NMT system. Surprisingly, the study finds that images might be redundant in this context. Additionally, the research introduces synthetic noise to assess whether images help the model deal with textual noise. Multimodal models slightly outperform text-only models in noisy settings, even with random images. The study's experiments translate from English to Hindi, Bengali, and Malayalam, outperforming state-of-the-art benchmarks significantly. Interestingly, the effect of visual context varies with source text noise: no visual context works best for non-noisy translations, cropped image features are optimal for low noise, and full image features work better in high-noise scenarios. This sheds light on the role of visual context, especially in noisy settings, opening up a new research direction for Noisy Neural Machine Translation in multimodal setups. The research emphasizes the importance of combining visual and textual information for improved translation in various environments.
研究旨在探讨在神经网络机器翻译(NMT)中利用多模态信息的有效性。先前的研究主要关注在资源有限的情况下利用多模态数据的情况,本研究则研究了加入大规模预训练的单一模态NMT系统时,图像特征对翻译的影响。令人惊讶地,研究发现在这种情况下,图像可能显得冗余。此外,研究引入了合成噪声以评估图像是否有助于模型处理文本噪声。在噪声环境下,多模态模型略微优于仅使用文本模型,即使是使用随机图像。研究的实验从英语翻译到Hindi、孟加拉语和马来语,显著优于最先进的基准。有趣的是,视觉上下文的效果与源文本噪声的类型有关:没有视觉上下文对非噪声翻译最好,裁剪的图像特征对于低噪声最佳,而完整图像特征在高噪声环境下更好。这阐明了视觉上下文的作用,特别是在噪声环境下,为在多种环境中改善翻译的新研究方向打开了大门。研究强调结合视觉和文本信息的重要性,以改善在各种环境中的翻译。
https://arxiv.org/abs/2308.16075
Neural Machine Translation (NMT) models have been shown to be vulnerable to adversarial attacks, wherein carefully crafted perturbations of the input can mislead the target model. In this paper, we introduce ACT, a novel adversarial attack framework against NMT systems guided by a classifier. In our attack, the adversary aims to craft meaning-preserving adversarial examples whose translations by the NMT model belong to a different class than the original translations in the target language. Unlike previous attacks, our new approach has a more substantial effect on the translation by altering the overall meaning, which leads to a different class determined by a classifier. To evaluate the robustness of NMT models to this attack, we propose enhancements to existing black-box word-replacement-based attacks by incorporating output translations of the target NMT model and the output logits of a classifier within the attack process. Extensive experiments in various settings, including a comparison with existing untargeted attacks, demonstrate that the proposed attack is considerably more successful in altering the class of the output translation and has more effect on the translation. This new paradigm can show the vulnerabilities of NMT systems by focusing on the class of translation rather than the mere translation quality as studied traditionally.
神经网络机器翻译模型(NMT)已经被证明是受对抗攻击危险的,其中精心制作的输入扰动可以误导目标模型。在本文中,我们介绍了 ACT,一个由分类器引导的对抗攻击框架,以评估NMT系统对对抗攻击的鲁棒性。在我们的攻击中,对抗者旨在制造意义保留的对抗示例,其NMT模型翻译属于目标语言原语翻译的不同的类别。与以前的攻击不同,我们的新方法通过改变整体意义对翻译产生更实质性的影响,导致由分类器决定的不同的类别。为了评估NMT模型对此类攻击的鲁棒性,我们提出了对现有黑盒单词替换攻击的增强,并将目标NMT模型的输出翻译和分类器的输出logits嵌入攻击过程中。在各种设置条件下进行广泛的实验,包括与现有未针对目标攻击的比较,表明,我们提出的攻击在改变输出翻译类别方面相当成功,并对翻译产生更大的影响。这个新范式通过关注翻译类别而不是传统的翻译质量,展示了NMT系统的漏洞。
https://arxiv.org/abs/2308.15246
There has been a growing interest in developing multimodal machine translation (MMT) systems that enhance neural machine translation (NMT) with visual knowledge. This problem setup involves using images as auxiliary information during training, and more recently, eliminating their use during inference. Towards this end, previous works face a challenge in training powerful MMT models from scratch due to the scarcity of annotated multilingual vision-language data, especially for low-resource languages. Simultaneously, there has been an influx of multilingual pre-trained models for NMT and multimodal pre-trained models for vision-language tasks, primarily in English, which have shown exceptional generalisation ability. However, these are not directly applicable to MMT since they do not provide aligned multimodal multilingual features for generative tasks. To alleviate this issue, instead of designing complex modules for MMT, we propose CLIPTrans, which simply adapts the independently pre-trained multimodal M-CLIP and the multilingual mBART. In order to align their embedding spaces, mBART is conditioned on the M-CLIP features by a prefix sequence generated through a lightweight mapping network. We train this in a two-stage pipeline which warms up the model with image captioning before the actual translation task. Through experiments, we demonstrate the merits of this framework and consequently push forward the state-of-the-art across standard benchmarks by an average of +2.67 BLEU. The code can be found at this http URL.
越来越多的人对开发多感官机器翻译(MMT)系统,通过视觉知识增强神经网络机器翻译(NMT)感兴趣。这个问题的解决方案涉及到在训练期间使用图像作为辅助信息,并且最近还 eliminate 了在推理期间使用图像的情况。为了解决这个问题,以前的工作面临训练强大 MMT 模型的挑战,因为缺乏标记的多种语言视觉-语言数据,特别是低资源语言的数据。同时,由于存在为 NMT 和视觉-语言任务训练多语言预训练模型的大量英语数据,这些模型显示出出色的泛化能力。然而,这些不能直接适用于 MMT,因为它们不提供生成任务所需的协调多语言多功能特征。为了解决这个问题,我们提出了 CLIPTrans,它仅仅适应独立训练的多语言 M-CLIP 和多语言 mBART。为了对齐它们的嵌入空间,mBART 通过一个轻量级映射网络生成的前缀序列Conditioned on M-CLIP 特征。我们在两个阶段的训练过程中训练这个模型,在真正的翻译任务之前进行图像captioning 预热。通过实验,我们证明了这个框架的优点,因此通过平均 +2.67 BLEU的成绩推进了标准基准测试的前沿。代码可在该 http URL 中找到。
https://arxiv.org/abs/2308.15226
Consistency regularization methods, such as R-Drop (Liang et al., 2021) and CrossConST (Gao et al., 2023), have achieved impressive supervised and zero-shot performance in the neural machine translation (NMT) field. Can we also boost end-to-end (E2E) speech-to-text translation (ST) by leveraging consistency regularization? In this paper, we conduct empirical studies on intra-modal and cross-modal consistency and propose two training strategies, SimRegCR and SimZeroCR, for E2E ST in regular and zero-shot scenarios. Experiments on the MuST-C benchmark show that our approaches achieve state-of-the-art (SOTA) performance in most translation directions. The analyses prove that regularization brought by the intra-modal consistency, instead of modality gap, is crucial for the regular E2E ST, and the cross-modal consistency could close the modality gap and boost the zero-shot E2E ST performance.
一致性 Regularization 方法,如 R-Drop (Liang et al., 2021) 和 CrossConST (Gao et al., 2023),在神经网络机器翻译(NMT)领域已经取得了令人印象深刻的 supervised 和 zero-shot 性能。通过利用一致性 Regularization,我们能否提高端到端(E2E)语音到文本翻译(ST)的性能呢?在本文中,我们对内模和跨模一致性进行了实证研究,并提出了两种训练策略:SimRegCR 和 SimZeroCR,用于Regular 和 zero-shot 场景下的 E2E ST。对 MuST-C 基准测试数据的实验表明,我们的方法在大多数翻译方向实现了最先进的性能(SOTA)。分析证明,内模一致性而不是模差对 Regular 的 E2E ST 至关重要,而跨模一致性可以关闭模差并提高 zero-shot 的 E2E ST 性能。
https://arxiv.org/abs/2308.14482
The Fon language, spoken by an average 2 million of people, is a truly low-resourced African language, with a limited online presence, and existing datasets (just to name but a few). Multitask learning is a learning paradigm that aims to improve the generalization capacity of a model by sharing knowledge across different but related tasks: this could be prevalent in very data-scarce scenarios. In this paper, we present the first explorative approach to multitask learning, for model capabilities enhancement in Natural Language Processing for the Fon language. Specifically, we explore the tasks of Named Entity Recognition (NER) and Part of Speech Tagging (POS) for Fon. We leverage two language model heads as encoders to build shared representations for the inputs, and we use linear layers blocks for classification relative to each task. Our results on the NER and POS tasks for Fon, show competitive (or better) performances compared to several multilingual pretrained language models finetuned on single tasks. Additionally, we perform a few ablation studies to leverage the efficiency of two different loss combination strategies and find out that the equal loss weighting approach works best in our case. Our code is open-sourced at this https URL.
Fon语言是由平均200万个人使用的真正资源有限的非洲语言,网上存在有限,现有数据集(仅举几个例子)。多任务学习是一种学习范式,旨在通过在不同但相关任务中共享知识来提高模型的泛化能力:这种情况可能会在非常缺乏数据的情况下普遍存在。在本文中,我们提出了探索多任务学习的第一种方法,以提高Fon语言自然语言处理模型的能力。具体而言,我们探索了Fon语言中的命名实体识别(NER)和部分语音标记(POS)任务。我们利用两个语言模型头作为编码器,构建对输入的共享表示,并使用线性层块对每个任务进行分类。我们对Fon语言的NER和POS任务的结果进行了比较,显示相对于多个多语言预训练语言模型微调单个任务,其表现更具竞争力(或更好)。此外,我们进行了一些废解研究,以利用两种不同损失组合策略的效率,并发现在我们这里,等损失权重方法最好。我们的代码在这个httpsURL上开源。
https://arxiv.org/abs/2308.14280
In Africa, and the world at large, there is an increasing focus on developing Neural Machine Translation (NMT) systems to overcome language barriers. NMT for Low-resource language is particularly compelling as it involves learning with limited labelled data. However, obtaining a well-aligned parallel corpus for low-resource languages can be challenging. The disparity between the technological advancement of a few global languages and the lack of research on NMT for local languages in Chad is striking. End-to-end NMT trials on low-resource Chad languages have not been attempted. Additionally, there is a dearth of online and well-structured data gathering for research in Natural Language Processing, unlike some African languages. However, a guided approach for data gathering can produce bitext data for many Chadian language translation pairs with well-known languages that have ample data. In this project, we created the first sba-Fr Dataset, which is a corpus of Ngambay-to-French translations, and fine-tuned three pre-trained models using this dataset. Our experiments show that the M2M100 model outperforms other models with high BLEU scores on both original and original+synthetic data. The publicly available bitext dataset can be used for research purposes.
在非洲和整个世界,越来越注重开发神经机器翻译(NMT)系统,以克服语言障碍。对于低资源语言而言,NMT尤其具有吸引力,因为它需要少量的标注数据进行学习。然而,获取低资源语言的平行语料库可能是一项挑战。与一些非洲语言不同,自然语言处理领域的在线和结构化数据收集非常缺乏。因此,没有尝试对低资源 Chad 语言进行端到端 NMT 试验。此外,没有尝试过使用指导性的数据采集方法来生成许多拥有大量数据的 Chad 语言翻译对的双向文本数据。在这个项目中,我们创造了第一个 sba-Fr dataset,它是 Ngambay 到法语语料库的翻译语料库,并使用该dataset微调了三个预训练模型。我们的实验结果表明,M2M100 模型在原始数据和原始+合成数据上的 BLEU 得分方面都优于其他模型。公开可用的双向文本数据集可以用于研究。
https://arxiv.org/abs/2308.13497
Machine Translation is one of the essential tasks in Natural Language Processing (NLP), which has massive applications in real life as well as contributing to other tasks in the NLP research community. Recently, Transformer -based methods have attracted numerous researchers in this domain and achieved state-of-the-art results in most of the pair languages. In this paper, we report an effective method using a phrase mechanism, PhraseTransformer, to improve the strong baseline model Transformer in constructing a Neural Machine Translation (NMT) system for parallel corpora Vietnamese-Chinese. Our experiments on the MT dataset of the VLSP 2022 competition achieved the BLEU score of 35.3 on Vietnamese to Chinese and 33.2 BLEU scores on Chinese to Vietnamese data. Our code is available at this https URL.
机器翻译是自然语言处理(NLP)中的关键任务之一,它在实际应用中具有广泛的应用,同时也为NLP研究社区的其他任务做出了贡献。近年来,基于Transformer的方法吸引了许多该领域的研究人员,并在大多数对偶语言中取得了最先进的结果。在本文中,我们报告了一种有效的方法,使用短语机制(PhraseTransformer)来改进强大的基线模型Transformer,以构建用于并行语料库越南语-中文的神经网络机器翻译(NMT)系统。我们在VLSP 2022竞争项目的MT数据集上进行了实验,在越南语到中文数据和中文到越南语数据上取得了BLEU得分分别为35.3和33.2。我们的代码可在this https URL上获取。
https://arxiv.org/abs/2308.10482
Sign Language Translation (SLT) is a challenging task that aims to generate spoken language sentences from sign language videos, both of which have different grammar and word/gloss order. From a Neural Machine Translation (NMT) perspective, the straightforward way of training translation models is to use sign language phrase-spoken language sentence pairs. However, human interpreters heavily rely on the context to understand the conveyed information, especially for sign language interpretation, where the vocabulary size may be significantly smaller than their spoken language equivalent. Taking direct inspiration from how humans translate, we propose a novel multi-modal transformer architecture that tackles the translation task in a context-aware manner, as a human would. We use the context from previous sequences and confident predictions to disambiguate weaker visual cues. To achieve this we use complementary transformer encoders, namely: (1) A Video Encoder, that captures the low-level video features at the frame-level, (2) A Spotting Encoder, that models the recognized sign glosses in the video, and (3) A Context Encoder, which captures the context of the preceding sign sequences. We combine the information coming from these encoders in a final transformer decoder to generate spoken language translations. We evaluate our approach on the recently published large-scale BOBSL dataset, which contains ~1.2M sequences, and on the SRF dataset, which was part of the WMT-SLT 2022 challenge. We report significant improvements on state-of-the-art translation performance using contextual information, nearly doubling the reported BLEU-4 scores of baseline approaches.
Sign Language Translation (SLT) 是一项挑战性的任务,旨在从手语视频中提取口语句子,而这两种语言具有不同的语法和单词/gloss排序。从神经网络机器翻译 (NMT) 的角度来看,训练翻译模型的简单方法就是使用手语短语-口语句子对。然而,人类翻译者 heavily rely on 上下文理解传达的信息,特别是对于手语翻译,词汇库可能比口语语言的相应词汇库小得多。从人类翻译的直接影响出发,我们提出了一种新的多模态Transformer架构,以处理翻译任务,就像人类一样具有上下文意识。我们使用先前序列的上下文和 confident 预测来歧义较弱的视觉提示。为了实现这一点,我们使用互补的Transformer编码器,包括:(1)视频编码器,捕捉帧级的低级别视频特征;(2)发现编码器,在视频中模型识别的可见手语gloss;(3)上下文编码器,捕捉先前手语序列的上下文。我们将这些信息从这些编码器中组合在一起,并在最终Transformer解码器中生成口语翻译。我们评估了我们的方法在最近发布的大型BOBSL数据集上,该数据集包含 ~1.2 万序列,以及SRF数据集,它是 WMT-SLT 2022 挑战的一部分。我们报告了利用上下文信息显著改善翻译性能的情况,几乎双倍了基准方法的BLEU-4得分。
https://arxiv.org/abs/2308.09622
The Transformer model has revolutionized Natural Language Processing tasks such as Neural Machine Translation, and many efforts have been made to study the Transformer architecture, which increased its efficiency and accuracy. One potential area for improvement is to address the computation of empty tokens that the Transformer computes only to discard them later, leading to an unnecessary computational burden. To tackle this, we propose an algorithm that sorts translation sentence pairs based on their length before batching, minimizing the waste of computing power. Since the amount of sorting could violate the independent and identically distributed (i.i.d) data assumption, we sort the data partially. In experiments, we apply the proposed method to English-Korean and English-Luganda language pairs for machine translation and show that there are gains in computational time while maintaining the performance. Our method is independent of architectures, so that it can be easily integrated into any training process with flexible data lengths.
Transformer模型已经革命性地改变了像神经网络机器翻译等自然语言处理任务,并做了很多努力来研究Transformer架构,提高了其效率和准确性。一个潜在的改进领域是解决Transformer计算过程中那些只计算一次但最终丢弃的空字符,导致不必要的计算负担。为了解决这一问题,我们提出了一种算法,在批量之前对翻译句子对进行排序,最小化计算资源的浪费。由于排序量可能违反了独立、同分布(i.i.d)数据假设,我们部分排序数据。在实验中,我们应用了该方法对英语-Korean和英语-Luganda语言对进行机器翻译,并表明在计算时间内取得了改进。我们的方法不受架构影响,因此可以轻松地融入具有灵活数据长度的任何训练过程。
https://arxiv.org/abs/2308.08153
We explore the effectiveness of character-level neural machine translation using Transformer architecture for various levels of language similarity and size of the training dataset on translation between Czech and Croatian, German, Hungarian, Slovak, and Spanish. We evaluate the models using automatic MT metrics and show that translation between similar languages benefits from character-level input segmentation, while for less related languages, character-level vanilla Transformer-base often lags behind subword-level segmentation. We confirm previous findings that it is possible to close the gap by finetuning the already trained subword-level models to character-level.
我们探索了使用Transformer架构进行字符级别的神经网络机器翻译在不同语言相似性和训练数据集大小方面的效力。我们对模型使用自动MT metrics进行了评估,并表明在相似的语言之间,字符级别的输入分割可以带来好处,而在不太相关的语言中,字符级别的基线Transformer模型常常落后于子词级别的分割。我们确认了以前的发现,即通过微调已经训练的子词级别模型到字符级别,可以缩短差距。
https://arxiv.org/abs/2308.04398
Sub-word segmentation is an essential pre-processing step for Neural Machine Translation (NMT). Existing work has shown that neural sub-word segmenters are better than Byte-Pair Encoding (BPE), however, they are inefficient as they require parallel corpora, days to train and hours to decode. This paper introduces SelfSeg, a self-supervised neural sub-word segmentation method that is much faster to train/decode and requires only monolingual dictionaries instead of parallel corpora. SelfSeg takes as input a word in the form of a partially masked character sequence, optimizes the word generation probability and generates the segmentation with the maximum posterior probability, which is calculated using a dynamic programming algorithm. The training time of SelfSeg depends on word frequencies, and we explore several word frequency normalization strategies to accelerate the training phase. Additionally, we propose a regularization mechanism that allows the segmenter to generate various segmentations for one word. To show the effectiveness of our approach, we conduct MT experiments in low-, middle- and high-resource scenarios, where we compare the performance of using different segmentation methods. The experimental results demonstrate that on the low-resource ALT dataset, our method achieves more than 1.2 BLEU score improvement compared with BPE and SentencePiece, and a 1.1 score improvement over Dynamic Programming Encoding (DPE) and Vocabulary Learning via Optimal Transport (VOLT) on average. The regularization method achieves approximately a 4.3 BLEU score improvement over BPE and a 1.2 BLEU score improvement over BPE-dropout, the regularized version of BPE. We also observed significant improvements on IWSLT15 Vi->En, WMT16 Ro->En and WMT15 Fi->En datasets, and competitive results on the WMT14 De->En and WMT14 Fr->En datasets.
子单词分割是神经网络机器翻译(NMT)的重要预处理步骤。现有的工作已经表明,神经网络子单词分割比字节编码(BPE)更好,但是它们的效率不高,因为它们需要并行 corpora,训练需要几天,解码需要数小时。本论文介绍了 selfSeg 一种自监督神经网络子单词分割方法,其训练/解码速度更快,只需要 Monolingual 词典,而不是并行 corpora。selfSeg 以一个半遮盖字符序列作为输入,优化了单词生成概率,并生成具有最大后概率分割的 segmentation,该过程使用动态规划算法计算。selfSeg 的训练时间取决于单词频率,我们探索了多个单词频率归一化策略以加速训练阶段。此外,我们提出了一种正则化机制,允许分割器为一个词生成多种分割。为了展示我们方法的有效性,我们在低、中和高资源场景下进行了 NMT 实验,比较了使用不同分割方法的性能。实验结果显示,在低资源 ALT 数据集上,我们方法比 BPE 和 SentencePiece 提高了超过 1.2 BLEU 得分,平均比动态规划编码(DPE)和通过最优传输(VOLT)学习词汇的归一化版本(DPE-dropout)提高了 1.1 得分。正则化方法实现了大约 4.3 的 BLEU 得分提高,比 BPE 和 BPE-dropout 的归一化版本 BPE 提高了 1.2 的 BLEU 得分。我们还观察到在 IWSLT15 Vi->En、WMT16 Ro->En 和 WMT15 Fi->En 数据集上的重大改进,以及 WMT14 De->En 和 WMT14 Fr->En 数据集上的竞争性结果。
https://arxiv.org/abs/2307.16400
Despite the tremendous success of Neural Machine Translation (NMT), its performance on low-resource language pairs still remains subpar, partly due to the limited ability to handle previously unseen inputs, i.e., generalization. In this paper, we propose a method called Joint Dropout, that addresses the challenge of low-resource neural machine translation by substituting phrases with variables, resulting in significant enhancement of compositionality, which is a key aspect of generalization. We observe a substantial improvement in translation quality for language pairs with minimal resources, as seen in BLEU and Direct Assessment scores. Furthermore, we conduct an error analysis, and find Joint Dropout to also enhance generalizability of low-resource NMT in terms of robustness and adaptability across different domains
尽管神经网络机器翻译(NMT)取得了巨大的成功,但在低资源语言对上的性能仍然较差,这部分原因是有限的处理能力来处理之前未知的输入,即泛化。在本文中,我们提出了一种名为“ Joint Dropout”的方法,旨在解决低资源神经网络机器翻译的挑战,通过使用变量替换句子,极大地增强了组合性,这是泛化的关键方面。我们观察到,在资源非常有限的语言对上,翻译质量有了显著的改善,如BLEU和直接评估 scores 所示。此外,我们进行了错误分析,发现“ Joint Dropout”还可以在低资源NMT的泛化能力方面提高稳定性和跨不同领域的适应性。
https://arxiv.org/abs/2307.12835