Abstract
Neural Machine Translation (NMT) driven by Transformer architectures has advanced significantly, yet faces challenges with low-resource language pairs like Vietnamese-Japanese (Vi-Ja). Issues include sparse parallel data and handling linguistic/cultural nuances. Recent progress in Large Language Models (LLMs) with strong reasoning, often refined via Reinforcement Learning (RL), enables high-quality synthetic data generation. We introduce VNJPTranslate, a pipeline designed to systematically address the Vi-Ja translation task. It features a targeted data augmentation strategy using advanced LLMs with Chain-of-Thought prompting for challenging segments identified via corpus analysis. Subsequently, we employ efficient fine-tuning techniques (Unsloth with QLoRA) on a capable, low-parameter autoregressive model (specifically, a fine-tuned version of the 1.8B parameter Sailor model, which is based on the Qwen architecture) to create a practical and high-performing translation system. This integrated approach aims to improve Vi-Ja translation quality significantly over existing baselines.
Abstract (translated)
基于Transformer架构的神经机器翻译(NMT)技术取得了显著进展,但在处理像越南语-日语(Vi-Ja)这样的低资源语言对时仍然面临挑战。这些问题包括平行数据稀疏以及处理语言和文化细微差别的困难。近年来,在大型语言模型(LLMs)中取得的重大进步,这些模型通过强化学习(RL)增强了推理能力,并能够生成高质量的合成数据。我们在此介绍VNJPTranslate,这是一种专门设计来系统性解决Vi-Ja翻译任务的流水线方法。 VNJPTranslate采用了一种针对性的数据增强策略,该策略利用先进的LLMs并通过Chain-of-Thought提示技术针对语料库分析中识别出的难点部分进行处理。随后,我们使用高效的微调技术(如Unsloth和QLoRA)在一款功能强大且参数较少的自回归模型上进行了训练(具体而言,是在1.8B参数规模的Sailor模型基础上进行微调,该模型基于Qwen架构)。这种方法旨在构建一个实用而高性能的翻译系统,并显著提升Vi-Ja翻译的质量超过现有的基准方法。
URL
https://arxiv.org/abs/2504.00339