Paper Reading AI Learner

VNJPTranslate: A comprehensive pipeline for Vietnamese-Japanese translation

2025-04-01 01:38:25
Hoang Hai Phan, Nguyen Duc Minh Vu, Nam Dang Phuong

Abstract

Neural Machine Translation (NMT) driven by Transformer architectures has advanced significantly, yet faces challenges with low-resource language pairs like Vietnamese-Japanese (Vi-Ja). Issues include sparse parallel data and handling linguistic/cultural nuances. Recent progress in Large Language Models (LLMs) with strong reasoning, often refined via Reinforcement Learning (RL), enables high-quality synthetic data generation. We introduce VNJPTranslate, a pipeline designed to systematically address the Vi-Ja translation task. It features a targeted data augmentation strategy using advanced LLMs with Chain-of-Thought prompting for challenging segments identified via corpus analysis. Subsequently, we employ efficient fine-tuning techniques (Unsloth with QLoRA) on a capable, low-parameter autoregressive model (specifically, a fine-tuned version of the 1.8B parameter Sailor model, which is based on the Qwen architecture) to create a practical and high-performing translation system. This integrated approach aims to improve Vi-Ja translation quality significantly over existing baselines.

Abstract (translated)

基于Transformer架构的神经机器翻译(NMT)技术取得了显著进展,但在处理像越南语-日语(Vi-Ja)这样的低资源语言对时仍然面临挑战。这些问题包括平行数据稀疏以及处理语言和文化细微差别的困难。近年来,在大型语言模型(LLMs)中取得的重大进步,这些模型通过强化学习(RL)增强了推理能力,并能够生成高质量的合成数据。我们在此介绍VNJPTranslate,这是一种专门设计来系统性解决Vi-Ja翻译任务的流水线方法。 VNJPTranslate采用了一种针对性的数据增强策略,该策略利用先进的LLMs并通过Chain-of-Thought提示技术针对语料库分析中识别出的难点部分进行处理。随后,我们使用高效的微调技术(如Unsloth和QLoRA)在一款功能强大且参数较少的自回归模型上进行了训练(具体而言,是在1.8B参数规模的Sailor模型基础上进行微调,该模型基于Qwen架构)。这种方法旨在构建一个实用而高性能的翻译系统,并显著提升Vi-Ja翻译的质量超过现有的基准方法。

URL

https://arxiv.org/abs/2504.00339

PDF

https://arxiv.org/pdf/2504.00339.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot