Abstract
Large language models (LLMs) have recently demonstrated remarkable capabilities in machine translation (MT). However, most advanced MT-specific LLMs heavily rely on external supervision signals during training, such as human-annotated reference data or trained reward models (RMs), which are often expensive to obtain and challenging to scale. To overcome this limitation, we propose a Simple Self-Rewarding (SSR) Reinforcement Learning (RL) framework for MT that is reference-free, fully online, and relies solely on self-judging rewards. Training with SSR using 13K monolingual examples and Qwen-2.5-7B as the backbone, our model SSR-Zero-7B outperforms existing MT-specific LLMs, e.g., TowerInstruct-13B and GemmaX-28-9B, as well as larger general LLMs like Qwen2.5-32B-Instruct in English $\leftrightarrow$ Chinese translation tasks from WMT23, WMT24, and Flores200 benchmarks. Furthermore, by augmenting SSR with external supervision from COMET, our strongest model, SSR-X-Zero-7B, achieves state-of-the-art performance in English $\leftrightarrow$ Chinese translation, surpassing all existing open-source models under 72B parameters and even outperforming closed-source models, e.g., GPT-4o and Gemini 1.5 Pro. Our analysis highlights the effectiveness of the self-rewarding mechanism compared to the external LLM-as-a-judge approach in MT and demonstrates its complementary benefits when combined with trained RMs. Our findings provide valuable insight into the potential of self-improving RL methods. We have publicly released our code, data and models.
Abstract (translated)
最近,大型语言模型(LLM)在机器翻译(MT)方面展示了显著的能力。然而,大多数先进的专门用于MT的LLM在训练过程中严重依赖外部监督信号,例如人工注释参考数据或经过训练的奖励模型(RMs),这些信号往往难以获取且不易扩展。为克服这一限制,我们提出了一种基于自我奖励的简单强化学习(SSR-RL)框架,该框架适用于MT,并且无需参考数据、完全在线进行,仅依靠自评奖励。 使用13K单语料例句和Qwen-2.5-7B作为骨干模型训练后,我们的SSR-Zero-7B模型在WMT23、WMT24及Flores200基准测试中的英汉互译任务中超越了现有的专门用于MT的LLM(如TowerInstruct-13B和GemmaX-28-9B),以及更大规模的一般LLM,例如Qwen2.5-32B-Instruct。此外,通过结合来自COMET的外部监督信号,我们的最强模型SSR-X-Zero-7B在英汉互译任务中达到了最先进的性能,在参数量小于720亿的所有现有开源模型中表现最佳,并且甚至超越了部分闭源模型(例如GPT-4o和Gemini 1.5 Pro)。 我们的分析突显了自我奖励机制相对于外部LLM作为评判者的方法在MT中的有效性,同时也展示了它与训练好的RMs结合时的互补优势。这些发现为自改善RL方法的潜力提供了有价值的见解。我们已公开发布了代码、数据和模型。
URL
https://arxiv.org/abs/2505.16637