Abstract
An effective reward model plays a pivotal role in reinforcement learning for post-training enhancement of visual generative models. However, current approaches of reward modeling suffer from implementation complexity due to their reliance on extensive human-annotated preference data or meticulously engineered quality dimensions that are often incomplete and engineering-intensive. Inspired by adversarial training in generative adversarial networks (GANs), this paper proposes GAN-RM, an efficient reward modeling framework that eliminates manual preference annotation and explicit quality dimension engineering. Our method trains the reward model through discrimination between a small set of representative, unpaired target samples(denoted as Preference Proxy Data) and model-generated ordinary outputs, requiring only a few hundred target samples. Comprehensive experiments demonstrate our GAN-RM's effectiveness across multiple key applications including test-time scaling implemented as Best-of-N sample filtering, post-training approaches like Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO).
Abstract (translated)
一个有效的奖励模型在强化学习中起着关键作用,特别是在视觉生成模型训练后的增强方面。然而,当前的奖励建模方法由于依赖大量的手动标注偏好数据或精心设计的质量维度(这些维度往往不完整且工程复杂度高),而面临实施上的复杂性问题。受对抗训练在生成对抗网络(GAN)中的启发,本文提出了一种名为GAN-RM的有效奖励建模框架,该框架消除了对人工偏好注释和显式质量维度工程的需求。我们的方法通过区分少量代表性的未配对目标样本(称为偏好代理数据)和模型生成的普通输出来训练奖励模型,只需要几百个目标样本即可完成训练。 全面的实验表明,GAN-RM在包括测试时间缩放(如Best-of-N采样过滤)、训练后的改进方法(如监督微调SFT和直接偏好优化DPO)等关键应用中均表现出显著的有效性。
URL
https://arxiv.org/abs/2506.13846