Paper Reading AI Learner

MA-RLHF: Reinforcement Learning from Human Feedback with Macro Actions

2024-10-03 17:55:13
Yekun Chai, Haoran Sun, Huang Fang, Shuohuan Wang, Yu Sun, Hua Wu

Abstract

Reinforcement learning from human feedback (RLHF) has demonstrated effectiveness in aligning large language models (LLMs) with human preferences. However, token-level RLHF suffers from the credit assignment problem over long sequences, where delayed rewards make it challenging for the model to discern which actions contributed to successful outcomes. This hinders learning efficiency and slows convergence. In this paper, we propose MA-RLHF, a simple yet effective RLHF framework that incorporates macro actions -- sequences of tokens or higher-level language constructs -- into the learning process. By operating at this higher level of abstraction, our approach reduces the temporal distance between actions and rewards, facilitating faster and more accurate credit assignment. This results in more stable policy gradient estimates and enhances learning efficiency within each episode, all without increasing computational complexity during training or inference. We validate our approach through extensive experiments across various model sizes and tasks, including text summarization, dialogue generation, question answering, and program synthesis. Our method achieves substantial performance improvements over standard RLHF, with performance gains of up to 30% in text summarization and code generation, 18% in dialogue, and 8% in question answering tasks. Notably, our approach reaches parity with vanilla RLHF 1.7x to 2x faster in terms of training time and continues to outperform it with further training. We will make our code and data publicly available at this https URL .

Abstract (translated)

强化学习从人类反馈(RLHF)已经在将大型语言模型(LLMs)与人类偏好对齐方面取得了有效性。然而,在长序列中,词级 RLHF 受到过时的奖励问题困扰,这使得模型难以分辨哪些动作导致了成功结果。这阻碍了学习效率并减慢了收敛速度。在本文中,我们提出了 MA-RLHF,一种简单而有效的 RLHF 框架,它将宏观行动——词或更高层次的语言构建的序列——融入到学习过程中。通过操作在更高抽象层次上,我们的方法减少了动作和奖励之间的时间间隔,促进了更快和更准确的奖励分配。这导致每个 episode 内的学习效率提高,同时训练或推理过程中的计算复杂度没有增加。我们通过在各种模型大小和任务上进行广泛的实验来验证我们的方法。我们的方法在文本摘要、对话生成、问题回答和程序合成等任务上取得了超过标准 RLHF 的显著性能提升。值得注意的是,我们的方法与普通 RLHF 的性能提升达到 30% 的效果。在训练时间上,我们的方法甚至比普通 RLHF 快 2 倍,并且在进一步训练后仍然表现出优异的性能。我们将把我们的代码和数据公开发布在本文的链接上。

URL

https://arxiv.org/abs/2410.02743

PDF

https://arxiv.org/pdf/2410.02743.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot