Paper Reading AI Learner

Improving Language Models with Advantage-based Offline Policy Gradients

2023-05-24 04:42:17
Ashutosh Baheti, Ximing Lu, Faeze Brahman, Ronan Le Bras, Maarten Sap, Mark Riedl

Abstract

Improving language model generations according to some user-defined quality or style constraints is challenging. Typical approaches include learning on additional human-written data, filtering ``low-quality'' data using heuristics and/or using reinforcement learning with human feedback (RLHF). However, filtering can remove valuable training signals, whereas data collection and RLHF constantly require additional human-written or LM exploration data which can be costly to obtain. A natural question to ask is ``Can we leverage RL to optimize LM utility on existing crowd-sourced and internet data?'' To this end, we present Left-over Lunch RL (LoL-RL), a simple training algorithm that uses offline policy gradients for learning language generation tasks as a 1-step RL game. LoL-RL can finetune LMs to optimize arbitrary classifier-based or human-defined utility functions on any sequence-to-sequence data. Experiments with five different language generation tasks using models of varying sizes and multiple rewards show that models trained with LoL-RL can consistently outperform the best supervised learning models. We also release our experimental code. this https URL

Abstract (translated)

提高语言模型生成根据某些用户定义的质量或风格限制是一项挑战性的任务。典型的方法包括利用额外的人类编写的数据学习,使用启发式方法过滤“低质量”的数据,或使用人类反馈的强化学习方法(RLHF)。然而,过滤可以删除宝贵的训练信号,而数据收集和RLHF经常需要额外的人类编写或LM探索数据,这些数据可能非常昂贵。一个自然的问题是:“我们如何利用RL优化LM在现有的 crowd-sourced 和互联网数据中的 utility?”为此,我们介绍了剩余午餐RL(LoL-RL),这是一个简单的训练算法,使用离线决策梯度学习语言生成任务,并将其作为1步RL游戏。LoL-RL可以优化LMs,以在任意分类基于或人类定义的 utility函数的任何序列到序列数据上优化任意分类器或人类定义的 utility函数。使用不同大小模型和多个奖励的实验,使用了五个不同的语言生成任务,表明训练使用LoL-RL的模型可以 consistently outperform 最好的监督学习模型。我们还发布了我们的实验代码。 this https URL 是本网站提供的实验代码。

URL

https://arxiv.org/abs/2305.14718

PDF

https://arxiv.org/pdf/2305.14718.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot