Paper Reading AI Learner

Expressive Value Learning for Scalable Offline Reinforcement Learning

2025-10-09 13:42:20
Nicolas Espinosa-Dice, Kiante Brantley, Wen Sun

Abstract

Reinforcement learning (RL) is a powerful paradigm for learning to make sequences of decisions. However, RL has yet to be fully leveraged in robotics, principally due to its lack of scalability. Offline RL offers a promising avenue by training agents on large, diverse datasets, avoiding the costly real-world interactions of online RL. Scaling offline RL to increasingly complex datasets requires expressive generative models such as diffusion and flow matching. However, existing methods typically depend on either backpropagation through time (BPTT), which is computationally prohibitive, or policy distillation, which introduces compounding errors and limits scalability to larger base policies. In this paper, we consider the question of how to develop a scalable offline RL approach without relying on distillation or backpropagation through time. We introduce Expressive Value Learning for Offline Reinforcement Learning (EVOR): a scalable offline RL approach that integrates both expressive policies and expressive value functions. EVOR learns an optimal, regularized Q-function via flow matching during training. At inference-time, EVOR performs inference-time policy extraction via rejection sampling against the expressive value function, enabling efficient optimization, regularization, and compute-scalable search without retraining. Empirically, we show that EVOR outperforms baselines on a diverse set of offline RL tasks, demonstrating the benefit of integrating expressive value learning into offline RL.

Abstract (translated)

强化学习(RL)是一种通过一系列决策进行学习的强大范式。然而,由于其可扩展性不足,强化学习在机器人技术领域尚未得到充分应用。离线强化学习通过利用大量多样化的数据集来训练智能体,从而避免了在线强化学习中昂贵的真实世界互动,为解决这一问题提供了一条有前景的途径。为了使离线强化学习能够处理越来越复杂的任务和数据集,需要使用表达能力强的生成模型,如扩散模型和流匹配模型。然而,现有的方法通常依赖于时间差分回溯(BPTT),这种技术计算成本高昂,或者依赖于策略蒸馏,后者会引入累积误差并限制更大的基础政策可扩展性。 本文探讨了如何开发一种无需依靠蒸馏或时间差分回溯即可实现规模化的离线强化学习方法。我们介绍了“EVOR”:即用于离线强化学习的表达价值学习(Expressive Value Learning for Offline Reinforcement Learning),这是一种结合了具有表现力策略和价值函数的可扩展离线RL方法。在训练阶段,EVOR通过流匹配来学习一个优化且正则化的Q-函数。而在推理阶段,EVOR通过对表达的价值函数进行拒绝采样提取策略,从而实现高效的优化、正则化以及计算规模上的搜索能力而无需重新训练。 实验表明,在一系列离线强化学习任务上,EVOR的表现优于基准模型,这证明了将表达价值学习整合到离线RL中的好处。

URL

https://arxiv.org/abs/2510.08218

PDF

https://arxiv.org/pdf/2510.08218.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot