Abstract
A broad use case of large language models (LLMs) is in goal-directed decision-making tasks (or "agent" tasks), where an LLM needs to not just generate completions for a given prompt, but rather make intelligent decisions over a multi-turn interaction to accomplish a task (e.g., when interacting with the web, using tools, or providing customer support). Reinforcement learning (RL) provides a general paradigm to address such agent tasks, but current RL methods for LLMs largely focus on optimizing single-turn rewards. By construction, most single-turn RL methods cannot endow LLMs with the ability to intelligently seek information over multiple turns, perform credit assignment, or reason about their past actions -- all of which are critical in agent tasks. This raises the question: how can we design effective and efficient multi-turn RL algorithms for LLMs? In this paper, we develop a framework for building multi-turn RL algorithms for fine-tuning LLMs, that preserves the flexibility of existing single-turn RL methods for LLMs (e.g., proximal policy optimization), while accommodating multiple turns, long horizons, and delayed rewards effectively. To do this, our framework adopts a hierarchical RL approach and runs two RL algorithms in parallel: a high-level off-policy value-based RL algorithm to aggregate reward over utterances, and a low-level RL algorithm that utilizes this high-level value function to train a token policy within each utterance or turn. Our hierarchical framework, Actor-Critic Framework with a Hierarchical Structure (ArCHer), can also give rise to other RL methods. Empirically, we find that ArCHer significantly improves efficiency and performance on agent tasks, attaining a sample efficiency of about 100x over existing methods, while also improving with larger model capacity (upto the 7 billion scale that we tested on).
Abstract (translated)
大语言模型的一个广泛应用是在目标导向的决策任务(或称为"代理"任务)中,其中一个大语言模型需要不仅为给定的提示生成完成,而且要在多轮交互中进行智能决策以完成任务(例如,与网站交互、使用工具或提供客户支持)。强化学习(RL)提供了一个通用框架来解决此类代理任务,但当前的大语言模型强化学习方法主要关注优化单轮奖励。通过构建,大多数单轮RL方法都无法使大语言模型具有在多轮中智能查找信息、进行价值分配或关于过去行动的能力——这些都是代理任务中关键的。这引发了这样一个问题:我们如何为LLM设计有效且高效的跨轮强化学习算法?在本文中,我们开发了一个用于微调LLM的多轮强化学习算法的框架,该框架保留了现有单轮RL方法对于LLM的灵活性,同时有效地适应多轮、长时和延迟奖励。为此,我们的框架采用了一种分层RL方法,并运行两个RL算法:一个基于 utterance 的高层次基于价值的RL算法,用于汇总 Utterance 上的奖励;一个低层次的RL算法,利用这个高层价值函数在每个 utterance 或 turn 上训练标记策略。我们的分层框架, Actor-Critic 框架(ArCHer),还可以导致其他RL方法。通过实证研究,我们发现ArCHer在代理任务中显著提高了效率和性能,具有约100倍的样本效率,同时,随着模型的容量(我们测试的模型的规模可以达到70亿)的增加,其性能也得到了提高(最高达到我们测试的70亿规模)。
URL
https://arxiv.org/abs/2402.19446