Abstract
A long-term goal of language agents is to learn and improve through their own experience, ultimately outperforming humans in complex, real-world tasks. However, training agents from experience data with reinforcement learning remains difficult in many environments, which either lack verifiable rewards (e.g., websites) or require inefficient long-horizon rollouts (e.g., multi-turn tool use). As a result, most current agents rely on supervised fine-tuning on expert data, which is challenging to scale and generalizes poorly. This limitation stems from the nature of expert demonstrations: they capture only a narrow range of scenarios and expose the agent to limited environment diversity. We address this limitation with a middle-ground paradigm we call early experience: interaction data generated by the agent's own actions, where the resulting future states serve as supervision without reward signals. Within this paradigm we study two strategies of using such data: (1) Implicit world modeling, which uses collected states to ground the policy in environment dynamics; and (2) Self-reflection, where the agent learns from its suboptimal actions to improve reasoning and decision-making. We evaluate across eight diverse environments and multiple model families. Our approaches consistently improve effectiveness and out-of-domain generalization, highlighting the value of early experience. Moreover, in environments with verifiable rewards, our results provide promising signals that early experience offers a strong foundation for subsequent reinforcement learning, positioning it as a practical bridge between imitation learning and fully experience-driven agents.
Abstract (translated)
长期目标是让语言代理通过自身经验学习和改进,最终在复杂的现实世界任务中超越人类的表现。然而,在许多环境中使用强化学习从经验数据训练代理仍然很困难,这些环境要么缺乏可验证的奖励(例如网站),要么需要低效的长周期执行(例如多轮工具使用)。因此,大多数当前代理依赖于专家数据上的监督微调,这在扩展性和泛化性方面都面临挑战。这一限制源于专家演示的本质:它们只能捕捉到有限范围内的场景,并将代理暴露于环境多样性的狭窄范围内。我们通过一种中间地带范式——早期经验来解决这个问题:这是由代理自身行动产生的交互数据,在这种情况下,未来的状态作为监督指导而无需奖励信号。 在该范式下,我们研究了两种使用此类数据的策略: 1. **隐含世界建模**(Implicit World Modeling):利用收集到的状态将政策扎根于环境动态之中。 2. **自我反思**(Self-reflection):代理通过学习其次优行为来改进推理和决策能力。 我们在八个多样化的环境中以及多个模型家族中进行了评估。我们的方法在有效性及跨领域泛化方面一致地表现出提升,突显了早期经验的价值。此外,在具有可验证奖励的环境中,我们的结果表明早期经验为后续强化学习提供了强有力的基础,将其作为模仿学习与完全基于经验的学习代理之间的实用桥梁。 这段文字阐述了一种新的训练语言代理的方法——“早期经验”,这种方法旨在通过让代理在没有明确奖励信号的情况下进行自我探索和反馈来提升其性能。这种方法试图解决当前依赖专家数据的监督微调方法中存在的局限性,并为未来的强化学习提供了一个潜在的强大基础,从而能够提高模型在复杂任务中的泛化能力及表现力。
URL
https://arxiv.org/abs/2510.08558