Abstract
Behavioral cloning (BC) methods trained with supervised learning (SL) are an effective way to learn policies from human demonstrations in domains like robotics. Goal-conditioning these policies enables a single generalist policy to capture diverse behaviors contained within an offline dataset. While goal-conditioned behavior cloning (GCBC) methods can perform well on in-distribution training tasks, they do not necessarily generalize zero-shot to tasks that require conditioning on novel state-goal pairs, i.e. combinatorial generalization. In part, this limitation can be attributed to a lack of temporal consistency in the state representation learned by BC; if temporally related states are encoded to similar latent representations, then the out-of-distribution gap for novel state-goal pairs would be reduced. Hence, encouraging this temporal consistency in the representation space should facilitate combinatorial generalization. Successor representations, which encode the distribution of future states visited from the current state, nicely encapsulate this property. However, previous methods for learning successor representations have relied on contrastive samples, temporal-difference (TD) learning, or both. In this work, we propose a simple yet effective representation learning objective, $\text{BYOL-}\gamma$ augmented GCBC, which is not only able to theoretically approximate the successor representation in the finite MDP case without contrastive samples or TD learning, but also, results in competitive empirical performance across a suite of challenging tasks requiring combinatorial generalization.
Abstract (translated)
行为克隆(Behavioral Cloning,BC)方法通过监督学习(Supervised Learning,SL)训练,在机器人等领域从人类演示中学习策略是一种有效的方式。通过对这些策略进行目标条件设定,可以使单一的通用策略捕获离线数据集中包含的各种行为。虽然目标导向的行为克隆(Goal-Conditioned Behavior Cloning,GCBC)方法在分布内任务上表现良好,但它们并不能必然地零样本泛化到需要对新颖的状态-目标对进行条件设定的任务中,即组合型泛化。这一限制部分归因于由行为克隆学习得到的状态表示缺乏时间一致性;如果相关的时间状态被编码成相似的潜在表示,则对于新出现的状态-目标对分布外差距将会减小。因此,在表现空间中鼓励这种时间一致性应该有助于实现组合型泛化。后继表示,即从当前状态访问到未来状态分布的编码方式,完美地封装了这一属性。然而,以前用于学习后继表示的方法依赖于对比样本、时差(Temporal-Difference,TD)学习或两者兼而有之。 在这项工作中,我们提出了一种简单且有效的表征学习目标——$\text{BYOL-}\gamma$增强的GCBC方法。该方法不仅能够在有限马尔可夫决策过程(MDP)的情况下理论上逼近后继表示,并且不需要对比样本或TD学习,而且还能在一系列需要组合型泛化的具有挑战性的任务中表现出竞争性的实证性能。
URL
https://arxiv.org/abs/2506.10137