Abstract
Despite the success of Random Network Distillation (RND) in various domains, it was shown as not discriminative enough to be used as an uncertainty estimator for penalizing out-of-distribution actions in offline reinforcement learning. In this paper, we revisit these results and show that, with a naive choice of conditioning for the RND prior, it becomes infeasible for the actor to effectively minimize the anti-exploration bonus and discriminativity is not an issue. We show that this limitation can be avoided with conditioning based on Feature-wise Linear Modulation (FiLM), resulting in a simple and efficient ensemble-free algorithm based on Soft Actor-Critic. We evaluate it on the D4RL benchmark, showing that it is capable of achieving performance comparable to ensemble-based methods and outperforming ensemble-free approaches by a wide margin.
Abstract (translated)
尽管随机网络蒸馏(RND)在各种领域中取得了成功,但研究表明它没有足够的 discrimination 用于 offline 强化学习中惩罚不符合分布的行动。在本文中,我们重新审视了这些结果,并表明,使用简单的条件化选择 RND 的前置条件,演员实际上无法有效地最小化反探索奖励, discrimination 不是一个问题。我们表明,这种限制可以通过基于特征线性调制(FiLM)的条件化避免,导致基于软演员批评的简单而高效的无组合算法。我们基于 D4RL 基准对其进行了评估,表明它有能力实现与组合方法相当的性能,并以显著优势超越无组合方法。
URL
https://arxiv.org/abs/2301.13616