Abstract
Temporal-difference (TD) methods learn state and action values efficiently by bootstrapping from their own future value predictions, but such a self-bootstrapping mechanism is prone to bootstrapping bias, where the errors in the value targets accumulate across steps and result in biased value estimates. Recent work has proposed to use chunked critics, which estimate the value of short action sequences ("chunks") rather than individual actions, speeding up value backup. However, extracting policies from chunked critics is challenging: policies must output the entire action chunk open-loop, which can be sub-optimal for environments that require policy reactivity and also challenging to model especially when the chunk length grows. Our key insight is to decouple the chunk length of the critic from that of the policy, allowing the policy to operate over shorter action chunks. We propose a novel algorithm that achieves this by optimizing the policy against a distilled critic for partial action chunks, constructed by optimistically backing up from the original chunked critic to approximate the maximum value achievable when a partial action chunk is extended to a complete one. This design retains the benefits of multi-step value propagation while sidestepping both the open-loop sub-optimality and the difficulty of learning action chunking policies for long action chunks. We evaluate our method on challenging, long-horizon offline goal-conditioned tasks and show that it reliably outperforms prior methods. Code: this http URL.
Abstract (translated)
时间差分(TD)方法通过从其自身的未来价值预测进行自我引导,高效地学习状态和动作值。然而,这种自举机制容易产生自举偏差:即价值目标中的错误会在步骤之间累积,并导致偏斜的价值估计。最近的工作提出使用“块化评估器”来解决这个问题,这些评估器估算短的动作序列(称为“块”)的价值,而不是单独的动作,从而加快了价值备份的速度。然而,从块化评估器中提取策略是具有挑战性的:策略必须输出整个动作块,并且在开环情况下工作,这可能对于需要策略反应性并且尤其是在动作块长度增加时难以建模的环境来说次优。 我们的关键见解在于将批评家(critic)的动作块长度与政策的动作块长度解耦,使政策可以在较短的动作块上操作。我们提出了一种新的算法,通过优化政策来对抗从原始块化评估器乐观地备份而构建的部分动作块精简版批评家来进行这一设计。这种设计保留了多步价值传播的好处,同时避免了开环次优性和学习长动作序列策略的难度。 我们在具有挑战性的、长期视角的离线目标条件任务上对我们的方法进行了评估,并展示了它可靠地超越了先前的方法。代码可以在提供的链接中找到:[此 HTTP URL](请将“this http URL”替换为实际URL)。
URL
https://arxiv.org/abs/2512.10926