Paper Reading AI Learner

Decoupled Q-Chunking

2025-12-11 18:52:51
Qiyang Li, Seohong Park, Sergey Levine

Abstract

Temporal-difference (TD) methods learn state and action values efficiently by bootstrapping from their own future value predictions, but such a self-bootstrapping mechanism is prone to bootstrapping bias, where the errors in the value targets accumulate across steps and result in biased value estimates. Recent work has proposed to use chunked critics, which estimate the value of short action sequences ("chunks") rather than individual actions, speeding up value backup. However, extracting policies from chunked critics is challenging: policies must output the entire action chunk open-loop, which can be sub-optimal for environments that require policy reactivity and also challenging to model especially when the chunk length grows. Our key insight is to decouple the chunk length of the critic from that of the policy, allowing the policy to operate over shorter action chunks. We propose a novel algorithm that achieves this by optimizing the policy against a distilled critic for partial action chunks, constructed by optimistically backing up from the original chunked critic to approximate the maximum value achievable when a partial action chunk is extended to a complete one. This design retains the benefits of multi-step value propagation while sidestepping both the open-loop sub-optimality and the difficulty of learning action chunking policies for long action chunks. We evaluate our method on challenging, long-horizon offline goal-conditioned tasks and show that it reliably outperforms prior methods. Code: this http URL.

Abstract (translated)

时间差分(TD)方法通过从其自身的未来价值预测进行自我引导,高效地学习状态和动作值。然而,这种自举机制容易产生自举偏差:即价值目标中的错误会在步骤之间累积,并导致偏斜的价值估计。最近的工作提出使用“块化评估器”来解决这个问题,这些评估器估算短的动作序列(称为“块”)的价值,而不是单独的动作,从而加快了价值备份的速度。然而,从块化评估器中提取策略是具有挑战性的:策略必须输出整个动作块,并且在开环情况下工作,这可能对于需要策略反应性并且尤其是在动作块长度增加时难以建模的环境来说次优。 我们的关键见解在于将批评家(critic)的动作块长度与政策的动作块长度解耦,使政策可以在较短的动作块上操作。我们提出了一种新的算法,通过优化政策来对抗从原始块化评估器乐观地备份而构建的部分动作块精简版批评家来进行这一设计。这种设计保留了多步价值传播的好处,同时避免了开环次优性和学习长动作序列策略的难度。 我们在具有挑战性的、长期视角的离线目标条件任务上对我们的方法进行了评估,并展示了它可靠地超越了先前的方法。代码可以在提供的链接中找到:[此 HTTP URL](请将“this http URL”替换为实际URL)。

URL

https://arxiv.org/abs/2512.10926

PDF

https://arxiv.org/pdf/2512.10926.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot