Paper Reading AI Learner

ImplicitRDP: An End-to-End Visual-Force Diffusion Policy with Structural Slow-Fast Learning

2025-12-11 18:59:46
Wendi Chen, Han Xue, Yi Wang, Fangyuan Zhou, Jun Lv, Yang Jin, Shirun Tang, Chuan Wen, Cewu Lu

Abstract

Human-level contact-rich manipulation relies on the distinct roles of two key modalities: vision provides spatially rich but temporally slow global context, while force sensing captures rapid, high-frequency local contact dynamics. Integrating these signals is challenging due to their fundamental frequency and informational disparities. In this work, we propose ImplicitRDP, a unified end-to-end visual-force diffusion policy that integrates visual planning and reactive force control within a single network. We introduce Structural Slow-Fast Learning, a mechanism utilizing causal attention to simultaneously process asynchronous visual and force tokens, allowing the policy to perform closed-loop adjustments at the force frequency while maintaining the temporal coherence of action chunks. Furthermore, to mitigate modality collapse where end-to-end models fail to adjust the weights across different modalities, we propose Virtual-target-based Representation Regularization. This auxiliary objective maps force feedback into the same space as the action, providing a stronger, physics-grounded learning signal than raw force prediction. Extensive experiments on contact-rich tasks demonstrate that ImplicitRDP significantly outperforms both vision-only and hierarchical baselines, achieving superior reactivity and success rates with a streamlined training pipeline. Code and videos will be publicly available at this https URL.

Abstract (translated)

人类级别的接触密集型操作依赖于两种关键模式的独特作用:视觉提供了空间丰富但时间上较慢的全局背景信息,而力感知则捕捉到了快速、高频的局部接触动态。由于它们的基本频率和信息差异,整合这些信号颇具挑战性。在这项工作中,我们提出了ImplicitRDP(隐式关联的视觉-力扩散策略),这是一种统一的端到端视觉-力传播政策,它在单一网络中集成了视觉规划与反应性的力控制。 为了处理异步的视觉和力信号,并允许该策略以力频率进行闭环调整同时保持行动片段的时间连贯性,我们引入了结构化慢快学习机制,利用因果注意力来同时处理这两种异步信号。此外,为了解决端到端模型在不同模式之间难以调整权重的问题(即模态塌陷),我们提出了基于虚拟目标的表示正则化方法。这一辅助目标将力反馈映射到了与行动相同的空间内,提供了一个比原始力预测更强、更具有物理基础的学习信号。 通过接触密集型任务中的大量实验表明,ImplicitRDP显著优于仅依赖视觉和分层基准模型,在反应性和成功率方面都表现卓越,并且拥有精简的训练流程。相关代码和视频将公开发布在该链接上:[此URL](请将"[此URL]"替换为实际提供的具体链接)。

URL

https://arxiv.org/abs/2512.10946

PDF

https://arxiv.org/pdf/2512.10946.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot