Paper Reading AI Learner

DexImit: Learning Bimanual Dexterous Manipulation from Monocular Human Videos

2026-02-10 18:59:02
Juncheng Mu, Sizhe Yang, Yiming Bao, Hojin Bae, Tianming Wei, Linning Xu, Boyi Li, Huazhe Xu, Jiangmiao Pang

Abstract

Data scarcity fundamentally limits the generalization of bimanual dexterous manipulation, as real-world data collection for dexterous hands is expensive and labor-intensive. Human manipulation videos, as a direct carrier of manipulation knowledge, offer significant potential for scaling up robot learning. However, the substantial embodiment gap between human hands and robotic dexterous hands makes direct pretraining from human videos extremely challenging. To bridge this gap and unleash the potential of large-scale human manipulation video data, we propose DexImit, an automated framework that converts monocular human manipulation videos into physically plausible robot data, without any additional information. DexImit employs a four-stage generation pipeline: (1) reconstructing hand-object interactions from arbitrary viewpoints with near-metric scale; (2) performing subtask decomposition and bimanual scheduling; (3) synthesizing robot trajectories consistent with the demonstrated interactions; (4) comprehensive data augmentation for zero-shot real-world deployment. Building on these designs, DexImit can generate large-scale robot data based on human videos, either from the Internet or video generation models. DexImit is capable of handling diverse manipulation tasks, including tool use (e.g., cutting an apple), long-horizon tasks (e.g., making a beverage), and fine-grained manipulations (e.g., stacking cups).

Abstract (translated)

数据稀缺从根本上限制了双手灵巧操作的泛化能力,因为为灵巧的手收集真实世界的数据显示出高成本和劳动密集型的特点。人类操作视频作为操作知识的直接载体,为机器人学习提供了显著扩展潜能的机会。然而,人类手部与机器人灵巧手之间的实体差距使从人类视频进行直接预训练变得极其具有挑战性。为了弥合这一差距并释放大规模人类操作视频数据的潜力,我们提出了DexImit,这是一个自动化框架,它能够将单目人类操作视频转换成物理上合理的机器人数据,无需任何额外信息。 DexImit采用了一个四阶段生成管道: 1. 从任意视角重构手部与对象之间的互动,并接近于度量比例; 2. 执行子任务分解和双手调度; 3. 合成符合展示互动的机器人轨迹; 4. 对数据进行全面增强,以实现零样本的真实世界部署。 基于这些设计,DexImit能够根据来自互联网或视频生成模型的人类视频生成大规模的机器人数据。DexImit可以处理各种操作任务,包括工具使用(例如切苹果)、长时序任务(例如制作饮料)和精细的操作(例如叠放杯子)。

URL

https://arxiv.org/abs/2602.10105

PDF

https://arxiv.org/pdf/2602.10105.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot