Abstract
Dexterous grasp synthesis remains a central challenge: the high dimensionality and kinematic diversity of multi-fingered hands prevent direct transfer of algorithms developed for parallel-jaw grippers. Existing approaches typically depend on large, hardware-specific grasp datasets collected in simulation or through costly real-world trials, hindering scalability as new dexterous hand designs emerge. To this end, we propose a data-efficient framework, which is designed to bypass robot grasp data collection by exploiting the rich, object-centric semantic priors latent in pretrained generative diffusion models. Temporally aligned and fine-grained grasp affordances are extracted from raw human video demonstrations and fused with 3D scene geometry from depth images to infer semantically grounded contact targets. A kinematics-aware retargeting module then maps these affordance representations to diverse dexterous hands without per-hand retraining. The resulting system produces stable, functionally appropriate multi-contact grasps that remain reliably successful across common objects and tools, while exhibiting strong generalization across previously unseen object instances within a category, pose variations, and multiple hand embodiments. This work (i) introduces a semantic affordance extraction pipeline leveraging vision-language generative priors for dexterous grasping, (ii) demonstrates cross-hand generalization without constructing hardware-specific grasp datasets, and (iii) establishes that a single depth modality suffices for high-performance grasp synthesis when coupled with foundation-model semantics. Our results highlight a path toward scalable, hardware-agnostic dexterous manipulation driven by human demonstrations and pretrained generative models.
Abstract (translated)
灵巧抓取合成仍然是一项核心挑战:多指手的高维度和运动学多样性使得无法直接转移为并指夹具开发的算法。现有的方法通常依赖于在仿真或通过昂贵的实际试验中收集的大规模硬件特定抓取数据集,这阻碍了随着新灵巧手设计出现时的可扩展性。为此,我们提出了一种数据高效框架,该框架旨在绕过机器人抓取数据采集过程,利用预训练生成扩散模型中的丰富、以对象为中心的语义先验知识。时间对齐且细粒度的抓握行为从原始的人类视频演示中提取,并与深度图像提供的3D场景几何结合,用于推断具有语义依据的接触目标。随后,一个运动学感知重定位模块将这些抓取表示映射到各种灵巧手中,无需针对每种手重新训练。最终系统生成了稳定且功能适当的多点接触抓握,能够在常见物体和工具上可靠地成功操作,并在同类新实例、姿态变化以及多种手形态中表现出强大的泛化能力。这项工作(i)引入了一条利用视觉语言生成先验知识的灵巧抓取语义行为提取流水线;(ii)展示了跨手一般化的性能,无需构建特定硬件的抓取数据集;以及(iii)确立了当与基础模型语义结合时,单一深度模态足以实现高性能的抓握合成。我们的成果强调了一条通过人类演示和预训练生成模型驱动的规模化、设备无关灵巧操作的道路。
URL
https://arxiv.org/abs/2601.08246