Paper Reading AI Learner

FSAG: Enhancing Human-to-Dexterous-Hand Finger-Specific Affordance Grounding via Diffusion Models

2026-01-13 06:08:56
Yifan Han, Pengfei Yi, Junyan Li, Hanqing Wang, Gaojing Zhang, Qi Peng Liu, Wenzhao Lian

Abstract

Dexterous grasp synthesis remains a central challenge: the high dimensionality and kinematic diversity of multi-fingered hands prevent direct transfer of algorithms developed for parallel-jaw grippers. Existing approaches typically depend on large, hardware-specific grasp datasets collected in simulation or through costly real-world trials, hindering scalability as new dexterous hand designs emerge. To this end, we propose a data-efficient framework, which is designed to bypass robot grasp data collection by exploiting the rich, object-centric semantic priors latent in pretrained generative diffusion models. Temporally aligned and fine-grained grasp affordances are extracted from raw human video demonstrations and fused with 3D scene geometry from depth images to infer semantically grounded contact targets. A kinematics-aware retargeting module then maps these affordance representations to diverse dexterous hands without per-hand retraining. The resulting system produces stable, functionally appropriate multi-contact grasps that remain reliably successful across common objects and tools, while exhibiting strong generalization across previously unseen object instances within a category, pose variations, and multiple hand embodiments. This work (i) introduces a semantic affordance extraction pipeline leveraging vision-language generative priors for dexterous grasping, (ii) demonstrates cross-hand generalization without constructing hardware-specific grasp datasets, and (iii) establishes that a single depth modality suffices for high-performance grasp synthesis when coupled with foundation-model semantics. Our results highlight a path toward scalable, hardware-agnostic dexterous manipulation driven by human demonstrations and pretrained generative models.

Abstract (translated)

灵巧抓取合成仍然是一项核心挑战:多指手的高维度和运动学多样性使得无法直接转移为并指夹具开发的算法。现有的方法通常依赖于在仿真或通过昂贵的实际试验中收集的大规模硬件特定抓取数据集,这阻碍了随着新灵巧手设计出现时的可扩展性。为此,我们提出了一种数据高效框架,该框架旨在绕过机器人抓取数据采集过程,利用预训练生成扩散模型中的丰富、以对象为中心的语义先验知识。时间对齐且细粒度的抓握行为从原始的人类视频演示中提取,并与深度图像提供的3D场景几何结合,用于推断具有语义依据的接触目标。随后,一个运动学感知重定位模块将这些抓取表示映射到各种灵巧手中,无需针对每种手重新训练。最终系统生成了稳定且功能适当的多点接触抓握,能够在常见物体和工具上可靠地成功操作,并在同类新实例、姿态变化以及多种手形态中表现出强大的泛化能力。这项工作(i)引入了一条利用视觉语言生成先验知识的灵巧抓取语义行为提取流水线;(ii)展示了跨手一般化的性能,无需构建特定硬件的抓取数据集;以及(iii)确立了当与基础模型语义结合时,单一深度模态足以实现高性能的抓握合成。我们的成果强调了一条通过人类演示和预训练生成模型驱动的规模化、设备无关灵巧操作的道路。

URL

https://arxiv.org/abs/2601.08246

PDF

https://arxiv.org/pdf/2601.08246.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot