Paper Reading AI Learner

ViTa-Zero: Zero-shot Visuotactile Object 6D Pose Estimation

2025-04-17 17:59:56
Hongyu Li, James Akl, Srinath Sridhar, Tye Brady, Taskin Padir

Abstract

Object 6D pose estimation is a critical challenge in robotics, particularly for manipulation tasks. While prior research combining visual and tactile (visuotactile) information has shown promise, these approaches often struggle with generalization due to the limited availability of visuotactile data. In this paper, we introduce ViTa-Zero, a zero-shot visuotactile pose estimation framework. Our key innovation lies in leveraging a visual model as its backbone and performing feasibility checking and test-time optimization based on physical constraints derived from tactile and proprioceptive observations. Specifically, we model the gripper-object interaction as a spring-mass system, where tactile sensors induce attractive forces, and proprioception generates repulsive forces. We validate our framework through experiments on a real-world robot setup, demonstrating its effectiveness across representative visual backbones and manipulation scenarios, including grasping, object picking, and bimanual handover. Compared to the visual models, our approach overcomes some drastic failure modes while tracking the in-hand object pose. In our experiments, our approach shows an average increase of 55% in AUC of ADD-S and 60% in ADD, along with an 80% lower position error compared to FoundationPose.

Abstract (translated)

六维(6D)姿态估计是机器人技术中的一个关键挑战,尤其是在执行抓取和操作任务时。尽管之前的研究将视觉信息与触觉(视听触觉)相结合的方法显示出了一定的潜力,但由于视触数据集有限的问题,这些方法在泛化能力上常常存在不足。本文介绍了ViTa-Zero框架,这是一个零样本学习下的视听触觉姿态估计框架。 我们的核心创新在于利用一个视觉模型作为主干,并基于来自触觉和本体感觉观察所推导出的物理约束来进行可行性检查及测试时优化。具体来说,我们将抓手-物体交互建模为弹簧质量系统,在此系统中,触觉传感器诱导吸引作用力,而本体感受则产生排斥作用力。 我们通过在实际机器人设置上的实验验证了该框架的有效性,展示出其对代表性视觉主干和操作场景(包括抓取、物体拾起及双臂交接)的适用性。与仅依赖视觉模型的方法相比,在跟踪手中物体姿态时,我们的方法克服了一些极端失败模式,并且在平均AUC增益方面,ADD-S提高了55%,ADD提高了60%,同时位置误差降低了80%(相较于FoundationPose)。

URL

https://arxiv.org/abs/2504.13179

PDF

https://arxiv.org/pdf/2504.13179.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot