Paper Reading AI Learner

GRIM: Task-Oriented Grasping with Conditioning on Generative Examples

2025-06-18 16:35:47
Shailesh, Alok Raj, Nayan Kumar, Priya Shukla, Andrew Melnik, Micheal Beetz, Gora Chand Nandi

Abstract

Task-Oriented Grasping (TOG) presents a significant challenge, requiring a nuanced understanding of task semantics, object affordances, and the functional constraints dictating how an object should be grasped for a specific task. To address these challenges, we introduce GRIM (Grasp Re-alignment via Iterative Matching), a novel training-free framework for task-oriented grasping. Initially, a coarse alignment strategy is developed using a combination of geometric cues and principal component analysis (PCA)-reduced DINO features for similarity scoring. Subsequently, the full grasp pose associated with the retrieved memory instance is transferred to the aligned scene object and further refined against a set of task-agnostic, geometrically stable grasps generated for the scene object, prioritizing task compatibility. In contrast to existing learning-based methods, GRIM demonstrates strong generalization capabilities, achieving robust performance with only a small number of conditioning examples.

Abstract (translated)

任务导向抓取(TOG)提出了一个重大挑战,需要对任务语义、物体功效以及决定特定任务中如何抓取物体的功能约束有细微的理解。为了解决这些挑战,我们引入了GRIM(通过迭代匹配进行抓取再定位),这是一种新颖的无需训练的任务导向抓取框架。最初,使用几何线索和主成分分析(PCA)降维后的DINO特征相结合的方法开发了一种粗略对齐策略来进行相似度评分。随后,将检索到的记忆实例相关的完整抓取姿态转移到与场景对象对齐的对象上,并进一步针对为该场景对象生成的一组任务无关、几何稳定的抓取进行细化,优先考虑任务兼容性。与现有的基于学习的方法不同,GRIM展示了强大的泛化能力,在仅有少量条件示例的情况下仍能实现稳健的性能。

URL

https://arxiv.org/abs/2506.15607

PDF

https://arxiv.org/pdf/2506.15607.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot