Paper Reading AI Learner

What does CLIP know about peeling a banana?

2024-04-18 09:06:05
Claudia Cuttano, Gabriele Rosi, Gabriele Trivigno, Giuseppe Averta

Abstract

Humans show an innate capability to identify tools to support specific actions. The association between objects parts and the actions they facilitate is usually named affordance. Being able to segment objects parts depending on the tasks they afford is crucial to enable intelligent robots to use objects of daily living. Traditional supervised learning methods for affordance segmentation require costly pixel-level annotations, while weakly supervised approaches, though less demanding, still rely on object-interaction examples and support a closed set of actions. These limitations hinder scalability, may introduce biases, and usually restrict models to a limited set of predefined actions. This paper proposes AffordanceCLIP, to overcome these limitations by leveraging the implicit affordance knowledge embedded within large pre-trained Vision-Language models like CLIP. We experimentally demonstrate that CLIP, although not explicitly trained for affordances detection, retains valuable information for the task. Our AffordanceCLIP achieves competitive zero-shot performance compared to methods with specialized training, while offering several advantages: i) it works with any action prompt, not just a predefined set; ii) it requires training only a small number of additional parameters compared to existing solutions and iii) eliminates the need for direct supervision on action-object pairs, opening new perspectives for functionality-based reasoning of models.

Abstract (translated)

人类表现出一种固有的能力,即识别支持特定动作的工具。对象部分与它们促进的动作之间的关联通常被称为 affordance。能够根据它们所促进的动作对对象部分进行分割是实现智能机器人使用日常生活中的物体的重要途径。传统的监督学习方法 for affordance segmentation 需要昂贵的像素级注释,而弱监督方法,尽管相对较少要求,但仍依赖于物体交互示例和支持一组动作。这些限制阻碍了可扩展性,可能引入偏差,并且通常将模型限制为有限的一组预定义动作。本文提出 AffordanceCLIP,通过利用预训练的 Vision-Language 模型如 CLIP 中内嵌的隐含 affordance 知识,从而克服这些限制。我们通过实验验证,CLIP 虽然在 affordance 检测方面并未进行专门训练,但保留了许多有价值的信息,对于该任务。我们的 AffordanceCLIP 与具有专门训练的方法相比具有竞争性的 zero-shot 性能,同时提供了几个优势:(i)它适用于任何动作提示,而不仅限于预定义的一组;(ii)与现有解决方案相比,需要训练的额外参数非常少;(iii)它消除了对动作-物体对之间的直接监督,为基于功能进行模型的功能推理打开了新的视角。

URL

https://arxiv.org/abs/2404.12015

PDF

https://arxiv.org/pdf/2404.12015.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot