Paper Reading AI Learner

CREPE: Learnable Prompting With CLIP Improves Visual Relationship Prediction

2023-07-10 18:15:03
Rakshith Subramanyam, T. S. Jayram, Rushil Anirudh, Jayaraman J. Thiagarajan

Abstract

In this paper, we explore the potential of Vision-Language Models (VLMs), specifically CLIP, in predicting visual object relationships, which involves interpreting visual features from images into language-based relations. Current state-of-the-art methods use complex graphical models that utilize language cues and visual features to address this challenge. We hypothesize that the strong language priors in CLIP embeddings can simplify these graphical models paving for a simpler approach. We adopt the UVTransE relation prediction framework, which learns the relation as a translational embedding with subject, object, and union box embeddings from a scene. We systematically explore the design of CLIP-based subject, object, and union-box representations within the UVTransE framework and propose CREPE (CLIP Representation Enhanced Predicate Estimation). CREPE utilizes text-based representations for all three bounding boxes and introduces a novel contrastive training strategy to automatically infer the text prompt for union-box. Our approach achieves state-of-the-art performance in predicate estimation, mR@5 27.79, and mR@20 31.95 on the Visual Genome benchmark, achieving a 15.3\% gain in performance over recent state-of-the-art at mR@20. This work demonstrates CLIP's effectiveness in object relation prediction and encourages further research on VLMs in this challenging domain.

Abstract (translated)

在本文中,我们探讨了视觉语言模型(VLMs,特别是CLIP)在预测视觉对象关系方面的潜力,这种关系是通过从图像中解释视觉特征到基于语言的关系中建立的。当前最先进的方法使用复杂的图形模型,利用语言提示和视觉特征来解决这个挑战。我们假设,在CLIP嵌入中的强语言先验可以简化这些图形模型,从而提供一个更简单的方法。我们采用UVTransE关系预测框架,该框架将关系学习为一个 Translation Embedding,从场景中提取主题、对象和合并框的嵌入。我们系统地探索了在UVTransE框架内,基于CLIP的主题、对象和合并框表示的设计,并提出了CREPE(CLIP表示增强的条件估计)。CREPE使用所有三个边界框的文本表示,并引入了一种新的对比度训练策略,以自动推断合并框的文本提示。我们的方法在条件估计方面实现了最先进的性能,mR@5 27.79,mR@20 31.95,在Visual Genome基准测试中取得了15.3\%的性能提升,比最近最先进的方法在mR@20上的性能提高了%。这项工作展示了CLIP在对象关系预测方面的效力,并鼓励在此挑战性的领域研究VLMs。

URL

https://arxiv.org/abs/2307.04838

PDF

https://arxiv.org/pdf/2307.04838.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot