Paper Reading AI Learner

Zero-shot Visual Relation Detection via Composite Visual Cues from Large Language Models

2023-05-21 14:40:48
Lin Li, Jun Xiao, Guikun Chen, Jian Shao, Yueting Zhuang, Long Chen

Abstract

Pretrained vision-language models, such as CLIP, have demonstrated strong generalization capabilities, making them promising tools in the realm of zero-shot visual recognition. Visual relation detection (VRD) is a typical task that identifies relationship (or interaction) types between object pairs within an image. However, naively utilizing CLIP with prevalent class-based prompts for zero-shot VRD has several weaknesses, e.g., it struggles to distinguish between different fine-grained relation types and it neglects essential spatial information of two objects. To this end, we propose a novel method for zero-shot VRD: RECODE, which solves RElation detection via COmposite DEscription prompts. Specifically, RECODE first decomposes each predicate category into subject, object, and spatial components. Then, it leverages large language models (LLMs) to generate description-based prompts (or visual cues) for each component. Different visual cues enhance the discriminability of similar relation categories from different perspectives, which significantly boosts performance in VRD. To dynamically fuse different cues, we further introduce a chain-of-thought method that prompts LLMs to generate reasonable weights for different visual cues. Extensive experiments on four VRD benchmarks have demonstrated the effectiveness and interpretability of RECODE.

Abstract (translated)

训练有素的视觉语言模型,如CLIP,已经表现出强大的泛化能力,使其成为零次视觉识别领域有前途的工具。视觉关系检测(VRD)是一种常见的任务,该任务在图像中识别关系(或交互)类型的不同细节类型。然而,天真地使用基于类的Clip提示进行零次VRD有以下几个弱点,例如,它 struggle 很难区分不同细致的关系类型,并且它忽略了两个对象的重要空间信息。为此,我们提出了一种零次VRD的新型方法:RECODE,该方法通过合并描述提示解决关系检测。具体来说,RECODE首先将每个谓词类别分解为主题、对象和空间组件。然后,它利用大型语言模型(LLM)生成每个组件的描述性提示(或视觉提示)。不同的视觉提示增强从不同的角度识别类似关系类别的可区分性,这显著提高了VRD的性能。为了动态融合不同的提示,我们引入了一种思考链方法,promptLLM生成合理的视觉提示权重。在四个VRD基准实验中,已经证明了RECODE的有效性和解释性。

URL

https://arxiv.org/abs/2305.12476

PDF

https://arxiv.org/pdf/2305.12476.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot