Abstract
In this paper, we explore the potential of Vision-Language Models (VLMs), specifically CLIP, in predicting visual object relationships, which involves interpreting visual features from images into language-based relations. Current state-of-the-art methods use complex graphical models that utilize language cues and visual features to address this challenge. We hypothesize that the strong language priors in CLIP embeddings can simplify these graphical models paving for a simpler approach. We adopt the UVTransE relation prediction framework, which learns the relation as a translational embedding with subject, object, and union box embeddings from a scene. We systematically explore the design of CLIP-based subject, object, and union-box representations within the UVTransE framework and propose CREPE (CLIP Representation Enhanced Predicate Estimation). CREPE utilizes text-based representations for all three bounding boxes and introduces a novel contrastive training strategy to automatically infer the text prompt for union-box. Our approach achieves state-of-the-art performance in predicate estimation, mR@5 27.79, and mR@20 31.95 on the Visual Genome benchmark, achieving a 15.3\% gain in performance over recent state-of-the-art at mR@20. This work demonstrates CLIP's effectiveness in object relation prediction and encourages further research on VLMs in this challenging domain.
Abstract (translated)
在本文中,我们探讨了视觉语言模型(VLMs,特别是CLIP)在预测视觉对象关系方面的潜力,这种关系是通过从图像中解释视觉特征到基于语言的关系中建立的。当前最先进的方法使用复杂的图形模型,利用语言提示和视觉特征来解决这个挑战。我们假设,在CLIP嵌入中的强语言先验可以简化这些图形模型,从而提供一个更简单的方法。我们采用UVTransE关系预测框架,该框架将关系学习为一个 Translation Embedding,从场景中提取主题、对象和合并框的嵌入。我们系统地探索了在UVTransE框架内,基于CLIP的主题、对象和合并框表示的设计,并提出了CREPE(CLIP表示增强的条件估计)。CREPE使用所有三个边界框的文本表示,并引入了一种新的对比度训练策略,以自动推断合并框的文本提示。我们的方法在条件估计方面实现了最先进的性能,mR@5 27.79,mR@20 31.95,在Visual Genome基准测试中取得了15.3\%的性能提升,比最近最先进的方法在mR@20上的性能提高了%。这项工作展示了CLIP在对象关系预测方面的效力,并鼓励在此挑战性的领域研究VLMs。
URL
https://arxiv.org/abs/2307.04838