Unified Visual Relationship Detection with Vision and Language Models

Abstract
Abstract (translated)
URL
PDF

Abstract

This work focuses on training a single visual relationship detector predicting over the union of label spaces from multiple datasets. Merging labels spanning different datasets could be challenging due to inconsistent taxonomies. The issue is exacerbated in visual relationship detection when second-order visual semantics are introduced between pairs of objects. To address this challenge, we propose UniVRD, a novel bottom-up method for Unified Visual Relationship Detection by leveraging vision and language models (VLMs). VLMs provide well-aligned image and text embeddings, where similar relationships are optimized to be close to each other for semantic unification. Our bottom-up design enables the model to enjoy the benefit of training with both object detection and visual relationship datasets. Empirical results on both human-object interaction detection and scene-graph generation demonstrate the competitive performance of our model. UniVRD achieves 38.07 mAP on HICO-DET, outperforming the current best bottom-up HOI detector by 60% relatively. More importantly, we show that our unified detector performs as well as dataset-specific models in mAP, and achieves further improvements when we scale up the model.

Abstract (translated)

这项工作重点是训练一个单一的视觉关系检测器,从多个数据集的label空间中预测。由于不同的分类器定义不一致,将跨越不同数据集的labels合并起来可能会非常困难。当两个物体之间的二阶视觉语义引入时,这个问题会更加突出。为了解决这个挑战,我们提出了UniVRD,一种利用视觉和语言模型(VLMs)的新方法,以统一的视觉关系检测为目标。VLMs提供对齐的图像和文本嵌入,其中相似关系被优化到彼此相邻以提高语义统一性。我们的bottom-up设计使模型能够同时训练物体检测和视觉关系数据集。对人类-物体交互检测和场景生成的主观结果 both human-object interaction detection and scene-graph generation 均证明了我们的模型的竞争性表现。UniVRD在HICO-DET上实现38.07mAP的性能,相对于当前最好的bottom-up HOI检测器高出60%。更重要的是,我们表明,我们的统一检测器在mAP方面的表现与数据集特定的模型相当,当我们扩大模型规模时还能取得进一步改进。

URL

https://arxiv.org/abs/2303.08998

PDF

https://arxiv.org/pdf/2303.08998.pdf