Paper Reading AI Learner

Unified Visual Relationship Detection with Vision and Language Models

2023-03-16 00:06:28
Long Zhao, Liangzhe Yuan, Boqing Gong, Yin Cui, Florian Schroff, Ming-Hsuan Yang, Hartwig Adam, Ting Liu

Abstract

This work focuses on training a single visual relationship detector predicting over the union of label spaces from multiple datasets. Merging labels spanning different datasets could be challenging due to inconsistent taxonomies. The issue is exacerbated in visual relationship detection when second-order visual semantics are introduced between pairs of objects. To address this challenge, we propose UniVRD, a novel bottom-up method for Unified Visual Relationship Detection by leveraging vision and language models (VLMs). VLMs provide well-aligned image and text embeddings, where similar relationships are optimized to be close to each other for semantic unification. Our bottom-up design enables the model to enjoy the benefit of training with both object detection and visual relationship datasets. Empirical results on both human-object interaction detection and scene-graph generation demonstrate the competitive performance of our model. UniVRD achieves 38.07 mAP on HICO-DET, outperforming the current best bottom-up HOI detector by 60% relatively. More importantly, we show that our unified detector performs as well as dataset-specific models in mAP, and achieves further improvements when we scale up the model.

Abstract (translated)

这项工作重点是训练一个单一的视觉关系检测器,从多个数据集的label空间中预测。由于不同的分类器定义不一致,将跨越不同数据集的labels合并起来可能会非常困难。当两个物体之间的二阶视觉语义引入时,这个问题会更加突出。为了解决这个挑战,我们提出了UniVRD,一种利用视觉和语言模型(VLMs)的新方法,以统一的视觉关系检测为目标。VLMs提供对齐的图像和文本嵌入,其中相似关系被优化到彼此相邻以提高语义统一性。我们的bottom-up设计使模型能够同时训练物体检测和视觉关系数据集。对人类-物体交互检测和场景生成的主观结果 both human-object interaction detection and scene-graph generation 均证明了我们的模型的竞争性表现。UniVRD在HICO-DET上实现38.07mAP的性能,相对于当前最好的bottom-up HOI检测器高出60%。更重要的是,我们表明,我们的统一检测器在mAP方面的表现与数据集特定的模型相当,当我们扩大模型规模时还能取得进一步改进。

URL

https://arxiv.org/abs/2303.08998

PDF

https://arxiv.org/pdf/2303.08998.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot