Paper Reading AI Learner

Single-Stage Visual Relationship Learning using Conditional Queries

2023-06-09 06:02:01
Alakh Desai, Tz-Ying Wu, Subarna Tripathi, Nuno Vasconcelos

Abstract

Research in scene graph generation (SGG) usually considers two-stage models, that is, detecting a set of entities, followed by combining them and labeling all possible relationships. While showing promising results, the pipeline structure induces large parameter and computation overhead, and typically hinders end-to-end optimizations. To address this, recent research attempts to train single-stage models that are computationally efficient. With the advent of DETR, a set based detection model, one-stage models attempt to predict a set of subject-predicate-object triplets directly in a single shot. However, SGG is inherently a multi-task learning problem that requires modeling entity and predicate distributions simultaneously. In this paper, we propose Transformers with conditional queries for SGG, namely, TraCQ with a new formulation for SGG that avoids the multi-task learning problem and the combinatorial entity pair distribution. We employ a DETR-based encoder-decoder design and leverage conditional queries to significantly reduce the entity label space as well, which leads to 20% fewer parameters compared to state-of-the-art single-stage models. Experimental results show that TraCQ not only outperforms existing single-stage scene graph generation methods, it also beats many state-of-the-art two-stage methods on the Visual Genome dataset, yet is capable of end-to-end training and faster inference.

Abstract (translated)

场景图生成研究(SGG)通常考虑两阶段模型,即检测一组实体,然后将它们组合并标记所有可能的关系。虽然结果显示出 promising 结果,但管道结构导致大型参数和计算 overhead,并通常阻碍端到端优化。为了解决这个问题,最近的研究尝试训练计算效率高的单一阶段模型。随着DETR的出现,基于集合检测模型的单一阶段模型试图直接预测一组主题-谓词-对象三体直接。然而,SGG本质上是一个多任务学习问题,需要同时建模实体和谓语分布。在本文中,我们提出了对SGG的条件查询Transformer,即 TraCQ,并提出了SGG的新 formulation,以避免多任务学习问题和组合实体对分布。我们采用基于DETR的编码-解码设计,并利用条件查询显著减少实体标签空间,这导致与最先进的单一阶段模型相比,参数减少了20%。实验结果显示,TraCQ不仅优于现有的单一阶段场景图生成方法,还在视觉基因组数据集上击败了许多最先进的两阶段方法,但能够进行端到端训练和更快的推理。

URL

https://arxiv.org/abs/2306.05689

PDF

https://arxiv.org/pdf/2306.05689.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot