Paper Reading AI Learner

EGTR: Extracting Graph from Transformer for Scene Graph Generation

2024-04-02 16:20:02
Jinbae Im, JeongYeon Nam, Nokyung Park, Hyungmin Lee, Seunghyun Park

Abstract

Scene Graph Generation (SGG) is a challenging task of detecting objects and predicting relationships between objects. After DETR was developed, one-stage SGG models based on a one-stage object detector have been actively studied. However, complex modeling is used to predict the relationship between objects, and the inherent relationship between object queries learned in the multi-head self-attention of the object detector has been neglected. We propose a lightweight one-stage SGG model that extracts the relation graph from the various relationships learned in the multi-head self-attention layers of the DETR decoder. By fully utilizing the self-attention by-products, the relation graph can be extracted effectively with a shallow relation extraction head. Considering the dependency of the relation extraction task on the object detection task, we propose a novel relation smoothing technique that adjusts the relation label adaptively according to the quality of the detected objects. By the relation smoothing, the model is trained according to the continuous curriculum that focuses on object detection task at the beginning of training and performs multi-task learning as the object detection performance gradually improves. Furthermore, we propose a connectivity prediction task that predicts whether a relation exists between object pairs as an auxiliary task of the relation extraction. We demonstrate the effectiveness and efficiency of our method for the Visual Genome and Open Image V6 datasets. Our code is publicly available at this https URL .

Abstract (translated)

场景图生成(SGG)是一项具有挑战性的任务,旨在检测物体并预测物体之间的关系。在DETR开发之后,基于一阶段对象的检测器的一阶SGG模型得到了广泛研究。然而,为了预测物体之间的关系,使用了复杂的建模。在多头自注意的对象检测器中学习的物体查询固有关系已被忽视。我们提出了一种轻量级的一阶SGG模型,它从DETR decoder的各个关系层中提取关系图。通过充分利用自注意的副产品,浅层关系提取头可以有效地提取关系图。考虑到关系提取任务与物体检测任务之间的依赖关系,我们提出了一种新颖的关系平滑技术,根据检测到的物体的质量调整关系标签。通过关系平滑,根据训练开始的物体检测任务,对模型进行训练,并在物体检测性能逐渐提高时进行多任务学习。此外,我们还提出了一种关系预测任务,作为关系提取的辅助任务,预测物体对之间的关系是否存在。我们在Visual Genome和Open Image V6数据集上证明了我们方法的有效性和高效性。我们的代码可在此处公开访问:https:// this URL.

URL

https://arxiv.org/abs/2404.02072

PDF

https://arxiv.org/pdf/2404.02072.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot