Paper Reading AI Learner

Informative Scene Graph Generation via Debiasing

2023-08-10 02:04:01
Lianli Gao, Xinyu Lyu, Yuyu Guo, Yuxuan Hu, Yuan-Fang Li, Lu Xu, Heng Tao Shen, Jingkuan Song

Abstract

Scene graph generation aims to detect visual relationship triplets, (subject, predicate, object). Due to biases in data, current models tend to predict common predicates, e.g. "on" and "at", instead of informative ones, e.g. "standing on" and "looking at". This tendency results in the loss of precise information and overall performance. If a model only uses "stone on road" rather than "stone blocking road" to describe an image, it may be a grave misunderstanding. We argue that this phenomenon is caused by two imbalances: semantic space level imbalance and training sample level imbalance. For this problem, we propose DB-SGG, an effective framework based on debiasing but not the conventional distribution fitting. It integrates two components: Semantic Debiasing (SD) and Balanced Predicate Learning (BPL), for these imbalances. SD utilizes a confusion matrix and a bipartite graph to construct predicate relationships. BPL adopts a random undersampling strategy and an ambiguity removing strategy to focus on informative predicates. Benefiting from the model-agnostic process, our method can be easily applied to SGG models and outperforms Transformer by 136.3%, 119.5%, and 122.6% on mR@20 at three SGG sub-tasks on the SGG-VG dataset. Our method is further verified on another complex SGG dataset (SGG-GQA) and two downstream tasks (sentence-to-graph retrieval and image captioning).

Abstract (translated)

场景图生成的目标是检测视觉关系三元组(主题、谓词、对象)。由于数据中的偏见,当前模型往往预测常见的谓词,例如“在”和“在”,而不是有用的谓词,例如“站在”和“看着”。这种趋势导致准确的信息和整体表现的损失。如果模型仅使用“在路上”而不是“在路上堵住了”来描述一个图像,这可能是一个非常严重的误解。我们认为,这种情况是由两个不平衡因素引起的:语义空间上的不平衡和训练样本水平的不平衡。为了解决这一问题,我们提出了 DB-SGG,一个基于去偏但不同于传统分布适应的有效框架。它整合了两个组件:语义去偏(SD)和平衡谓词学习(BPL)。SD利用混淆矩阵和二分类图来构建谓词关系。BPL采用随机Undersampling策略和歧义移除策略,重点优化有用的谓词。得益于无模型过程,我们的方法和Transformer在 SGG 数据集上的三个 SGG 子任务中的 mR@20 表现相比提高了136.3%、119.5% 和 122.6%。我们在另一个复杂的 SGG 数据集(SGG-GQA)和两个后续任务(sentence-to-graph 检索和图像摘要)上进行了进一步验证。

URL

https://arxiv.org/abs/2308.05286

PDF

https://arxiv.org/pdf/2308.05286.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot