Paper Reading AI Learner

Visual Relationships as Functions: Enabling Few-Shot Scene Graph Prediction

2019-06-12 01:27:15
Apoorva Dornadula, Austin Narcomey, Ranjay Krishna, Michael Bernstein, Li Fei-Fei

Abstract

Scene graph prediction --- classifying the set of objects and predicates in a visual scene --- requires substantial training data. The long-tailed distribution of relationships can be an obstacle for such approaches, however, as they can only be trained on the small set of predicates that carry sufficient labels. We introduce the first scene graph prediction model that supports few-shot learning of predicates, enabling scene graph approaches to generalize to a set of new predicates. First, we introduce a new model of predicates as functions that operate on object features or image locations. Next, we define a scene graph model where these functions are trained as message passing protocols within a new graph convolution framework. We train the framework with a frequently occurring set of predicates and show that our approach outperforms those that use the same amount of supervision by 1.78 at recall@50 and performs on par with other scene graph models. Next, we extract object representations generated by the trained predicate functions to train few-shot predicate classifiers on rare predicates with as few as 1 labeled example. When compared to strong baselines like transfer learning from existing state-of-the-art representations, we show improved 5-shot performance by 4.16 recall@1. Finally, we show that our predicate functions generate interpretable visualizations, enabling the first interpretable scene graph model.

Abstract (translated)

场景图预测——对视觉场景中的一组对象和谓词进行分类——需要大量的训练数据。然而,对于这种方法来说,关系的长尾分布可能是一个障碍,因为它们只能在带有足够标签的一组谓词上进行训练。我们引入了第一个场景图预测模型,该模型支持谓词的少量镜头学习,使场景图方法能够归纳为一组新的谓词。首先,我们引入一个新的谓词模型,作为操作对象特征或图像位置的函数。接下来,我们定义一个场景图模型,在这个模型中,这些函数在一个新的图卷积框架内被训练为消息传递协议。我们使用一组频繁出现的谓词对框架进行了培训,并表明我们的方法优于那些在recall@50使用相同数量的监督的方法1.78,并且与其他场景图模型的性能相当。接下来,我们提取由经过训练的谓词函数生成的对象表示,以便在很少的谓词上训练少量的shot谓词分类器,并且只训练1个标记的示例。与从现有最先进的表示中获得的传输学习等强基线相比,我们显示出4.16召回@1提高了5次射击的性能。最后,我们证明谓词函数生成可解释的可视化,从而启用第一个可解释的场景图模型。

URL

https://arxiv.org/abs/1906.04876

PDF

https://arxiv.org/pdf/1906.04876.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot