Abstract
Visual Relation Detection (VRD) aims to detect relationships between objects for image understanding. Most existing VRD methods rely on thousands of training samples of each relationship to achieve satisfactory performance. Some recent papers tackle this problem by few-shot learning with elaborately designed pipelines and pre-trained word vectors. However, the performance of existing few-shot VRD models is severely hampered by the poor generalization capability, as they struggle to handle the vast semantic diversity of visual relationships. Nonetheless, humans have the ability to learn new relationships with just few examples based on their knowledge. Inspired by this, we devise a knowledge-augmented, few-shot VRD framework leveraging both textual knowledge and visual relation knowledge to improve the generalization ability of few-shot VRD. The textual knowledge and visual relation knowledge are acquired from a pre-trained language model and an automatically constructed visual relation knowledge graph, respectively. We extensively validate the effectiveness of our framework. Experiments conducted on three benchmarks from the commonly used Visual Genome dataset show that our performance surpasses existing state-of-the-art models with a large improvement.
Abstract (translated)
视觉关系检测(VRD)旨在检测对象之间的图像理解关系。目前,大多数VRD方法依赖于每个关系数千个训练样本来实现良好的性能。一些最近的论文通过精心设计的流程和预训练的词向量来解决这个问题。然而,现有的少数几次VRD模型的表现受到 poor 的泛化能力极大的限制,因为它们努力处理视觉关系的巨大语义多样性。然而,人类有一种能力,仅基于他们的知识,通过几个示例学习新的关系。受到这个想法的启发,我们设计了一个知识增强的少数几次VRD框架,利用文本知识和视觉关系知识来提高少数几次VRD的泛化能力。文本知识和视觉关系知识从预训练的语言模型和自动生成的视觉关系知识图获取。我们广泛验证我们的框架的有效性。从常用的视觉基因组数据集常用的三个基准点开始,进行了实验,结果表明我们的性能远远超过现有的最先进的模型,取得了巨大的改进。
URL
https://arxiv.org/abs/2303.05342