Optimising the Input Image to Improve Visual Relationship Detection

Abstract
Abstract (translated)
URL
PDF

Abstract

Visual Relationship Detection is defined as, given an image composed of a subject and an object, the correct relation is predicted. To improve the visual part of this difficult problem, ten preprocessing methods were tested to determine whether the widely used Union method yields the optimal results. Therefore, focusing solely on predicate prediction, no object detection and linguistic knowledge were used to prevent them from affecting the comparison results. Once fine-tuned, the Visual Geometry Group models were evaluated using Recall@1, per-predicate recall, activation maximisations, class activation maps, and error analysis. From this research it was found that using preprocessing methods such as the Union-Without-Background-and-with-Binary-mask (Union-WB-and-B) method yields significantly better results than the widely used Union method since, as designed, it enables the Convolutional Neural Network to also identify the subject and object in the convolutional layers instead of solely in the fully-connected layers.

Abstract (translated)

视觉关系检测的定义是，给定一幅由一个主体和一个对象组成的图像，预测出正确的关系。为了改善这一难题的可视性，对10种预处理方法进行了测试，以确定广泛使用的联合方法是否能产生最佳结果。因此，只注重谓词预测，不使用对象检测和语言知识来防止它们影响比较结果。一旦进行了微调，就可以使用recall@1、每个谓词的recall、激活最大化、类激活映射和错误分析来评估视觉几何组模型。本研究发现，使用无背景联合和二元掩模联合等预处理方法（union-wb-and-b）比广泛使用的联合方法（union-wb-and-b）效果显著好，因为根据设计，它能使卷积神经网络同时识别卷积层中的主题和对象。而不是仅仅在完全连接的层中。

URL

https://arxiv.org/abs/1903.11029

PDF

https://arxiv.org/pdf/1903.11029.pdf