Improving Visual Reasoning by Exploiting The Knowledge in Texts

2021-02-09 11:21:44

Sahand Sharifzadeh, Sina Moayed Baharlou, Martin Schmitt, Hinrich Schütze, Volker Tresp

arXiv_AI

arXiv_AI Classification Relation Knowledge Transformer Self-Supervised

Abstract
Abstract (translated)
URL
PDF

Abstract

This paper presents a new framework for training image-based classifiers from a combination of texts and images with very few labels. We consider a classification framework with three modules: a backbone, a relational reasoning component, and a classification component. While the backbone can be trained from unlabeled images by self-supervised learning, we can fine-tune the relational reasoning and the classification components from external sources of knowledge instead of annotated images. By proposing a transformer-based model that creates structured knowledge from textual input, we enable the utilization of the knowledge in texts. We show that, compared to the supervised baselines with 1% of the annotated images, we can achieve ~8x more accurate results in scene graph classification, ~3x in object classification, and ~1.5x in predicate classification.

Abstract (translated)

URL

https://arxiv.org/abs/2102.04760

PDF

https://arxiv.org/pdf/2102.04760.pdf