VQA-LOL: Visual Question Answering under the Lens of Logic

2020-02-19 17:57:46

Tejas Gokhale, Pratyay Banerjee, Chitta Baral, Yezhou Yang

arXiv_CV

Abstract
Abstract (translated)
URL
PDF

Abstract

Logical connectives and their implications on the meaning of a natural language sentence are a fundamental aspect of understanding. In this paper, we investigate visual question answering (VQA) through the lens of logical transformation and posit that systems that seek to answer questions about images must be robust to these transformations of the question. If a VQA system is able to answer a question, it should also be able to answer the logical composition of questions. We analyze the performance of state-of-the-art models on the VQA task under these logical operations and show that they have difficulty in correctly answering such questions. We then construct an augmentation of the VQA dataset with questions containing logical operations and retrain the same models to establish a baseline. We further propose a novel methodology to train models to learn negation, conjunction, and disjunction and show improvement in learning logical composition and retaining performance on VQA. We suggest this work as a move towards embedding logical connectives in visual understanding, along with the benefits of robustness and generalizability. Our code and dataset is available online at this https URL

Abstract (translated)

URL

https://arxiv.org/abs/2002.08325

PDF

https://arxiv.org/pdf/2002.08325.pdf