Attention on Attention: Architectures for Visual Question Answering

2018-03-21 03:05:58

Jasdeep Singh, Vincent Ying, Alex Nutkiewicz

arXiv_CV

Abstract
Abstract (translated)
URL
PDF

Abstract

Visual Question Answering (VQA) is an increasingly popular topic in deep learning research, requiring coordination of natural language processing and computer vision modules into a single architecture. We build upon the model which placed first in the VQA Challenge by developing thirteen new attention mechanisms and introducing a simplified classifier. We performed 300 GPU hours of extensive hyperparameter and architecture searches and were able to achieve an evaluation score of 64.78%, outperforming the existing state-of-the-art single model's validation score of 63.15%.

Abstract (translated)

视觉问答（VQA）是深度学习研究中越来越受欢迎的话题，需要将自然语言处理和计算机视觉模块协调为单一架构。我们通过开发13个新的关注机制并引入一个简化的分类器，建立了VQA挑战中排在第一位的模型。我们执行了300 GPU小时的广泛超参数和架构搜索，并且能够实现64.78％的评估分数，超过了现有最先进的单一模型的验证分数63.15％。

URL

https://arxiv.org/abs/1803.07724

PDF

https://arxiv.org/pdf/1803.07724.pdf