Abstract
Visual Question Answering (VQA) is an increasingly popular topic in deep learning research, requiring coordination of natural language processing and computer vision modules into a single architecture. We build upon the model which placed first in the VQA Challenge by developing thirteen new attention mechanisms and introducing a simplified classifier. We performed 300 GPU hours of extensive hyperparameter and architecture searches and were able to achieve an evaluation score of 64.78%, outperforming the existing state-of-the-art single model's validation score of 63.15%.
Abstract (translated)
视觉问答(VQA)是深度学习研究中越来越受欢迎的话题,需要将自然语言处理和计算机视觉模块协调为单一架构。我们通过开发13个新的关注机制并引入一个简化的分类器,建立了VQA挑战中排在第一位的模型。我们执行了300 GPU小时的广泛超参数和架构搜索,并且能够实现64.78%的评估分数,超过了现有最先进的单一模型的验证分数63.15%。
URL
https://arxiv.org/abs/1803.07724