Abstract
The study of algorithms to automatically answer visual questions currently is motivated by visual question answering (VQA) datasets constructed in artificial VQA settings. We propose VizWiz, the first goal-oriented VQA dataset arising from a natural VQA setting. VizWiz consists of over 31,000 visual questions originating from blind people who each took a picture using a mobile phone and recorded a spoken question about it, together with 10 crowdsourced answers per visual question. VizWiz differs from the many existing VQA datasets because (1) images are captured by blind photographers and so are often poor quality, (2) questions are spoken and so are more conversational, and (3) often visual questions cannot be answered. Evaluation of modern algorithms for answering visual questions and deciding if a visual question is answerable reveals that VizWiz is a challenging dataset. We introduce this dataset to encourage a larger community to develop more generalized algorithms that can assist blind people.
Abstract (translated)
目前自动回答视觉问题的算法的研究是通过在人工VQA设置中构建的视觉问答(VQA)数据集来激发的。我们提出VizWiz,这是首个面向目标的VQA数据集,它源于自然VQA设置。 VizWiz包含来自盲人的31,000多个视觉问题,每个人使用移动电话拍摄照片并记录关于它的口头问题以及每个视觉问题的10个众包答案。 VizWiz与许多现有的VQA数据集有所不同,因为(1)图像被盲人摄影师拍摄,因此通常质量较差,(2)提问更多,而且会话更多,以及(3)经常出现视觉问题无法回答。评估用于回答视觉问题和决定视觉问题是否可回答的现代算法表明VizWiz是一个具有挑战性的数据集。我们引入这个数据集来鼓励更大的社区开发可以帮助盲人的更广泛的算法。
URL
https://arxiv.org/abs/1802.08218