Paper Reading AI Learner

Robustness Analysis of Visual QA Models by Basic Questions

2018-05-26 05:14:02
Jia-Hong Huang, Cuong Duc Dao, Modar Alfadly, C. Huck Yang, Bernard Ghanem

Abstract

Visual Question Answering (VQA) models should have both high robustness and accuracy. Unfortunately, most of the current VQA research only focuses on accuracy because there is a lack of proper methods to measure the robustness of VQA models. There are two main modules in our algorithm. Given a natural language question about an image, the first module takes the question as input and then outputs the ranked basic questions, with similarity scores, of the main given question. The second module takes the main question, image and these basic questions as input and then outputs the text-based answer of the main question about the given image. We claim that a robust VQA model is one, whose performance is not changed much when related basic questions as also made available to it as input. We formulate the basic questions generation problem as a LASSO optimization, and also propose a large scale Basic Question Dataset (BQD) and Rscore (novel robustness measure), for analyzing the robustness of VQA models. We hope our BQD will be used as a benchmark for to evaluate the robustness of VQA models, so as to help the community build more robust and accurate VQA models.

Abstract (translated)

视觉问答应用(VQA)模型应具有较高的健壮性和准确性。不幸的是,目前的VQA研究大多只关注准确性,因为缺乏适当的方法来衡量VQA模型的稳健性。我们的算法中有两个主要模块。给定关于图像的自然语言问题,第一个模块将该问题作为输入,然后输出具有相似性分数的主要给定问题的排序基本问题。第二个模块将主要问题,图像和这些基本问题作为输入,然后输出关于给定图像的主要问题的基于文本的答案。我们声称一个强大的VQA模型是一个强大的VQA模型,当相关的基本问题也作为输入提供时,其性能没有太大变化。为了分析VQA模型的鲁棒性,我们将基本问题生成问题定义为LASSO优化问题,并提出了大规模基本问题数据集(BQD)和Rscore(新型鲁棒性度量)。我们希望我们的BQD将被用作衡量VQA模型稳健性的基准,以帮助社区构建更强大和更精确的VQA模型。

URL

https://arxiv.org/abs/1709.04625

PDF

https://arxiv.org/pdf/1709.04625.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot