Paper Reading AI Learner

Modularized Textual Grounding for Counterfactual Resilience

2019-04-07 05:59:04
Zhiyuan Fang, Shu Kong, Charless Fowlkes, Yezhou Yang

Abstract

Computer Vision applications often require a textual grounding module with precision, interpretability, and resilience to counterfactual inputs/queries. To achieve high grounding precision, current textual grounding methods heavily rely on large-scale training data with manual annotations at the pixel level. Such annotations are expensive to obtain and thus severely narrow the model's scope of real-world applications. Moreover, most of these methods sacrifice interpretability, generalizability, and they neglect the importance of being resilient to counterfactual inputs. To address these issues, we propose a visual grounding system which is 1) end-to-end trainable in a weakly supervised fashion with only image-level annotations, and 2) counterfactually resilient owing to the modular design. Specifically, we decompose textual descriptions into three levels: entity, semantic attribute, color information, and perform compositional grounding progressively. We validate our model through a series of experiments and demonstrate its improvement over the state-of-the-art methods. In particular, our model's performance not only surpasses other weakly/un-supervised methods and even approaches the strongly supervised ones, but also is interpretable for decision making and performs much better in face of counterfactual classes than all the others.

Abstract (translated)

计算机视觉应用通常需要一个文本基础模块,具有精确性、可解释性和对反事实输入/查询的恢复能力。为了达到较高的接地精度,目前的文本接地方法严重依赖于大规模的训练数据,并在像素级进行人工标注。这样的注释获取起来很昂贵,因此严重地缩小了模型在现实世界中的应用范围。此外,这些方法大多牺牲了可解释性、可归纳性,而忽视了对反事实输入具有适应力的重要性。为了解决这些问题,我们提出了一个视觉接地系统,它是1)端到端的训练在弱监督的方式,只有图像级的注释,和2)反事实的弹性,由于模块化设计。具体来说,我们将文本描述分解为三个层次:实体、语义属性、颜色信息,并逐步执行组合基础。我们通过一系列的实验来验证我们的模型,并证明它比最先进的方法有所改进。特别是,我们的模型的性能不仅超越了其他弱/非监督方法,甚至接近了强监督方法,而且对于决策具有可解释性,在面对反事实类时的性能比其他所有模型都要好得多。

URL

https://arxiv.org/abs/1904.03589

PDF

https://arxiv.org/pdf/1904.03589.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot