Abstract
Computer Vision applications often require a textual grounding module with precision, interpretability, and resilience to counterfactual inputs/queries. To achieve high grounding precision, current textual grounding methods heavily rely on large-scale training data with manual annotations at the pixel level. Such annotations are expensive to obtain and thus severely narrow the model's scope of real-world applications. Moreover, most of these methods sacrifice interpretability, generalizability, and they neglect the importance of being resilient to counterfactual inputs. To address these issues, we propose a visual grounding system which is 1) end-to-end trainable in a weakly supervised fashion with only image-level annotations, and 2) counterfactually resilient owing to the modular design. Specifically, we decompose textual descriptions into three levels: entity, semantic attribute, color information, and perform compositional grounding progressively. We validate our model through a series of experiments and demonstrate its improvement over the state-of-the-art methods. In particular, our model's performance not only surpasses other weakly/un-supervised methods and even approaches the strongly supervised ones, but also is interpretable for decision making and performs much better in face of counterfactual classes than all the others.
Abstract (translated)
计算机视觉应用通常需要一个文本基础模块,具有精确性、可解释性和对反事实输入/查询的恢复能力。为了达到较高的接地精度,目前的文本接地方法严重依赖于大规模的训练数据,并在像素级进行人工标注。这样的注释获取起来很昂贵,因此严重地缩小了模型在现实世界中的应用范围。此外,这些方法大多牺牲了可解释性、可归纳性,而忽视了对反事实输入具有适应力的重要性。为了解决这些问题,我们提出了一个视觉接地系统,它是1)端到端的训练在弱监督的方式,只有图像级的注释,和2)反事实的弹性,由于模块化设计。具体来说,我们将文本描述分解为三个层次:实体、语义属性、颜色信息,并逐步执行组合基础。我们通过一系列的实验来验证我们的模型,并证明它比最先进的方法有所改进。特别是,我们的模型的性能不仅超越了其他弱/非监督方法,甚至接近了强监督方法,而且对于决策具有可解释性,在面对反事实类时的性能比其他所有模型都要好得多。
URL
https://arxiv.org/abs/1904.03589