Abstract
We revisit a particular visual grounding method: the "Image Retrieval Using Scene Graphs" (IRSG) system of Johnson et al. (2015). Our experiments indicate that the system does not effectively use its learned object-relationship models. We also look closely at the IRSG dataset, as well as the widely used Visual Relationship Dataset (VRD) that is adapted from it. We find that these datasets exhibit biases that allow methods that ignore relationships to perform relatively well. We also describe several other problems with the IRSG dataset, and report on experiments using a subset of the dataset in which the biases and other problems are removed. Our studies contribute to a more general effort: that of better understanding what machine learning methods that combine language and vision actually learn and what popular datasets actually test.
Abstract (translated)
我们回顾了一种特殊的视觉接地方法:Johnson等人的“使用场景图的图像检索”(IRSG)系统。(2015年)。实验表明,该系统不能有效地利用学习对象关系模型。我们还仔细研究了IRSG数据集,以及从中改编的广泛使用的可视关系数据集(VRD)。我们发现,这些数据集表现出偏差,使得忽略关系的方法能够表现得相对良好。我们还描述了IRSG数据集的其他几个问题,并报告了使用数据集子集进行的实验,其中消除了偏差和其他问题。我们的研究有助于更广泛的工作:更好地理解将语言和视觉结合起来的机器学习方法,以及哪些流行的数据集实际测试。
URL
https://arxiv.org/abs/1904.02225