Abstract
Understanding relations between objects is crucial for understanding the semantics of a visual scene. It is also an essential step in order to bridge visual and language models. However, current state-of-the-art computer vision models still lack the ability to perform spatial reasoning well. Existing datasets mostly cover a relatively small number of spatial relations, all of which are static relations that do not intrinsically involve motion. In this paper, we propose the Spatial and Temporal Understanding of Prepositions Dataset (STUPD) -- a large-scale video dataset for understanding static and dynamic spatial relationships derived from prepositions of the English language. The dataset contains 150K visual depictions (videos and images), consisting of 30 distinct spatial prepositional senses, in the form of object interaction simulations generated synthetically using Unity3D. In addition to spatial relations, we also propose 50K visual depictions across 10 temporal relations, consisting of videos depicting event/time-point interactions. To our knowledge, no dataset exists that represents temporal relations through visual settings. In this dataset, we also provide 3D information about object interactions such as frame-wise coordinates, and descriptions of the objects used. The goal of this synthetic dataset is to help models perform better in visual relationship detection in real-world settings. We demonstrate an increase in the performance of various models over 2 real-world datasets (ImageNet-VidVRD and Spatial Senses) when pretrained on the STUPD dataset, in comparison to other pretraining datasets.
Abstract (translated)
理解物体之间的关系对于理解视觉场景的语义是至关重要的。它也是连接视觉和语言模型的至关重要的步骤。然而,当前最先进的计算机视觉模型仍然缺乏很好的空间推理能力。现有的数据集大多覆盖了相对较小的空间关系,这些空间关系并不包含本质上不涉及运动的关系。在本文中,我们提出了“空间与时间的理解”数据集(STUPD)——一个大规模的视频数据集,用于理解从英语短语中的空间关系提取的静态和动态空间关系。该数据集包含150万可视化呈现(视频和图像),由30个不同的空间短语感知组成,通过使用Unity3D生成的对象交互模拟生成。除了空间关系,我们还提出了10个时间关系上的50万可视化呈现,包括视频描述事件/时间点交互。据我们所知,目前还没有数据集通过视觉设置来表示时间关系。在这个数据集中,我们提供了有关对象交互的三维信息,如帧率坐标和使用的对象的描述。该合成数据集的目的是帮助模型在现实世界场景中更好地进行视觉关系检测。我们在STUPD数据集上预训练后,与其他预训练数据集相比,证明了各种模型的性能提高了。我们证明了在不同的模型上,通过视觉设置检测视觉关系时,性能提高了。
URL
https://arxiv.org/abs/2309.06680