Paper Reading AI Learner

STUPD: A Synthetic Dataset for Spatial and Temporal Relation Reasoning

2023-09-13 02:35:59
Palaash Agrawal, Haidi Azaman, Cheston Tan

Abstract

Understanding relations between objects is crucial for understanding the semantics of a visual scene. It is also an essential step in order to bridge visual and language models. However, current state-of-the-art computer vision models still lack the ability to perform spatial reasoning well. Existing datasets mostly cover a relatively small number of spatial relations, all of which are static relations that do not intrinsically involve motion. In this paper, we propose the Spatial and Temporal Understanding of Prepositions Dataset (STUPD) -- a large-scale video dataset for understanding static and dynamic spatial relationships derived from prepositions of the English language. The dataset contains 150K visual depictions (videos and images), consisting of 30 distinct spatial prepositional senses, in the form of object interaction simulations generated synthetically using Unity3D. In addition to spatial relations, we also propose 50K visual depictions across 10 temporal relations, consisting of videos depicting event/time-point interactions. To our knowledge, no dataset exists that represents temporal relations through visual settings. In this dataset, we also provide 3D information about object interactions such as frame-wise coordinates, and descriptions of the objects used. The goal of this synthetic dataset is to help models perform better in visual relationship detection in real-world settings. We demonstrate an increase in the performance of various models over 2 real-world datasets (ImageNet-VidVRD and Spatial Senses) when pretrained on the STUPD dataset, in comparison to other pretraining datasets.

Abstract (translated)

理解物体之间的关系对于理解视觉场景的语义是至关重要的。它也是连接视觉和语言模型的至关重要的步骤。然而,当前最先进的计算机视觉模型仍然缺乏很好的空间推理能力。现有的数据集大多覆盖了相对较小的空间关系,这些空间关系并不包含本质上不涉及运动的关系。在本文中,我们提出了“空间与时间的理解”数据集(STUPD)——一个大规模的视频数据集,用于理解从英语短语中的空间关系提取的静态和动态空间关系。该数据集包含150万可视化呈现(视频和图像),由30个不同的空间短语感知组成,通过使用Unity3D生成的对象交互模拟生成。除了空间关系,我们还提出了10个时间关系上的50万可视化呈现,包括视频描述事件/时间点交互。据我们所知,目前还没有数据集通过视觉设置来表示时间关系。在这个数据集中,我们提供了有关对象交互的三维信息,如帧率坐标和使用的对象的描述。该合成数据集的目的是帮助模型在现实世界场景中更好地进行视觉关系检测。我们在STUPD数据集上预训练后,与其他预训练数据集相比,证明了各种模型的性能提高了。我们证明了在不同的模型上,通过视觉设置检测视觉关系时,性能提高了。

URL

https://arxiv.org/abs/2309.06680

PDF

https://arxiv.org/pdf/2309.06680.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot