Paper Reading AI Learner

Video Relationship Reasoning using Gated Spatio-Temporal Energy Graph

2019-03-25 18:41:06
Yao-Hung Hubert Tsai, Santosh Divvala, Louis-Philippe Morency, Ruslan Salakhutdinov, Ali Farhadi

Abstract

Visual relationship reasoning is a crucial yet challenging task for understanding rich interactions across visual concepts. For example, a relationship 'man, open, door' involves a complex relation 'open' between concrete entities 'man, door'. While much of the existing work has studied this problem in the context of still images, understanding visual relationships in videos has received limited attention. Due to their temporal nature, videos enable us to model and reason about a more comprehensive set of visual relationships, such as those requiring multiple (temporal) observations (e.g., 'man, lift up, box' vs. 'man, put down, box'), as well as relationships that are often correlated through time (e.g., 'woman, pay, money' followed by 'woman, buy, coffee'). In this paper, we construct a Conditional Random Field on a fully-connected spatio-temporal graph that exploits the statistical dependency between relational entities spatially and temporally. We introduce a novel gated energy function parametrization that learns adaptive relations conditioned on visual observations. Our model optimization is computationally efficient, and its space computation complexity is significantly amortized through our proposed parameterization. Experimental results on benchmark video datasets (ImageNet Video and Charades) demonstrate state-of-the-art performance across three standard relationship reasoning tasks: Detection, Tagging, and Recognition.

Abstract (translated)

视觉关系推理对于理解跨视觉概念的丰富交互作用是一项至关重要但具有挑战性的任务。例如,关系“人,打开,门”涉及具体实体“人,门”之间的复杂关系“打开”。虽然现有的许多研究工作都是在静止图像的背景下研究这个问题,但是理解视频中的视觉关系却受到了有限的关注。由于它们的时间性质,视频使我们能够对一组更全面的视觉关系进行建模和推理,例如那些需要多个(时间)观察的关系(例如,“男人,举起,盒子”对“男人,放下,盒子”),以及通常与时间相关的关系(例如,“女人,支付,金钱”,然后是“女人,购买,咖啡)。本文在一个完全连通的时空图上构造了一个条件随机场,利用关系实体之间的统计依赖性,从空间和时间两个角度出发。本文介绍了一种新的门控能量函数参数化方法,该方法基于视觉观测学习自适应关系。我们的模型优化在计算上是有效的,通过我们提出的参数化,它的空间计算复杂性得到了显著的分摊。在基准视频数据集(ImageNet视频和字谜)上的实验结果展示了三个标准关系推理任务(检测、标记和识别)的最先进性能。

URL

https://arxiv.org/abs/1903.10547

PDF

https://arxiv.org/pdf/1903.10547.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot