Paper Reading AI Learner

Neural Message Passing on Hybrid Spatio-Temporal Visual and Symbolic Graphs for Video Understanding

2019-05-17 17:33:48
Effrosyni Mavroudi, Benjamín Béjar Haro, René Vidal

Abstract

Many problems in video understanding require labeling multiple activities occurring concurrently in different parts of a video, including the objects and actors participating in such activities. However, state-of-the-art methods in computer vision focus primarily on tasks such as action classification, action detection, or action segmentation, where typically only one action label needs to be predicted. In this work, we propose a generic approach to classifying one or more nodes of a spatio-temporal graph grounded on spatially localized semantic entities in a video, such as actors and objects. In particular, we combine an attributed spatio-temporal visual graph, which captures visual context and interactions, with an attributed symbolic graph grounded on the semantic label space, which captures relationships between multiple labels. We further propose a neural message passing framework for jointly refining the representations of the nodes and edges of the hybrid visual-symbolic graph. Our framework features a) node-type and edge-type conditioned filters and adaptive graph connectivity, b) a soft-assignment module for connecting visual nodes to symbolic nodes and vice versa, c) a symbolic graph reasoning module that enforces semantic coherence and d) a pooling module for aggregating the refined node and edge representations for downstream classification tasks. We demonstrate the generality of our approach on a variety of tasks, such as temporal subactivity classification and object affordance classification on the CAD-120 dataset and multilabel temporal action localization on the large scale Charades dataset, where we outperform existing deep learning approaches, using only raw RGB frames.

Abstract (translated)

视频理解中的许多问题都要求标记在视频的不同部分同时发生的多个活动,包括参与这些活动的对象和参与者。然而,计算机视觉中最先进的方法主要集中在诸如动作分类、动作检测或动作分割等任务上,这些任务通常只需要预测一个动作标签。在这项工作中,我们提出了一个通用的方法来分类一个或多个节点的时空图基于空间本地化的语义实体在视频中,如演员和对象。特别是,我们将捕获视觉上下文和交互的属性化时空可视图与基于语义标签空间的属性化符号图相结合,后者捕获多个标签之间的关系。进一步提出了一种神经消息传递框架,用于联合细化混合视觉符号图的节点和边缘表示。我们的框架具有以下特点:a)节点类型和边缘类型条件过滤器和自适应图形连接;b)用于将可视节点连接到符号节点的软分配模块,反之亦然;c)用于强制语义一致性的符号图推理模块;d)用于聚合优化节点和边缘表示的池模块。下游分类任务。我们演示了我们在各种任务上的方法的一般性,例如CAD-120数据集上的时间子活动分类和对象供给分类,以及大规模Charades数据集上的多标签时间动作定位,在这些任务中,我们仅使用原始的RGB帧,优于现有的深度学习方法。

URL

https://arxiv.org/abs/1905.07385

PDF

https://arxiv.org/pdf/1905.07385.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot