Neural Message Passing on Hybrid Spatio-Temporal Visual and Symbolic Graphs for Video Understanding

Abstract
Abstract (translated)
URL
PDF

Abstract

Many problems in video understanding require labeling multiple activities occurring concurrently in different parts of a video, including the objects and actors participating in such activities. However, state-of-the-art methods in computer vision focus primarily on tasks such as action classification, action detection, or action segmentation, where typically only one action label needs to be predicted. In this work, we propose a generic approach to classifying one or more nodes of a spatio-temporal graph grounded on spatially localized semantic entities in a video, such as actors and objects. In particular, we combine an attributed spatio-temporal visual graph, which captures visual context and interactions, with an attributed symbolic graph grounded on the semantic label space, which captures relationships between multiple labels. We further propose a neural message passing framework for jointly refining the representations of the nodes and edges of the hybrid visual-symbolic graph. Our framework features a) node-type and edge-type conditioned filters and adaptive graph connectivity, b) a soft-assignment module for connecting visual nodes to symbolic nodes and vice versa, c) a symbolic graph reasoning module that enforces semantic coherence and d) a pooling module for aggregating the refined node and edge representations for downstream classification tasks. We demonstrate the generality of our approach on a variety of tasks, such as temporal subactivity classification and object affordance classification on the CAD-120 dataset and multilabel temporal action localization on the large scale Charades dataset, where we outperform existing deep learning approaches, using only raw RGB frames.

Abstract (translated)

视频理解中的许多问题都要求标记在视频的不同部分同时发生的多个活动，包括参与这些活动的对象和参与者。然而，计算机视觉中最先进的方法主要集中在诸如动作分类、动作检测或动作分割等任务上，这些任务通常只需要预测一个动作标签。在这项工作中，我们提出了一个通用的方法来分类一个或多个节点的时空图基于空间本地化的语义实体在视频中，如演员和对象。特别是，我们将捕获视觉上下文和交互的属性化时空可视图与基于语义标签空间的属性化符号图相结合，后者捕获多个标签之间的关系。进一步提出了一种神经消息传递框架，用于联合细化混合视觉符号图的节点和边缘表示。我们的框架具有以下特点：a）节点类型和边缘类型条件过滤器和自适应图形连接；b）用于将可视节点连接到符号节点的软分配模块，反之亦然；c）用于强制语义一致性的符号图推理模块；d）用于聚合优化节点和边缘表示的池模块。下游分类任务。我们演示了我们在各种任务上的方法的一般性，例如CAD-120数据集上的时间子活动分类和对象供给分类，以及大规模Charades数据集上的多标签时间动作定位，在这些任务中，我们仅使用原始的RGB帧，优于现有的深度学习方法。

URL

https://arxiv.org/abs/1905.07385

PDF

https://arxiv.org/pdf/1905.07385.pdf