Abstract
Visual relationship reasoning is a crucial yet challenging task for understanding rich interactions across visual concepts. For example, a relationship 'man, open, door' involves a complex relation 'open' between concrete entities 'man, door'. While much of the existing work has studied this problem in the context of still images, understanding visual relationships in videos has received limited attention. Due to their temporal nature, videos enable us to model and reason about a more comprehensive set of visual relationships, such as those requiring multiple (temporal) observations (e.g., 'man, lift up, box' vs. 'man, put down, box'), as well as relationships that are often correlated through time (e.g., 'woman, pay, money' followed by 'woman, buy, coffee'). In this paper, we construct a Conditional Random Field on a fully-connected spatio-temporal graph that exploits the statistical dependency between relational entities spatially and temporally. We introduce a novel gated energy function parametrization that learns adaptive relations conditioned on visual observations. Our model optimization is computationally efficient, and its space computation complexity is significantly amortized through our proposed parameterization. Experimental results on benchmark video datasets (ImageNet Video and Charades) demonstrate state-of-the-art performance across three standard relationship reasoning tasks: Detection, Tagging, and Recognition.
Abstract (translated)
视觉关系推理对于理解跨视觉概念的丰富交互作用是一项至关重要但具有挑战性的任务。例如,关系“人,打开,门”涉及具体实体“人,门”之间的复杂关系“打开”。虽然现有的许多研究工作都是在静止图像的背景下研究这个问题,但是理解视频中的视觉关系却受到了有限的关注。由于它们的时间性质,视频使我们能够对一组更全面的视觉关系进行建模和推理,例如那些需要多个(时间)观察的关系(例如,“男人,举起,盒子”对“男人,放下,盒子”),以及通常与时间相关的关系(例如,“女人,支付,金钱”,然后是“女人,购买,咖啡)。本文在一个完全连通的时空图上构造了一个条件随机场,利用关系实体之间的统计依赖性,从空间和时间两个角度出发。本文介绍了一种新的门控能量函数参数化方法,该方法基于视觉观测学习自适应关系。我们的模型优化在计算上是有效的,通过我们提出的参数化,它的空间计算复杂性得到了显著的分摊。在基准视频数据集(ImageNet视频和字谜)上的实验结果展示了三个标准关系推理任务(检测、标记和识别)的最先进性能。
URL
https://arxiv.org/abs/1903.10547