Abstract
We introduce a hierarchical architecture for video understanding that exploits the structure of real world actions by capturing targets at different levels of granularity. We design the model such that it first learns simpler coarse-grained tasks, and then moves on to learn more fine-grained targets. The model is trained with a joint loss on different granularity levels. We demonstrate empirical results on the recent release of Something-Something dataset, which provides a hierarchy of targets, namely coarse-grained action groups, fine-grained action categories, and captions. Experiments suggest that models that exploit targets at different levels of granularity achieve better performance on all levels.
Abstract (translated)
我们引入了一种用于视频理解的分层体系结构,它通过捕获不同粒度级别的目标来利用现实世界操作的结构。我们设计模型,使其首先学习更简单的粗粒度任务,然后继续学习更细粒度的目标。该模型在不同粒度级别上以关节损失进行训练。我们展示了最近发布的Something-Something数据集的实证结果,该数据集提供了目标层次结构,即粗粒度动作组,细粒度动作类别和标题。实验表明,利用不同粒度级别的目标的模型在所有级别上都可以获得更好的性能。
URL
https://arxiv.org/abs/1809.03316