Abstract
Action recognition has been advanced in recent years by benchmarks with rich annotations. However, research is still mainly limited to human action or sports recognition - focusing on a highly specific video understanding task and thus leaving a significant gap towards describing the overall content of a video. We fill in this gap by presenting a large-scale "Holistic Video Understanding Dataset"~(HVU). HVU is organized hierarchically in a semantic taxonomy that focuses on multi-label and multi-task video understanding as a comprehensive problem that encompasses the recognition of multiple semantic aspects in the dynamic scene. HVU contains approx.~577k videos in total with 13M annotations for training and validation set spanning over {4378} classes. HVU encompasses semantic aspects defined on categories of scenes, objects, actions, events, attributes and concepts, which naturally captures the real-world scenarios. Further, we introduce a new spatio-temporal deep neural network architecture called "Holistic Appearance and Temporal Network"~(HATNet) that builds on fusing 2D and 3D architectures into one by combining intermediate representations of appearance and temporal cues. HATNet focuses on the multi-label and multi-task learning problem and is trained in an end-to-end manner. The experiments show that HATNet trained on HVU outperforms current state-of-the-art methods on challenging human action datasets: HMDB51, UCF101, and Kinetics. The dataset and codes will be made publicly available.
Abstract (translated)
近年来,通过具有丰富注释的基准,行动识别得到了提升。然而,研究仍然主要局限于人类行为或运动识别——侧重于一个高度具体的视频理解任务,因此在描述视频的总体内容方面留下了巨大的差距。我们通过展示一个大规模的“整体视频理解数据集”(hvu)来填补这个空白。hvu是一个语义分类法中的层次结构,它将多标签和多任务视频理解作为一个综合性问题,包括在动态场景中识别多个语义方面。hvu总共包含大约577k个视频,其中13m是针对培训和验证集的注释,覆盖了4378个课程。hvu包含在场景、对象、动作、事件、属性和概念类别上定义的语义方面,这些类别自然地捕获了真实场景。
URL
https://arxiv.org/abs/1904.11451