Abstract
In this dissertation, I present my work towards exploring temporal information for better video understanding. Specifically, I have worked on two problems: action recognition and semantic segmentation. For action recognition, I have proposed a framework, termed hidden two-stream networks, to learn an optimal motion representation that does not require the computation of optical flow. My framework alleviates several challenges faced in video classification, such as learning motion representations, real-time inference, multi-framerate handling, generalizability to unseen actions, etc. For semantic segmentation, I have introduced a general framework that uses video prediction models to synthesize new training samples. By scaling up the training dataset, my trained models are more accurate and robust than previous models even without modifications to the network architectures or objective functions. I believe videos have much more potential to be mined, and temporal information is one of the most important cues for machines to perceive the visual world better.
Abstract (translated)
在本文中,我介绍了我的工作,旨在探索时间信息,以更好地理解视频。具体来说,我研究了两个问题:动作识别和语义分割。对于动作识别,我提出了一个称为隐藏双流网络的框架,以学习不需要计算光流的最优运动表示。我的框架减轻了视频分类面临的几个挑战,如学习运动表示、实时推理、多帧处理、不可见动作的可归纳性等。对于语义分割,我引入了一个通用框架,使用视频预测模型合成新的训练样本。通过扩展训练数据集,我的训练模型比以前的模型更准确和健壮,即使没有修改网络架构或目标函数。我相信视频有更大的开发潜力,时间信息是机器更好地感知视觉世界的最重要线索之一。
URL
https://arxiv.org/abs/1905.10654