Paper Reading AI Learner

Exploring Temporal Information for Improved Video Understanding


Abstract

In this dissertation, I present my work towards exploring temporal information for better video understanding. Specifically, I have worked on two problems: action recognition and semantic segmentation. For action recognition, I have proposed a framework, termed hidden two-stream networks, to learn an optimal motion representation that does not require the computation of optical flow. My framework alleviates several challenges faced in video classification, such as learning motion representations, real-time inference, multi-framerate handling, generalizability to unseen actions, etc. For semantic segmentation, I have introduced a general framework that uses video prediction models to synthesize new training samples. By scaling up the training dataset, my trained models are more accurate and robust than previous models even without modifications to the network architectures or objective functions. I believe videos have much more potential to be mined, and temporal information is one of the most important cues for machines to perceive the visual world better.

Abstract (translated)

在本文中,我介绍了我的工作,旨在探索时间信息,以更好地理解视频。具体来说,我研究了两个问题:动作识别和语义分割。对于动作识别,我提出了一个称为隐藏双流网络的框架,以学习不需要计算光流的最优运动表示。我的框架减轻了视频分类面临的几个挑战,如学习运动表示、实时推理、多帧处理、不可见动作的可归纳性等。对于语义分割,我引入了一个通用框架,使用视频预测模型合成新的训练样本。通过扩展训练数据集,我的训练模型比以前的模型更准确和健壮,即使没有修改网络架构或目标函数。我相信视频有更大的开发潜力,时间信息是机器更好地感知视觉世界的最重要线索之一。

URL

https://arxiv.org/abs/1905.10654

PDF

https://arxiv.org/pdf/1905.10654.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot