Paper Reading AI Learner

A Memory Network Approach for Story-based Temporal Summarization of 360° Videos

2018-06-18 15:05:21
Sangho Lee, Jinyoung Sung, Youngjae Yu, Gunhee Kim

Abstract

We address the problem of story-based temporal summarization of long 360{\deg} videos. We propose a novel memory network model named Past-Future Memory Network (PFMN), in which we first compute the scores of 81 normal field of view (NFOV) region proposals cropped from the input 360{\deg} video, and then recover a latent, collective summary using the network with two external memories that store the embeddings of previously selected subshots and future candidate subshots. Our major contributions are two-fold. First, our work is the first to address story-based temporal summarization of 360{\deg} videos. Second, our model is the first attempt to leverage memory networks for video summarization tasks. For evaluation, we perform three sets of experiments. First, we investigate the view selection capability of our model on the Pano2Vid dataset. Second, we evaluate the temporal summarization with a newly collected 360{\deg} video dataset. Finally, we experiment our model's performance in another domain, with image-based storytelling VIST dataset. We verify that our model achieves state-of-the-art performance on all the tasks.

Abstract (translated)

我们解决了长360度视频的基于故事的时间总结问题。我们提出了一种名为过去 - 未来记忆网络(PFMN)的新型记忆网络模型,其中我们首先计算从输入360 {\ deg}视频剪辑出的81个正常视场(NFOV)区域提案的得分,然后恢复潜在的集体总结,使用网络和两个外部记忆来存储先前选择的子图和未来候选子图的嵌入。我们的主要贡献是双重的。首先,我们的工作是第一个针对360 {\ deg}视频的基于故事的时间总结。其次,我们的模型是首次尝试利用内存网络进行视频摘要任务。为了评估,我们进行三组实验。首先,我们调查Pano2Vid数据集上我们模型的视图选择能力。其次,我们用新收集的360度视频数据集评估时间总结。最后,我们通过基于图像的讲故事VIST数据集来实验我们的模型在另一个领域的表现。我们验证了我们的模型在所有任务上达到了最先进的性能。

URL

https://arxiv.org/abs/1805.02838

PDF

https://arxiv.org/pdf/1805.02838.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot