Paper Reading AI Learner

Video Object Segmentation using Space-Time Memory Networks

2019-04-01 07:27:24
Seoung Wug Oh, Joon-Young Lee, Ning Xu, Seon Joo Kim

Abstract

We propose a novel solution for semi-supervised video object segmentation. By the nature of the problem, available cues (e.g. video frame(s) with object masks) become richer with the intermediate predictions. However, the existing methods are unable to fully exploit this rich source of information. We resolve the issue by leveraging memory networks and learn to read relevant information from all available sources. In our framework, the past frames with object masks form an external memory, and the current frame as the query is segmented using the mask information in the memory. Specifically, the query and the memory are densely matched in the feature space, covering all the space-time pixel locations in a feed-forward fashion. Contrast to the previous approaches, the abundant use of the guidance information allows us to better handle the challenges such as appearance changes and occlussions. We validate our method on the latest benchmark sets and achieved the state-of-the-art performance (overall score of 79.4 on Youtube-VOS val set, J of 88.7 and 79.2 on DAVIS 2016/2017 val set respectively) while having a fast runtime (0.16 second/frame on DAVIS 2016 val set).

Abstract (translated)

提出了一种新的半监督视频对象分割方法。根据问题的性质,可用的提示(例如带对象遮罩的视频帧)随着中间预测变得更丰富。然而,现有的方法无法充分利用这种丰富的信息源。我们通过利用内存网络来解决这个问题,并学习从所有可用资源中读取相关信息。在我们的框架中,带有对象掩码的过去的帧形成一个外部内存,而当前帧作为查询使用内存中的掩码信息进行分段。具体来说,查询和内存在特征空间中紧密匹配,以前馈方式覆盖所有时空像素位置。与以前的方法相比,大量使用指导信息使我们能够更好地处理外观变化和咬合等挑战。我们在最新的基准集上验证了我们的方法,并取得了最先进的性能(YouTube Vos Val集的总分为79.4分,Davis 2016/2017 Val集的总分为88.7分和79.2分),同时具有快速的运行时间(Davis 2016 Val集的0.16秒/帧)。

URL

https://arxiv.org/abs/1904.00607

PDF

https://arxiv.org/pdf/1904.00607.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot