Paper Reading AI Learner

Video Semantic Salient Instance Segmentation: Benchmark Dataset and Baseline

2018-07-04 05:30:52
Trung-Nghia Le, Akihiro Sugimoto

Abstract

This paper pushes the envelope on salient regions in a video to decompose them into semantically meaningful components, semantic salient instances. To address this video semantic salient instance segmentation, we construct a new dataset, Semantic Salient Instance Video (SESIV) dataset. Our SESIV dataset consists of 84 high-quality video sequences with pixel-wisely per-frame ground-truth labels annotated for different segmentation tasks. We also provide a baseline for this problem, called Fork-Join Strategy (FJS). FJS is a two-stream network leveraging advantages of two different segmentation tasks, i.e., semantic instance segmentation and salient object segmentation. In FJS, we introduce a sequential fusion that combines the outputs of the two streams to have non-overlapping instances one by one. We also introduce a recurrent instance propagation to refine the shapes and semantic meanings of instances, and an identity tracking to maintain both the identity and the semantic meaning of an instance over the entire video. Experimental results demonstrated the effectiveness of our proposed FJS.

Abstract (translated)

本文将信封推到视频中的显着区域,将它们分解为语义上有意义的组件,即语义显着实例。为了解决这个视频语义显着实例分割,我们构建了一个新的数据集,即语义显着实例视频(SESIV)数据集。我们的SESIV数据集由84个高质量的视频序列组成,每个帧的地面实况标签按照不同的分段任务进行注释。我们还为此问题提供了一个基线,称为Fork-Join Strategy(FJS)。 FJS是利用两个不同分段任务的优点的双流网络,即语义实例分割和显着对象分割。在FJS中,我们引入了一种顺序融合,它将两个流的输出结合起来,逐个具有非重叠的实例。我们还引入了一个循环实例传播来优化实例的形状和语义含义,并引入一个身份跟踪来维护整个视频中实例的身份和语义。实验结果证明了我们提出的FJS的有效性。

URL

https://arxiv.org/abs/1807.01452

PDF

https://arxiv.org/pdf/1807.01452.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot