Paper Reading AI Learner

Maximal Cliques on Multi-Frame Proposal Graph for Unsupervised Video Object Segmentation

2023-01-29 04:12:44
Jialin Yuan, Jay Patravali, Hung Nguyen, Chanho Kim, Li Fuxin

Abstract

Unsupervised Video Object Segmentation (UVOS) aims at discovering objects and tracking them through videos. For accurate UVOS, we observe if one can locate precise segment proposals on key frames, subsequent processes are much simpler. Hence, we propose to reason about key frame proposals using a graph built with the object probability masks initially generated from multiple frames around the key frame and then propagated to the key frame. On this graph, we compute maximal cliques, with each clique representing one candidate object. By making multiple proposals in the clique to vote for the key frame proposal, we obtain refined key frame proposals that could be better than any of the single-frame proposals. A semi-supervised VOS algorithm subsequently tracks these key frame proposals to the entire video. Our algorithm is modular and hence can be used with any instance segmentation and semi-supervised VOS algorithm. We achieve state-of-the-art performance on the DAVIS-2017 validation and test-dev dataset. On the related problem of video instance segmentation, our method shows competitive performance with the previous best algorithm that requires joint training with the VOS algorithm.

Abstract (translated)

无监督视频对象分割(UVOS)的目标是发现视频中的物体并跟踪它们。对于准确的UVOS,我们观察,如果可以在关键帧上精确地提出分割建议,后续的处理会变得更简单。因此,我们提出,使用从关键帧周围多个帧生成的物体概率掩码构建的图来考虑关键帧建议。在这个图中,我们计算最大群集,每个群集代表一个可能的对象。通过在群集中提出多个建议来投票支持关键帧建议,我们获得 refined 的关键帧建议,可能比单个帧建议更好。随后,半监督的VOS算法跟踪这些关键帧建议到整个视频。我们的算法是模块化的,因此可以与任何实例分割和半监督的VOS算法一起使用。我们在Dasys 2017验证和测试开发数据集上实现了最先进的性能。在视频实例分割相关问题的相关方面,我们的方法显示出与之前最好的算法的竞争性表现,这需要与VOS算法进行联合训练。

URL

https://arxiv.org/abs/2301.12352

PDF

https://arxiv.org/pdf/2301.12352.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot