Paper Reading AI Learner

InstMove: Instance Motion for Object-centric Video Segmentation

2023-03-14 17:58:44
Qihao Liu, Junfeng Wu, Yi Jiang, Xiang Bai, Alan Yuille, Song Bai

Abstract

Despite significant efforts, cutting-edge video segmentation methods still remain sensitive to occlusion and rapid movement, due to their reliance on the appearance of objects in the form of object embeddings, which are vulnerable to these disturbances. A common solution is to use optical flow to provide motion information, but essentially it only considers pixel-level motion, which still relies on appearance similarity and hence is often inaccurate under occlusion and fast movement. In this work, we study the instance-level motion and present InstMove, which stands for Instance Motion for Object-centric Video Segmentation. In comparison to pixel-wise motion, InstMove mainly relies on instance-level motion information that is free from image feature embeddings, and features physical interpretations, making it more accurate and robust toward occlusion and fast-moving objects. To better fit in with the video segmentation tasks, InstMove uses instance masks to model the physical presence of an object and learns the dynamic model through a memory network to predict its position and shape in the next frame. With only a few lines of code, InstMove can be integrated into current SOTA methods for three different video segmentation tasks and boost their performance. Specifically, we improve the previous arts by 1.5 AP on OVIS dataset, which features heavy occlusions, and 4.9 AP on YouTubeVIS-Long dataset, which mainly contains fast-moving objects. These results suggest that instance-level motion is robust and accurate, and hence serving as a powerful solution in complex scenarios for object-centric video segmentation.

Abstract (translated)

尽管做出了巨大的努力,最先进的视频分割方法仍然对遮挡和快速移动敏感,这是因为它们依赖于以对象嵌入的形式出现的物体的外观,而这些干扰容易破坏它们的外观相似性。一种常见的解决方案是使用光学流提供运动信息,但本质上是仅考虑像素级别的运动,仍然依赖于外观相似性,因此通常在遮挡和快速移动时不准确。在这项工作中,我们研究了实例级别的运动,并提出了名为InstMove的实例运动术语,以代表对象中心视频分割。与像素级别的运动相比,InstMove主要依赖于没有图像特征嵌入的实例级别的运动信息,并具有物理解释,使其在遮挡和快速移动对象方面更加准确和鲁棒。为了更好地与视频分割任务相适应,InstMove使用实例 masks 模型物体的物理存在,并通过记忆网络学习动态模型,以预测其下帧的位置和形状。仅使用几个代码行,InstMove可以集成到当前的最佳方法中,以增强三种不同的视频分割任务的性能。具体而言,我们在OVIS数据集上提高了先前的作品1.5AP,该数据集包含严重的遮挡,而在YouTubeVIS-Long数据集上提高了4.9AP,该数据集主要包含快速移动的对象。这些结果表明实例级别的运动是稳健和准确的,因此对于对象中心视频分割复杂的场景是一种强大的解决方案。

URL

https://arxiv.org/abs/2303.08132

PDF

https://arxiv.org/pdf/2303.08132.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot