Paper Reading AI Learner

LVOS: A Benchmark for Large-scale Long-term Video Object Segmentation

2024-04-30 07:50:29
Lingyi Hong, Zhongying Liu, Wenchao Chen, Chenzhi Tan, Yuang Feng, Xinyu Zhou, Pinxue Guo, Jinglun Li, Zhaoyu Chen, Shuyong Gao, Wei Zhang, Wenqiang Zhang

Abstract

Video object segmentation (VOS) aims to distinguish and track target objects in a video. Despite the excellent performance achieved by off-the-shell VOS models, existing VOS benchmarks mainly focus on short-term videos lasting about 5 seconds, where objects remain visible most of the time. However, these benchmarks poorly represent practical applications, and the absence of long-term datasets restricts further investigation of VOS in realistic scenarios. Thus, we propose a novel benchmark named LVOS, comprising 720 videos with 296,401 frames and 407,945 high-quality annotations. Videos in LVOS last 1.14 minutes on average, approximately 5 times longer than videos in existing datasets. Each video includes various attributes, especially challenges deriving from the wild, such as long-term reappearing and cross-temporal similar objects. Compared to previous benchmarks, our LVOS better reflects VOS models' performance in real scenarios. Based on LVOS, we evaluate 20 existing VOS models under 4 different settings and conduct a comprehensive analysis. On LVOS, these models suffer a large performance drop, highlighting the challenge of achieving precise tracking and segmentation in real-world scenarios. Attribute-based analysis indicates that key factor to accuracy decline is the increased video length, emphasizing LVOS's crucial role. We hope our LVOS can advance development of VOS in real scenes. Data and code are available at this https URL.

Abstract (translated)

视频对象分割(VOS)旨在在视频中区分和跟踪目标对象。尽管通过离线VOS模型的优异性能,已经达到了很好的效果,但现有的VOS基准主要关注持续约5秒的短期视频,其中物体大部分时间都是可见的。然而,这些基准未能很好地代表实际应用场景,缺乏长期数据集也限制了VOS在现实场景中的进一步研究。因此,我们提出了一个名为LVOS的新基准,由720个视频组成,包含296,401帧和407,945个高质量注释。LVOS中的视频平均持续1.14分钟,比现有数据集中的视频长约5倍。每个视频具有各种属性,尤其是来自野生的具有挑战性的属性,例如长期重复和跨时间相关的类似物体。与以前的基准相比,我们的LVOS更能反映VOS模型在现实场景中的性能。基于LVOS,我们对4种不同设置下的20个现有VOS模型进行了评估,并进行了全面分析。在LVOS上,这些模型性能下降较大,突出了在现实场景中实现精确跟踪和分割的挑战。基于属性的分析表明,准确度下降的关键因素是视频长度,强调了LVOS在现实场景中具有关键作用。我们希望我们的LVOS能够促进VOS在现实场景的发展。数据和代码可在此链接处获取:https://www.example.com/

URL

https://arxiv.org/abs/2404.19326

PDF

https://arxiv.org/pdf/2404.19326.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot