YouTube-VOS: A Large-Scale Video Object Segmentation Benchmark

Abstract
Abstract (translated)
URL
PDF

Abstract

Learning long-term spatial-temporal features are critical for many video analysis tasks. However, existing video segmentation methods predominantly rely on static image segmentation techniques, and methods capturing temporal dependency for segmentation have to depend on pretrained optical flow models, leading to suboptimal solutions for the problem. End-to-end sequential learning to explore spatialtemporal features for video segmentation is largely limited by the scale of available video segmentation datasets, i.e., even the largest video segmentation dataset only contains 90 short video clips. To solve this problem, we build a new large-scale video object segmentation dataset called YouTube Video Object Segmentation dataset (YouTube-VOS). Our dataset contains 4,453 YouTube video clips and 94 object categories. This is by far the largest video object segmentation dataset to our knowledge and has been released at this http URL We further evaluate several existing state-of-the-art video object segmentation algorithms on this dataset which aims to establish baselines for the development of new algorithms in the future.

Abstract (translated)

学习长期时空特征对于许多视频分析任务至关重要。然而，现有的视频分割方法主要依赖于静态图像分割技术，并且捕获分割的时间依赖性的方法必须依赖于预训练的光流模型，导致该问题的次优解决方案。用于探索视频分割的空间时间特征的端到端顺序学习在很大程度上受到可用视频分割数据集的规模的限制，即，即使最大的视频分割数据集也仅包含90个短视频剪辑。为了解决这个问题，我们构建了一个名为YouTube视频对象分割数据集（YouTube-VOS）的新的大型视频对象分割数据集。我们的数据集包含4,453个YouTube视频剪辑和94个对象类别。这是迄今为止我们所知的最大视频对象分割数据集，并已在此http URL发布。我们进一步评估该数据集上几个现有的最先进的视频对象分割算法，旨在为新的开发建立基线未来的算法。

URL

https://arxiv.org/abs/1809.03327

PDF

https://arxiv.org/pdf/1809.03327.pdf