Abstract
The recent surge in open-source text-to-video generation models has significantly energized the research community, yet their dependence on proprietary training datasets remains a key constraint. While existing open datasets like Koala-36M employ algorithmic filtering of web-scraped videos from early platforms, they still lack the quality required for fine-tuning advanced video generation models. We present Tiger200K, a manually curated high visual quality video dataset sourced from User-Generated Content (UGC) platforms. By prioritizing visual fidelity and aesthetic quality, Tiger200K underscores the critical role of human expertise in data curation, and providing high-quality, temporally consistent video-text pairs for fine-tuning and optimizing video generation architectures through a simple but effective pipeline including shot boundary detection, OCR, border detecting, motion filter and fine bilingual caption. The dataset will undergo ongoing expansion and be released as an open-source initiative to advance research and applications in video generative models. Project page: this https URL
Abstract (translated)
近期,开源文本到视频生成模型的激增显著激发了研究社区的热情,然而它们对专有训练数据集的依赖仍然是一个主要限制。虽然现有的公开数据集如Koala-36M采用了从早期平台抓取并算法过滤网络视频的方法,但这些数据集仍然缺乏用于精细调整先进视频生成模型所需的高质量内容。我们推出了Tiger200K,这是一个来自用户生成内容(UGC)平台的、由人工精挑细选而成的高视觉质量视频数据集。 通过优先考虑视觉保真度和审美品质,Tiger200K突显了人类专业知识在数据整理中的关键作用,并提供了高质量的时间连贯性视频-文本对,用于通过包括镜头边界检测、OCR(光学字符识别)、边框检测、运动过滤及精细双语描述在内的简单但有效的管道进行微调和优化视频生成架构。该数据集将不断扩展并作为开源倡议发布,以推进视频生成模型的研究和应用。 项目页面:[此链接](https://this-url.com)
URL
https://arxiv.org/abs/2504.15182