Abstract
Video generation techniques have made remarkable progress, promising to be the foundation of interactive world exploration. However, existing video generation datasets are not well-suited for world exploration training as they suffer from some limitations: limited locations, short duration, static scenes, and a lack of annotations about exploration and the world. In this paper, we introduce Sekai (meaning ``world'' in Japanese), a high-quality first-person view worldwide video dataset with rich annotations for world exploration. It consists of over 5,000 hours of walking or drone view (FPV and UVA) videos from over 100 countries and regions across 750 cities. We develop an efficient and effective toolbox to collect, pre-process and annotate videos with location, scene, weather, crowd density, captions, and camera trajectories. Experiments demonstrate the quality of the dataset. And, we use a subset to train an interactive video world exploration model, named YUME (meaning ``dream'' in Japanese). We believe Sekai will benefit the area of video generation and world exploration, and motivate valuable applications.
Abstract (translated)
视频生成技术取得了显著进步,有望成为互动式世界探索的基础。然而,现有的视频生成数据集由于存在一些局限性(如地点有限、时长较短、场景静态以及缺乏关于探索和世界的标注信息)并不适合用于世界探索训练。为此,在这篇论文中我们介绍了Sekai(日语中的“世界”),这是一个高质量的第一人称视角的全球视频数据集,包含丰富的世界探索注释信息。该数据集涵盖了来自超过100个国家和地区、750个城市的步行或无人机视图(FPV和UVA)视频,总时长超过5,000小时。我们开发了一个高效且有效的工具箱来收集、预处理并标注视频中的地点、场景、天气状况、人群密度、字幕以及相机轨迹等信息。实验结果证明了该数据集的质量。此外,我们使用其子集训练出一个互动式视频世界探索模型——YUME(日语中的“梦想”)。我们认为Sekai将有助于视频生成和世界探索领域的发展,并激发有价值的应用场景。
URL
https://arxiv.org/abs/2506.15675