Abstract
We empirically study autoregressive pre-training from videos. To perform our study, we construct a series of autoregressive video models, called Toto. We treat videos as sequences of visual tokens and train transformer models to autoregressively predict future tokens. Our models are pre-trained on a diverse dataset of videos and images comprising over 1 trillion visual tokens. We explore different architectural, training, and inference design choices. We evaluate the learned visual representations on a range of downstream tasks including image recognition, video classification, object tracking, and robotics. Our results demonstrate that, despite minimal inductive biases, autoregressive pre-training leads to competitive performance across all benchmarks. Finally, we find that scaling our video models results in similar scaling curves to those seen in language models, albeit with a different rate. More details at this https URL
Abstract (translated)
我们通过实证研究了从视频中进行自回归预训练的方法。为了开展这项研究,我们构建了一系列名为Toto的自回归视频模型。我们将视频视为视觉令牌序列,并训练变压器模型以自回归方式预测未来的令牌。我们的模型在包含超过1万亿个视觉令牌的多样化数据集(包括视频和图像)上进行了预训练。我们在架构、训练和推理设计选择方面做了各种探索。我们评估了所学习到的视觉表示形式在一系列下游任务上的表现,包括图像识别、视频分类、对象跟踪和机器人技术。我们的研究结果表明,尽管具有最少的归纳偏置,自回归预训练仍然能在一个广泛的标准上取得竞争性的性能。 最后,我们发现随着视频模型规模的增长,其扩展曲线与语言模型相似,不过增长率有所不同。更多详情请参阅此链接(在实际回答中应提供具体网址,此处以“https URL”表示)。
URL
https://arxiv.org/abs/2501.05453