Abstract
This paper presents improved native unified multimodal models, \emph{i.e.,} Show-o2, that leverage autoregressive modeling and flow matching. Built upon a 3D causal variational autoencoder space, unified visual representations are constructed through a dual-path of spatial (-temporal) fusion, enabling scalability across image and video modalities while ensuring effective multimodal understanding and generation. Based on a language model, autoregressive modeling and flow matching are natively applied to the language head and flow head, respectively, to facilitate text token prediction and image/video generation. A two-stage training recipe is designed to effectively learn and scale to larger models. The resulting Show-o2 models demonstrate versatility in handling a wide range of multimodal understanding and generation tasks across diverse modalities, including text, images, and videos. Code and models are released at this https URL.
Abstract (translated)
本文介绍了改进的原生统一多模态模型,即Show-o2,该模型利用了自回归建模和流匹配技术。基于3D因果变分自动编码器空间,通过空间(-时间)融合的双路径构建统一的视觉表示,使图像和视频模式之间的可扩展性增强,并确保有效的跨模态理解和生成。在语言模型的基础上,自回归建模被原生应用于语言头端,而流匹配则应用于流头端,以促进文本标记预测和图像/视频生成。设计了一种两阶段的训练方案,旨在有效地学习并扩展到更大的模型规模。最终得到的Show-o2模型展示了处理多种多模态理解和生成任务(包括文本、图像和视频)的高度灵活性。代码和模型可在[此链接](https://this-url.com)获取。
URL
https://arxiv.org/abs/2506.15564