Abstract
SkyReels V4 is a unified multi modal video foundation model for joint video audio generation, inpainting, and editing. The model adopts a dual stream Multimodal Diffusion Transformer (MMDiT) architecture, where one branch synthesizes video and the other generates temporally aligned audio, while sharing a powerful text encoder based on the Multimodal Large Language Models (MMLM). SkyReels V4 accepts rich multi modal instructions, including text, images, video clips, masks, and audio references. By combining the MMLMs multi modal instruction following capability with in context learning in the video branch MMDiT, the model can inject fine grained visual guidance under complex conditioning, while the audio branch MMDiT simultaneously leverages audio references to guide sound generation. On the video side, we adopt a channel concatenation formulation that unifies a wide range of inpainting style tasks, such as image to video, video extension, and video editing under a single interface, and naturally extends to vision referenced inpainting and editing via multi modal prompts. SkyReels V4 supports up to 1080p resolution, 32 FPS, and 15 second duration, enabling high fidelity, multi shot, cinema level video generation with synchronized audio. To make such high resolution, long-duration generation computationally feasible, we introduce an efficiency strategy: Joint generation of low resolution full sequences and high-resolution keyframes, followed by dedicated super-resolution and frame interpolation models. To our knowledge, SkyReels V4 is the first video foundation model that simultaneously supports multi-modal input, joint video audio generation, and a unified treatment of generation, inpainting, and editing, while maintaining strong efficiency and quality at cinematic resolutions and durations.
Abstract (translated)
SkyReels V4 是一个统一的多模态视频基础模型,用于联合生成视频和音频、修复和编辑。该模型采用双流多模态扩散变换器(MMDiT)架构,其中一条分支合成视频,另一条分支生成时间对齐的音频,并共享基于多模态大型语言模型(MMLM)的强大文本编码器。SkyReels V4 接受丰富的多模态指令,包括文本、图像、视频片段、遮罩和音频参考。通过结合 MMLM 的多模态指令跟随能力和视频分支 MMDiT 中的上下文学习能力,该模型可以在复杂的条件下注入精细的视觉引导,而音频分支 MMDiT 同时利用音频参考来指导声音生成。在视频方面,我们采用通道级联的形式,统一了包括图像到视频、视频扩展和视频编辑在内的广泛修复风格任务,并通过多模态提示自然地扩展到了基于视觉参考的修复和编辑。SkyReels V4 支持高达 1080p 分辨率、32 FPS 和 15 秒时长,能够生成高保真度、多镜头、电影级别的视频并配有同步音频。为了使这种高分辨率、长时间生成在计算上可行,我们引入了一个效率策略:先联合生成低分辨率完整序列和高清关键帧,然后使用专门的超分辨率模型和帧插值模型进行后处理。据我们所知,SkyReels V4 是第一个同时支持多模态输入、联合视频音频生成,并统一处理生成、修复和编辑的视频基础模型,且在电影级分辨率和时长下保持了强大的效率和质量。
URL
https://arxiv.org/abs/2602.21818