Abstract
This report introduces Aquarius, a family of industry-level video generation models for marketing scenarios designed for thousands-xPU clusters and models with hundreds of billions of parameters. Leveraging efficient engineering architecture and algorithmic innovation, Aquarius demonstrates exceptional performance in high-fidelity, multi-aspect-ratio, and long-duration video synthesis. By disclosing the framework's design details, we aim to demystify industrial-scale video generation systems and catalyze advancements in the generative video community. The Aquarius framework consists of five components: Distributed Graph and Video Data Processing Pipeline: Manages tens of thousands of CPUs and thousands of xPUs via automated task distribution, enabling efficient video data processing. Additionally, we are about to open-source the entire data processing framework named "Aquarius-Datapipe". Model Architectures for Different Scales: Include a Single-DiT architecture for 2B models and a Multimodal-DiT architecture for 13.4B models, supporting multi-aspect ratios, multi-resolution, and multi-duration video generation. High-Performance infrastructure designed for video generation model training: Incorporating hybrid parallelism and fine-grained memory optimization strategies, this infrastructure achieves 36% MFU at large scale. Multi-xPU Parallel Inference Acceleration: Utilizes diffusion cache and attention optimization to achieve a 2.35x inference speedup. Multiple marketing-scenarios applications: Including image-to-video, text-to-video (avatar), video inpainting and video personalization, among others. More downstream applications and multi-dimensional evaluation metrics will be added in the upcoming version updates.
Abstract (translated)
该报告介绍了Aquarius,这是一个专为营销场景设计的行业级视频生成模型系列,适用于数千xPU集群和具有数百亿参数的大规模模型。通过高效的工程架构和算法创新,Aquarius在高保真度、多宽高比以及长时间段视频合成方面展现了卓越性能。公开该框架的设计细节旨在揭开工业级视频生成系统的神秘面纱,并推动生成式视频社区的进步。 Aquarius框架由五个组成部分构成: 1. **分布式图与视频数据处理流水线**:通过自动任务分配管理数以万计的CPU和数千xPU,从而实现高效的视频数据处理。此外,我们即将开源整个数据处理框架“Aquarius-Datapipe”。 2. **不同规模的模型架构**:包括针对20亿参数模型设计的Single-DiT架构以及面向134亿参数模型设计的Multimodal-DiT架构,支持多宽高比、多分辨率及多种时长的视频生成。 3. **高性能基础设施,用于视频生成模型训练**:采用混合并行化和细粒度内存优化策略,在大规模下达到36%的MFU(最大可能利用率)。 4. **多xPU并行推理加速**:利用扩散缓存与注意力机制优化实现2.35倍的推理速度提升。 5. **多种营销场景应用**:包括图像转视频、文本转视频(虚拟形象)、视频修复和视频个性化等。未来版本更新将增加更多下游应用场景及多维度评估指标。
URL
https://arxiv.org/abs/2505.10584