Paper Reading AI Learner

Aquarius: A Family of Industry-Level Video Generation Models for Marketing Scenarios

2025-05-14 13:39:53
Huafeng Shi, Jianzhong Liang, Rongchang Xie, Xian Wu, Cheng Chen, Chang Liu

Abstract

This report introduces Aquarius, a family of industry-level video generation models for marketing scenarios designed for thousands-xPU clusters and models with hundreds of billions of parameters. Leveraging efficient engineering architecture and algorithmic innovation, Aquarius demonstrates exceptional performance in high-fidelity, multi-aspect-ratio, and long-duration video synthesis. By disclosing the framework's design details, we aim to demystify industrial-scale video generation systems and catalyze advancements in the generative video community. The Aquarius framework consists of five components: Distributed Graph and Video Data Processing Pipeline: Manages tens of thousands of CPUs and thousands of xPUs via automated task distribution, enabling efficient video data processing. Additionally, we are about to open-source the entire data processing framework named "Aquarius-Datapipe". Model Architectures for Different Scales: Include a Single-DiT architecture for 2B models and a Multimodal-DiT architecture for 13.4B models, supporting multi-aspect ratios, multi-resolution, and multi-duration video generation. High-Performance infrastructure designed for video generation model training: Incorporating hybrid parallelism and fine-grained memory optimization strategies, this infrastructure achieves 36% MFU at large scale. Multi-xPU Parallel Inference Acceleration: Utilizes diffusion cache and attention optimization to achieve a 2.35x inference speedup. Multiple marketing-scenarios applications: Including image-to-video, text-to-video (avatar), video inpainting and video personalization, among others. More downstream applications and multi-dimensional evaluation metrics will be added in the upcoming version updates.

Abstract (translated)

该报告介绍了Aquarius,这是一个专为营销场景设计的行业级视频生成模型系列,适用于数千xPU集群和具有数百亿参数的大规模模型。通过高效的工程架构和算法创新,Aquarius在高保真度、多宽高比以及长时间段视频合成方面展现了卓越性能。公开该框架的设计细节旨在揭开工业级视频生成系统的神秘面纱,并推动生成式视频社区的进步。 Aquarius框架由五个组成部分构成: 1. **分布式图与视频数据处理流水线**:通过自动任务分配管理数以万计的CPU和数千xPU,从而实现高效的视频数据处理。此外,我们即将开源整个数据处理框架“Aquarius-Datapipe”。 2. **不同规模的模型架构**:包括针对20亿参数模型设计的Single-DiT架构以及面向134亿参数模型设计的Multimodal-DiT架构,支持多宽高比、多分辨率及多种时长的视频生成。 3. **高性能基础设施,用于视频生成模型训练**:采用混合并行化和细粒度内存优化策略,在大规模下达到36%的MFU(最大可能利用率)。 4. **多xPU并行推理加速**:利用扩散缓存与注意力机制优化实现2.35倍的推理速度提升。 5. **多种营销场景应用**:包括图像转视频、文本转视频(虚拟形象)、视频修复和视频个性化等。未来版本更新将增加更多下游应用场景及多维度评估指标。

URL

https://arxiv.org/abs/2505.10584

PDF

https://arxiv.org/pdf/2505.10584.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot