Abstract
This study presents a novel approach to enhance the cost-to-quality ratio of image generation with diffusion models. We hypothesize that differences between distilled (e.g. FLUX.1-schnell) and baseline (e.g. FLUX.1-dev) models are consistent and, therefore, learnable within a specialized domain, like portrait generation. We generate a synthetic paired dataset and train a fast image-to-image translation head. Using two sets of low- and high-quality synthetic images, our model is trained to refine the output of a distilled generator (e.g., FLUX.1-schnell) to a level comparable to a baseline model like FLUX.1-dev, which is more computationally intensive. Our results show that the pipeline, which combines a distilled version of a large generative model with our enhancement layer, delivers similar photorealistic portraits to the baseline version with up to an 82% decrease in computational cost compared to FLUX.1-dev. This study demonstrates the potential for improving the efficiency of AI solutions involving large-scale image generation.
Abstract (translated)
这项研究提出了一种新方法,旨在提高使用扩散模型生成图像的成本与质量比率。我们假设经过蒸馏(例如FLUX.1-schnell)和基准(例如FLUX.1-dev)模型之间的差异在特定领域内是稳定且可学习的,比如人像生成领域。为此,我们生成了一个合成配对数据集,并训练了一个快速图像到图像翻译头。使用两组低质量和高质量的人工合成图像,我们的模型被训练以优化一个蒸馏生成器(例如FLUX.1-schnell)的输出,使其质量接近于计算资源需求更高的基准模型如FLUX.1-dev。 研究结果表明,结合大尺寸生成模型的简化版本与增强层的管道能够提供类似于基线版本的逼真图像,但相比FLUX.1-dev可降低高达82%的计算成本。这项研究表明,在大规模图像生成涉及的人工智能解决方案中,存在提高效率的巨大潜力。
URL
https://arxiv.org/abs/2505.02255