Abstract
Representation Autoencoders (RAEs) have shown distinct advantages in diffusion modeling on ImageNet by training in high-dimensional semantic latent spaces. In this work, we investigate whether this framework can scale to large-scale, freeform text-to-image (T2I) generation. We first scale RAE decoders on the frozen representation encoder (SigLIP-2) beyond ImageNet by training on web, synthetic, and text-rendering data, finding that while scale improves general fidelity, targeted data composition is essential for specific domains like text. We then rigorously stress-test the RAE design choices originally proposed for ImageNet. Our analysis reveals that scaling simplifies the framework: while dimension-dependent noise scheduling remains critical, architectural complexities such as wide diffusion heads and noise-augmented decoding offer negligible benefits at scale Building on this simplified framework, we conduct a controlled comparison of RAE against the state-of-the-art FLUX VAE across diffusion transformer scales from 0.5B to 9.8B parameters. RAEs consistently outperform VAEs during pretraining across all model scales. Further, during finetuning on high-quality datasets, VAE-based models catastrophically overfit after 64 epochs, while RAE models remain stable through 256 epochs and achieve consistently better performance. Across all experiments, RAE-based diffusion models demonstrate faster convergence and better generation quality, establishing RAEs as a simpler and stronger foundation than VAEs for large-scale T2I generation. Additionally, because both visual understanding and generation can operate in a shared representation space, the multimodal model can directly reason over generated latents, opening new possibilities for unified models.
Abstract (translated)
表示自编码器(RAE,Representation Autoencoders)在图像网(ImageNet)上的扩散模型训练中显示出了显著的优势,尤其是在高维语义潜在空间的训练方面。在这项工作中,我们探讨了这一框架是否可以扩展到大规模、自由形式的文字转图像(T2I,Text-to-Image)生成任务上。 首先,我们在冻结表示编码器(SigLIP-2)的基础上,通过网络数据、合成数据和文本渲染数据对RAE解码器进行训练,以超越ImageNet的限制。我们发现,在扩展规模时虽然整体保真度有所提高,但在特定领域如文字生成中,有针对性的数据组合至关重要。 接着,我们严格测试了最初为ImageNet设计的RAE架构选择的有效性。分析结果显示,随着规模的扩大,框架变得简化:尽管维度依赖性的噪音调度仍然关键,但诸如扩散头部宽度加大和噪音增强解码等复杂结构在大规模下几乎没有带来实际好处。 基于这一简化的框架,我们对比了RAE与当前最佳的FLUX VAE(变分自编码器),在从0.5B到9.8B参数的不同规模下的扩散变压器模型上进行了有控制的比较。结果表明,在所有模型规模的预训练阶段,RAEs始终优于VAEs。 进一步地,在高质量数据集上的微调过程中,基于VAE的模型在64个epoch后出现灾难性过拟合,而基于RAE的模型则保持稳定至256个epoch,并且在整个过程中表现更佳。在所有实验中,基于RAE的扩散模型都显示出更快的收敛速度和更好的生成质量,确立了RAEs作为大规模T2I生成任务中的简化且更强的基础框架的地位。 此外,由于视觉理解和生成都可以在共享表示空间内操作,多模态模型可以直接对生成的潜在表达进行推理,为统一性模型提供了新的可能性。
URL
https://arxiv.org/abs/2601.16208