Paper Reading AI Learner

Scaling Text-to-Image Diffusion Transformers with Representation Autoencoders

2026-01-22 18:58:16
Shengbang Tong, Boyang Zheng, Ziteng Wang, Bingda Tang, Nanye Ma, Ellis Brown, Jihan Yang, Rob Fergus, Yann LeCun, Saining Xie

Abstract

Representation Autoencoders (RAEs) have shown distinct advantages in diffusion modeling on ImageNet by training in high-dimensional semantic latent spaces. In this work, we investigate whether this framework can scale to large-scale, freeform text-to-image (T2I) generation. We first scale RAE decoders on the frozen representation encoder (SigLIP-2) beyond ImageNet by training on web, synthetic, and text-rendering data, finding that while scale improves general fidelity, targeted data composition is essential for specific domains like text. We then rigorously stress-test the RAE design choices originally proposed for ImageNet. Our analysis reveals that scaling simplifies the framework: while dimension-dependent noise scheduling remains critical, architectural complexities such as wide diffusion heads and noise-augmented decoding offer negligible benefits at scale Building on this simplified framework, we conduct a controlled comparison of RAE against the state-of-the-art FLUX VAE across diffusion transformer scales from 0.5B to 9.8B parameters. RAEs consistently outperform VAEs during pretraining across all model scales. Further, during finetuning on high-quality datasets, VAE-based models catastrophically overfit after 64 epochs, while RAE models remain stable through 256 epochs and achieve consistently better performance. Across all experiments, RAE-based diffusion models demonstrate faster convergence and better generation quality, establishing RAEs as a simpler and stronger foundation than VAEs for large-scale T2I generation. Additionally, because both visual understanding and generation can operate in a shared representation space, the multimodal model can directly reason over generated latents, opening new possibilities for unified models.

Abstract (translated)

表示自编码器(RAE,Representation Autoencoders)在图像网(ImageNet)上的扩散模型训练中显示出了显著的优势,尤其是在高维语义潜在空间的训练方面。在这项工作中,我们探讨了这一框架是否可以扩展到大规模、自由形式的文字转图像(T2I,Text-to-Image)生成任务上。 首先,我们在冻结表示编码器(SigLIP-2)的基础上,通过网络数据、合成数据和文本渲染数据对RAE解码器进行训练,以超越ImageNet的限制。我们发现,在扩展规模时虽然整体保真度有所提高,但在特定领域如文字生成中,有针对性的数据组合至关重要。 接着,我们严格测试了最初为ImageNet设计的RAE架构选择的有效性。分析结果显示,随着规模的扩大,框架变得简化:尽管维度依赖性的噪音调度仍然关键,但诸如扩散头部宽度加大和噪音增强解码等复杂结构在大规模下几乎没有带来实际好处。 基于这一简化的框架,我们对比了RAE与当前最佳的FLUX VAE(变分自编码器),在从0.5B到9.8B参数的不同规模下的扩散变压器模型上进行了有控制的比较。结果表明,在所有模型规模的预训练阶段,RAEs始终优于VAEs。 进一步地,在高质量数据集上的微调过程中,基于VAE的模型在64个epoch后出现灾难性过拟合,而基于RAE的模型则保持稳定至256个epoch,并且在整个过程中表现更佳。在所有实验中,基于RAE的扩散模型都显示出更快的收敛速度和更好的生成质量,确立了RAEs作为大规模T2I生成任务中的简化且更强的基础框架的地位。 此外,由于视觉理解和生成都可以在共享表示空间内操作,多模态模型可以直接对生成的潜在表达进行推理,为统一性模型提供了新的可能性。

URL

https://arxiv.org/abs/2601.16208

PDF

https://arxiv.org/pdf/2601.16208.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot