Paper Reading AI Learner

Towards Scalable Pre-training of Visual Tokenizers for Generation

2025-12-15 18:59:54
Jingfeng Yao, Yuda Song, Yucong Zhou, Xinggang Wang

Abstract

The quality of the latent space in visual tokenizers (e.g., VAEs) is crucial for modern generative models. However, the standard reconstruction-based training paradigm produces a latent space that is biased towards low-level information, leading to a foundation flaw: better pixel-level accuracy does not lead to higher-quality generation. This implies that pouring extensive compute into visual tokenizer pre-training translates poorly to improved performance in generation. We identify this as the ``pre-training scaling problem`` and suggest a necessary shift: to be effective for generation, a latent space must concisely represent high-level semantics. We present VTP, a unified visual tokenizer pre-training framework, pioneering the joint optimization of image-text contrastive, self-supervised, and reconstruction losses. Our large-scale study reveals two principal findings: (1) understanding is a key driver of generation, and (2) much better scaling properties, where generative performance scales effectively with compute, parameters, and data allocated to the pretraining of the visual tokenizer. After large-scale pre-training, our tokenizer delivers a competitive profile (78.2 zero-shot accuracy and 0.36 rFID on ImageNet) and 4.1 times faster convergence on generation compared to advanced distillation methods. More importantly, it scales effectively: without modifying standard DiT training specs, solely investing more FLOPS in pretraining VTP achieves 65.8\% FID improvement in downstream generation, while conventional autoencoder stagnates very early at 1/10 FLOPS. Our pre-trained models are available at this https URL.

Abstract (translated)

在视觉标记化器(如VAE)中,潜在空间的质量对于现代生成模型至关重要。然而,标准的基于重构的训练范式会产生偏向于低级信息的潜在空间,导致了一个基础缺陷:更好的像素级别准确性并不会带来更高质量的生成结果。这意味着将大量的计算资源投入到视觉标记化器预训练中并不能有效提升其在生成任务中的表现。我们称这一现象为“预训练扩展问题”,并提出需要一个必要的转变:为了使潜在空间对生成任务有效,它必须能够简洁地表示高级语义信息。 我们介绍了VTP(Visual Token Pre-training),这是一个统一的视觉标记化器预训练框架,开创了图像-文本对比学习、自我监督和重构损失联合优化的新方法。我们的大规模研究表明,两个主要发现是:(1) 对于生成任务来说,理解是一个关键驱动因素;(2) 大幅改善了扩展性能,在计算资源、参数数量以及用于标记化器预训练的数据量上,生成性能得到了有效提升。 在大规模预训练之后,我们的标记化器表现出具有竞争力的性能(ImageNet上的零样本精度78.2%和0.36 rFID),并且在生成任务中比先进的蒸馏方法收敛速度快4.1倍。更重要的是,它可以有效地扩展:不修改标准DiT训练规格的情况下,在预训练VTP上仅增加更多的计算量就能使下游生成性能的FID(Frechet Inception Distance)提高65.8%,而传统的自动编码器在非常早期时就停滞不前,只有1/10的计算量。我们的预训练模型可以在以下链接找到:[this https URL]。 通过这种新的方法和发现,VTP为视觉标记化器的预训练提供了一种更有效的方法,并且展示了大规模预训练如何能够显著改善生成任务的表现。

URL

https://arxiv.org/abs/2512.13687

PDF

https://arxiv.org/pdf/2512.13687.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot