Paper Reading AI Learner

RecTok: Reconstruction Distillation along Rectified Flow

2025-12-15 15:14:20
Qingyu Shi, Size Wu, Jinbin Bai, Kaidong Yu, Yujing Wang, Yunhai Tong, Xiangtai Li, Xuelong Li

Abstract

Visual tokenizers play a crucial role in diffusion models. The dimensionality of latent space governs both reconstruction fidelity and the semantic expressiveness of the latent feature. However, a fundamental trade-off is inherent between dimensionality and generation quality, constraining existing methods to low-dimensional latent spaces. Although recent works have leveraged vision foundation models to enrich the semantics of visual tokenizers and accelerate convergence, high-dimensional tokenizers still underperform their low-dimensional counterparts. In this work, we propose RecTok, which overcomes the limitations of high-dimensional visual tokenizers through two key innovations: flow semantic distillation and reconstruction--alignment distillation. Our key insight is to make the forward flow in flow matching semantically rich, which serves as the training space of diffusion transformers, rather than focusing on the latent space as in previous works. Specifically, our method distills the semantic information in VFMs into the forward flow trajectories in flow matching. And we further enhance the semantics by introducing a masked feature reconstruction loss. Our RecTok achieves superior image reconstruction, generation quality, and discriminative performance. It achieves state-of-the-art results on the gFID-50K under both with and without classifier-free guidance settings, while maintaining a semantically rich latent space structure. Furthermore, as the latent dimensionality increases, we observe consistent improvements. Code and model are available at this https URL.

Abstract (translated)

视觉标记化器在扩散模型中扮演着关键角色。潜在空间的维度决定了重构保真度和潜在特征语义表达的能力。然而,维度与生成质量之间存在着固有的权衡关系,使得现有方法受限于低维潜在空间。尽管最近的工作利用了视觉基础模型来丰富视觉标记化器的语义并加速收敛速度,但高维标记化器仍然不如其低维对应物表现良好。 在这项工作中,我们提出了RecTok,它通过两项关键创新克服了高维视觉标记化器的局限性:流式语义蒸馏和重构-对齐蒸馏。我们的主要见解是让流动匹配中的正向流程具有丰富的语义信息,作为扩散转换器的训练空间,而此前的工作则侧重于潜在空间。 具体而言,我们的方法将VFMs(视觉基础模型)中的语义信息蒸馏到流式匹配中的正向流程轨迹中,并通过引入掩码特征重构损失进一步增强语义。RecTok在图像重构、生成质量和判别性能方面表现优异,在有和没有无分类器指导的情况下,gFID-50K的指标上取得了最先进的结果,并且保持了具有丰富语义信息的潜在空间结构。此外,随着潜在维度的增加,我们观察到了一致性的改进。 代码和模型可在此URL获取:[请用户提供链接]

URL

https://arxiv.org/abs/2512.13421

PDF

https://arxiv.org/pdf/2512.13421.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot