Abstract
Visual tokenizers play a crucial role in diffusion models. The dimensionality of latent space governs both reconstruction fidelity and the semantic expressiveness of the latent feature. However, a fundamental trade-off is inherent between dimensionality and generation quality, constraining existing methods to low-dimensional latent spaces. Although recent works have leveraged vision foundation models to enrich the semantics of visual tokenizers and accelerate convergence, high-dimensional tokenizers still underperform their low-dimensional counterparts. In this work, we propose RecTok, which overcomes the limitations of high-dimensional visual tokenizers through two key innovations: flow semantic distillation and reconstruction--alignment distillation. Our key insight is to make the forward flow in flow matching semantically rich, which serves as the training space of diffusion transformers, rather than focusing on the latent space as in previous works. Specifically, our method distills the semantic information in VFMs into the forward flow trajectories in flow matching. And we further enhance the semantics by introducing a masked feature reconstruction loss. Our RecTok achieves superior image reconstruction, generation quality, and discriminative performance. It achieves state-of-the-art results on the gFID-50K under both with and without classifier-free guidance settings, while maintaining a semantically rich latent space structure. Furthermore, as the latent dimensionality increases, we observe consistent improvements. Code and model are available at this https URL.
Abstract (translated)
视觉标记化器在扩散模型中扮演着关键角色。潜在空间的维度决定了重构保真度和潜在特征语义表达的能力。然而,维度与生成质量之间存在着固有的权衡关系,使得现有方法受限于低维潜在空间。尽管最近的工作利用了视觉基础模型来丰富视觉标记化器的语义并加速收敛速度,但高维标记化器仍然不如其低维对应物表现良好。 在这项工作中,我们提出了RecTok,它通过两项关键创新克服了高维视觉标记化器的局限性:流式语义蒸馏和重构-对齐蒸馏。我们的主要见解是让流动匹配中的正向流程具有丰富的语义信息,作为扩散转换器的训练空间,而此前的工作则侧重于潜在空间。 具体而言,我们的方法将VFMs(视觉基础模型)中的语义信息蒸馏到流式匹配中的正向流程轨迹中,并通过引入掩码特征重构损失进一步增强语义。RecTok在图像重构、生成质量和判别性能方面表现优异,在有和没有无分类器指导的情况下,gFID-50K的指标上取得了最先进的结果,并且保持了具有丰富语义信息的潜在空间结构。此外,随着潜在维度的增加,我们观察到了一致性的改进。 代码和模型可在此URL获取:[请用户提供链接]
URL
https://arxiv.org/abs/2512.13421