FiT: Flexible Vision Transformer for Diffusion Model

2024-02-19 18:59:07
Zeyu Lu, Zidong Wang, Di Huang, Chengyue Wu, Xihui Liu, Wanli Ouyang, Lei Bai


Nature is infinitely resolution-free. In the context of this reality, existing diffusion models, such as Diffusion Transformers, often face challenges when processing image resolutions outside of their trained domain. To overcome this limitation, we present the Flexible Vision Transformer (FiT), a transformer architecture specifically designed for generating images with unrestricted resolutions and aspect ratios. Unlike traditional methods that perceive images as static-resolution grids, FiT conceptualizes images as sequences of dynamically-sized tokens. This perspective enables a flexible training strategy that effortlessly adapts to diverse aspect ratios during both training and inference phases, thus promoting resolution generalization and eliminating biases induced by image cropping. Enhanced by a meticulously adjusted network structure and the integration of training-free extrapolation techniques, FiT exhibits remarkable flexibility in resolution extrapolation generation. Comprehensive experiments demonstrate the exceptional performance of FiT across a broad range of resolutions, showcasing its effectiveness both within and beyond its training resolution distribution. Repository available at this https URL.

Abstract (translated)

自然无限分辨率。在这个现实世界的背景下,经过训练的扩散模型(如Diffusion Transformers)在处理超出其训练领域图像分辨率的问题时常常面临挑战。为了克服这一限制,我们提出了灵活视觉Transformer(FiT),一种专为生成无限制分辨率和大小的图像而设计的Transformer架构。与传统方法将图像视为静态分辨率网格的不同,FiT将图像视为大小可变的数据序列。这种观点使得在训练和推理阶段都能轻松适应各种 aspect ratios,从而促进了分辨率泛化并消除了由图像裁剪引起的偏差。通过精心调整网络结构和集成无训练扩展技术,FiT在分辨率扩展生成方面表现出非凡的灵活性。全面的实验证明,FiT在广泛的分辨率范围内都表现出优异的性能,证明了其在其训练分辨率分布之外的有效性。您可以在此链接的仓库中访问FiT:



