Abstract
CLIP is a discriminative model trained to align images and text in a shared embedding space. Due to its multimodal structure, it serves as the backbone of many generative pipelines, where a decoder is trained to map from the shared space back to images. In this work, we show that image synthesis is nevertheless possible using CLIP alone -- without any decoder, training, or fine-tuning. Our approach optimizes a frequency-aware implicit neural representation that encourages coarse-to-fine generation by stratifying frequencies across network layers. To stabilize this inverse mapping, we introduce adversarially robust initialization, a lightweight Orthogonal Procrustes projection to align local text and image embeddings, and a blending loss that anchors outputs to natural image statistics. Without altering CLIP's weights, this framework unlocks capabilities such as text-to-image generation, style transfer, and image reconstruction. These findings suggest that discriminative models may hold untapped generative potential, hidden in plain sight.
Abstract (translated)
CLIP 是一种判别模型,经过训练后能够在共享的嵌入空间中对齐图像和文本。由于其多模态结构,它成为了许多生成管道的基础,在这些管道中,解码器被训练为从共享空间映射回图像。然而在这项工作中,我们展示了仅使用 CLIP 也可以进行图像合成——无需任何解码器、训练或微调。我们的方法优化了一种频率感知的隐式神经表示,通过在网络层之间分层频带来鼓励粗到细的生成过程。为了稳定这个逆向映射,我们引入了对抗鲁棒初始化、一种轻量级的正交普罗克斯特斯投影以对齐本地文本和图像嵌入,并且加入了一种混合损失来将输出锚定在自然图像统计上。不改变 CLIP 的权重的情况下,这一框架解锁了包括从文本生成图像、风格转换以及图像重建在内的多种能力。这些发现表明判别模型可能隐藏着未被发掘的生成潜力。
URL
https://arxiv.org/abs/2505.23161