Abstract
In this paper we propose a novel modification of Contrastive Language-Image Pre-Training (CLIP) guidance for the task of unsupervised backlit image enhancement. Our work builds on the state-of-the-art CLIP-LIT approach, which learns a prompt pair by constraining the text-image similarity between a prompt (negative/positive sample) and a corresponding image (backlit image/well-lit image) in the CLIP embedding space. Learned prompts then guide an image enhancement network. Based on the CLIP-LIT framework, we propose two novel methods for CLIP guidance. First, we show that instead of tuning prompts in the space of text embeddings, it is possible to directly tune their embeddings in the latent space without any loss in quality. This accelerates training and potentially enables the use of additional encoders that do not have a text encoder. Second, we propose a novel approach that does not require any prompt tuning. Instead, based on CLIP embeddings of backlit and well-lit images from training data, we compute the residual vector in the embedding space as a simple difference between the mean embeddings of the well-lit and backlit images. This vector then guides the enhancement network during training, pushing a backlit image towards the space of well-lit images. This approach further dramatically reduces training time, stabilizes training and produces high quality enhanced images without artifacts, both in supervised and unsupervised training regimes. Additionally, we show that residual vectors can be interpreted, revealing biases in training data, and thereby enabling potential bias correction.
Abstract (translated)
在本文中,我们提出了一种新颖的对于无监督反光照像增强任务的 Contrastive Language-Image 前馈(CLIP)指导。我们的工作基于最先进的 CLIP-LIT 方法,该方法通过在 CLIP 嵌入空间中约束提示(负样本/正样本)与相应图像(反光照像/良好光照图像)之间的文本-图像相似性来学习提示对。学习到的提示 then 指导图像增强网络。基于 CLIP-LIT 框架,我们提出了两种新颖的 CLIP 指导方法。首先,我们证明了直接在语义空间中调整提示而不是在文本嵌入空间中进行调整,不会损失质量。这加速了训练,并有可能使使用没有文本编码器的额外编码器成为可能。其次,我们提出了一种不需要提示调整的新方法。我们基于训练数据的反光照像和良好光照图像的 CLIP 嵌入计算残差向量作为简单差异来表示照明条件下的图像。该向量在训练期间指导增强网络,将反光照像推向良好光照图像的空间。这种方法进一步显著减少了训练时间,稳定了训练,并产生了高质量的增强图像,同时避免了伪影,无论是监督还是无监督训练模式下。此外,我们还证明了残差向量可以解释,揭示了训练数据中的偏见,从而实现了可能的偏见纠正。
URL
https://arxiv.org/abs/2404.01889