Paper Reading AI Learner

RAVE: Residual Vector Embedding for CLIP-Guided Backlit Image Enhancement

2024-04-02 12:28:40
Tatiana Gaintseva, Marting Benning, Gregory Slabaugh

Abstract

In this paper we propose a novel modification of Contrastive Language-Image Pre-Training (CLIP) guidance for the task of unsupervised backlit image enhancement. Our work builds on the state-of-the-art CLIP-LIT approach, which learns a prompt pair by constraining the text-image similarity between a prompt (negative/positive sample) and a corresponding image (backlit image/well-lit image) in the CLIP embedding space. Learned prompts then guide an image enhancement network. Based on the CLIP-LIT framework, we propose two novel methods for CLIP guidance. First, we show that instead of tuning prompts in the space of text embeddings, it is possible to directly tune their embeddings in the latent space without any loss in quality. This accelerates training and potentially enables the use of additional encoders that do not have a text encoder. Second, we propose a novel approach that does not require any prompt tuning. Instead, based on CLIP embeddings of backlit and well-lit images from training data, we compute the residual vector in the embedding space as a simple difference between the mean embeddings of the well-lit and backlit images. This vector then guides the enhancement network during training, pushing a backlit image towards the space of well-lit images. This approach further dramatically reduces training time, stabilizes training and produces high quality enhanced images without artifacts, both in supervised and unsupervised training regimes. Additionally, we show that residual vectors can be interpreted, revealing biases in training data, and thereby enabling potential bias correction.

Abstract (translated)

在本文中,我们提出了一种新颖的对于无监督反光照像增强任务的 Contrastive Language-Image 前馈(CLIP)指导。我们的工作基于最先进的 CLIP-LIT 方法,该方法通过在 CLIP 嵌入空间中约束提示(负样本/正样本)与相应图像(反光照像/良好光照图像)之间的文本-图像相似性来学习提示对。学习到的提示 then 指导图像增强网络。基于 CLIP-LIT 框架,我们提出了两种新颖的 CLIP 指导方法。首先,我们证明了直接在语义空间中调整提示而不是在文本嵌入空间中进行调整,不会损失质量。这加速了训练,并有可能使使用没有文本编码器的额外编码器成为可能。其次,我们提出了一种不需要提示调整的新方法。我们基于训练数据的反光照像和良好光照图像的 CLIP 嵌入计算残差向量作为简单差异来表示照明条件下的图像。该向量在训练期间指导增强网络,将反光照像推向良好光照图像的空间。这种方法进一步显著减少了训练时间,稳定了训练,并产生了高质量的增强图像,同时避免了伪影,无论是监督还是无监督训练模式下。此外,我们还证明了残差向量可以解释,揭示了训练数据中的偏见,从而实现了可能的偏见纠正。

URL

https://arxiv.org/abs/2404.01889

PDF

https://arxiv.org/pdf/2404.01889.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot