Paper Reading AI Learner

CLIPtone: Unsupervised Learning for Text-based Image Tone Adjustment

2024-04-01 13:57:46
Hyeongmin Lee, Kyoungkook Kang, Jungseul Ok, Sunghyun Cho

Abstract

Recent image tone adjustment (or enhancement) approaches have predominantly adopted supervised learning for learning human-centric perceptual assessment. However, these approaches are constrained by intrinsic challenges of supervised learning. Primarily, the requirement for expertly-curated or retouched images escalates the data acquisition expenses. Moreover, their coverage of target style is confined to stylistic variants inferred from the training data. To surmount the above challenges, we propose an unsupervised learning-based approach for text-based image tone adjustment method, CLIPtone, that extends an existing image enhancement method to accommodate natural language descriptions. Specifically, we design a hyper-network to adaptively modulate the pretrained parameters of the backbone model based on text description. To assess whether the adjusted image aligns with the text description without ground truth image, we utilize CLIP, which is trained on a vast set of language-image pairs and thus encompasses knowledge of human perception. The major advantages of our approach are three fold: (i) minimal data collection expenses, (ii) support for a range of adjustments, and (iii) the ability to handle novel text descriptions unseen in training. Our approach's efficacy is demonstrated through comprehensive experiments, including a user study.

Abstract (translated)

近年来,图像色调调整(或增强)方法主要采用监督学习来进行人机中心感知评估。然而,这些方法受到监督学习内生挑战的限制。首先,专家编辑或修复图像的需求导致数据获取费用增加。其次,它们对目标风格的覆盖仅限于从训练数据中推断的文体变异。为了克服上述挑战,我们提出了一个基于无监督学习的文本图像色调调整方法,CLIPtone,该方法将现有的图像增强方法扩展到适应自然语言描述。具体来说,我们设计了一个超网络,根据文本描述自适应地调整骨干模型的预训练参数。为了评估调整后的图像是否与文本描述一致,我们使用了CLIP,它在一个广泛的语图像对训练集上进行训练,因此包括人类感知知识。我们方法的主要优势是三倍:(一)最小数据收集费用,(二)支持各种调整,(三)能够处理在训练中未见过的文本描述。通过全面的实验,包括用户研究,我们证明了这种方法的有效性。

URL

https://arxiv.org/abs/2404.01123

PDF

https://arxiv.org/pdf/2404.01123.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot