Abstract
Recent image tone adjustment (or enhancement) approaches have predominantly adopted supervised learning for learning human-centric perceptual assessment. However, these approaches are constrained by intrinsic challenges of supervised learning. Primarily, the requirement for expertly-curated or retouched images escalates the data acquisition expenses. Moreover, their coverage of target style is confined to stylistic variants inferred from the training data. To surmount the above challenges, we propose an unsupervised learning-based approach for text-based image tone adjustment method, CLIPtone, that extends an existing image enhancement method to accommodate natural language descriptions. Specifically, we design a hyper-network to adaptively modulate the pretrained parameters of the backbone model based on text description. To assess whether the adjusted image aligns with the text description without ground truth image, we utilize CLIP, which is trained on a vast set of language-image pairs and thus encompasses knowledge of human perception. The major advantages of our approach are three fold: (i) minimal data collection expenses, (ii) support for a range of adjustments, and (iii) the ability to handle novel text descriptions unseen in training. Our approach's efficacy is demonstrated through comprehensive experiments, including a user study.
Abstract (translated)
近年来,图像色调调整(或增强)方法主要采用监督学习来进行人机中心感知评估。然而,这些方法受到监督学习内生挑战的限制。首先,专家编辑或修复图像的需求导致数据获取费用增加。其次,它们对目标风格的覆盖仅限于从训练数据中推断的文体变异。为了克服上述挑战,我们提出了一个基于无监督学习的文本图像色调调整方法,CLIPtone,该方法将现有的图像增强方法扩展到适应自然语言描述。具体来说,我们设计了一个超网络,根据文本描述自适应地调整骨干模型的预训练参数。为了评估调整后的图像是否与文本描述一致,我们使用了CLIP,它在一个广泛的语图像对训练集上进行训练,因此包括人类感知知识。我们方法的主要优势是三倍:(一)最小数据收集费用,(二)支持各种调整,(三)能够处理在训练中未见过的文本描述。通过全面的实验,包括用户研究,我们证明了这种方法的有效性。
URL
https://arxiv.org/abs/2404.01123