Abstract
In this paper, we show different fine-tuning methods for Stable Diffusion XL; this includes inference steps, and caption customization for each image to align with generating images in the style of a commercial 2D icon training set. We also show how important it is to properly define what "high-quality" really is especially for a commercial-use environment. As generative AI models continue to gain widespread acceptance and usage, there emerge many different ways to optimize and evaluate them for various applications. Specifically text-to-image models, such as Stable Diffusion XL and DALL-E 3 require distinct evaluation practices to effectively generate high-quality icons according to a specific style. Although some images that are generated based on a certain style may have a lower FID score (better), we show how this is not absolute in and of itself even for rasterized icons. While FID scores reflect the similarity of generated images to the overall training set, CLIP scores measure the alignment between generated images and their textual descriptions. We show how FID scores miss significant aspects, such as the minority of pixel differences that matter most in an icon, while CLIP scores result in misjudging the quality of icons. The CLIP model's understanding of "similarity" is shaped by its own training data; which does not account for feature variation in our style of choice. Our findings highlight the need for specialized evaluation metrics and fine-tuning approaches when generating high-quality commercial icons, potentially leading to more effective and tailored applications of text-to-image models in professional design contexts.
Abstract (translated)
在本文中,我们展示了稳定扩散XL的不同微调方法;这包括推理步骤和针对每个图像的文本定制,使其符合商业2D图标训练集中的生成图像的风格。我们还展示了在商业环境中正确定义“高质量”的重要性。随着生成人工智能模型继续得到广泛接受和应用,出现了许多优化和评估方法,用于为各种应用优化和评估它们。具体来说,如文本到图像模型(如稳定扩散XL和DALL-E 3)需要明确的评估实践才能根据特定风格有效地生成高质量图标。尽管根据某种风格生成的图像中可能存在较低的FID分数(更好),但我们将展示这并不绝对,甚至对于浮动点图标。FID分数反映了生成图像与整个训练集的相似性,而CLIP分数衡量了生成图像与文本描述之间的对齐。我们展示了FID分数忽略了重要方面,例如 icon 中至关重要的少数像素差异,而CLIP分数则错误地判断了图标的质量。CLIP模型的“相似性”理解受其训练数据影响;这并没有考虑到我们选择风格的特征变化。我们的研究结果强调了在生成高质量商业图标时需要专门的评估指标和微调方法,这可能导致专业设计环境中文本到图像模型的更有效和定制化的应用。
URL
https://arxiv.org/abs/2407.08513