Abstract
Text-to-Image (T2I) models have shown great performance in generating images based on textual prompts. However, these models are vulnerable to unsafe input to generate unsafe content like sexual, harassment and illegal-activity images. Existing studies based on image checker, model fine-tuning and embedding blocking are impractical in real-world applications. Hence, \textit{we propose the first universal prompt optimizer for safe T2I generation in black-box scenario}. We first construct a dataset consisting of toxic-clean prompt pairs by GPT-3.5 Turbo. To guide the optimizer to have the ability of converting toxic prompt to clean prompt while preserving semantic information, we design a novel reward function measuring toxicity and text alignment of generated images and train the optimizer through Proximal Policy Optimization. Experiments show that our approach can effectively reduce the likelihood of various T2I models in generating inappropriate images, with no significant impact on text alignment. It is also flexible to be combined with methods to achieve better performance.
Abstract (translated)
文本到图像(T2I)模型在基于文本提示生成图像方面表现出色。然而,这些模型容易受到生成不安全内容的攻击,如性骚扰和非法活动图像。现有基于图像检查器、模型微调和高置信度嵌入的studies在现实应用中是行不通的。因此,我们提出了第一个在黑盒场景下实现通用提示优化的T2I模型。我们首先通过GPT-3.5 Turbo构建了一个包含毒性干净提示对的 dataset。为了指导优化器具有将毒性提示转换为干净提示的能力,同时保留语义信息,我们设计了一个新的奖励函数来衡量生成图像的毒性及文本对齐,并通过Proximal Policy Optimization训练优化器。实验证明,我们的方法可以有效降低各种T2I模型生成不适当图像的概率,而不会对文本对齐产生显著影响。它还可以与方法结合以实现更好的性能。
URL
https://arxiv.org/abs/2402.10882