UnsafeBench: Benchmarking Image Safety Classifiers on Real-World and AI-Generated Images

Abstract
Abstract (translated)
URL
PDF

Abstract

Image safety classifiers play an important role in identifying and mitigating the spread of unsafe images online (e.g., images including violence, hateful rhetoric, etc.). At the same time, with the advent of text-to-image models and increasing concerns about the safety of AI models, developers are increasingly relying on image safety classifiers to safeguard their models. Yet, the performance of current image safety classifiers remains unknown for real-world and AI-generated images. To bridge this research gap, in this work, we propose UnsafeBench, a benchmarking framework that evaluates the effectiveness and robustness of image safety classifiers. First, we curate a large dataset of 10K real-world and AI-generated images that are annotated as safe or unsafe based on a set of 11 unsafe categories of images (sexual, violent, hateful, etc.). Then, we evaluate the effectiveness and robustness of five popular image safety classifiers, as well as three classifiers that are powered by general-purpose visual language models. Our assessment indicates that existing image safety classifiers are not comprehensive and effective enough in mitigating the multifaceted problem of unsafe images. Also, we find that classifiers trained only on real-world images tend to have degraded performance when applied to AI-generated images. Motivated by these findings, we design and implement a comprehensive image moderation tool called PerspectiveVision, which effectively identifies 11 categories of real-world and AI-generated unsafe images. The best PerspectiveVision model achieves an overall F1-Score of 0.810 on six evaluation datasets, which is comparable with closed-source and expensive state-of-the-art models like GPT-4V. UnsafeBench and PerspectiveVision can aid the research community in better understanding the landscape of image safety classification in the era of generative AI.

Abstract (translated)

图像安全分类器在识别和减轻网上不安全图像的传播方面发挥着重要作用（例如，包括暴力、仇恨言论等内容的图像）。与此同时，随着文本到图像模型的出现和对人工智能模型安全性的日益关注，开发人员越来越多地依赖图像安全分类器来保护他们的模型。然而，目前图像安全分类器的性能对于真实世界和人工智能生成的图像仍然是未知的。为了填补这一研究空白，在这项工作中，我们提出了UnsafeBench，一个评估图像安全分类器有效性和鲁棒性的基准框架。首先，我们收集了一个包含10K个真实世界和人工智能生成的图像的数据集，这些图像根据11个不安全图像类别（性、暴力、仇恨等）被标注为安全或不可靠。然后，我们评估了五款流行图像安全分类器和三款基于通用视觉语言模型的分类器的有效性和鲁棒性。我们的评估结果显示，现有图像安全分类器在减轻多方面不安全图像传播方面缺乏全面性和有效性。此外，我们还发现，仅基于真实世界图像训练的分类器在应用于人工智能生成的图像时表现不佳。为了激励这些发现，我们设计并实现了名为PerspectiveVision的全面图像分级工具，它有效地识别了11个真实世界和人工智能生成的不安全图像类别。 PerspectiveVision的最佳模型在六个评估数据集上的总体F1分数为0.810，与GPT-4V等开源和昂贵的模型相当。UnsafeBench和PerspectiveVision可以为研究社区在生成人工智能时代更好地理解图像安全分类器提供帮助。

URL

https://arxiv.org/abs/2405.03486

PDF

https://arxiv.org/pdf/2405.03486.pdf

UnsafeBench: Benchmarking Image Safety Classifiers on Real-World and AI-Generated Images

Abstract

Abstract (translated)

URL

PDF Copy

PDF