Towards Equitable Representation in Text-to-Image Synthesis Models with the Cross-Cultural Understanding Benchmark Dataset

Abstract
Abstract (translated)
URL
PDF

Abstract

It has been shown that accurate representation in media improves the well-being of the people who consume it. By contrast, inaccurate representations can negatively affect viewers and lead to harmful perceptions of other cultures. To achieve inclusive representation in generated images, we propose a culturally-aware priming approach for text-to-image synthesis using a small but culturally curated dataset that we collected, known here as Cross-Cultural Understanding Benchmark (CCUB) Dataset, to fight the bias prevalent in giant datasets. Our proposed approach is comprised of two fine-tuning techniques: (1) Adding visual context via fine-tuning a pre-trained text-to-image synthesis model, Stable Diffusion, on the CCUB text-image pairs, and (2) Adding semantic context via automated prompt engineering using the fine-tuned large language model, GPT-3, trained on our CCUB culturally-aware text data. CCUB dataset is curated and our approach is evaluated by people who have a personal relationship with that particular culture. Our experiments indicate that priming using both text and image is effective in improving the cultural relevance and decreasing the offensiveness of generated images while maintaining quality.

Abstract (translated)

研究表明,媒体准确的表现形式能够改善消耗它的人的健康。相比之下,不准确的表现形式可能会负面影响观众,导致对其他文化的有害感知。为了实现在生成图像中的包容表现,我们提出了一种文化意识的prime方法,使用我们收集的一个小但经过文化校正的数据集,即Cross-Cultural Understanding Benchmark (CCUB) Dataset,以对抗大型数据集中存在的偏见。我们提出的方法是由两个微调技术组成的:(1)通过微调预先训练的文本到图像合成模型稳定扩散,在CCUB文本图像对上添加视觉上下文;(2)通过使用自动化prompt engineering方法,使用我们CCUB文化意识的文本数据训练的微调大型语言模型GPT-3,添加语义上下文。CCUB数据集是经过 curated 的,我们的方法和由与该特定文化有 personal 关系的人进行评估。我们的实验表明,使用文本和图像的prime方法,能够在保持质量的同时改善文化相关性,减少生成图像的攻击性。

URL

https://arxiv.org/abs/2301.12073

PDF

https://arxiv.org/pdf/2301.12073.pdf