Are Models Biased on Text without Gender-related Language?

Abstract
Abstract (translated)
URL
PDF

Abstract

Gender bias research has been pivotal in revealing undesirable behaviors in large language models, exposing serious gender stereotypes associated with occupations, and emotions. A key observation in prior work is that models reinforce stereotypes as a consequence of the gendered correlations that are present in the training data. In this paper, we focus on bias where the effect from training data is unclear, and instead address the question: Do language models still exhibit gender bias in non-stereotypical settings? To do so, we introduce UnStereoEval (USE), a novel framework tailored for investigating gender bias in stereotype-free scenarios. USE defines a sentence-level score based on pretraining data statistics to determine if the sentence contain minimal word-gender associations. To systematically benchmark the fairness of popular language models in stereotype-free scenarios, we utilize USE to automatically generate benchmarks without any gender-related language. By leveraging USE's sentence-level score, we also repurpose prior gender bias benchmarks (Winobias and Winogender) for non-stereotypical evaluation. Surprisingly, we find low fairness across all 28 tested models. Concretely, models demonstrate fair behavior in only 9%-41% of stereotype-free sentences, suggesting that bias does not solely stem from the presence of gender-related words. These results raise important questions about where underlying model biases come from and highlight the need for more systematic and comprehensive bias evaluation. We release the full dataset and code at this https URL.

Abstract (translated)

性别偏见研究在揭示大型语言模型中的不良行为、职业 associated 的严重性别刻板印象和情感方面具有关键作用。以前的工作中发现，模型会强化刻板印象，因为训练数据中存在的性别相关关系。在本文中，我们关注训练数据效果不明确的情况，并回答这个问题：语言模型在非刻板印象的场景中是否仍然存在性别偏见？为了回答这个问题，我们引入了UnStereoEval（USE），一种专为研究性别偏见在非刻板印象场景中的框架。USE根据预训练数据统计数据定义了一个句子级得分，以确定句子是否包含最小词-性别关联。为了系统地评估流行语言模型在非刻板印象场景中的公平性，我们利用USE自动生成没有任何性别相关语言的基准。通过利用USE的句子级得分，我们还重新利用了以前的性别偏见基准（Winobias和Winogender）进行非刻板印象评估。令人惊讶的是，我们发现所有28个测试模型在非刻板印象场景中都存在低公平性。具体来说，模型在非刻板印象句子中的公正行为仅占9%-41%。这些结果引发了一些重要问题，即潜在模型偏见来自何处，以及需要更系统化和全面的偏见评估。我们在这个链接上发布了完整的数据集和代码：https://www.aclweb.org/anthology/N18-2172

URL

https://arxiv.org/abs/2405.00588

PDF

https://arxiv.org/pdf/2405.00588.pdf

Are Models Biased on Text without Gender-related Language?

Abstract

Abstract (translated)

URL

PDF Copy

PDF