Abstract
Natural language generation tools are powerful and effective for generating content. However, language models are known to display bias and fairness issues, making them impractical to deploy for many use cases. We here focus on how fairness issues impact automatically generated test content, which can have stringent requirements to ensure the test measures only what it was intended to measure. Specifically, we identify test content that is focused on particular domains and experiences that only reflect a certain demographic or that are potentially emotionally upsetting; both of which could inadvertently impact a test-taker's score. This kind of content doesn't reflect typical biases out of context, making it challenging even for modern models that contain safeguards. We build a dataset of 621 generated texts annotated for fairness and explore a variety of methods for classification: fine-tuning, topic-based classification, and prompting, including few-shot and self-correcting prompts. We find that combining prompt self-correction and few-shot learning performs best, yielding an F1 score of .791 on our held-out test set, while much smaller BERT- and topic-based models have competitive performance on out-of-domain data.
Abstract (translated)
自然语言生成工具对于生成内容非常强大和有效。然而,语言模型已经被证明存在偏见和不公平问题,这使得它们在许多用例上部署不实用。在这里,我们关注公平性问题如何影响自动生成的测试内容,这些内容可能对测试者得分产生严格的要求,以确保测试只衡量了它本应测量的内容。具体来说,我们识别出关注特定领域和经验的测试内容,这可能只反映了某些人口统计学或可能引起情感不安的内容;这两者都可能无意中影响测试者的得分。这类内容不反映上下文的典型偏见,这使得现代模型(包含安全措施)更难以处理。我们建立了一个为公平性 annotated的621个生成的文本的数据集,并探讨了分类的方法:微调、基于主题的分类和提示,包括少样本和自纠正提示。我们发现,结合自纠正提示和少样本学习效果最好,在 hold-out 测试集上的 F1 分数为.791,而BERT 和基于主题的模型在离域数据上的竞争性能较小。
URL
https://arxiv.org/abs/2404.15104