Abstract
Large language models (LLMs) have demonstrated significant capability in code generation, drawing increasing attention to the evaluation of the quality and safety of their outputs. However, research on bias in code generation remains limited. Existing studies typically assess bias by applying malicious prompts or reapply tasks and dataset for discriminative models. Given that LLMs are often aligned with human values and that prior datasets are not fully optimized for code-related tasks, there is a pressing need for benchmarks specifically designed for evaluating code models. In this study, we introduce FairCode, a novel benchmark for evaluating bias in code generation. FairCode comprises two tasks: function implementation and test case generation, each evaluating social bias through diverse scenarios. Additionally, we propose a new metric, FairScore, to assess model performance on this benchmark. We conduct experiments on widely used LLMs and provide a comprehensive analysis of the results. The findings reveal that all tested LLMs exhibit bias. The code is available at this https URL.
Abstract (translated)
大型语言模型(LLM)在代码生成方面表现出显著的能力,这引起了对其输出质量和安全性的评估的广泛关注。然而,关于代码生成中的偏见的研究仍然有限。现有的研究通常通过应用恶意提示或重新使用用于区分模型的任务和数据集来评估偏见。鉴于LLM通常与人类价值观相一致,并且先前的数据集并未完全优化用于代码相关任务,因此迫切需要专门设计以评估代码模型的基准。在这项研究中,我们介绍了FairCode,这是一个新的基准测试工具,旨在评估代码生成中的偏见。FairCode包括两个任务:函数实现和测试用例生成,每个任务通过多种场景来评估社会偏见。此外,我们提出了一种新指标——FairScore,用于评估模型在该基准上的性能。我们在广泛使用的LLM上进行了实验,并提供了全面的结果分析。研究发现表明,所有测试的LLM都表现出某种程度的偏见。代码可在[此链接](https://example.com)获取。(请将示例链接替换为实际提供的链接)
URL
https://arxiv.org/abs/2501.05396