Deceiving to Enlighten: Coaxing LLMs to Self-Reflection for Enhanced Bias Detection and Mitigation

Abstract
Abstract (translated)
URL
PDF

Abstract

Large Language Models (LLMs) embed complex biases and stereotypes that can lead to detrimental user experiences and societal consequences, often without conscious awareness from the models themselves. This paper emphasizes the importance of equipping LLMs with mechanisms for better self-reflection and bias recognition. Our experiments demonstrate that by informing LLMs that their generated content does not represent their own views and questioning them about bias, their capability to identify and address biases improves. This enhancement is attributed to the internal attention mechanisms and potential internal sensitivity policies of LLMs. Building upon these findings, we propose a novel method to diminish bias in LLM outputs. This involves engaging LLMs in multi-role scenarios acting as different roles where they are tasked for bias exposure, with a role of an impartial referee in the end of each loop of debate. A ranking scoring mechanism is employed to quantify bias levels, enabling more refined reflections and superior output quality. Comparative experimental results confirm that our method outperforms existing approaches in reducing bias, making it a valuable contribution to efforts towards more ethical AI systems.

Abstract (translated)

大语言模型（LLMs）通常会嵌入复杂的偏见和刻板印象，这可能导致用户体验的负面影响和社会后果，而这些偏见和刻板印象通常是模型自身无意识地导致的。本文强调了为LLM提供更好的自我反思和偏见识别机制的重要性。我们的实验证明，通过让LLM知道其生成的内容并不代表其自己的观点，并询问其关于偏见，其识别和解决偏见的能力会提高。这种提高是由LLM的内部关注机制和潜在的内部敏感性政策所导致的。在此基础上，我们提出了一个新方法来减少LLM输出中的偏见。这就涉及让LLM在多角色场景中扮演不同的角色，在每一轮辩论结束时，它们作为公正的评委。使用排名评分机制来量化偏见水平，实现更精细的反思和更好的输出质量。比较实验结果证实，我们的方法在减少偏见方面超越了现有方法，为努力构建更道德的AI系统做出了宝贵的贡献。

URL

https://arxiv.org/abs/2404.10160

PDF

https://arxiv.org/pdf/2404.10160.pdf

Deceiving to Enlighten: Coaxing LLMs to Self-Reflection for Enhanced Bias Detection and Mitigation

Abstract

Abstract (translated)

URL

PDF Copy

PDF