Abstract
We study attribute control in language models through the method of Causal Average Treatment Effect (Causal ATE). Existing methods for the attribute control task in Language Models (LMs) check for the co-occurrence of words in a sentence with the attribute of interest, and control for them. However, spurious correlation of the words with the attribute in the training dataset, can cause models to hallucinate the presence of the attribute when presented with the spurious correlate during inference. We show that the simple perturbation-based method of Causal ATE removes this unintended effect. Additionally, we offer a theoretical foundation for investigating Causal ATE in the classification task, and prove that it reduces the number of false positives -- thereby mitigating the issue of unintended bias. Specifically, we ground it in the problem of toxicity mitigation, where a significant challenge lies in the inadvertent bias that often emerges towards protected groups post detoxification. We show that this unintended bias can be solved by the use of the Causal ATE metric.
Abstract (translated)
我们通过Causal Average Treatment Effect(Causal ATE)方法研究语言模型中属性控制的问题。现有的语言模型(LMs)属性控制任务方法检查句子中单词与感兴趣属性的同时出现情况,并对其进行控制。然而,训练数据中单词与属性之间的伪相关可能会导致模型在推理过程中错觉地存在感兴趣属性。我们证明了Causal ATE的简单扰动方法可以消除这种意外影响。 此外,我们还为研究分类任务中的Causal ATE提供了理论基础,并证明了它减少了假阳性结果的数量——从而减轻了无意偏见的問題。具体来说,我们将其基础放在了毒性缓解问题上,因为在脱毒后常常无意地出现对受保护群体的偏见。我们证明了这种无意偏见可以通过使用Causal ATE指标来解决。
URL
https://arxiv.org/abs/2311.11229