Abstract
As NLP models become more complex, understanding their decisions becomes more crucial. Counterfactuals (CFs), where minimal changes to inputs flip a model's prediction, offer a way to explain these models. While Large Language Models (LLMs) have shown remarkable performance in NLP tasks, their efficacy in generating high-quality CFs remains uncertain. This work fills this gap by investigating how well LLMs generate CFs for two NLU tasks. We conduct a comprehensive comparison of several common LLMs, and evaluate their CFs, assessing both intrinsic metrics, and the impact of these CFs on data augmentation. Moreover, we analyze differences between human and LLM-generated CFs, providing insights for future research directions. Our results show that LLMs generate fluent CFs, but struggle to keep the induced changes minimal. Generating CFs for Sentiment Analysis (SA) is less challenging than NLI where LLMs show weaknesses in generating CFs that flip the original label. This also reflects on the data augmentation performance, where we observe a large gap between augmenting with human and LLMs CFs. Furthermore, we evaluate LLMs' ability to assess CFs in a mislabelled data setting, and show that they have a strong bias towards agreeing with the provided labels. GPT4 is more robust against this bias and its scores correlate well with automatic metrics. Our findings reveal several limitations and point to potential future work directions.
Abstract (translated)
随着自然语言处理(NLP)模型的日益复杂,理解它们的决策变得越来越重要。反事实(CF)是一种方法,通过对输入进行最小修改来翻转模型的预测,提供了解释这些模型的方法。虽然大型自然语言处理模型(LLMs)在NLP任务中表现出色,但它们在生成高质量反事实方面的有效性仍然不确定。本文通过研究LLMs如何为两个NLU任务生成反事实来填补这一空白。我们全面比较了几个常见的LLM,并评估了它们的反事实,考虑了内生指标以及这些反事实对数据增强的影响。此外,我们分析了人类和LLM生成的反事实之间的差异,为未来的研究方向提供了洞察力。我们的研究结果表明,LLMs可以生成流畅的反事实,但在保持诱导变化最小方面遇到困难。为情感分析(SA)生成反事实比为LLM生成反事实更容易,这也反映了数据增强表现上人类和LLM生成的反事实之间存在较大差距。此外,我们评估了LLM在未标注数据环境下的反事实评估能力,并发现它们对提供标签的偏倚非常强烈。GPT4对这种偏见具有更强的抵抗力,其分数与自动指标密切相关。我们的研究结果揭示了几个局限性,并指出了潜在的未来研究方向。
URL
https://arxiv.org/abs/2405.00722