Abstract
Regret in Large Language Models refers to their explicit regret expression when presented with evidence contradicting their previously generated misinformation. Studying the regret mechanism is crucial for enhancing model reliability and helps in revealing how cognition is coded in neural networks. To understand this mechanism, we need to first identify regret expressions in model outputs, then analyze their internal representation. This analysis requires examining the model's hidden states, where information processing occurs at the neuron level. However, this faces three key challenges: (1) the absence of specialized datasets capturing regret expressions, (2) the lack of metrics to find the optimal regret representation layer, and (3) the lack of metrics for identifying and analyzing regret neurons. Addressing these limitations, we propose: (1) a workflow for constructing a comprehensive regret dataset through strategically designed prompting scenarios, (2) the Supervised Compression-Decoupling Index (S-CDI) metric to identify optimal regret representation layers, and (3) the Regret Dominance Score (RDS) metric to identify regret neurons and the Group Impact Coefficient (GIC) to analyze activation patterns. Our experimental results successfully identified the optimal regret representation layer using the S-CDI metric, which significantly enhanced performance in probe classification experiments. Additionally, we discovered an M-shaped decoupling pattern across model layers, revealing how information processing alternates between coupling and decoupling phases. Through the RDS metric, we categorized neurons into three distinct functional groups: regret neurons, non-regret neurons, and dual neurons.
Abstract (translated)
在大型语言模型中,后悔机制指的是当这些模型生成的错误信息被证据反驳时,它们能够明确表达出来的后悔或纠正。研究这种后悔机制对于提高模型的可靠性至关重要,并有助于揭示神经网络中的认知编码方式。为了理解这一机制,我们首先需要识别出模型输出中的后悔表达,然后分析其内部表示形式。这项分析要求检查模型的隐藏状态,在这个过程中信息处理发生在神经元层面。 然而,这面临着三大挑战: 1. 缺乏专门捕捉后悔表达的数据集; 2. 缺乏用于寻找最优后悔表征层的度量标准; 3. 没有可以用来识别和分析后悔神经元的标准度量方法。 为了应对这些限制,我们提出了以下解决方案: 1. 构建一个全面的后悔数据集的工作流程,通过设计策略性的提示场景来完成。 2. Supervised Compression-Decoupling Index (S-CDI) 度量标准以确定最优后悔表征层的位置。 3. Regret Dominance Score (RDS) 度量标准用于识别后悔神经元,并且使用 Group Impact Coefficient (GIC) 来分析激活模式。 我们的实验结果成功地利用 S-CDI 度量标准识别出了最优的后悔表示层,在探针分类实验中显著提升了性能。此外,我们还发现了一种模型层级中的 M 形解耦模式,揭示了信息处理过程如何在耦合和解耦阶段之间交替进行。通过 RDS 度量标准,我们将神经元分为三类不同的功能组:后悔神经元、非后悔神经元以及双功能神经元。 这种方法不仅有助于我们更深入地理解大型语言模型内部的工作原理,也为我们改进这些系统的性能提供了新途径。
URL
https://arxiv.org/abs/2506.15617