Abstract
Despite the success of large language models (LLMs) in natural language generation, much evidence shows that LLMs may produce incorrect or nonsensical text. This limitation highlights the importance of discerning when to trust LLMs, especially in safety-critical domains. Existing methods, which rely on verbalizing confidence to tell the reliability by inducing top-k responses and sampling-aggregating multiple responses, often fail, due to the lack of objective guidance of confidence. To address this, we propose CONfidence-Quality-ORDerpreserving alignment approach (CONQORD), leveraging reinforcement learning with a tailored dual-component reward function. This function encompasses quality reward and orderpreserving alignment reward functions. Specifically, the order-preserving reward incentivizes the model to verbalize greater confidence for responses of higher quality to align the order of confidence and quality. Experiments demonstrate that our CONQORD significantly improves the alignment performance between confidence levels and response accuracy, without causing the model to become over-cautious. Furthermore, the aligned confidence provided by CONQORD informs when to trust LLMs, and acts as a determinant for initiating the retrieval process of external knowledge. Aligning confidence with response quality ensures more transparent and reliable responses, providing better trustworthiness.
Abstract (translated)
尽管大型语言模型(LLMs)在自然语言生成方面的成功,但大量证据表明,LLMs可能会产生不正确或不合逻辑的文本。这种局限性突出了在关键领域(如安全领域)中分辨何时信任LLMs的重要性。现有的方法,通过通过诱导top-k响应来告知可靠性来获得置信度,以及采样聚合多个响应,常常会失败,因为置信度的客观指导不足。为了解决这个问题,我们提出了CONfidence-Quality-ORDerpreserving alignment approach(CONQORD),利用带有自定义双组件奖励函数的强化学习。该函数包括质量和顺序保持的置信度奖励函数。具体来说,顺序保持奖励激励模型对高质量响应给出更大的置信度,以对置信度和质量的顺序进行对齐。实验证明,我们的CONQORD显著提高了置信水平与响应准确性的对齐性能,而不会使模型变得过于谨慎。此外,由CONQORD提供的对齐置信告知何时信任LLMs,并作为外部知识检索过程的决断因素。将置信度与响应质量对齐确保了更透明和可靠的回应,提高了可信度。
URL
https://arxiv.org/abs/2404.17287