Harmonic LLMs are Trustworthy

Abstract
Abstract (translated)
URL
PDF

Abstract

We introduce an intuitive method to test the robustness (stability and explainability) of any black-box LLM in real-time, based upon the local deviation from harmoniticity, denoted as $\gamma$. To the best of our knowledge this is the first completely model-agnostic and unsupervised method of measuring the robustness of any given response from an LLM, based upon the model itself conforming to a purely mathematical standard. We conduct human annotation experiments to show the positive correlation of $\gamma$ with false or misleading answers, and demonstrate that following the gradient of $\gamma$ in stochastic gradient ascent efficiently exposes adversarial prompts. Measuring $\gamma$ across thousands of queries in popular LLMs (GPT-4, ChatGPT, Claude-2.1, Mixtral-8x7B, Smaug-72B, Llama2-7B, and MPT-7B) allows us to estimate the liklihood of wrong or hallucinatory answers automatically and quantitatively rank the reliability of these models in various objective domains (Web QA, TruthfulQA, and Programming QA). Across all models and domains tested, human ratings confirm that $\gamma \to 0$ indicates trustworthiness, and the low-$\gamma$ leaders among these models are GPT-4, ChatGPT, and Smaug-72B.

Abstract (translated)

我们提出了一种直观的方法来实时测试任何黑盒LLM的稳健性（稳定性和可解释性），基于离散偏差，称为$\gamma$。据我们所知，这是基于模型本身遵循纯数学标准来测量任何给定LLM响应稳健性的第一个完全模型无关和无监督的方法。我们进行了人类注释实验来表明$\gamma$与虚假或误导性答案之间的正相关性，并证明在随机梯度上升过程中，沿着$\gamma$的梯度可以有效地揭示对抗性提示。通过对流行LLM（GPT-4，ChatGPT，Claude-2.1，Mixtral-8x7B，Smaug-72B，Llama2-7B和MPT-7B）成千上万个查询的$\gamma$进行测量，使我们能够自动估计错误或幻觉性答案的概率，并定量排名这些模型的可靠性在各种客观领域（Web QA，Truthful QA和编程 QA）上。在所有模型和领域测试中，人类评分证实了$\gamma \to 0$表示可信度，这些模型中低$\gamma$的领导是GPT-4，ChatGPT和Smaug-72B。

URL

https://arxiv.org/abs/2404.19708

PDF

https://arxiv.org/pdf/2404.19708.pdf

Harmonic LLMs are Trustworthy

Abstract

Abstract (translated)

URL

PDF Copy

PDF