Paper Reading AI Learner

Harmonic LLMs are Trustworthy

2024-04-30 17:00:32
Nicholas S. Kersting, Mohammad Rahman, Suchismitha Vedala, Yang Wang

Abstract

We introduce an intuitive method to test the robustness (stability and explainability) of any black-box LLM in real-time, based upon the local deviation from harmoniticity, denoted as $\gamma$. To the best of our knowledge this is the first completely model-agnostic and unsupervised method of measuring the robustness of any given response from an LLM, based upon the model itself conforming to a purely mathematical standard. We conduct human annotation experiments to show the positive correlation of $\gamma$ with false or misleading answers, and demonstrate that following the gradient of $\gamma$ in stochastic gradient ascent efficiently exposes adversarial prompts. Measuring $\gamma$ across thousands of queries in popular LLMs (GPT-4, ChatGPT, Claude-2.1, Mixtral-8x7B, Smaug-72B, Llama2-7B, and MPT-7B) allows us to estimate the liklihood of wrong or hallucinatory answers automatically and quantitatively rank the reliability of these models in various objective domains (Web QA, TruthfulQA, and Programming QA). Across all models and domains tested, human ratings confirm that $\gamma \to 0$ indicates trustworthiness, and the low-$\gamma$ leaders among these models are GPT-4, ChatGPT, and Smaug-72B.

Abstract (translated)

我们提出了一种直观的方法来实时测试任何黑盒LLM的稳健性(稳定性和可解释性),基于离散偏差,称为$\gamma$。据我们所知,这是基于模型本身遵循纯数学标准来测量任何给定LLM响应稳健性的第一个完全模型无关和无监督的方法。我们进行了人类注释实验来表明$\gamma$与虚假或误导性答案之间的正相关性,并证明在随机梯度上升过程中,沿着$\gamma$的梯度可以有效地揭示对抗性提示。通过对流行LLM(GPT-4,ChatGPT,Claude-2.1,Mixtral-8x7B,Smaug-72B,Llama2-7B和MPT-7B)成千上万个查询的$\gamma$进行测量,使我们能够自动估计错误或幻觉性答案的概率,并定量排名这些模型的可靠性在各种客观领域(Web QA,Truthful QA和编程 QA)上。在所有模型和领域测试中,人类评分证实了$\gamma \to 0$表示可信度,这些模型中低$\gamma$的领导是GPT-4,ChatGPT和Smaug-72B。

URL

https://arxiv.org/abs/2404.19708

PDF

https://arxiv.org/pdf/2404.19708.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot