Paper Reading AI Learner

Conformal Alignment: Knowing When to Trust Foundation Models with Guarantees

2024-05-16 17:55:24
Yu Gui, Ying Jin, Zhimei Ren

Abstract

Before deploying outputs from foundation models in high-stakes tasks, it is imperative to ensure that they align with human values. For instance, in radiology report generation, reports generated by a vision-language model must align with human evaluations before their use in medical decision-making. This paper presents Conformal Alignment, a general framework for identifying units whose outputs meet a user-specified alignment criterion. It is guaranteed that on average, a prescribed fraction of selected units indeed meet the alignment criterion, regardless of the foundation model or the data distribution. Given any pre-trained model and new units with model-generated outputs, Conformal Alignment leverages a set of reference data with ground-truth alignment status to train an alignment predictor. It then selects new units whose predicted alignment scores surpass a data-dependent threshold, certifying their corresponding outputs as trustworthy. Through applications to question answering and radiology report generation, we demonstrate that our method is able to accurately identify units with trustworthy outputs via lightweight training over a moderate amount of reference data. En route, we investigate the informativeness of various features in alignment prediction and combine them with standard models to construct the alignment predictor.

Abstract (translated)

在将基础模型在高风险任务中部署输出之前,确保它们与人类价值观保持一致是至关重要的。例如,在放射学报告生成中,使用视觉语言模型生成的报告必须在用于医疗决策之前与人类评价保持一致。本文介绍了一种称为“对齐预测”的一般框架,可以确定满足用户指定对齐标准的单元。保证,即使使用不同的基础模型或数据分布,平均选择的单元也确实满足对齐标准。给定任何预训练模型和新单元,对齐预测利用具有真实对齐状态的参考数据集训练一个对齐预测器。然后选择预测对齐分数超过数据相关阈值的新的单元,证明其相应输出值得信赖。通过应用于问题回答和放射学报告生成,我们证明了我们的方法能够通过轻量训练在适量参考数据上准确地识别具有可信输出的单元。在研究对齐预测的各种特征的信息性之后,我们将它们与标准模型结合使用构建对齐预测器。

URL

https://arxiv.org/abs/2405.10301

PDF

https://arxiv.org/pdf/2405.10301.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot