Abstract
Before deploying outputs from foundation models in high-stakes tasks, it is imperative to ensure that they align with human values. For instance, in radiology report generation, reports generated by a vision-language model must align with human evaluations before their use in medical decision-making. This paper presents Conformal Alignment, a general framework for identifying units whose outputs meet a user-specified alignment criterion. It is guaranteed that on average, a prescribed fraction of selected units indeed meet the alignment criterion, regardless of the foundation model or the data distribution. Given any pre-trained model and new units with model-generated outputs, Conformal Alignment leverages a set of reference data with ground-truth alignment status to train an alignment predictor. It then selects new units whose predicted alignment scores surpass a data-dependent threshold, certifying their corresponding outputs as trustworthy. Through applications to question answering and radiology report generation, we demonstrate that our method is able to accurately identify units with trustworthy outputs via lightweight training over a moderate amount of reference data. En route, we investigate the informativeness of various features in alignment prediction and combine them with standard models to construct the alignment predictor.
Abstract (translated)
在将基础模型在高风险任务中部署输出之前,确保它们与人类价值观保持一致是至关重要的。例如,在放射学报告生成中,使用视觉语言模型生成的报告必须在用于医疗决策之前与人类评价保持一致。本文介绍了一种称为“对齐预测”的一般框架,可以确定满足用户指定对齐标准的单元。保证,即使使用不同的基础模型或数据分布,平均选择的单元也确实满足对齐标准。给定任何预训练模型和新单元,对齐预测利用具有真实对齐状态的参考数据集训练一个对齐预测器。然后选择预测对齐分数超过数据相关阈值的新的单元,证明其相应输出值得信赖。通过应用于问题回答和放射学报告生成,我们证明了我们的方法能够通过轻量训练在适量参考数据上准确地识别具有可信输出的单元。在研究对齐预测的各种特征的信息性之后,我们将它们与标准模型结合使用构建对齐预测器。
URL
https://arxiv.org/abs/2405.10301