Abstract
As artificial intelligence systems grow more powerful, there has been increasing interest in "AI safety" research to address emerging and future risks. However, the field of AI safety remains poorly defined and inconsistently measured, leading to confusion about how researchers can contribute. This lack of clarity is compounded by the unclear relationship between AI safety benchmarks and upstream general capabilities (e.g., general knowledge and reasoning). To address these issues, we conduct a comprehensive meta-analysis of AI safety benchmarks, empirically analyzing their correlation with general capabilities across dozens of models and providing a survey of existing directions in AI safety. Our findings reveal that many safety benchmarks highly correlate with upstream model capabilities, potentially enabling "safetywashing" -- where capability improvements are misrepresented as safety advancements. Based on these findings, we propose an empirical foundation for developing more meaningful safety metrics and define AI safety in a machine learning research context as a set of clearly delineated research goals that are empirically separable from generic capabilities advancements. In doing so, we aim to provide a more rigorous framework for AI safety research, advancing the science of safety evaluations and clarifying the path towards measurable progress.
Abstract (translated)
随着人工智能系统变得越来越强大,人们对“AI安全”研究的兴趣不断增加,以解决日益出现和未来的风险。然而,AI安全领域仍然缺乏清晰的定义和一致的测量标准,导致研究人员如何做出贡献存在困惑。这种不清晰加剧了AI安全基准与上游通用能力(如通用知识和推理能力)之间的模糊关系。为解决这些问题,我们对AI安全基准进行全面元分析,通过实验分析它们与通用能力之间的相关性,并为AI安全提供了现有方向的建议。我们的研究结果表明,许多安全基准与上游模型能力高度相关,这可能使“安全涂抹”成为可能——即能力改进被错误地代表为安全进步。基于这些发现,我们提出了一个用于制定更可靠安全指标的实证基础,将AI安全在机器学习研究背景下定义为一系列明确定义的研究目标,这些目标可以根据实验结果进行实证区分。这样做旨在为AI安全研究提供一个更严谨的框架,促进安全评价科学的进步,并阐明通往可衡量进步的道路。
URL
https://arxiv.org/abs/2407.21792