Abstract
This paper introduces v0.5 of the AI Safety Benchmark, which has been created by the MLCommons AI Safety Working Group. The AI Safety Benchmark has been designed to assess the safety risks of AI systems that use chat-tuned language models. We introduce a principled approach to specifying and constructing the benchmark, which for v0.5 covers only a single use case (an adult chatting to a general-purpose assistant in English), and a limited set of personas (i.e., typical users, malicious users, and vulnerable users). We created a new taxonomy of 13 hazard categories, of which 7 have tests in the v0.5 benchmark. We plan to release version 1.0 of the AI Safety Benchmark by the end of 2024. The v1.0 benchmark will provide meaningful insights into the safety of AI systems. However, the v0.5 benchmark should not be used to assess the safety of AI systems. We have sought to fully document the limitations, flaws, and challenges of v0.5. This release of v0.5 of the AI Safety Benchmark includes (1) a principled approach to specifying and constructing the benchmark, which comprises use cases, types of systems under test (SUTs), language and context, personas, tests, and test items; (2) a taxonomy of 13 hazard categories with definitions and subcategories; (3) tests for seven of the hazard categories, each comprising a unique set of test items, i.e., prompts. There are 43,090 test items in total, which we created with templates; (4) a grading system for AI systems against the benchmark; (5) an openly available platform, and downloadable tool, called ModelBench that can be used to evaluate the safety of AI systems on the benchmark; (6) an example evaluation report which benchmarks the performance of over a dozen openly available chat-tuned language models; (7) a test specification for the benchmark.
Abstract (translated)
本文介绍了由MLCommons AI Safety Working Group创建的AI安全基准的v0.5版本。AI安全基准旨在评估使用聊天机器人语言模型的AI系统的安全性风险。我们引入了一种基于原则的方法来指定和构建基准,涵盖v0.5的只有一个用例(用英语与通用助手进行交流的成年人),以及一组有限的人物角色(即典型用户、恶意用户和易受攻击的用户)。我们创建了一个包含13个危险类别的新分类器,其中7个在v0.5基准中有测试。我们计划在2024年底发布AI安全基准的1.0版本。v1.0基准将为AI系统的安全性提供有意义的见解。然而,v0.5基准不应用于评估AI系统的安全性。我们努力全面记录v0.5基准的局限性、缺陷和挑战。发布v0.5 AI安全基准包括:(1)基于原则指定和构建基准的方法,包括使用案例、测试类型、系统类型、语言和上下文、人物角色、测试和测试项目;(2)一个包含13个危险类别的分类器及其定义和子类别;(3)对7个危险类别的测试,每个测试都包括一个独特的测试项目,即提示。总共有43,090个测试项目,我们使用模板创建;(4)一个针对基准对AI系统进行评估的评分系统;(5)一个公开可用的平台和可下载的工具,名为ModelBench,用于在基准上评估AI系统的安全性;(6)一个基准评估报告,该报告衡量了超过12个公开可用的聊天机器人语言模型的性能;(7)基准测试规格。
URL
https://arxiv.org/abs/2404.12241