Abstract
All of the frontier AI companies have published safety frameworks where they define capability thresholds and risk mitigations that determine how they will safely develop and deploy their models. Adoption of systematic approaches to risk modelling, based on established practices used in safety-critical industries, has been recommended, however frontier AI companies currently do not describe in detail any structured approach to identifying and analysing hazards. STPA (Systems-Theoretic Process Analysis) is a systematic methodology for identifying how complex systems can become unsafe, leading to hazards. It achieves this by mapping out controllers and controlled processes then analysing their interactions and feedback loops to understand how harmful outcomes could occur (Leveson & Thomas, 2018). We evaluate STPA's ability to broaden the scope, improve traceability and strengthen the robustness of safety assurance for frontier AI systems. Applying STPA to the threat model and scenario described in 'A Sketch of an AI Control Safety Case' (Korbak et al., 2025), we derive a list of Unsafe Control Actions. From these we select a subset and explore the Loss Scenarios that lead to them if left unmitigated. We find that STPA is able to identify causal factors that may be missed by unstructured hazard analysis methodologies thereby improving robustness. We suggest STPA could increase the safety assurance of frontier AI when used to complement or check coverage of existing AI governance techniques including capability thresholds, model evaluations and emergency procedures. The application of a systematic methodology supports scalability by increasing the proportion of the analysis that could be conducted by LLMs, reducing the burden on human domain experts.
Abstract (translated)
所有的前沿人工智能公司都已经发布了安全框架,其中定义了能力阈值和风险缓解措施,以确定它们如何安全地开发和部署模型。然而,尽管基于安全关键行业中的成熟实践来采用系统化的风险管理方法已被推荐,但当前的前沿AI公司并未详细描述任何结构化的方法来识别和分析潜在危害。STPA(系统理论过程分析)是一种系统性方法,用于识别复杂系统变得不安全的方式,进而导致各种危险的发生。通过绘制控制器与受控流程,并分析它们之间的相互作用及反馈循环,STPA能够理解有害结果可能发生的机制(Leveson & Thomas, 2018)。我们评估了STPA在扩大范围、增强可追溯性和加强前沿AI系统的安全保证稳健性方面的能力。 我们将STPA应用于《AI控制安全性案例草图》(Korbak等人,2025年)中描述的威胁模型和场景,并由此得出了一系列不安全控制行动。从中选择了一部分未缓解措施可能导致损失的情景进行探讨。我们发现,相较于无结构化危害分析方法,STPA能够识别可能被遗漏的因果因素,从而增强稳健性。我们认为,当应用于补充或检查现有AI治理技术(包括能力阈值、模型评估和紧急程序)时,STPA可以提高前沿AI的安全保证水平。 系统的应用支持了可扩展性,通过增加LLM(大型语言模型)在分析中所占的比例,减轻了对人类领域专家的负担。
URL
https://arxiv.org/abs/2506.01782