Abstract
Corrigibility of autonomous agents is an under explored part of system design, with previous work focusing on single agent systems. It has been suggested that uncertainty over the human preferences acts to keep the agents corrigible, even in the face of human irrationality. We present a general framework for modelling corrigibility in a multi-agent setting as a 2 player game in which the agents always have a move in which they can ask the human for supervision. This is formulated as a Bayesian game for the purpose of introducing uncertainty over the human beliefs. We further analyse two specific cases. First, a two player corrigibility game, in which we want corrigibility displayed in both agents for both common payoff (monotone) games and harmonic games. Then we investigate an adversary setting, in which one agent is considered to be a `defending' agent and the other an `adversary'. A general result is provided for what belief over the games and human rationality the defending agent is required to have to induce corrigibility.
Abstract (translated)
自主代理的可纠正性是系统设计中一个尚未充分探索的部分,之前的研究主要集中在单个代理系统上。有人提出,在面对人类非理性的情况下,对人类偏好的不确定性可以保持代理的可纠正性。我们在此提出了一个多代理环境下的可纠正性的通用框架,并将其建模为一个两人博弈游戏,在该游戏中,代理始终有一个动作可以让它们请求人类进行监督。为了引入关于人类信念的不确定性,我们将这一问题形式化为贝叶斯博弈。 进一步地,我们分析了两个特定案例: 1. **双人可纠正性博弈**:在这种情境下,我们希望在双方都表现出共同收益(单调)游戏和和谐游戏中均具备可纠正性的代理。这意味着,无论是在合作还是竞争环境中,代理都能根据人类的指导调整其行为。 2. **对抗设置**:在此场景中,一个代理被视作“防御方”,另一个作为“对手”。我们提供了一个关于哪些信念需要由防守方代理持有以诱导出可纠正性的一般结果。具体来说,这个一般结论涉及游戏类型和人类理性之间的关系,确定了为了保证代理能够响应并服从于人类的监督而必须持有的信念。 通过这些分析,我们可以更好地理解如何在多代理系统中设计具有自我修正能力的人工智能,以确保它们即使面临复杂的社会和技术环境也能正确地服务于人类的最佳利益。
URL
https://arxiv.org/abs/2501.05360