Paper Reading AI Learner

On Corrigibility and Alignment in Multi Agent Games

2025-01-09 16:44:38
Edmund Dable-Heath, Boyko Vodenicharski, James Bishop

Abstract

Corrigibility of autonomous agents is an under explored part of system design, with previous work focusing on single agent systems. It has been suggested that uncertainty over the human preferences acts to keep the agents corrigible, even in the face of human irrationality. We present a general framework for modelling corrigibility in a multi-agent setting as a 2 player game in which the agents always have a move in which they can ask the human for supervision. This is formulated as a Bayesian game for the purpose of introducing uncertainty over the human beliefs. We further analyse two specific cases. First, a two player corrigibility game, in which we want corrigibility displayed in both agents for both common payoff (monotone) games and harmonic games. Then we investigate an adversary setting, in which one agent is considered to be a `defending' agent and the other an `adversary'. A general result is provided for what belief over the games and human rationality the defending agent is required to have to induce corrigibility.

Abstract (translated)

自主代理的可纠正性是系统设计中一个尚未充分探索的部分,之前的研究主要集中在单个代理系统上。有人提出,在面对人类非理性的情况下,对人类偏好的不确定性可以保持代理的可纠正性。我们在此提出了一个多代理环境下的可纠正性的通用框架,并将其建模为一个两人博弈游戏,在该游戏中,代理始终有一个动作可以让它们请求人类进行监督。为了引入关于人类信念的不确定性,我们将这一问题形式化为贝叶斯博弈。 进一步地,我们分析了两个特定案例: 1. **双人可纠正性博弈**:在这种情境下,我们希望在双方都表现出共同收益(单调)游戏和和谐游戏中均具备可纠正性的代理。这意味着,无论是在合作还是竞争环境中,代理都能根据人类的指导调整其行为。 2. **对抗设置**:在此场景中,一个代理被视作“防御方”,另一个作为“对手”。我们提供了一个关于哪些信念需要由防守方代理持有以诱导出可纠正性的一般结果。具体来说,这个一般结论涉及游戏类型和人类理性之间的关系,确定了为了保证代理能够响应并服从于人类的监督而必须持有的信念。 通过这些分析,我们可以更好地理解如何在多代理系统中设计具有自我修正能力的人工智能,以确保它们即使面临复杂的社会和技术环境也能正确地服务于人类的最佳利益。

URL

https://arxiv.org/abs/2501.05360

PDF

https://arxiv.org/pdf/2501.05360.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot