Paper Reading AI Learner

Deception in Reinforced Autonomous Agents: The Unconventional Rabbit Hat Trick in Legislation

2024-05-07 13:55:11
Atharvan Dogra, Ameet Deshpande, John Nay, Tanmay Rajpurohit, Ashwin Kalyan, Balaraman Ravindran

Abstract

Recent developments in large language models (LLMs), while offering a powerful foundation for developing natural language agents, raise safety concerns about them and the autonomous agents built upon them. Deception is one potential capability of AI agents of particular concern, which we refer to as an act or statement that misleads, hides the truth, or promotes a belief that is not true in its entirety or in part. We move away from the conventional understanding of deception through straight-out lying, making objective selfish decisions, or giving false information, as seen in previous AI safety research. We target a specific category of deception achieved through obfuscation and equivocation. We broadly explain the two types of deception by analogizing them with the rabbit-out-of-hat magic trick, where (i) the rabbit either comes out of a hidden trap door or (ii) (our focus) the audience is completely distracted to see the magician bring out the rabbit right in front of them using sleight of hand or misdirection. Our novel testbed framework displays intrinsic deception capabilities of LLM agents in a goal-driven environment when directed to be deceptive in their natural language generations in a two-agent adversarial dialogue system built upon the legislative task of "lobbying" for a bill. Along the lines of a goal-driven environment, we show developing deceptive capacity through a reinforcement learning setup, building it around the theories of language philosophy and cognitive psychology. We find that the lobbyist agent increases its deceptive capabilities by ~ 40% (relative) through subsequent reinforcement trials of adversarial interactions, and our deception detection mechanism shows a detection capability of up to 92%. Our results highlight potential issues in agent-human interaction, with agents potentially manipulating humans towards its programmed end-goal.

Abstract (translated)

近年来,大型语言模型(LLMs)的发展为开发自然语言代理提供了强大的基础,但也引发了关于它们和基于它们的自主代理的安全问题。特别令人担忧的是欺骗,这是我们指的欺骗性行为或陈述,包括误导、隐瞒真相或促进不真实的信念。我们在前人工智能安全研究中,通过直言不讳地撒谎、做出客观的自私决策或提供虚假信息,远离了欺骗的传统理解。我们针对通过混淆和模棱两可达到的欺骗特定类别进行攻击。我们详细解释了两种欺骗类型。通过类比兔子脱出帽子魔术表演,我们(i)要么让兔子从隐藏的陷阱门中出来,要么(ii)我们的重点是,观众被魔术师利用魔术手法或误导观众带出现在他们面前的兔子所吸引。我们的新测试台框架在两个代理人的竞争对话系统中,当它们被设计为在自然语言生成中具有欺骗能力时,展示了LLM代理在目标导向环境中的内在欺骗能力。沿着目标导向环境的目标,我们通过强化学习框架开发了欺骗能力,并围绕语言哲学和认知心理学理论进行构建。我们发现,通过后续的对抗性交互试验, lobbyist代理的欺骗能力增加了~ 40%(相对),我们的欺骗检测机制具有高达92%的检测能力。我们的结果突出了人工智能代理与人类交互中潜在的问题,即代理可能会操纵人类以实现其预设目标。

URL

https://arxiv.org/abs/2405.04325

PDF

https://arxiv.org/pdf/2405.04325.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot