Deception in Reinforced Autonomous Agents: The Unconventional Rabbit Hat Trick in Legislation

Abstract
Abstract (translated)
URL
PDF

Abstract

Recent developments in large language models (LLMs), while offering a powerful foundation for developing natural language agents, raise safety concerns about them and the autonomous agents built upon them. Deception is one potential capability of AI agents of particular concern, which we refer to as an act or statement that misleads, hides the truth, or promotes a belief that is not true in its entirety or in part. We move away from the conventional understanding of deception through straight-out lying, making objective selfish decisions, or giving false information, as seen in previous AI safety research. We target a specific category of deception achieved through obfuscation and equivocation. We broadly explain the two types of deception by analogizing them with the rabbit-out-of-hat magic trick, where (i) the rabbit either comes out of a hidden trap door or (ii) (our focus) the audience is completely distracted to see the magician bring out the rabbit right in front of them using sleight of hand or misdirection. Our novel testbed framework displays intrinsic deception capabilities of LLM agents in a goal-driven environment when directed to be deceptive in their natural language generations in a two-agent adversarial dialogue system built upon the legislative task of "lobbying" for a bill. Along the lines of a goal-driven environment, we show developing deceptive capacity through a reinforcement learning setup, building it around the theories of language philosophy and cognitive psychology. We find that the lobbyist agent increases its deceptive capabilities by ~ 40% (relative) through subsequent reinforcement trials of adversarial interactions, and our deception detection mechanism shows a detection capability of up to 92%. Our results highlight potential issues in agent-human interaction, with agents potentially manipulating humans towards its programmed end-goal.

Abstract (translated)

近年来，大型语言模型（LLMs）的发展为开发自然语言代理提供了强大的基础，但也引发了关于它们和基于它们的自主代理的安全问题。特别令人担忧的是欺骗，这是我们指的欺骗性行为或陈述，包括误导、隐瞒真相或促进不真实的信念。我们在前人工智能安全研究中，通过直言不讳地撒谎、做出客观的自私决策或提供虚假信息，远离了欺骗的传统理解。我们针对通过混淆和模棱两可达到的欺骗特定类别进行攻击。我们详细解释了两种欺骗类型。通过类比兔子脱出帽子魔术表演，我们（i）要么让兔子从隐藏的陷阱门中出来，要么（ii）我们的重点是，观众被魔术师利用魔术手法或误导观众带出现在他们面前的兔子所吸引。我们的新测试台框架在两个代理人的竞争对话系统中，当它们被设计为在自然语言生成中具有欺骗能力时，展示了LLM代理在目标导向环境中的内在欺骗能力。沿着目标导向环境的目标，我们通过强化学习框架开发了欺骗能力，并围绕语言哲学和认知心理学理论进行构建。我们发现，通过后续的对抗性交互试验， lobbyist代理的欺骗能力增加了~ 40%（相对），我们的欺骗检测机制具有高达92%的检测能力。我们的结果突出了人工智能代理与人类交互中潜在的问题，即代理可能会操纵人类以实现其预设目标。

URL

https://arxiv.org/abs/2405.04325

PDF

https://arxiv.org/pdf/2405.04325.pdf

Deception in Reinforced Autonomous Agents: The Unconventional Rabbit Hat Trick in Legislation

Abstract

Abstract (translated)

URL

PDF Copy

PDF