Abstract
We present CAIA, a benchmark exposing a critical blind spot in AI evaluation: the inability of state-of-the-art models to operate in adversarial, high-stakes environments where misinformation is weaponized and errors are irreversible. While existing benchmarks measure task completion in controlled settings, real-world deployment demands resilience against active deception. Using crypto markets as a testbed where $30 billion was lost to exploits in 2024, we evaluate 17 models on 178 time-anchored tasks requiring agents to distinguish truth from manipulation, navigate fragmented information landscapes, and make irreversible financial decisions under adversarial pressure. Our results reveal a fundamental capability gap: without tools, even frontier models achieve only 28% accuracy on tasks junior analysts routinely handle. Tool augmentation improves performance but plateaus at 67.4% versus 80% human baseline, despite unlimited access to professional resources. Most critically, we uncover a systematic tool selection catastrophe: models preferentially choose unreliable web search over authoritative data, falling for SEO-optimized misinformation and social media manipulation. This behavior persists even when correct answers are directly accessible through specialized tools, suggesting foundational limitations rather than knowledge gaps. We also find that Pass@k metrics mask dangerous trial-and-error behavior for autonomous deployment. The implications extend beyond crypto to any domain with active adversaries, e.g. cybersecurity, content moderation, etc. We release CAIA with contamination controls and continuous updates, establishing adversarial robustness as a necessary condition for trustworthy AI autonomy. The benchmark reveals that current models, despite impressive reasoning scores, remain fundamentally unprepared for environments where intelligence must survive active opposition.
Abstract (translated)
我们介绍了CAIA,这是一个基准测试框架,揭示了人工智能评估中的一个关键盲点:最先进的模型在对抗性、高风险环境中运行的能力不足,特别是在错误不可逆转且信息被武器化的条件下。虽然现有的基准测试是在受控环境下衡量任务完成情况,但实际部署需要具备抵御主动欺骗的韧性。我们使用加密货币市场作为试验平台,在2024年这个领域因漏洞损失了300亿美元的情况下,对17个模型在178项时间锚定的任务中进行了评估,这些任务要求代理区分事实与操纵、导航信息碎片化环境,并在对抗压力下做出不可逆转的财务决策。我们的结果显示了一个基本能力差距:即使是最前沿的模型,在没有工具辅助的情况下,也只能达到28%的准确率,而初级分析师通常可以轻松处理的任务准确率为100%。虽然借助工具能提升性能至67.4%,但与人类专家使用专业资源无限接入所能达到的80%基准相比仍显不足。最重要的是,我们发现了一个系统性的工具选择灾难:模型倾向于依赖不可靠的网络搜索而非权威数据来源,从而落入搜索引擎优化的虚假信息和社交媒体操纵圈套。即使正确答案可以通过专门工具直接获取,这种行为仍然存在,这表明这是基础性限制问题,而不仅仅是知识差距。此外,我们还发现Pass@k指标掩盖了自主部署时危险的尝试与错误行为。其影响不仅限于加密货币领域,在任何有活跃对手参与的领域(如网络安全、内容审核等)同样适用。我们将CAIA公开发布,并配备了污染控制和持续更新功能,以建立对抗性鲁棒性作为可信人工智能自治的前提条件。该基准测试揭示了当前模型尽管在推理分数上表现出色,但在需要抵御主动反对的情况下仍然根本无法胜任。
URL
https://arxiv.org/abs/2510.00332