Paper Reading AI Learner

When Hallucination Costs Millions: Benchmarking AI Agents in High-Stakes Adversarial Financial Markets

2025-09-30 22:39:06
Zeshi Dai, Zimo Peng, Zerui Cheng, Ryan Yihe Li

Abstract

We present CAIA, a benchmark exposing a critical blind spot in AI evaluation: the inability of state-of-the-art models to operate in adversarial, high-stakes environments where misinformation is weaponized and errors are irreversible. While existing benchmarks measure task completion in controlled settings, real-world deployment demands resilience against active deception. Using crypto markets as a testbed where $30 billion was lost to exploits in 2024, we evaluate 17 models on 178 time-anchored tasks requiring agents to distinguish truth from manipulation, navigate fragmented information landscapes, and make irreversible financial decisions under adversarial pressure. Our results reveal a fundamental capability gap: without tools, even frontier models achieve only 28% accuracy on tasks junior analysts routinely handle. Tool augmentation improves performance but plateaus at 67.4% versus 80% human baseline, despite unlimited access to professional resources. Most critically, we uncover a systematic tool selection catastrophe: models preferentially choose unreliable web search over authoritative data, falling for SEO-optimized misinformation and social media manipulation. This behavior persists even when correct answers are directly accessible through specialized tools, suggesting foundational limitations rather than knowledge gaps. We also find that Pass@k metrics mask dangerous trial-and-error behavior for autonomous deployment. The implications extend beyond crypto to any domain with active adversaries, e.g. cybersecurity, content moderation, etc. We release CAIA with contamination controls and continuous updates, establishing adversarial robustness as a necessary condition for trustworthy AI autonomy. The benchmark reveals that current models, despite impressive reasoning scores, remain fundamentally unprepared for environments where intelligence must survive active opposition.

Abstract (translated)

我们介绍了CAIA,这是一个基准测试框架,揭示了人工智能评估中的一个关键盲点:最先进的模型在对抗性、高风险环境中运行的能力不足,特别是在错误不可逆转且信息被武器化的条件下。虽然现有的基准测试是在受控环境下衡量任务完成情况,但实际部署需要具备抵御主动欺骗的韧性。我们使用加密货币市场作为试验平台,在2024年这个领域因漏洞损失了300亿美元的情况下,对17个模型在178项时间锚定的任务中进行了评估,这些任务要求代理区分事实与操纵、导航信息碎片化环境,并在对抗压力下做出不可逆转的财务决策。我们的结果显示了一个基本能力差距:即使是最前沿的模型,在没有工具辅助的情况下,也只能达到28%的准确率,而初级分析师通常可以轻松处理的任务准确率为100%。虽然借助工具能提升性能至67.4%,但与人类专家使用专业资源无限接入所能达到的80%基准相比仍显不足。最重要的是,我们发现了一个系统性的工具选择灾难:模型倾向于依赖不可靠的网络搜索而非权威数据来源,从而落入搜索引擎优化的虚假信息和社交媒体操纵圈套。即使正确答案可以通过专门工具直接获取,这种行为仍然存在,这表明这是基础性限制问题,而不仅仅是知识差距。此外,我们还发现Pass@k指标掩盖了自主部署时危险的尝试与错误行为。其影响不仅限于加密货币领域,在任何有活跃对手参与的领域(如网络安全、内容审核等)同样适用。我们将CAIA公开发布,并配备了污染控制和持续更新功能,以建立对抗性鲁棒性作为可信人工智能自治的前提条件。该基准测试揭示了当前模型尽管在推理分数上表现出色,但在需要抵御主动反对的情况下仍然根本无法胜任。

URL

https://arxiv.org/abs/2510.00332

PDF

https://arxiv.org/pdf/2510.00332.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot