Abstract
Agentic AI systems, which build on Large Language Models (LLMs) and interact with tools and memory, have rapidly advanced in capability and scope. Yet, since LLMs have been shown to struggle in multilingual settings, typically resulting in lower performance and reduced safety, agentic systems risk inheriting these limitations. This raises concerns about the global accessibility of such systems, as users interacting in languages other than English may encounter unreliable or security-critical agent behavior. Despite growing interest in evaluating agentic AI, existing benchmarks focus exclusively on English, leaving multilingual settings unexplored. To address this gap, we propose MAPS, a multilingual benchmark suite designed to evaluate agentic AI systems across diverse languages and tasks. MAPS builds on four widely used agentic benchmarks - GAIA (real-world tasks), SWE-bench (code generation), MATH (mathematical reasoning), and the Agent Security Benchmark (security). We translate each dataset into ten diverse languages, resulting in 805 unique tasks and 8,855 total language-specific instances. Our benchmark suite enables a systematic analysis of how multilingual contexts affect agent performance and robustness. Empirically, we observe consistent degradation in both performance and security when transitioning from English to other languages, with severity varying by task and correlating with the amount of translated input. Building on these findings, we provide actionable recommendations to guide agentic AI systems development and assessment under multilingual settings. This work establishes a standardized evaluation framework, encouraging future research towards equitable, reliable, and globally accessible agentic AI. MAPS benchmark suite is publicly available at this https URL
Abstract (translated)
代理型AI系统,这些系统基于大型语言模型(LLMs)并能够与工具和记忆进行交互,在能力范围方面迅速取得了进步。然而,由于LLM在多语言环境中表现挣扎,通常会导致性能下降和安全性降低,因此这种类型的系统可能继承了这些限制。这引发了关于此类系统的全球可访问性的担忧,非英语使用者可能会遇到不可靠或有安全隐患的代理行为。 尽管对评估代理型AI的兴趣日益增长,现有的基准测试主要集中在英文环境上,忽略了多语言场景的研究需求。为填补这一空白,我们提出了MAPS(Multilingual Agentic Performance Suite),一套旨在评估在多种语言和任务下表现情况的多语言基准套件。MAPS基于四个广泛使用的代理型基准——GAIA(现实世界任务)、SWE-bench(代码生成)、MATH(数学推理)以及Agent Security Benchmark(安全性)。我们将每个数据集翻译成十种不同的语言,总共创建了805个独特任务和8,855个特定于各语言的实例。我们的基准套件能够系统地分析多语言环境如何影响代理的表现力与稳健性。 在实证研究中,我们观察到当从英文过渡到其他语言时,表现和安全性会持续下降,这种变化程度因任务不同而异,并且通常随着输入翻译量的增加而加剧。基于这些发现,我们提供了实用建议,以指导多语言设置下代理型AI系统的开发与评估。 这项工作确立了一个标准化的评价框架,鼓励未来的研究朝着公平、可靠以及全球可访问的目标迈进。MAPS基准套件可在[此处](this https URL)公开获取。
URL
https://arxiv.org/abs/2505.15935