Abstract
As frontier AI models are deployed globally, it is essential that their behaviour remains safe and reliable across diverse linguistic and cultural contexts. To examine how current model safeguards hold up in such settings, participants from the International Network for Advanced AI Measurement, Evaluation and Science, including representatives from Singapore, Japan, Australia, Canada, the EU, France, Kenya, South Korea and the UK conducted a joint multilingual evaluation exercise. Led by Singapore AISI, two open-weight models were tested across ten languages spanning high and low resourced groups: Cantonese English, Farsi, French, Japanese, Korean, Kiswahili, Malay, Mandarin Chinese and Telugu. Over 6,000 newly translated prompts were evaluated across five harm categories (privacy, non-violent crime, violent crime, intellectual property and jailbreak robustness), using both LLM-as-a-judge and human annotation. The exercise shows how safety behaviours can vary across languages. These include differences in safeguard robustness across languages and harm types and variation in evaluator reliability (LLM-as-judge vs. human review). Further, it also generated methodological insights for improving multilingual safety evaluations, such as the need for culturally contextualised translations, stress-tested evaluator prompts and clearer human annotation guidelines. This work represents an initial step toward a shared framework for multilingual safety testing of advanced AI systems and calls for continued collaboration with the wider research community and industry.
Abstract (translated)
随着前沿AI模型在全球范围内部署,确保它们在各种语言和文化背景下保持安全且可靠的行为至关重要。为了检验当前模型的安全措施在这种环境中的有效性,来自国际先进人工智能测量、评估与科学网络的参与者们(包括新加坡、日本、澳大利亚、加拿大、欧盟、法国、肯尼亚、韩国和英国等国代表)共同开展了一项多语言联合评估实验。 该实验由新加坡AI安全研究所领导,对两个开源模型在十种不同语言中进行了测试,这些语言涵盖了高资源组和低资源组:粤语、英语、波斯语、法语、日语、韩语、斯瓦希里语、马来语、普通话和泰卢固语。使用超过6,000个新翻译的提示词,在五个危害类别(隐私保护、非暴力犯罪、暴力犯罪、知识产权侵犯以及逃逸鲁棒性)中进行了评估,同时采用了“大语言模型作为裁判”与人工标注相结合的方式。 实验揭示了不同语言间安全行为的变化情况。这包括各语言和不同类型的危害在防护措施强度上的差异,以及评判者可靠性(机器评判与人类评审)的不同之处。此外,该实验还为改进多语言安全性评估方法提供了思路,如文化上下文相关翻译的必要性、经过压力测试的评判提示语以及更明确的人工标注指导方针等。 这项工作标志着建立一个多语言安全测试框架的初步尝试,并呼吁整个研究社区和业界持续合作以促进这一领域的进一步发展。
URL
https://arxiv.org/abs/2601.15706