Assessing Adversarial Robustness of Large Language Models: An Empirical Study

Abstract
Abstract (translated)
URL
PDF

Abstract

Large Language Models (LLMs) have revolutionized natural language processing, but their robustness against adversarial attacks remains a critical concern. We presents a novel white-box style attack approach that exposes vulnerabilities in leading open-source LLMs, including Llama, OPT, and T5. We assess the impact of model size, structure, and fine-tuning strategies on their resistance to adversarial perturbations. Our comprehensive evaluation across five diverse text classification tasks establishes a new benchmark for LLM robustness. The findings of this study have far-reaching implications for the reliable deployment of LLMs in real-world applications and contribute to the advancement of trustworthy AI systems.

Abstract (translated)

大语言模型（LLMs）在自然语言处理领域取得了巨大的革命性突破，但它们对对抗性攻击的鲁棒性仍然是一个关键的担忧。我们提出了一种新颖的白色盒攻击方法，揭示了包括LLMA、OPT和T5在内领先的开源LLM中的漏洞。我们评估了模型大小、结构和微调策略对其对对抗性干扰的鲁棒性影响。我们在五个多样化的文本分类任务上进行全面的评估，为LLM的鲁棒性树立了新的基准。本研究的结果对在现实应用中可靠部署LLM以及可信AI系统的进步具有深远的影响。

URL

https://arxiv.org/abs/2405.02764

PDF

https://arxiv.org/pdf/2405.02764.pdf

Assessing Adversarial Robustness of Large Language Models: An Empirical Study

Abstract

Abstract (translated)

URL

PDF Copy

PDF