Abstract
Large Language Models (LLMs) have revolutionized natural language processing, but their robustness against adversarial attacks remains a critical concern. We presents a novel white-box style attack approach that exposes vulnerabilities in leading open-source LLMs, including Llama, OPT, and T5. We assess the impact of model size, structure, and fine-tuning strategies on their resistance to adversarial perturbations. Our comprehensive evaluation across five diverse text classification tasks establishes a new benchmark for LLM robustness. The findings of this study have far-reaching implications for the reliable deployment of LLMs in real-world applications and contribute to the advancement of trustworthy AI systems.
Abstract (translated)
大语言模型(LLMs)在自然语言处理领域取得了巨大的革命性突破,但它们对对抗性攻击的鲁棒性仍然是一个关键的担忧。我们提出了一种新颖的白色盒攻击方法,揭示了包括LLMA、OPT和T5在内领先的开源LLM中的漏洞。我们评估了模型大小、结构和微调策略对其对对抗性干扰的鲁棒性影响。我们在五个多样化的文本分类任务上进行全面的评估,为LLM的鲁棒性树立了新的基准。本研究的结果对在现实应用中可靠部署LLM以及可信AI系统的进步具有深远的影响。
URL
https://arxiv.org/abs/2405.02764