Abstract
Large Language Models (LLMs) can be \emph{misused} to spread online spam and misinformation. Content watermarking deters misuse by hiding a message in model-generated outputs, enabling their detection using a secret watermarking key. Robustness is a core security property, stating that evading detection requires (significant) degradation of the content's quality. Many LLM watermarking methods have been proposed, but robustness is tested only against \emph{non-adaptive} attackers who lack knowledge of the watermarking method and can find only suboptimal attacks. We formulate the robustness of LLM watermarking as an objective function and propose preference-based optimization to tune \emph{adaptive} attacks against the specific watermarking method. Our evaluation shows that (i) adaptive attacks substantially outperform non-adaptive baselines. (ii) Even in a non-adaptive setting, adaptive attacks optimized against a few known watermarks remain highly effective when tested against other unseen watermarks, and (iii) optimization-based attacks are practical and require less than seven GPU hours. Our findings underscore the need to test robustness against adaptive attackers.
Abstract (translated)
大语言模型(LLMs)可能被用于传播网络垃圾信息和错误信息。内容水印标记防止了滥用,通过在模型生成的输出中隐藏信息,使它们能够通过秘密水印键进行检测。稳健性是核心安全属性,表明要逃避检测,需要(显著)降低内容的质量。已经提出了许多LLM水印标记方法,但只有针对非适应性攻击者进行测试,他们不知道水印方法,只能找到次优攻击。我们将LLM水印的稳健性表示为一个目标函数,并提出基于偏好的优化来调整针对特定水印方法的适应性攻击。我们的评估显示,(i)适应性攻击远优于非适应性基线。(ii)即使在非适应性设置中,针对几个已知水印的适应性攻击仍然在与其他未见水印的测试中具有高度的有效性。(iii)基于优化的攻击是实用的,并且只需要几个GPU小时。我们的发现强调了对适应性攻击者进行稳健性测试的必要性。
URL
https://arxiv.org/abs/2410.02440