Paper Reading AI Learner

Optimizing Adaptive Attacks against Content Watermarks for Language Models

2024-10-03 12:37:39
Abdulrahman Diaa, Toluwani Aremu, Nils Lukas

Abstract

Large Language Models (LLMs) can be \emph{misused} to spread online spam and misinformation. Content watermarking deters misuse by hiding a message in model-generated outputs, enabling their detection using a secret watermarking key. Robustness is a core security property, stating that evading detection requires (significant) degradation of the content's quality. Many LLM watermarking methods have been proposed, but robustness is tested only against \emph{non-adaptive} attackers who lack knowledge of the watermarking method and can find only suboptimal attacks. We formulate the robustness of LLM watermarking as an objective function and propose preference-based optimization to tune \emph{adaptive} attacks against the specific watermarking method. Our evaluation shows that (i) adaptive attacks substantially outperform non-adaptive baselines. (ii) Even in a non-adaptive setting, adaptive attacks optimized against a few known watermarks remain highly effective when tested against other unseen watermarks, and (iii) optimization-based attacks are practical and require less than seven GPU hours. Our findings underscore the need to test robustness against adaptive attackers.

Abstract (translated)

大语言模型(LLMs)可能被用于传播网络垃圾信息和错误信息。内容水印标记防止了滥用,通过在模型生成的输出中隐藏信息,使它们能够通过秘密水印键进行检测。稳健性是核心安全属性,表明要逃避检测,需要(显著)降低内容的质量。已经提出了许多LLM水印标记方法,但只有针对非适应性攻击者进行测试,他们不知道水印方法,只能找到次优攻击。我们将LLM水印的稳健性表示为一个目标函数,并提出基于偏好的优化来调整针对特定水印方法的适应性攻击。我们的评估显示,(i)适应性攻击远优于非适应性基线。(ii)即使在非适应性设置中,针对几个已知水印的适应性攻击仍然在与其他未见水印的测试中具有高度的有效性。(iii)基于优化的攻击是实用的,并且只需要几个GPU小时。我们的发现强调了对适应性攻击者进行稳健性测试的必要性。

URL

https://arxiv.org/abs/2410.02440

PDF

https://arxiv.org/pdf/2410.02440.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot