Paper Reading AI Learner

LocalStyleFool: Regional Video Style Transfer Attack Using Segment Anything Model

2024-03-18 10:53:00
Yuxin Cao, Jinghao Li, Xi Xiao, Derui Wang, Minhui Xue, Hao Ge, Wei Liu, Guangwu Hu

Abstract

Previous work has shown that well-crafted adversarial perturbations can threaten the security of video recognition systems. Attackers can invade such models with a low query budget when the perturbations are semantic-invariant, such as StyleFool. Despite the query efficiency, the naturalness of the minutia areas still requires amelioration, since StyleFool leverages style transfer to all pixels in each frame. To close the gap, we propose LocalStyleFool, an improved black-box video adversarial attack that superimposes regional style-transfer-based perturbations on videos. Benefiting from the popularity and scalably usability of Segment Anything Model (SAM), we first extract different regions according to semantic information and then track them through the video stream to maintain the temporal consistency. Then, we add style-transfer-based perturbations to several regions selected based on the associative criterion of transfer-based gradient information and regional area. Perturbation fine adjustment is followed to make stylized videos adversarial. We demonstrate that LocalStyleFool can improve both intra-frame and inter-frame naturalness through a human-assessed survey, while maintaining competitive fooling rate and query efficiency. Successful experiments on the high-resolution dataset also showcase that scrupulous segmentation of SAM helps to improve the scalability of adversarial attacks under high-resolution data.

Abstract (translated)

以前的工作表明,精心制作的对抗扰动可以威胁视频识别系统的安全性。当扰动对语义不透明时,攻击者可以使用较低的查询预算侵入这些模型,例如StyleFool。尽管查询效率高,但最小区域的自然性仍然需要改进,因为StyleFool利用样式转移来对每个帧中的所有像素进行样式转移。为了弥合这一差距,我们提出了LocalStyleFool,一种改进的视频 adversarial 攻击,它超出了视频的局部样式转移。得益于 Segment Anything Model (SAM) 的流行和可扩展性,我们首先根据语义信息提取不同的区域,然后通过视频流跟踪它们以保持时间一致性。接着,我们将基于转移基于梯度信息的选择性区域添加到几个区域中。通过人类评估的调整来使逼真的视频具有攻击性。我们证明了LocalStyleFool可以通过人类评估的调查提高帧内和帧间的自然性,同时保持竞争力的 fooling 速率和查询效率。在具有高分辨率数据的高分辨率数据集上进行成功的实验,也展示了通过SAM的仔细分割,可以提高高分辨率数据中 adversarial 攻击的可扩展性。

URL

https://arxiv.org/abs/2403.11656

PDF

https://arxiv.org/pdf/2403.11656.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot