Abstract
Previous work has shown that well-crafted adversarial perturbations can threaten the security of video recognition systems. Attackers can invade such models with a low query budget when the perturbations are semantic-invariant, such as StyleFool. Despite the query efficiency, the naturalness of the minutia areas still requires amelioration, since StyleFool leverages style transfer to all pixels in each frame. To close the gap, we propose LocalStyleFool, an improved black-box video adversarial attack that superimposes regional style-transfer-based perturbations on videos. Benefiting from the popularity and scalably usability of Segment Anything Model (SAM), we first extract different regions according to semantic information and then track them through the video stream to maintain the temporal consistency. Then, we add style-transfer-based perturbations to several regions selected based on the associative criterion of transfer-based gradient information and regional area. Perturbation fine adjustment is followed to make stylized videos adversarial. We demonstrate that LocalStyleFool can improve both intra-frame and inter-frame naturalness through a human-assessed survey, while maintaining competitive fooling rate and query efficiency. Successful experiments on the high-resolution dataset also showcase that scrupulous segmentation of SAM helps to improve the scalability of adversarial attacks under high-resolution data.
Abstract (translated)
以前的工作表明,精心制作的对抗扰动可以威胁视频识别系统的安全性。当扰动对语义不透明时,攻击者可以使用较低的查询预算侵入这些模型,例如StyleFool。尽管查询效率高,但最小区域的自然性仍然需要改进,因为StyleFool利用样式转移来对每个帧中的所有像素进行样式转移。为了弥合这一差距,我们提出了LocalStyleFool,一种改进的视频 adversarial 攻击,它超出了视频的局部样式转移。得益于 Segment Anything Model (SAM) 的流行和可扩展性,我们首先根据语义信息提取不同的区域,然后通过视频流跟踪它们以保持时间一致性。接着,我们将基于转移基于梯度信息的选择性区域添加到几个区域中。通过人类评估的调整来使逼真的视频具有攻击性。我们证明了LocalStyleFool可以通过人类评估的调查提高帧内和帧间的自然性,同时保持竞争力的 fooling 速率和查询效率。在具有高分辨率数据的高分辨率数据集上进行成功的实验,也展示了通过SAM的仔细分割,可以提高高分辨率数据中 adversarial 攻击的可扩展性。
URL
https://arxiv.org/abs/2403.11656