Paper Reading AI Learner

Vid-SME: Membership Inference Attacks against Large Video Understanding Models

2025-05-29 13:17:25
Qi Li, Runpeng Yu, Xinchao Wang

Abstract

Multimodal large language models (MLLMs) demonstrate remarkable capabilities in handling complex multimodal tasks and are increasingly adopted in video understanding applications. However, their rapid advancement raises serious data privacy concerns, particularly given the potential inclusion of sensitive video content, such as personal recordings and surveillance footage, in their training datasets. Determining improperly used videos during training remains a critical and unresolved challenge. Despite considerable progress on membership inference attacks (MIAs) for text and image data in MLLMs, existing methods fail to generalize effectively to the video domain. These methods suffer from poor scalability as more frames are sampled and generally achieve negligible true positive rates at low false positive rates (TPR@Low FPR), mainly due to their failure to capture the inherent temporal variations of video frames and to account for model behavior differences as the number of frames varies. To address these challenges, we introduce Vid-SME, the first membership inference method tailored for video data used in video understanding LLMs (VULLMs). Vid-SME leverages the confidence of model output and integrates adaptive parameterization to compute Sharma-Mittal entropy (SME) for video inputs. By leveraging the SME difference between natural and temporally-reversed video frames, Vid-SME derives robust membership scores to determine whether a given video is part of the model's training set. Experiments on various self-trained and open-sourced VULLMs demonstrate the strong effectiveness of Vid-SME.

Abstract (translated)

多模态大型语言模型(MLLMs)在处理复杂多模态任务方面表现出卓越的能力,并且越来越多地被应用于视频理解应用中。然而,这些模型的迅速发展引发了严重的数据隐私问题,尤其是考虑到训练数据集中可能包含敏感的视频内容,例如个人记录和监控录像。确定训练过程中不当使用的视频仍然是一个关键且未解决的问题。尽管在MLLMs中的文本和图像数据上对成员推理攻击(MIAs)取得了显著进展,但现有方法无法有效地推广到视频领域。这些方法因未能捕捉视频帧的内在时间变化以及随着帧数变化模型行为差异而出现可扩展性差的问题,并且通常在低假阳性率下的真正阳性率(TPR@Low FPR)非常小。 为了解决这些问题,我们引入了Vid-SME,这是第一个针对用于视频理解大型语言模型(VULLMs)的视频数据的成员推理方法。Vid-SME利用模型输出的信心,并结合自适应参数化来计算视频输入的Sharma-Mittal熵(SME)。通过利用自然视频帧和时间反转视频帧之间的SME差异,Vid-SME推导出稳健的成员分数,以确定给定视频是否属于模型的训练集。在各种自训练和开源VULLMs上的实验表明了Vid-SME的强大有效性。

URL

https://arxiv.org/abs/2506.03179

PDF

https://arxiv.org/pdf/2506.03179.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot