Abstract
Multimodal large language models (MLLMs) demonstrate remarkable capabilities in handling complex multimodal tasks and are increasingly adopted in video understanding applications. However, their rapid advancement raises serious data privacy concerns, particularly given the potential inclusion of sensitive video content, such as personal recordings and surveillance footage, in their training datasets. Determining improperly used videos during training remains a critical and unresolved challenge. Despite considerable progress on membership inference attacks (MIAs) for text and image data in MLLMs, existing methods fail to generalize effectively to the video domain. These methods suffer from poor scalability as more frames are sampled and generally achieve negligible true positive rates at low false positive rates (TPR@Low FPR), mainly due to their failure to capture the inherent temporal variations of video frames and to account for model behavior differences as the number of frames varies. To address these challenges, we introduce Vid-SME, the first membership inference method tailored for video data used in video understanding LLMs (VULLMs). Vid-SME leverages the confidence of model output and integrates adaptive parameterization to compute Sharma-Mittal entropy (SME) for video inputs. By leveraging the SME difference between natural and temporally-reversed video frames, Vid-SME derives robust membership scores to determine whether a given video is part of the model's training set. Experiments on various self-trained and open-sourced VULLMs demonstrate the strong effectiveness of Vid-SME.
Abstract (translated)
多模态大型语言模型(MLLMs)在处理复杂多模态任务方面表现出卓越的能力,并且越来越多地被应用于视频理解应用中。然而,这些模型的迅速发展引发了严重的数据隐私问题,尤其是考虑到训练数据集中可能包含敏感的视频内容,例如个人记录和监控录像。确定训练过程中不当使用的视频仍然是一个关键且未解决的问题。尽管在MLLMs中的文本和图像数据上对成员推理攻击(MIAs)取得了显著进展,但现有方法无法有效地推广到视频领域。这些方法因未能捕捉视频帧的内在时间变化以及随着帧数变化模型行为差异而出现可扩展性差的问题,并且通常在低假阳性率下的真正阳性率(TPR@Low FPR)非常小。 为了解决这些问题,我们引入了Vid-SME,这是第一个针对用于视频理解大型语言模型(VULLMs)的视频数据的成员推理方法。Vid-SME利用模型输出的信心,并结合自适应参数化来计算视频输入的Sharma-Mittal熵(SME)。通过利用自然视频帧和时间反转视频帧之间的SME差异,Vid-SME推导出稳健的成员分数,以确定给定视频是否属于模型的训练集。在各种自训练和开源VULLMs上的实验表明了Vid-SME的强大有效性。
URL
https://arxiv.org/abs/2506.03179