Paper Reading AI Learner

Extending Information Bottleneck Attribution to Video Sequences

2025-01-28 12:19:44
Veronika Solopova, Lucas Schmidt, Dorothea Kolossa

Abstract

We introduce VIBA, a novel approach for explainable video classification by adapting Information Bottlenecks for Attribution (IBA) to video sequences. While most traditional explainability methods are designed for image models, our IBA framework addresses the need for explainability in temporal models used for video analysis. To demonstrate its effectiveness, we apply VIBA to video deepfake detection, testing it on two architectures: the Xception model for spatial features and a VGG11-based model for capturing motion dynamics through optical flow. Using a custom dataset that reflects recent deepfake generation techniques, we adapt IBA to create relevance and optical flow maps, visually highlighting manipulated regions and motion inconsistencies. Our results show that VIBA generates temporally and spatially consistent explanations, which align closely with human annotations, thus providing interpretability for video classification and particularly for deepfake detection.

Abstract (translated)

我们介绍了一种新的方法VIBA,这是一种通过将信息瓶颈归因法(Information Bottlenecks for Attribution,简称IBA)应用于视频序列来实现可解释性视频分类的方法。大多数传统的可解释性方法是为图像模型设计的,而我们的IBA框架则解决了在用于视频分析的时间模型中需要可解释性的问题。为了证明其有效性,我们将VIBA应用于视频深度伪造检测,并针对两种架构进行了测试:一种是使用Xception模型来提取空间特征,另一种则是基于VGG11模型通过光学流捕捉运动动态的架构。 我们利用了一个定制的数据集来进行实验,该数据集反映了最新的深度伪造生成技术。在这一基础上,我们将IBA方法扩展到创建相关性和光学流图,以视觉方式突出显示被篡改的区域和运动不一致性。我们的结果显示,VIBA能够生成与人类标注高度一致的时间和空间上连贯的解释,从而为视频分类及其特定应用——深度伪造检测提供了可解释性。

URL

https://arxiv.org/abs/2501.16889

PDF

https://arxiv.org/pdf/2501.16889.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot