The rise of short-form videos on platforms like TikTok has brought new challenges in safeguarding young viewers from inappropriate content. Traditional moderation methods often fall short in handling the vast and rapidly changing landscape of user-generated videos, increasing the risk of children encountering harmful material. This paper introduces TikGuard, a transformer-based deep learning approach aimed at detecting and flagging content unsuitable for children on TikTok. By using a specially curated dataset, TikHarm, and leveraging advanced video classification techniques, TikGuard achieves an accuracy of 86.7%, showing a notable improvement over existing methods in similar contexts. While direct comparisons are limited by the uniqueness of the TikHarm dataset, TikGuard's performance highlights its potential in enhancing content moderation, contributing to a safer online experience for minors. This study underscores the effectiveness of transformer models in video classification and sets a foundation for future research in this area.
短形式视频在像TikTok这样的平台上崛起,为保护年轻观众免受不良内容带来了新的挑战。传统的审核方法通常在处理用户生成视频的广泛且快速变化的地形方面存在缺陷,增加了孩子们接触有害材料的风险。本文介绍了TikGuard,一种基于Transformer的深度学习方法,旨在检测和标记TikTok上不适合儿童的内容。通过使用专门挑选的数据集TikHarm和利用先进的视频分类技术,TikGuard实现了86.7%的准确率,表明其在类似环境中与现有方法相比取得了显著的改善。尽管直接比较受到TikHarm数据集独特性的限制,但TikGuard的表现突出了其在增强内容审核方面的潜力,为未成年人提供更安全在线体验。本研究突出了Transformer模型在视频分类中的有效性,为该领域未来的研究奠定了基础。
https://arxiv.org/abs/2410.00403
Edge computing's growing prominence, due to its ability to reduce communication latency and enable real-time processing, is promoting the rise of high-performance, heterogeneous System-on-Chip solutions. While current approaches often involve scaling down modern hardware, the performance characteristics of neural network workloads on these platforms can vary significantly, especially when it comes to parallel processing, which is a critical consideration for edge deployments. To address this, we conduct a comprehensive study comparing the latency and throughput of various linear algebra and neural network inference tasks across CPU-only, CPU/GPU, and CPU/NPU integrated solutions. {We find that the Neural Processing Unit (NPU) excels in matrix-vector multiplication (58.6% faster) and some neural network tasks (3.2$\times$ faster for video classification and large language models). GPU outperforms in matrix multiplication (22.6% faster) and LSTM networks (2.7$\times$ faster) while CPU excels at less parallel operations like dot product. NPU-based inference offers a balance of latency and throughput at lower power consumption. GPU-based inference, though more energy-intensive, performs best with large dimensions and batch sizes. We highlight the potential of heterogeneous computing solutions for edge AI, where diverse compute units can be strategically leveraged to boost accurate and real-time inference.
由于其能够降低通信延迟并实现实时处理,使边缘计算的重要性不断增加,推动了高性能、异构 System-on-Chip 解决方案的发展。然而,通常的方法往往会缩小现代硬件的规模,这些平台上的神经网络负载的性能特征差异很大,特别是在并行处理方面,这是边缘部署的关键考虑因素。为了应对这个问题,我们进行了全面的研究,比较了各种线性代数和神经网络推理任务在CPU-only、CPU/GPU和CPU/NPU集成解决方案中的延迟和吞吐量。我们发现,神经处理单元(NPU)在矩阵向量乘法(58.6% faster)和某些神经网络任务(视频分类和大语言模型的速度快3.2倍)方面表现优异。GPU 在矩阵乘法(22.6% faster)和LSTM网络(2.7倍 faster)方面优于CPU,而在并行操作方面,比如点积,CPU的性能较低。基于NPU的推理在较低的功耗下实现了平衡的延迟和吞吐量。尽管GPU 的性能更耗费能量,但在大维度和高批量的场景下表现最佳。我们强调了异构计算解决方案在边缘 AI 中的潜力,通过 strategic leveraging diverse compute units,可以提高准确和实时的推理。
https://arxiv.org/abs/2409.14803
Event cameras offer low-power visual sensing capabilities ideal for edge-device applications. However, their high event rate, driven by high temporal details, can be restrictive in terms of bandwidth and computational resources. In edge AI applications, determining the minimum amount of events for specific tasks can allow reducing the event rate to improve bandwidth, memory, and processing efficiency. In this paper, we study the effect of event subsampling on the accuracy of event data classification using convolutional neural network (CNN) models. Surprisingly, across various datasets, the number of events per video can be reduced by an order of magnitude with little drop in accuracy, revealing the extent to which we can push the boundaries in accuracy vs. event rate trade-off. Additionally, we also find that lower classification accuracy in high subsampling rates is not solely attributable to information loss due to the subsampling of the events, but that the training of CNNs can be challenging in highly subsampled scenarios, where the sensitivity to hyperparameters increases. We quantify training instability across multiple event-based classification datasets using a novel metric for evaluating the hyperparameter sensitivity of CNNs in different subsampling settings. Finally, we analyze the weight gradients of the network to gain insight into this instability.
活动相机具有低功耗的视觉感知能力,特别适用于边缘设备应用。然而,由于其高事件率,驱动因素是高时间细节,在带宽和计算资源方面可能会有所限制。在边缘人工智能应用中,确定特定任务所需的最低事件量可以降低事件率,提高带宽、内存和处理效率。在本文中,我们研究了事件子采样对使用卷积神经网络(CNN)模型进行事件数据分类的准确性的影响。令人惊讶的是,在各种数据集中,每个视频的事件数量可以降低一个数量级,而准确度几乎没有下降,揭示了我们在准确性和事件率之间的权衡程度。此外,我们还发现,在高度子采样的情况下,较低的分类准确性不仅仅是因为事件子采样导致的,而且是因为在高度子采样情况下训练CNN可能具有挑战性,参数对超参数的敏感性增加。我们使用一种新的指标,评估不同子采样设置中CNN的参数敏感性,来量化训练不稳定。最后,我们分析了网络的权重梯度,以深入了解这种不稳定性。
https://arxiv.org/abs/2409.08953
Given a video with $T$ frames, frame sampling is a task to select $N \ll T$ frames, so as to maximize the performance of a fixed video classifier. Not just brute-force search, but most existing methods suffer from its vast search space of $\binom{T}{N}$, especially when $N$ gets large. To address this challenge, we introduce a novel perspective of reducing the search space from $O(T^N)$ to $O(T)$. Instead of exploring the entire $O(T^N)$ space, our proposed semi-optimal policy selects the top $N$ frames based on the independently estimated value of each frame using per-frame confidence, significantly reducing the computational complexity. We verify that our semi-optimal policy can efficiently approximate the optimal policy, particularly under practical settings. Additionally, through extensive experiments on various datasets and model architectures, we demonstrate that learning our semi-optimal policy ensures stable and high performance regardless of the size of $N$ and $T$.
给定一个具有 $T$ 帧的视频,帧采样是将 $N$ 帧的选择任务,以最大化固定视频分类器的性能。不仅仅是暴力搜索,而且大多数现有方法都存在其巨大的搜索空间 $\binom{T}{N}$,尤其是在 $N$ 较大时。为解决这个问题,我们引入了一种新的观点,将搜索空间从 $O(T^N)$ 减少到 $O(T)$。我们提出的半最优策略基于每个帧的独立估计值,选择前 $N$ 帧,大大减少了计算复杂度。我们验证,我们的半最优策略可以有效地近似最优策略,特别是在实际设置下。此外,通过在各种数据集和模型架构上进行广泛的实验,我们证明,学习我们的半最优策略可以确保稳定和高性能,而不会影响 $N$ 和 $T$ 的值。
https://arxiv.org/abs/2409.05260
Spiking Neural Networks (SNNs) have emerged as a compelling, energy-efficient alternative to traditional Artificial Neural Networks (ANNs) for static image tasks such as image classification and segmentation. However, in the more complex video classification domain, SNN-based methods fall considerably short of ANN-based benchmarks due to the challenges in processing dense frame sequences. To bridge this gap, we propose ReSpike, a hybrid framework that synergizes the strengths of ANNs and SNNs to tackle action recognition tasks with high accuracy and low energy cost. By decomposing film clips into spatial and temporal components, i.e., RGB image Key Frames and event-like Residual Frames, ReSpike leverages ANN for learning spatial information and SNN for learning temporal information. In addition, we propose a multi-scale cross-attention mechanism for effective feature fusion. Compared to state-of-the-art SNN baselines, our ReSpike hybrid architecture demonstrates significant performance improvements (e.g., >30% absolute accuracy improvement on HMDB-51, UCF-101, and Kinetics-400). Furthermore, ReSpike achieves comparable performance with prior ANN approaches while bringing better accuracy-energy tradeoff.
尖峰神经网络(SNNs)已成为传统人工神经网络(ANNs)在静态图像任务(如图像分类和分割)上的一个引人注目的、能源高效的替代方案。然而,在更复杂的视频分类领域,基于SNN的方法与基于ANN的方法相比,基准测试的结果差距很大,因为处理密集帧序列的挑战。为了弥合这一差距,我们提出了ReSpike,一种结合ANNs和SNNs优势的混合框架,以解决高准确性和低能量成本的动作识别任务。通过将电影剪辑分解为空间和时间组件,即RGB图像关键帧和事件状的残差帧,ReSpike利用ANNs学习空间信息,SNNs学习时间信息。此外,我们还提出了多尺度 cross-attention 机制来实现有效的特征融合。与最先进的SNN基线相比,我们的ReSpike混合架构显示出显著的性能改进(例如,HMDB-51、UCF-101和Kinetics-400的绝对准确度改进率均大于30%)。此外,ReSpike与先前的ANN方法实现了相当的表演,同时带来了更好的准确-能量权衡。
https://arxiv.org/abs/2409.01564
Exploiting both audio and visual modalities for video classification is a challenging task, as the existing methods require large model architectures, leading to high computational complexity and resource requirements. Smaller architectures, on the other hand, struggle to achieve optimal performance. In this paper, we propose Attend-Fusion, an audio-visual (AV) fusion approach that introduces a compact model architecture specifically designed to capture intricate audio-visual relationships in video data. Through extensive experiments on the challenging YouTube-8M dataset, we demonstrate that Attend-Fusion achieves an F1 score of 75.64\% with only 72M parameters, which is comparable to the performance of larger baseline models such as Fully-Connected Late Fusion (75.96\% F1 score, 341M parameters). Attend-Fusion achieves similar performance to the larger baseline model while reducing the model size by nearly 80\%, highlighting its efficiency in terms of model complexity. Our work demonstrates that the Attend-Fusion model effectively combines audio and visual information for video classification, achieving competitive performance with significantly reduced model size. This approach opens new possibilities for deploying high-performance video understanding systems in resource-constrained environments across various applications.
利用音频和视频模态进行视频分类是一个具有挑战性的任务,因为现有的方法需要大型模型架构,导致高计算复杂度和资源需求。较小的架构则很难实现最优性能。在本文中,我们提出了Attend-Fusion,一种专为捕捉视频数据中复杂的音频-视觉关系而设计的音频-视觉(AV)融合方法。通过对具有挑战性的YouTube-8M数据集的广泛实验,我们证明了Attend-Fusion达到75.64\%的F1得分,仅使用72M参数,与具有类似性能的大型基线模型(如Fully-Connected Late Fusion,75.96\% F1 score,341M parameters)相当。Attend-Fusion在大型基线模型的同时减小了模型大小,凸显了其在模型复杂度方面的效率。我们的工作表明,Attend-Fusion模型能够有效地结合音频和视频信息进行视频分类,实现与显著减小模型大小相当的竞争性能。这种方法为在各种应用环境中部署高效的视频理解系统提供了新的可能性。
https://arxiv.org/abs/2408.14441
In the domain of black-box model extraction, conventional methods reliant on soft labels or surrogate datasets struggle with scaling to high-dimensional input spaces and managing the complexity of an extensive array of interrelated classes. In this work, we present a novel approach that utilizes SHAP (SHapley Additive exPlanations) to enhance synthetic data generation. SHAP quantifies the individual contributions of each input feature towards the victim model's output, facilitating the optimization of an energy-based GAN towards a desirable output. This method significantly boosts performance, achieving a 16.45% increase in the accuracy of image classification models and extending to video classification models with an average improvement of 26.11% and a maximum of 33.36% on challenging datasets such as UCF11, UCF101, Kinetics 400, Kinetics 600, and Something-Something V2. We further demonstrate the effectiveness and practical utility of our method under various scenarios, including the availability of top-k prediction probabilities, top-k prediction labels, and top-1 labels.
在领域黑盒模型提取中,传统方法在软标签或代理数据集上依赖,很难扩展到高维输入空间并管理广泛的相互类别的复杂性。在本文中,我们提出了一个新方法,利用SHAP(SHapley Additive exPlanations)增强合成数据生成。SHAP计算每个输入特征对受害者模型输出的单独贡献,从而促进基于能量的GAN优化到一个理想的输出。这种方法显著提高了性能,将图像分类模型的准确率提高了16.45%,并将视频分类模型的平均改进扩展至具有挑战性的数据集(如UCF11,UCF101,Kinetics 400,Kinetics 600和Something-Something V2),这些数据集的准确率提高至33.36%。我们进一步展示了我们的方法在不同场景下的有效性和实际应用价值,包括存在Top-k预测概率、Top-k预测标签和Top-1标签的情况。
https://arxiv.org/abs/2408.02140
Narrative videos, such as movies, pose significant challenges in video understanding due to their rich contexts (characters, dialogues, storylines) and diverse demands (identify who, relationship, and reason). In this paper, we introduce MovieSeq, a multimodal language model developed to address the wide range of challenges in understanding video contexts. Our core idea is to represent videos as interleaved multimodal sequences (including images, plots, videos, and subtitles), either by linking external knowledge databases or using offline models (such as whisper for subtitles). Through instruction-tuning, this approach empowers the language model to interact with videos using interleaved multimodal instructions. For example, instead of solely relying on video as input, we jointly provide character photos alongside their names and dialogues, allowing the model to associate these elements and generate more comprehensive responses. To demonstrate its effectiveness, we validate MovieSeq's performance on six datasets (LVU, MAD, Movienet, CMD, TVC, MovieQA) across five settings (video classification, audio description, video-text retrieval, video captioning, and video question-answering). The code will be public at this https URL.
叙事视频(如电影)在视频理解方面面临着显著的挑战,因为它们具有丰富的上下文(角色、对话、故事情节)和多样化的要求(确定谁、关系和原因)。在本文中,我们介绍了MovieSeq,一种多模态语言模型,旨在解决理解视频上下文的广泛范围挑战。我们的核心思想是将视频表示为交织的多模态序列(包括图像、剧情、视频和字幕),或者通过链接外部知识数据库或使用离线模型(如whisper for subtitles)来实现。通过指令调整,这种方法使语言模型能够使用交织的多模态指令与视频进行交互。例如, instead of solely relying on video as input, we jointly provide character photos alongside their names and dialogues, allowing the model to associate these elements and generate more comprehensive responses. 为了证明其有效性,我们在六个数据集(LVU,MAD,Movienet,CMD,TVC,MovieQA)和五个设置(视频分类、音频描述、视频-文本检索、视频字幕、视频问题回答)上对MovieSeq进行了评估。代码将在这个https URL上公开。
https://arxiv.org/abs/2407.21757
Hate speech is a pressing issue in modern society, with significant effects both online and offline. Recent research in hate speech detection has primarily centered on text-based media, largely overlooking multimodal content such as videos. Existing studies on hateful video datasets have predominantly focused on English content within a Western context and have been limited to binary labels (hateful or non-hateful), lacking detailed contextual information. This study presents MultiHateClip1 , an novel multilingual dataset created through hate lexicons and human annotation. It aims to enhance the detection of hateful videos on platforms such as YouTube and Bilibili, including content in both English and Chinese languages. Comprising 2,000 videos annotated for hatefulness, offensiveness, and normalcy, this dataset provides a cross-cultural perspective on gender-based hate speech. Through a detailed examination of human annotation results, we discuss the differences between Chinese and English hateful videos and underscore the importance of different modalities in hateful and offensive video analysis. Evaluations of state-of-the-art video classification models, such as VLM, GPT-4V and Qwen-VL, on MultiHateClip highlight the existing challenges in accurately distinguishing between hateful and offensive content and the urgent need for models that are both multimodally and culturally nuanced. MultiHateClip represents a foundational advance in enhancing hateful video detection by underscoring the necessity of a multimodal and culturally sensitive approach in combating online hate speech.
仇恨言论是现代社会的一个紧迫问题,对在线和离线领域都产生了重大影响。最近,仇恨言论检测的研究主要集中在基于文本的媒体上,很大程度上忽略了多模态内容,如视频。现有的仇恨视频数据集研究主要集中在西方语境下的英语内容,并局限于二元标签(仇恨或非仇恨),缺乏详细的上下文信息。本研究提出了MultiHateClip1,一个由仇恨词汇和人类标注创建的多语言数据集。它旨在提高YouTube和Bilibili等平台上的仇恨视频检测,包括英语和中文内容。这个数据集包括2000个被标注为仇恨性、攻击性和正常性的视频,为基于性别仇恨言论提供了一个跨文化视角。通过详细审查人类标注结果,我们讨论了中文和英文仇恨视频之间的差异,并强调了两类视频分析中不同模态的重要性。评估了最先进的视频分类模型,如VLM、GPT-4V和Qwen-VL,在MultiHateClip上的效果,突出了准确区分仇恨和攻击性内容以及迫切需要具有多模态和文化细微的模型。MultiHateClip在增强仇恨视频检测方面取得了基础性的进展,强调在打击网络仇恨言论方面需要多模态和敏感的方法。
https://arxiv.org/abs/2408.03468
Previous deepfake detection methods mostly depend on low-level textural features vulnerable to perturbations and fall short of detecting unseen forgery methods. In contrast, high-level semantic features are less susceptible to perturbations and not limited to forgery-specific artifacts, thus having stronger generalization. Motivated by this, we propose a detection method that utilizes high-level semantic features of faces to identify inconsistencies in temporal domain. We introduce UniForensics, a novel deepfake detection framework that leverages a transformer-based video classification network, initialized with a meta-functional face encoder for enriched facial representation. In this way, we can take advantage of both the powerful spatio-temporal model and the high-level semantic information of faces. Furthermore, to leverage easily accessible real face data and guide the model in focusing on spatio-temporal features, we design a Dynamic Video Self-Blending (DVSB) method to efficiently generate training samples with diverse spatio-temporal forgery traces using real facial videos. Based on this, we advance our framework with a two-stage training approach: The first stage employs a novel self-supervised contrastive learning, where we encourage the network to focus on forgery traces by impelling videos generated by the same forgery process to have similar representations. On the basis of the representation learned in the first stage, the second stage involves fine-tuning on face forgery detection dataset to build a deepfake detector. Extensive experiments validates that UniForensics outperforms existing face forgery methods in generalization ability and robustness. In particular, our method achieves 95.3\% and 77.2\% cross dataset AUC on the challenging Celeb-DFv2 and DFDC respectively.
先前的深度伪造检测方法主要依赖于低水平的纹理特征,这些特征易受扰动,并且无法检测出未被标识为伪造的方法。相反,高层面的语义特征对扰动更敏感,不受伪造特定瑕疵的限制,因此具有更强的一般化能力。基于这一点,我们提出了利用人脸面部高层语义特征来识别时域中存在差异的检测方法。我们引入了UniForensics,这是一种新颖的深度伪造检测框架,利用Transformer类视频分类网络作为启动器,该网络初始化了一个元功能脸部编码器,以丰富面部表示。通过这种方式,我们可以充分利用强大的时空模型和面部高级语义信息的优势。此外,为了利用易于访问的真实人脸数据,并引导模型集中注意力于时空特征,我们设计了一个动态视频自融合(DVSB)方法,有效地使用真实的人脸视频生成多样化的时空伪造痕迹训练样本,从而在第二阶段引入了深度学习框架的改进:首先,我们提出了一个新颖的监督对抗性学习策略,通过促使从同一伪造过程中产生的视频具有相似表示,来引导网络关注伪造足迹。基于第一阶段所学到的代表,在第二阶段中涉及对人脸伪造检测数据集进行微调以构建深度伪造检测器。广泛的实验验证了UniForensics在一般化能力和鲁棒性方面优于现有的面部伪造方法。具体来说,我们的方法在挑战性的Celeb-DFv2和DFDC数据集中分别实现95.3\%和77.2\%的交叉数据AUC。 总结:利用高层面的语义特征检测深度伪造是一个潜在的研究领域,并且可以增强现有的框架能力。我们提出了一个综合解决方案,结合了自监督强化学习策略与分层训练方法,以改进深度伪造检测器的表现。
https://arxiv.org/abs/2407.19079
Vision Transformers (ViT) have emerged as the de-facto choice for numerous industry grade vision solutions. But their inference cost can be prohibitive for many settings, as they compute self-attention in each layer which suffers from quadratic computational complexity in the number of tokens. On the other hand, spatial information in images and spatio-temporal information in videos is usually sparse and redundant. In this work, we introduce LookupViT, that aims to exploit this information sparsity to reduce ViT inference cost. LookupViT provides a novel general purpose vision transformer block that operates by compressing information from higher resolution tokens to a fixed number of tokens. These few compressed tokens undergo meticulous processing, while the higher-resolution tokens are passed through computationally cheaper layers. Information sharing between these two token sets is enabled through a bidirectional cross-attention mechanism. The approach offers multiple advantages - (a) easy to implement on standard ML accelerators (GPUs/TPUs) via standard high-level operators, (b) applicable to standard ViT and its variants, thus generalizes to various tasks, (c) can handle different tokenization and attention approaches. LookupViT also offers flexibility for the compressed tokens, enabling performance-computation trade-offs in a single trained model. We show LookupViT's effectiveness on multiple domains - (a) for image-classification (ImageNet-1K and ImageNet-21K), (b) video classification (Kinetics400 and Something-Something V2), (c) image captioning (COCO-Captions) with a frozen encoder. LookupViT provides $2\times$ reduction in FLOPs while upholding or improving accuracy across these domains. In addition, LookupViT also demonstrates out-of-the-box robustness and generalization on image classification (ImageNet-C,R,A,O), improving by up to $4\%$ over ViT.
Vision Transformers (ViT) 已经成为许多行业级视觉解决方案的首选,但对于许多设置,其推理成本可能过高,因为它们在每层计算自注意力,这会导致在token数量上的平方计算复杂度。另一方面,图像和视频中的空间信息通常是稀疏和冗余的。在这项工作中,我们介绍了一个名为LookupViT的工作,旨在利用这种信息稀疏性降低ViT的推理成本。LookupViT提供了一个新颖的一般性视觉 transformer 块,通过压缩高分辨率token的信息,将其压缩到固定数量的路上。这几个经过精心处理的压缩token在较低计算成本的层中通过双向注意力机制进行信息共享。这种方法具有多个优点-(a)通过标准的高层操作很容易实现于标准 ML 加速器(GPUs/TPUs)上,(b)适用于标准的ViT及其变体,所以可以应用于各种任务,(c)可以处理不同的token化和注意力方法。LookupViT还具有压缩token的灵活性,在单个训练模型上实现性能计算权衡。我们在多个领域证明了LookupViT的有效性 -(a)图像分类(ImageNet-1K 和 ImageNet-21K),(b)视频分类(Kinetics400 和 Something-Something V2),(c)图像摘要(COCO-Captions)使用冻结编码器。LookupViT在FLOPs方面提供了2倍的增长,同时在这些领域保持或提高准确性。此外,LookupViT还展示了在图像分类(ImageNet-C,R,A,O)上的稳健性和泛化能力,提高了4%以上。
https://arxiv.org/abs/2407.12753
Face Forgery videos have elicited critical social public concerns and various detectors have been proposed. However, fully-supervised detectors may lead to easily overfitting to specific forgery methods or videos, and existing self-supervised detectors are strict on auxiliary tasks, such as requiring audio or multi-modalities, leading to limited generalization and robustness. In this paper, we examine whether we can address this issue by leveraging visual-only real face videos. To this end, we propose to learn the Natural Consistency representation (NACO) of real face videos in a self-supervised manner, which is inspired by the observation that fake videos struggle to maintain the natural spatiotemporal consistency even under unknown forgery methods and different perturbations. Our NACO first extracts spatial features of each frame by CNNs then integrates them into Transformer to learn the long-range spatiotemporal representation, leveraging the advantages of CNNs and Transformer on local spatial receptive field and long-term memory respectively. Furthermore, a Spatial Predictive Module~(SPM) and a Temporal Contrastive Module~(TCM) are introduced to enhance the natural consistency representation learning. The SPM aims to predict random masked spatial features from spatiotemporal representation, and the TCM regularizes the latent distance of spatiotemporal representation by shuffling the natural order to disturb the consistency, which could both force our NACO more sensitive to the natural spatiotemporal consistency. After the representation learning stage, a MLP head is fine-tuned to perform the usual forgery video classification task. Extensive experiments show that our method outperforms other state-of-the-art competitors with impressive generalization and robustness.
面对伪造视频,已经引起了广泛的社会公众关注,并提出了各种检测方法。然而,完全监督的检测可能会导致对特定伪造方法的过拟合,而现有的自监督检测方法对辅助任务有严格的要求,导致对泛化和鲁棒性的限制。在本文中,我们探讨是否可以通过利用仅包含视觉信息的真实人脸视频来解决这个问题。为此,我们提出了一种自监督地学习真实人脸视频的自然一致性表示(NACO)的方法,这是基于观察到假视频在未知伪造方法下甚至无法保持自然时空一致性的启示。我们的NACO首先通过CNN提取每个帧的时空特征,然后将它们整合到Transformer中学习长距离时空表示,利用CNN和Transformer在局部时空感受野和长期记忆方面的优势。此外,还引入了空间预测模块(SPM)和时间对比模块(TCM)来增强自然一致性表示的学习。SPM旨在预测从时空表示中随机遮罩的时空特征,而TCM通过随机打乱自然顺序来干扰一致性,这可能会使我们的人脸NACO更加敏感。在表示学习阶段之后,对MLP头进行微调以执行通常的伪造视频分类任务。大量实验证明,我们的方法在泛化能力和鲁棒性方面优于其他最先进的竞争对手。
https://arxiv.org/abs/2407.10550
Pre-trained vision-language models (VLMs) have enabled significant progress in open vocabulary computer vision tasks such as image classification, object detection and image segmentation. Some recent works have focused on extending VLMs to open vocabulary single label action classification in videos. However, previous methods fall short in holistic video understanding which requires the ability to simultaneously recognize multiple actions and entities e.g., objects in the video in an open vocabulary setting. We formulate this problem as open vocabulary multilabel video classification and propose a method to adapt a pre-trained VLM such as CLIP to solve this task. We leverage large language models (LLMs) to provide semantic guidance to the VLM about class labels to improve its open vocabulary performance with two key contributions. First, we propose an end-to-end trainable architecture that learns to prompt an LLM to generate soft attributes for the CLIP text-encoder to enable it to recognize novel classes. Second, we integrate a temporal modeling module into CLIP's vision encoder to effectively model the spatio-temporal dynamics of video concepts as well as propose a novel regularized finetuning technique to ensure strong open vocabulary classification performance in the video domain. Our extensive experimentation showcases the efficacy of our approach on multiple benchmark datasets.
预训练的视觉-语言模型(VLMs)已经在诸如图像分类、目标检测和图像分割等开放词汇计算机视觉任务中取得了显著的进展。一些最近的工作将重点放在将VLMs扩展到开放词汇单标签动作分类在视频上。然而,之前的方法在全面理解视频方面存在缺陷,这需要同时识别多个动作和实体,例如视频中的对象。我们将这个问题转化为开放词汇多标签视频分类,并提出了将预训练的VLM如CLIP适应这种任务的方法。我们利用大型语言模型(LLMs)为VLM提供语义指导,以便提高其开放词汇性能。我们的方法具有两个关键贡献:第一,我们提出了一个端到端的训练架构,学习提示LLM生成软特征,以使它能够识别新的类别。第二,我们将一个时间建模模块集成到CLIP的视觉编码器中,有效地建模了视频概念的空间和时间动态,并提出了一个新的正则化微调技术,以确保在视频域中具有强大的开放词汇分类性能。我们对多个基准数据集的实验表明,我们的方法在多个方面都具有有效性。
https://arxiv.org/abs/2407.09073
Deepfake techniques generate highly realistic data, making it challenging for humans to discern between actual and artificially generated images. Recent advancements in deep learning-based deepfake detection methods, particularly with diffusion models, have shown remarkable progress. However, there is a growing demand for real-world applications to detect unseen individuals, deepfake techniques, and scenarios. To address this limitation, we propose a Prototype-based Unified Framework for Deepfake Detection (PUDD). PUDD offers a detection system based on similarity, comparing input data against known prototypes for video classification and identifying potential deepfakes or previously unseen classes by analyzing drops in similarity. Our extensive experiments reveal three key findings: (1) PUDD achieves an accuracy of 95.1% on Celeb-DF, outperforming state-of-the-art deepfake detection methods; (2) PUDD leverages image classification as the upstream task during training, demonstrating promising performance in both image classification and deepfake detection tasks during inference; (3) PUDD requires only 2.7 seconds for retraining on new data and emits 10$^{5}$ times less carbon compared to the state-of-the-art model, making it significantly more environmentally friendly.
深度伪造技术生成的数据非常逼真,使得人类很难区分真实和伪造的图像。在深度学习为基础的深度伪造检测方法的最新进展中,特别是扩散模型,取得了显著的进步。然而,对于检测未见过的个体、深度伪造技术和情景,现实世界的应用需求不断增加。为了克服这一局限,我们提出了一个基于原型统一的深度伪造检测框架(PUDD)。PUDD 提供了一种基于相似度的检测系统,将输入数据与已知原型进行比较,通过分析相似性的下降来识别潜在的深度伪造者或以前未见过的类别。我们进行的广泛实验揭示了三个关键发现:(1)PUDD 在Celeb-DF上的准确率达到了95.1%,超越了最先进的深度伪造检测方法;(2)在训练过程中将图像分类作为上游任务,表明在推理过程中在图像分类和深度伪造检测任务上都取得了有前景的表现;(3)在只使用新的数据进行重新训练的情况下,PUDD 只需要2.7秒钟,排放的碳比最先进模型少10^5倍,使得它具有显著的环保优势。
https://arxiv.org/abs/2406.15921
Recent advancements in Machine Unlearning (MU) have introduced solutions to selectively remove certain training samples, such as those with outdated or sensitive information, from trained models. Despite these advancements, evaluation of MU methods have been inconsistent, employing different trained models and architectures, and sample removal strategies, which hampers accurate comparison. In addition, prior MU approaches have mainly focused on singular tasks or modalities, which is not comprehensive. To address these limitations, we develop MU-Bench, the first comprehensive benchmark for MU that (i) unifies the sets of deleted samples and trained models, and (ii) provides broad coverage of tasks and data modalities, including previously unexplored domains such as speech and video classification. Our evaluation show that RandLabel and SalUn are the most effective general MU approaches on MU-Bench, and BadT and SCRUB are capable of achieving random performance on the deletion set. We analyze several under-investigated aspects of unlearning, including scalability, the impacts of parameter-efficient fine-tuning and curriculum learning, and susceptibility to dataset biases. MU-Bench provides an easy-to-use package that includes dataset splits, models, and implementations, together with a leader board to enable unified and scalable MU research.
近年来,机器学习(ML)中的进展为选择性地从训练模型中移除某些训练样本,如过时或敏感信息,提供了解决方案。尽管这些进步,但ML方法的评估存在不一致性,使用了不同的训练模型和架构以及样本删除策略,这阻碍了准确比较。此外,之前的多元方法主要集中在单一任务或维度上,不全面。为了克服这些限制,我们开发了MU-Bench,第一个全面的MU基准,它统一了删除的样本和训练模型,并为任务和数据维度提供广泛的覆盖,包括之前未探索的领域,如语音和视频分类。我们的评估显示,RandLabel和SalUn是MU-Bench中最有效的通用ML方法,而BadT和SCRUB在删除集上具有随机性能。我们分析了几个尚未进行深入研究的问题,包括可扩展性、参数高效微调以及正则化学习,以及易受数据偏差的影响。MU-Bench提供了一个易于使用的软件包,包括数据集划分、模型和实现,以及领导者板,以实现统一和可扩展的ML研究。
https://arxiv.org/abs/2406.14796
Human action recognition in dark videos is a challenging task for computer vision. Recent research focuses on applying dark enhancement methods to improve the visibility of the video. However, such video processing results in the loss of critical information in the original (un-enhanced) video. Conversely, traditional two-stream methods are capable of learning information from both original and processed videos, but it can lead to a significant increase in the computational cost during the inference phase in the task of video classification. To address these challenges, we propose a novel teacher-student video classification framework, named Dual-Light KnowleDge Distillation for Action Recognition in the Dark (DL-KDD). This framework enables the model to learn from both original and enhanced video without introducing additional computational cost during inference. Specifically, DL-KDD utilizes the strategy of knowledge distillation during training. The teacher model is trained with enhanced video, and the student model is trained with both the original video and the soft target generated by the teacher model. This teacher-student framework allows the student model to predict action using only the original input video during inference. In our experiments, the proposed DL-KDD framework outperforms state-of-the-art methods on the ARID, ARID V1.5, and Dark-48 datasets. We achieve the best performance on each dataset and up to a 4.18% improvement on Dark-48, using only original video inputs, thus avoiding the use of two-stream framework or enhancement modules for inference. We further validate the effectiveness of the distillation strategy in ablative experiments. The results highlight the advantages of our knowledge distillation framework in dark human action recognition.
人类动作识别在黑暗视频中是一项具有挑战性的计算机视觉任务。最近的研究专注于将黑暗增强方法应用于提高视频的可视化度。然而,这种视频处理会导致原始(未增强)视频中关键信息的丢失。相反,传统的两流方法可以从原始和处理视频中学到信息,但在推理阶段会显著增加计算成本。为了应对这些挑战,我们提出了一个名为Dual-Light KnowleDge Distillation for Action Recognition in the Dark (DL-KDD)的新教师-学生视频分类框架。这个框架在推理过程中不引入额外的计算成本。具体来说,DL-KDD利用知识蒸馏策略进行训练。教师模型通过增强视频进行训练,学生模型通过原始视频和教师模型生成的软目标进行训练。这个教师-学生框架使得学生模型在推理过程中仅使用原始输入视频预测动作。在实验中,我们发现,与最先进的Methods相比,DL-KDD框架在ARID、ARID V1.5和Dark-48数据集上的性能都表现出色。我们在每个数据集上都实现了最佳性能,而在Dark-48数据集上实现了4.18%的提高,而使用 only the original video inputs。我们还进一步验证了知识蒸馏策略在实验中的有效性。结果强调了我们在黑暗中人类动作识别中使用知识蒸馏框架的优势。
https://arxiv.org/abs/2406.02468
Understanding Activities of Daily Living (ADLs) is a crucial step for different applications including assistive robots, smart homes, and healthcare. However, to date, few benchmarks and methods have focused on complex ADLs, especially those involving multi-person interactions in home environments. In this paper, we propose a new dataset and benchmark, InteractADL, for understanding complex ADLs that involve interaction between humans (and objects). Furthermore, complex ADLs occurring in home environments comprise a challenging long-tailed distribution due to the rarity of multi-person interactions, and pose fine-grained visual recognition tasks due to the presence of semantically and visually similar classes. To address these issues, we propose a novel method for fine-grained few-shot video classification called Name Tuning that enables greater semantic separability by learning optimal class name vectors. We show that Name Tuning can be combined with existing prompt tuning strategies to learn the entire input text (rather than only learning the prompt or class names) and demonstrate improved performance for few-shot classification on InteractADL and 4 other fine-grained visual classification benchmarks. For transparency and reproducibility, we release our code at this https URL.
理解日常活动活动(ADLs)对于诸如辅助机器人、智能家居和医疗设备等各个应用领域来说至关重要。然而,到目前为止,很少有基准和方法关注复杂ADL,尤其是涉及多人在家庭环境中的交互。在本文中,我们提出了一个新的数据集和基准,称为交互ADL,以了解复杂ADL中的人(及其与物体)之间的交互。此外,家庭环境中的复杂ADL由于多个人类交互的稀有性,构成了一个具有挑战性的长尾分布,并提出了细粒度的视觉识别任务。为了应对这些问题,我们提出了一个名为Name Tuning的新方法,一种通过学习最优分类名称向量来提高语义分离的细粒度视觉分类方法。我们证明了Name Tuning可以与现有的提示调整策略相结合,以学习整个输入文本(而不仅仅是提示或分类名称),并在InteractADL和四个其他细粒度视觉分类基准上取得了较好的性能。为了保证透明度和可重复性,我们将代码发布在此处:https://www.example.com/。
https://arxiv.org/abs/2406.01662
In an era of rapidly evolving internet technology, the surge in multimodal content, including videos, has expanded the horizons of online communication. However, the detection of toxic content in this diverse landscape, particularly in low-resource code-mixed languages, remains a critical challenge. While substantial research has addressed toxic content detection in textual data, the realm of video content, especially in non-English languages, has been relatively underexplored. This paper addresses this research gap by introducing a benchmark dataset, the first of its kind, consisting of 931 videos with 4021 code-mixed Hindi-English utterances collected from YouTube. Each utterance within this dataset has been meticulously annotated for toxicity, severity, and sentiment labels. We have developed an advanced Multimodal Multitask framework built for Toxicity detection in Video Content by leveraging Large Language Models (LLMs), crafted for the primary objective along with the additional tasks of conducting sentiment and severity analysis. ToxVidLLM incorporates three key modules the Encoder module, Cross-Modal Synchronization module, and Multitask module crafting a generic multimodal LLM customized for intricate video classification tasks. Our experiments reveal that incorporating multiple modalities from the videos substantially enhances the performance of toxic content detection by achieving an Accuracy and Weighted F1 score of 94.29% and 94.35%, respectively.
在快速发展的互联网技术时代,多模态内容的激增已经扩展了互联网沟通的视野。然而,在多样性较高的场景中检测有毒内容,特别是低资源代码混合语言中的内容,仍然是一个关键的挑战。尽管在文本数据中的有毒内容检测方面已经进行了大量研究,但在视频内容领域,尤其是在非英语语言中,这个领域仍然相对缺乏研究。本文通过引入一个基准数据集来填补这个研究空白,这个数据集包括从YouTube上收集的931个视频,每个视频都有4021个印度-英语混合代码的语料。这个数据集中的每个语料都已详细注释为毒性、严重性和情感标签。我们利用大型语言模型(LLMs)构建了用于视频内容中有毒内容检测的先进多模态多任务框架,并专门为情感和严重性分析增加了任务。ToxVidLLM包括三个关键模块:编码器模块、跨模态同步模块和多任务模块,为复杂的视频分类任务创建了一个通用的多模态LLM。我们的实验结果表明,通过引入视频中的多个模态,可以极大地提高有毒内容检测的性能,达到94.29%的准确度和94.35%的加权F1分数。
https://arxiv.org/abs/2405.20628
Recently, video generation techniques have advanced rapidly. Given the popularity of video content on social media platforms, these models intensify concerns about the spread of fake information. Therefore, there is a growing demand for detectors capable of distinguishing between fake AI-generated videos and mitigating the potential harm caused by fake information. However, the lack of large-scale datasets from the most advanced video generators poses a barrier to the development of such detectors. To address this gap, we introduce the first AI-generated video detection dataset, GenVideo. It features the following characteristics: (1) a large volume of videos, including over one million AI-generated and real videos collected; (2) a rich diversity of generated content and methodologies, covering a broad spectrum of video categories and generation techniques. We conducted extensive studies of the dataset and proposed two evaluation methods tailored for real-world-like scenarios to assess the detectors' performance: the cross-generator video classification task assesses the generalizability of trained detectors on generators; the degraded video classification task evaluates the robustness of detectors to handle videos that have degraded in quality during dissemination. Moreover, we introduced a plug-and-play module, named Detail Mamba (DeMamba), designed to enhance the detectors by identifying AI-generated videos through the analysis of inconsistencies in temporal and spatial dimensions. Our extensive experiments demonstrate DeMamba's superior generalizability and robustness on GenVideo compared to existing detectors. We believe that the GenVideo dataset and the DeMamba module will significantly advance the field of AI-generated video detection. Our code and dataset will be aviliable at \url{this https URL}.
近年来,视频生成技术发展迅速。考虑到社交媒体上视频内容的流行,这些模型加剧了对虚假信息传播的担忧。因此,人们越来越需要能够区分虚假AI生成的视频并减轻虚假信息可能带来的潜在危害的检测器。然而,从最先进的视频生成器那里获取的大规模数据集的缺乏构成了发展这种检测器的障碍。为解决这个空白,我们引入了第一个AI生成的视频检测数据集GenVideo。其特征如下:(1)大量的视频,包括超过100万AI生成和真实视频;(2)丰富多样的生成内容和方法论,涵盖了广泛的视频类别和生成技术。我们对该数据集进行了广泛的研究,并提出了两个针对现实场景的评估方法,以评估检测器的性能:跨生成器视频分类任务评估训练后的检测器的泛化能力;降低视频分类任务评估检测器对质量下降的视频的鲁棒性。此外,我们还引入了一个名为Detail Mamba(DeMamba)的插件模块,通过分析时空维度的不一致性来识别AI生成的视频。我们的大量实验证明,DeMamba在GenVideo数据集上比现有检测器具有更卓越的泛化能力和鲁棒性。我们相信,GenVideo数据集和DeMamba模块将对AI生成视频检测领域产生重大影响。我们的代码和数据集将在此处可用:https://this URL。
https://arxiv.org/abs/2405.19707
Semi-supervised learning suffers from the imbalance of labeled and unlabeled training data in the video surveillance scenario. In this paper, we propose a new semi-supervised learning method called SIAVC for industrial accident video classification. Specifically, we design a video augmentation module called the Super Augmentation Block (SAB). SAB adds Gaussian noise and randomly masks video frames according to historical loss on the unlabeled data for model optimization. Then, we propose a Video Cross-set Augmentation Module (VCAM) to generate diverse pseudo-label samples from the high-confidence unlabeled samples, which alleviates the mismatch of sampling experience and provides high-quality training data. Additionally, we construct a new industrial accident surveillance video dataset with frame-level annotation, namely ECA9, to evaluate our proposed method. Compared with the state-of-the-art semi-supervised learning based methods, SIAVC demonstrates outstanding video classification performance, achieving 88.76\% and 89.13\% accuracy on ECA9 and Fire Detection datasets, respectively. The source code and the constructed dataset ECA9 will be released in \url{this https URL}.
半监督学习在视频监控场景中受到有标签和无标签训练数据之间不平衡的影响。在本文中,我们提出了一种名为SIAVC的新半监督学习方法,用于工业事故视频分类。具体来说,我们设计了一个视频增强模块,称为Super Augmentation Block(SAB)。SAB在无标签数据上根据历史损失添加高斯噪声并随机遮盖视频帧。然后,我们提出了一个视频跨样本增强模块(VCAM),从高置信度的无标签样本中生成多样性的伪标签样本,减轻了抽样体验的差异,并为高质量训练数据提供了支持。此外,我们还构建了一个新的工业事故监控视频数据集,以评估我们所提出的方法。与最先进的半监督学习方法相比,SIAVC表现出出色的视频分类性能,在ECA9和火灾检测数据集上分别获得了88.76%和89.13%的准确率。源代码和构建的数据集ECA9将发布在https://这个URL上。
https://arxiv.org/abs/2405.14506