In the current era of Machine Learning, Transformers have become the de facto approach across a variety of domains, such as computer vision and natural language processing. Transformer-based solutions are the backbone of current state-of-the-art methods for language generation, image and video classification, segmentation, action and object recognition, among many others. Interestingly enough, while these state-of-the-art methods produce impressive results in their respective domains, the problem of understanding the relationship between vision and language is still beyond our reach. In this work, we propose a common ground between vision and language based on events in space and time in an explainable and programmatic way, to connect learning-based vision and language state of the art models and provide a solution to the long standing problem of describing videos in natural language. We validate that our algorithmic approach is able to generate coherent, rich and relevant textual descriptions on videos collected from a variety of datasets, using both standard metrics (e.g. Bleu, ROUGE) and the modern LLM-as-a-Jury approach.
在当今的机器学习时代,Transformer模型已成为计算机视觉和自然语言处理等多个领域的标准方法。基于Transformer的解决方案是当前最先进的语言生成、图像和视频分类、分割、动作与物体识别等众多领域技术的基础。有趣的是,尽管这些最先进技术在其各自领域取得了令人瞩目的成果,但理解视觉与语言之间的关系仍然是我们尚未攻克的问题。 在这项工作中,我们提出了一种基于时间和空间中事件的解释性和程序化方法来连接以学习为基础的视觉和语言最新模型,并提供了解决长期以来存在的用自然语言描述视频问题的新方案。通过使用标准指标(如Bleu、ROUGE)以及现代的大规模语言模型作为评审的方法,我们验证了我们的算法能够生成连贯、丰富且相关的文本描述,适用于从不同数据集中收集的各种视频。
https://arxiv.org/abs/2501.08460
We empirically study autoregressive pre-training from videos. To perform our study, we construct a series of autoregressive video models, called Toto. We treat videos as sequences of visual tokens and train transformer models to autoregressively predict future tokens. Our models are pre-trained on a diverse dataset of videos and images comprising over 1 trillion visual tokens. We explore different architectural, training, and inference design choices. We evaluate the learned visual representations on a range of downstream tasks including image recognition, video classification, object tracking, and robotics. Our results demonstrate that, despite minimal inductive biases, autoregressive pre-training leads to competitive performance across all benchmarks. Finally, we find that scaling our video models results in similar scaling curves to those seen in language models, albeit with a different rate. More details at this https URL
我们通过实证研究了从视频中进行自回归预训练的方法。为了开展这项研究,我们构建了一系列名为Toto的自回归视频模型。我们将视频视为视觉令牌序列,并训练变压器模型以自回归方式预测未来的令牌。我们的模型在包含超过1万亿个视觉令牌的多样化数据集(包括视频和图像)上进行了预训练。我们在架构、训练和推理设计选择方面做了各种探索。我们评估了所学习到的视觉表示形式在一系列下游任务上的表现,包括图像识别、视频分类、对象跟踪和机器人技术。我们的研究结果表明,尽管具有最少的归纳偏置,自回归预训练仍然能在一个广泛的标准上取得竞争性的性能。 最后,我们发现随着视频模型规模的增长,其扩展曲线与语言模型相似,不过增长率有所不同。更多详情请参阅此链接(在实际回答中应提供具体网址,此处以“https URL”表示)。
https://arxiv.org/abs/2501.05453
Automated viewpoint classification in echocardiograms can help under-resourced clinics and hospitals in providing faster diagnosis and screening when expert technicians may not be available. We propose a novel approach towards echocardiographic viewpoint classification. We show that treating viewpoint classification as video classification rather than image classification yields advantage. We propose a CNN-GRU architecture with a novel temporal feature weaving method, which leverages both spatial and temporal information to yield a 4.33\% increase in accuracy over baseline image classification while using only four consecutive frames. The proposed approach incurs minimal computational overhead. Additionally, we publish the Neonatal Echocardiogram Dataset (NED), a professionally-annotated dataset providing sixteen viewpoints and associated echocardipgraphy videos to encourage future work and development in this field. Code available at: this https URL
在心超图像中实现自动视角分类可以帮助资源不足的诊所和医院,在专业技术人员可能不可用的情况下,提供更快的诊断和筛查。我们提出了一种新颖的方法来解决心超图视角分类问题。研究表明,将视角分类视为视频分类而非图像分类可以带来优势。为此,我们提出了一个结合了CNN(卷积神经网络)与GRU(门控循环单元)架构,并引入了一种新的时间特征编织方法,这种方法利用空间和时间信息,在仅使用连续四帧的情况下,相比于基线图像分类,准确率提高了4.33%。此外,我们的方法对计算资源的要求非常低。 我们还发布了新生儿心超数据集(NED),这是一个由专业人员标注的数据集,包含十六个视角及其相关的心脏超声视频,以鼓励未来在这个领域的研究和开发工作。代码可在以下链接获取:this https URL
https://arxiv.org/abs/2501.03967
Challenges in cross-learning involve inhomogeneous or even inadequate amount of training data and lack of resources for retraining large pretrained models. Inspired by transfer learning techniques in NLP, adapters and prefix tuning, this paper presents a new model-agnostic plugin architecture for cross-learning, called CM3T, that adapts transformer-based models to new or missing information. We introduce two adapter blocks: multi-head vision adapters for transfer learning and cross-attention adapters for multimodal learning. Training becomes substantially efficient as the backbone and other plugins do not need to be finetuned along with these additions. Comparative and ablation studies on three datasets Epic-Kitchens-100, MPIIGroupInteraction and UDIVA v0.5 show efficacy of this framework on different recording settings and tasks. With only 12.8% trainable parameters compared to the backbone to process video input and only 22.3% trainable parameters for two additional modalities, we achieve comparable and even better results than the state-of-the-art. CM3T has no specific requirements for training or pretraining and is a step towards bridging the gap between a general model and specific practical applications of video classification.
跨领域学习面临的挑战包括异质或甚至不足的训练数据,以及重新训练大型预训练模型资源匮乏的问题。受自然语言处理(NLP)中迁移学习技术的启发,本文提出了一种新的与模型无关的插件架构——CM3T,用于跨领域学习,该架构可以使基于Transformer的模型适应新出现或缺失的信息。我们引入了两种适配器模块:多头视觉适配器用于转移学习和跨注意力适配器用于多模态学习。由于不需要对骨干网络和其他插件进行微调,因此训练效率显著提高。在Epic-Kitchens-100、MPIIGroupInteraction 和 UDIVA v0.5 数据集上的比较与消融研究表明,该框架在不同的记录设置和任务上均表现出良好的效果。仅使用占骨干网络12.8%的可训练参数来处理视频输入,并且针对两种额外模态只需22.3%的可训练参数,CM3T就能达到与当前最佳方法相当甚至更好的结果。此外,CM3T对于训练或预训练没有特定要求,并为通用模型和具体应用场景之间的差距搭建了一座桥梁。
https://arxiv.org/abs/2501.03332
This work aims to predict the popularity of short videos using the videos themselves and their related features. Popularity is measured by four key engagement metrics: view count, like count, comment count, and share count. This study employs video classification models with different architectures and training methods as backbone networks to extract video modality features. Meanwhile, the cleaned video captions are incorporated into a carefully designed prompt framework, along with the video, as input for video-to-text generation models, which generate detailed text-based video content understanding. These texts are then encoded into vectors using a pre-trained BERT model. Based on the six sets of vectors mentioned above, a neural network is trained for each of the four prediction metrics. Moreover, the study conducts data mining and feature engineering based on the video and tabular data, constructing practical features such as the total frequency of hashtag appearances, the total frequency of mention appearances, video duration, frame count, frame rate, and total time online. Multiple machine learning models are trained, and the most stable model, XGBoost, is selected. Finally, the predictions from the neural network and XGBoost models are averaged to obtain the final result.
这项工作旨在通过短视频本身及其相关特征来预测短视频的受欢迎程度。受欢迎程度通过四个关键参与指标进行衡量:观看次数、点赞数、评论数和分享数。本研究采用不同架构和训练方法的视频分类模型作为骨干网络,以提取视频模态特征。同时,对清理过的视频字幕在精心设计的提示框架中与视频一起输入视频到文本生成模型,该模型生成详细的基于文本的视频内容理解。随后使用预训练的BERT模型将这些文本编码成向量。根据上述六组向量,为四个预测指标中的每一个训练一个神经网络。 此外,本研究还对视频和表格数据进行了数据分析和特征工程,构建了诸如标签总出现频率、提及总出现频率、视频时长、帧数、帧率和在线总时间等实用特征。训练了多个机器学习模型,并选择了最稳定的XGBoost模型。最后,通过平均神经网络和XGBoost模型的预测结果来获得最终结果。
https://arxiv.org/abs/2501.01422
Both few-shot learning and domain adaptation sub-fields in Computer Vision have seen significant recent progress in terms of the availability of state-of-the-art algorithms and datasets. Frameworks have been developed for each sub-field; however, building a common system or framework that combines both is something that has not been explored. As part of our research, we present the first unified framework that combines domain adaptation for the few-shot learning setting across 3 different tasks - image classification, object detection and video classification. Our framework is highly modular with the capability to support few-shot learning with/without the inclusion of domain adaptation depending on the algorithm. Furthermore, the most important configurable feature of our framework is the on-the-fly setup for incremental $n$-shot tasks with the optional capability to configure the system to scale to a traditional many-shot task. With more focus on Self-Supervised Learning (SSL) for current few-shot learning approaches, our system also supports multiple SSL pre-training configurations. To test our framework's capabilities, we provide benchmarks on a wide range of algorithms and datasets across different task and problem settings. The code is open source has been made publicly available here: this https URL
在计算机视觉领域,少量样本学习和域适应这两个子领域的研究都取得了显著进展,尤其是在先进算法和数据集的可用性方面。为每个子领域开发了框架;然而,构建一个能够将两者结合起来的通用系统或框架还未曾被探索过。作为我们研究的一部分,我们提出了首个结合少量样本学习场景下的域适应的统一框架,涵盖了三种不同的任务:图像分类、目标检测和视频分类。我们的框架高度模块化,支持根据算法需要选择是否包含域适应的少量样本学习。此外,我们框架最重要的可配置特性是能够实时设置增量$n$-样本任务,并且可以选择将系统扩展到传统的多样本任务中去。鉴于目前少量样本学习方法对自监督学习(SSL)的关注越来越多,我们的系统还支持多种SSL预训练配置。为了测试我们框架的能力,我们在广泛的算法和数据集上提供了基准,在不同的任务和问题设置下进行验证。代码是开源的,并在这里公开:此 https 链接
https://arxiv.org/abs/2412.16275
Human perception integrates multiple modalities, such as vision, hearing, and language, into a unified understanding of the surrounding reality. While recent multimodal models have achieved significant progress by aligning pairs of modalities via contrastive learning, their solutions are unsuitable when scaling to multiple modalities. These models typically align each modality to a designated anchor without ensuring the alignment of all modalities with each other, leading to suboptimal performance in tasks requiring a joint understanding of multiple modalities. In this paper, we structurally rethink the pairwise conventional approach to multimodal learning and we present the novel Gramian Representation Alignment Measure (GRAM), which overcomes the above-mentioned limitations. GRAM learns and then aligns $n$ modalities directly in the higher-dimensional space in which modality embeddings lie by minimizing the Gramian volume of the $k$-dimensional parallelotope spanned by the modality vectors, ensuring the geometric alignment of all modalities simultaneously. GRAM can replace cosine similarity in any downstream method, holding for 2 to $n$ modality and providing more meaningful alignment with respect to previous similarity measures. The novel GRAM-based contrastive loss function enhances the alignment of multimodal models in the higher-dimensional embedding space, leading to new state-of-the-art performance in downstream tasks such as video-audio-text retrieval and audio-video classification. The project page, the code, and the pretrained models are available at this https URL.
人类感知将多种模式(如视觉、听觉和语言)整合成对周围现实的统一理解。虽然最近的多模态模型通过对比学习对齐两种模式取得了显著进展,但这些解决方案在扩展到多种模式时并不适用。这些模型通常将每种模式与指定的锚点进行对齐,而没有确保所有模式之间的相互对齐,导致需要联合理解多种模式的任务性能不佳。本文重新思考了传统的多模态学习成对方法,并提出了新的Gramian表征对齐措施(GRAM),克服了上述限制。GRAM直接在模态嵌入存在的高维空间中学习并优化$n$种模式的对齐,通过最小化由模态向量张成的$k$维平行六面体的Gramian体积来确保所有模态的几何对齐。GRAM可以替代任何下游方法中的余弦相似性,适用于从2到$n$种模式,并提供了比先前相似度衡量更具有意义的对齐。基于GRAM的新对比损失函数增强了多模态模型在高维嵌入空间中的对齐效果,在视频-音频-文本检索和音频-视频分类等下游任务中达到了新的最先进的性能水平。项目页面、代码和预训练模型可以在提供的这个https URL地址获得。
https://arxiv.org/abs/2412.11959
Audio-visual Zero-Shot Learning (ZSL) has attracted significant attention for its ability to identify unseen classes and perform well in video classification tasks. However, modal imbalance in (G)ZSL leads to over-reliance on the optimal modality, reducing discriminative capabilities for unseen classes. Some studies have attempted to address this issue by modifying parameter gradients, but two challenges still remain: (a) Quality discrepancies, where modalities offer differing quantities and qualities of information for the same concept. (b) Content discrepancies, where sample contributions within a modality vary significantly. To address these challenges, we propose a Discrepancy-Aware Attention Network (DAAN) for Enhanced Audio-Visual ZSL. Our approach introduces a Quality-Discrepancy Mitigation Attention (QDMA) unit to minimize redundant information in the high-quality modality and a Contrastive Sample-level Gradient Modulation (CSGM) block to adjust gradient magnitudes and balance content discrepancies. We quantify modality contributions by integrating optimization and convergence rate for more precise gradient modulation in CSGM. Experiments demonstrates DAAN achieves state-of-the-art performance on benchmark datasets, with ablation studies validating the effectiveness of individual modules.
音频视觉零样本学习(ZSL)因其能够识别未见过的类别并在视频分类任务中表现出色而受到了广泛关注。然而,(G)ZSL中的模态不平衡问题导致过度依赖最优模态,降低了对未见类别的辨别能力。一些研究试图通过修改参数梯度来解决这个问题,但仍存在两个挑战:(a) 质量差异,即对于同一概念,不同模态提供的信息数量和质量有所不同。(b) 内容差异,在同一个模态内样本贡献的变化很大。为了解决这些问题,我们提出了一种差异感知注意力网络(DAAN),用于增强音频视觉ZSL。我们的方法引入了一个质量差异缓解注意单元(QDMA)来减少高质量模态中的冗余信息,并且使用对比样本级梯度调节模块(CSGM)来调整梯度大小并平衡内容差异。我们在CSGM中通过整合优化和收敛率来量化模态贡献,实现更精确的梯度调节。实验表明,DAAN在基准数据集上实现了最先进的性能,消融研究验证了各个模块的有效性。
https://arxiv.org/abs/2412.11715
In recent years, Spiking Neural Networks (SNNs) have gathered significant interest due to their temporal understanding capabilities. This work introduces, to the best of our knowledge, the first Cortical Column like hybrid architecture for the Time-Series Data Classification Task that leverages SNNs and is inspired by the brain structure, inspired from the previous hybrid models. We introduce several encoding methods to use with this model. Finally, we develop a procedure for training this network on the training dataset. As an effort to make using these models simpler, we make all the implementations available to the public.
https://arxiv.org/abs/2412.00237
Detecting mixed-critical events through computer vision is challenging due to the need for contextual understanding to assess event criticality accurately. Mixed critical events, such as fires of varying severity or traffic incidents, demand adaptable systems that can interpret context to trigger appropriate responses. This paper addresses these challenges by proposing a versatile detection system for smart city applications, offering a solution tested across traffic and fire detection scenarios. Our contributions include an analysis of detection requirements and the development of a system adaptable to diverse applications, advancing automated surveillance for smart cities.
通过计算机视觉检测混合关键事件颇具挑战性,因为需要对情境进行理解以准确评估事件的严重程度。例如不同严重程度的火灾或交通事故等混合关键事件要求系统具有可适应性,能够解读情境并触发适当的响应。本文针对这些挑战提出了一个适用于智慧城市应用的多功能检测系统,并在交通和火灾检测场景中进行了测试。我们的贡献包括对检测需求的分析以及开发了一个可以应用于多种场景的系统,从而推动了智慧城市的自动化监控发展。
https://arxiv.org/abs/2411.15773
The advancements in large language models (LLMs) have propelled the improvement of video understanding tasks by incorporating LLMs with visual models. However, most existing LLM-based models (e.g., VideoLLaMA, VideoChat) are constrained to processing short-duration videos. Recent attempts to understand long-term videos by extracting and compressing visual features into a fixed memory size. Nevertheless, those methods leverage only visual modality to merge video tokens and overlook the correlation between visual and textual queries, leading to difficulties in effectively handling complex question-answering tasks. To address the challenges of long videos and complex prompts, we propose AdaCM$^2$, which, for the first time, introduces an adaptive cross-modality memory reduction approach to video-text alignment in an auto-regressive manner on video streams. Our extensive experiments on various video understanding tasks, such as video captioning, video question answering, and video classification, demonstrate that AdaCM$^2$ achieves state-of-the-art performance across multiple datasets while significantly reducing memory usage. Notably, it achieves a 4.5% improvement across multiple tasks in the LVU dataset with a GPU memory consumption reduction of up to 65%.
大型语言模型(LLMs)的进步通过将这些模型与视觉模型相结合,推动了视频理解任务的改进。然而,大多数现有的基于LLM的模型(例如VideoLLaMA、VideoChat)仅限于处理短时长视频。最近尝试通过提取和压缩视觉特征到固定内存大小来理解长期视频。尽管如此,这些方法只利用视觉模态来合并视频令牌,并忽视了视觉与文本查询之间的关联性,导致难以有效地处理复杂的问答任务。为了解决长时间视频和复杂提示的挑战,我们提出了AdaCM$^2$,该模型首次引入了一种自适应跨模态记忆缩减方法,在自动回归方式下应用于视频流中的视频-文本对齐。我们在各种视频理解任务上的广泛实验,如视频字幕生成、视频问答和视频分类,表明AdaCM$^2$在多个数据集上达到了最先进的性能,并显著减少了内存使用量。值得注意的是,在LVU数据集中,它实现了跨多项任务4.5%的提升,同时GPU内存消耗减少高达65%。
https://arxiv.org/abs/2411.12593
We present Attend-Fusion, a novel and efficient approach for audio-visual fusion in video classification tasks. Our method addresses the challenge of exploiting both audio and visual modalities while maintaining a compact model architecture. Through extensive experiments on the YouTube-8M dataset, we demonstrate that our Attend-Fusion achieves competitive performance with significantly reduced model complexity compared to larger baseline models.
我们提出了Attend-Fusion,这是一种新颖且高效的视频分类任务中音视频融合方法。我们的方法解决了在保持紧凑模型架构的同时利用音频和视觉模态的挑战。通过在YouTube-8M数据集上的广泛实验,我们证明了与更大的基线模型相比,我们的Attend-Fusion在显著降低模型复杂度的情况下仍能实现具有竞争力的性能。
https://arxiv.org/abs/2411.05603
As violent crimes continue to happen, it becomes necessary to have security cameras that can rapidly identify moments of violence with excellent accuracy. The purpose of this study is to identify how many frames should be analyzed at a time in order to optimize a violence detection model's accuracy as a parameter of the depth of a 3D convolutional network. Previous violence classification models have been created, but their application to live footage may be flawed. In this project, a convolutional neural network was created to analyze optical flow frames of each video. The number of frames analyzed at a time would vary with one, two, three, ten, and twenty frames, and each model would be trained for 20 epochs. The greatest validation accuracy was 94.87% and occurred with the model that analyzed three frames at a time. This means that machine learning models to detect violence may function better when analyzing three frames at a time for this dataset. The methodology used to identify the optimal number of frames to analyze at a time could be used in other applications of video classification, especially those of complex or abstract actions, such as violence.
随着暴力犯罪的持续发生,需要有能够快速且准确识别暴力时刻的安全摄像头。本研究旨在确定为了优化暴力检测模型的准确性,三维卷积网络深度参数下应该一次性分析多少帧。虽然已有一些暴力分类模型被创建出来,但它们在实时视频中的应用可能存在缺陷。在这个项目中,我们建立了一个卷积神经网络来分析每个视频的光流帧。一次分析的帧数分别为一、二、三、十和二十帧,并且每种模型都进行了20个周期的训练。最高的验证准确率达到了94.87%,这发生在一次性分析三帧的模型上。这意味着,对于这个数据集来说,机器学习模型在检测暴力时可能在一次分析三帧的情况下效果更好。用于确定最佳分析帧数的方法也可以应用于其他视频分类领域,特别是那些复杂或抽象动作的分类,如暴力行为识别。
https://arxiv.org/abs/2411.01348
As the scale of data and models for video understanding rapidly expand, handling long-form video input in transformer-based models presents a practical challenge. Rather than resorting to input sampling or token dropping, which may result in information loss, token merging shows promising results when used in collaboration with transformers. However, the application of token merging for long-form video processing is not trivial. We begin with the premise that token merging should not rely solely on the similarity of video tokens; the saliency of tokens should also be considered. To address this, we explore various video token merging strategies for long-form video classification, starting with a simple extension of image token merging, moving to region-concentrated merging, and finally proposing a learnable video token merging (VTM) algorithm that dynamically merges tokens based on their saliency. Extensive experimental results show that we achieve better or comparable performances on the LVU, COIN, and Breakfast datasets. Moreover, our approach significantly reduces memory costs by 84% and boosts throughput by approximately 6.89 times compared to baseline algorithms.
随着视频理解中数据和模型规模的迅速扩大,基于变压器模型处理长格式视频输入变得具有实际挑战性。与其依赖可能导致信息丢失的输入采样或标记丢弃,与变压器协作使用的标记合并显示出有希望的结果。然而,对长格式视频处理应用标记合并并不简单。我们从这样一个前提开始:标记合并不应仅依靠视频标记之间的相似性;还应考虑标记的重要性。为了解决这一问题,我们探索了多种针对长格式视频分类的视频标记合并策略,从图像标记合并的简单扩展入手,转向集中于区域的合并方法,最终提出了一种基于可学习的视频标记合并(VTM)算法,该算法根据标记的重要性动态地进行合并。广泛的实验结果表明,在LVU、COIN和Breakfast数据集上我们取得了更好的或相当的表现。此外,与基线算法相比,我们的方法显著降低了84%的内存成本,并将吞吐量提高了大约6.89倍。
https://arxiv.org/abs/2410.23782
Dense features, customized for different business scenarios, are essential in short video classification. However, their complexity, specific adaptation requirements, and high computational costs make them resource-intensive and less accessible during online inference. Consequently, these dense features are categorized as `Privileged Dense Features'.Meanwhile, end-to-end multi-modal models have shown promising results in numerous computer vision tasks. In industrial applications, prioritizing end-to-end multi-modal features, can enhance efficiency but often leads to the loss of valuable information from historical privileged dense this http URL integrate both features while maintaining efficiency and manageable resource costs, we present Confidence-aware Privileged Feature Distillation (CPFD), which empowers features of an end-to-end multi-modal model by adaptively distilling privileged features during this http URL existing privileged feature distillation (PFD) methods, which apply uniform weights to all instances during distillation, potentially causing unstable performance across different business scenarios and a notable performance gap between teacher model (Dense Feature enhanced multimodal-model DF-X-VLM) and student model (multimodal-model only X-VLM), our CPFD leverages confidence scores derived from the teacher model to adaptively mitigate the performance variance with the student this http URL conducted extensive offline experiments on five diverse tasks demonstrating that CPFD improves the video classification F1 score by 6.76% compared with end-to-end multimodal-model (X-VLM) and by 2.31% with vanilla PFD on-average. And it reduces the performance gap by 84.6% and achieves results comparable to teacher model DF-X-VLM. The effectiveness of CPFD is further substantiated by online experiments, and our framework has been deployed in production systems for over a dozen models.
密集特征,针对不同的业务场景进行定制,是短时视频分类的基本要求。然而,它们的复杂性、特定的适应要求和高的计算成本使它们变得资源密集且难以在线推理时使用。因此,这些密集特征被归类为“特权密集特征”。与此同时,端到端多模态模型已经在许多计算机视觉任务中取得了良好的结果。在工业应用中,优先考虑端到端多模态特征,可以提高效率,但往往会导致历史特权密集特征的宝贵信息丢失。为了同时保持效率和可管理资源成本,我们提出了Confidence-aware Privileged Feature Distillation(CPFD),它通过在现有的特权特征提取方法中自适应地蒸馏特权特征,使端到端多模态模型的特征具有可塑性。通过应用教师模型的置信分数,CPFD可以动态地减轻学生模型的性能方差。我们在五个多样化的任务上进行了广泛的离线实验,结果表明,与端到端多模态模型(X-VLM)相比,CPFD可以将视频分类F1得分提高6.76%;与普通PFD相比,CPFD可以将性能差距减少84.6%。在实际应用中,CPFD的效果得到了进一步的验证,我们的框架已经被部署在多个生产系统上,支持了数十个模型。
https://arxiv.org/abs/2410.03038
The rise of short-form videos on platforms like TikTok has brought new challenges in safeguarding young viewers from inappropriate content. Traditional moderation methods often fall short in handling the vast and rapidly changing landscape of user-generated videos, increasing the risk of children encountering harmful material. This paper introduces TikGuard, a transformer-based deep learning approach aimed at detecting and flagging content unsuitable for children on TikTok. By using a specially curated dataset, TikHarm, and leveraging advanced video classification techniques, TikGuard achieves an accuracy of 86.7%, showing a notable improvement over existing methods in similar contexts. While direct comparisons are limited by the uniqueness of the TikHarm dataset, TikGuard's performance highlights its potential in enhancing content moderation, contributing to a safer online experience for minors. This study underscores the effectiveness of transformer models in video classification and sets a foundation for future research in this area.
短形式视频在像TikTok这样的平台上崛起,为保护年轻观众免受不良内容带来了新的挑战。传统的审核方法通常在处理用户生成视频的广泛且快速变化的地形方面存在缺陷,增加了孩子们接触有害材料的风险。本文介绍了TikGuard,一种基于Transformer的深度学习方法,旨在检测和标记TikTok上不适合儿童的内容。通过使用专门挑选的数据集TikHarm和利用先进的视频分类技术,TikGuard实现了86.7%的准确率,表明其在类似环境中与现有方法相比取得了显著的改善。尽管直接比较受到TikHarm数据集独特性的限制,但TikGuard的表现突出了其在增强内容审核方面的潜力,为未成年人提供更安全在线体验。本研究突出了Transformer模型在视频分类中的有效性,为该领域未来的研究奠定了基础。
https://arxiv.org/abs/2410.00403
Edge computing's growing prominence, due to its ability to reduce communication latency and enable real-time processing, is promoting the rise of high-performance, heterogeneous System-on-Chip solutions. While current approaches often involve scaling down modern hardware, the performance characteristics of neural network workloads on these platforms can vary significantly, especially when it comes to parallel processing, which is a critical consideration for edge deployments. To address this, we conduct a comprehensive study comparing the latency and throughput of various linear algebra and neural network inference tasks across CPU-only, CPU/GPU, and CPU/NPU integrated solutions. {We find that the Neural Processing Unit (NPU) excels in matrix-vector multiplication (58.6% faster) and some neural network tasks (3.2$\times$ faster for video classification and large language models). GPU outperforms in matrix multiplication (22.6% faster) and LSTM networks (2.7$\times$ faster) while CPU excels at less parallel operations like dot product. NPU-based inference offers a balance of latency and throughput at lower power consumption. GPU-based inference, though more energy-intensive, performs best with large dimensions and batch sizes. We highlight the potential of heterogeneous computing solutions for edge AI, where diverse compute units can be strategically leveraged to boost accurate and real-time inference.
由于其能够降低通信延迟并实现实时处理,使边缘计算的重要性不断增加,推动了高性能、异构 System-on-Chip 解决方案的发展。然而,通常的方法往往会缩小现代硬件的规模,这些平台上的神经网络负载的性能特征差异很大,特别是在并行处理方面,这是边缘部署的关键考虑因素。为了应对这个问题,我们进行了全面的研究,比较了各种线性代数和神经网络推理任务在CPU-only、CPU/GPU和CPU/NPU集成解决方案中的延迟和吞吐量。我们发现,神经处理单元(NPU)在矩阵向量乘法(58.6% faster)和某些神经网络任务(视频分类和大语言模型的速度快3.2倍)方面表现优异。GPU 在矩阵乘法(22.6% faster)和LSTM网络(2.7倍 faster)方面优于CPU,而在并行操作方面,比如点积,CPU的性能较低。基于NPU的推理在较低的功耗下实现了平衡的延迟和吞吐量。尽管GPU 的性能更耗费能量,但在大维度和高批量的场景下表现最佳。我们强调了异构计算解决方案在边缘 AI 中的潜力,通过 strategic leveraging diverse compute units,可以提高准确和实时的推理。
https://arxiv.org/abs/2409.14803
Event cameras offer low-power visual sensing capabilities ideal for edge-device applications. However, their high event rate, driven by high temporal details, can be restrictive in terms of bandwidth and computational resources. In edge AI applications, determining the minimum amount of events for specific tasks can allow reducing the event rate to improve bandwidth, memory, and processing efficiency. In this paper, we study the effect of event subsampling on the accuracy of event data classification using convolutional neural network (CNN) models. Surprisingly, across various datasets, the number of events per video can be reduced by an order of magnitude with little drop in accuracy, revealing the extent to which we can push the boundaries in accuracy vs. event rate trade-off. Additionally, we also find that lower classification accuracy in high subsampling rates is not solely attributable to information loss due to the subsampling of the events, but that the training of CNNs can be challenging in highly subsampled scenarios, where the sensitivity to hyperparameters increases. We quantify training instability across multiple event-based classification datasets using a novel metric for evaluating the hyperparameter sensitivity of CNNs in different subsampling settings. Finally, we analyze the weight gradients of the network to gain insight into this instability.
活动相机具有低功耗的视觉感知能力,特别适用于边缘设备应用。然而,由于其高事件率,驱动因素是高时间细节,在带宽和计算资源方面可能会有所限制。在边缘人工智能应用中,确定特定任务所需的最低事件量可以降低事件率,提高带宽、内存和处理效率。在本文中,我们研究了事件子采样对使用卷积神经网络(CNN)模型进行事件数据分类的准确性的影响。令人惊讶的是,在各种数据集中,每个视频的事件数量可以降低一个数量级,而准确度几乎没有下降,揭示了我们在准确性和事件率之间的权衡程度。此外,我们还发现,在高度子采样的情况下,较低的分类准确性不仅仅是因为事件子采样导致的,而且是因为在高度子采样情况下训练CNN可能具有挑战性,参数对超参数的敏感性增加。我们使用一种新的指标,评估不同子采样设置中CNN的参数敏感性,来量化训练不稳定。最后,我们分析了网络的权重梯度,以深入了解这种不稳定性。
https://arxiv.org/abs/2409.08953
Given a video with $T$ frames, frame sampling is a task to select $N \ll T$ frames, so as to maximize the performance of a fixed video classifier. Not just brute-force search, but most existing methods suffer from its vast search space of $\binom{T}{N}$, especially when $N$ gets large. To address this challenge, we introduce a novel perspective of reducing the search space from $O(T^N)$ to $O(T)$. Instead of exploring the entire $O(T^N)$ space, our proposed semi-optimal policy selects the top $N$ frames based on the independently estimated value of each frame using per-frame confidence, significantly reducing the computational complexity. We verify that our semi-optimal policy can efficiently approximate the optimal policy, particularly under practical settings. Additionally, through extensive experiments on various datasets and model architectures, we demonstrate that learning our semi-optimal policy ensures stable and high performance regardless of the size of $N$ and $T$.
给定一个具有 $T$ 帧的视频,帧采样是将 $N$ 帧的选择任务,以最大化固定视频分类器的性能。不仅仅是暴力搜索,而且大多数现有方法都存在其巨大的搜索空间 $\binom{T}{N}$,尤其是在 $N$ 较大时。为解决这个问题,我们引入了一种新的观点,将搜索空间从 $O(T^N)$ 减少到 $O(T)$。我们提出的半最优策略基于每个帧的独立估计值,选择前 $N$ 帧,大大减少了计算复杂度。我们验证,我们的半最优策略可以有效地近似最优策略,特别是在实际设置下。此外,通过在各种数据集和模型架构上进行广泛的实验,我们证明,学习我们的半最优策略可以确保稳定和高性能,而不会影响 $N$ 和 $T$ 的值。
https://arxiv.org/abs/2409.05260
Spiking Neural Networks (SNNs) have emerged as a compelling, energy-efficient alternative to traditional Artificial Neural Networks (ANNs) for static image tasks such as image classification and segmentation. However, in the more complex video classification domain, SNN-based methods fall considerably short of ANN-based benchmarks due to the challenges in processing dense frame sequences. To bridge this gap, we propose ReSpike, a hybrid framework that synergizes the strengths of ANNs and SNNs to tackle action recognition tasks with high accuracy and low energy cost. By decomposing film clips into spatial and temporal components, i.e., RGB image Key Frames and event-like Residual Frames, ReSpike leverages ANN for learning spatial information and SNN for learning temporal information. In addition, we propose a multi-scale cross-attention mechanism for effective feature fusion. Compared to state-of-the-art SNN baselines, our ReSpike hybrid architecture demonstrates significant performance improvements (e.g., >30% absolute accuracy improvement on HMDB-51, UCF-101, and Kinetics-400). Furthermore, ReSpike achieves comparable performance with prior ANN approaches while bringing better accuracy-energy tradeoff.
尖峰神经网络(SNNs)已成为传统人工神经网络(ANNs)在静态图像任务(如图像分类和分割)上的一个引人注目的、能源高效的替代方案。然而,在更复杂的视频分类领域,基于SNN的方法与基于ANN的方法相比,基准测试的结果差距很大,因为处理密集帧序列的挑战。为了弥合这一差距,我们提出了ReSpike,一种结合ANNs和SNNs优势的混合框架,以解决高准确性和低能量成本的动作识别任务。通过将电影剪辑分解为空间和时间组件,即RGB图像关键帧和事件状的残差帧,ReSpike利用ANNs学习空间信息,SNNs学习时间信息。此外,我们还提出了多尺度 cross-attention 机制来实现有效的特征融合。与最先进的SNN基线相比,我们的ReSpike混合架构显示出显著的性能改进(例如,HMDB-51、UCF-101和Kinetics-400的绝对准确度改进率均大于30%)。此外,ReSpike与先前的ANN方法实现了相当的表演,同时带来了更好的准确-能量权衡。
https://arxiv.org/abs/2409.01564