While most modern video understanding models operate on short-range clips, real-world videos are often several minutes long with semantically consistent segments of variable length. A common approach to process long videos is applying a short-form video model over uniformly sampled clips of fixed temporal length and aggregating the outputs. This approach neglects the underlying nature of long videos since fixed-length clips are often redundant or uninformative. In this paper, we aim to provide a generic and adaptive sampling approach for long-form videos in lieu of the de facto uniform sampling. Viewing videos as semantically consistent segments, we formulate a task-agnostic, unsupervised, and scalable approach based on Kernel Temporal Segmentation (KTS) for sampling and tokenizing long videos. We evaluate our method on long-form video understanding tasks such as video classification and temporal action localization, showing consistent gains over existing approaches and achieving state-of-the-art performance on long-form video modeling.
现代视频理解模型通常处理的是短片段,而实际视频往往几十分钟,具有语义 consistent 的片段长度可变。处理长视频的常见方法是使用一段固定时间长度的片段进行均匀采样,并汇总输出。这种方法忽略了长视频的深层次本质,因为固定长度的片段往往重复或无意义。在本文中,我们旨在提供一种通用的、自适应的采样方法,以代替事实上的均匀采样。将视频视为语义 consistent 的片段,我们制定了基于核心时间分割(KTS)的任务无关、无监督和可扩展的方法,用于采样和 tokenizing 长视频。我们针对长视频理解任务,如视频分类和时间行为定位,评估了我们的方法,显示与现有方法一致的增益,并在长视频建模方面实现了最先进的性能。
https://arxiv.org/abs/2309.11569
Despite an exciting new wave of multimodal machine learning models, current approaches still struggle to interpret the complex contextual relationships between the different modalities present in videos. Going beyond existing methods that emphasize simple activities or objects, we propose a new model-agnostic approach for generating detailed textual descriptions that captures multimodal video information. Our method leverages the extensive knowledge learnt by large language models, such as GPT-3.5 or Llama2, to reason about textual descriptions of the visual and aural modalities, obtained from BLIP-2, Whisper and ImageBind. Without needing additional finetuning of video-text models or datasets, we demonstrate that available LLMs have the ability to use these multimodal textual descriptions as proxies for ``sight'' or ``hearing'' and perform zero-shot multimodal classification of videos in-context. Our evaluations on popular action recognition benchmarks, such as UCF-101 or Kinetics, show these context-rich descriptions can be successfully used in video understanding tasks. This method points towards a promising new research direction in multimodal classification, demonstrating how an interplay between textual, visual and auditory machine learning models can enable more holistic video understanding.
尽管出现了令人兴奋的多媒态机器学习模型,但现有的方法仍然难以解释视频中出现的不同感官模式之间的复杂上下文关系。我们提出了一种新的模型无关的方法,用于生成详细文本描述,捕捉多媒态视频信息。我们的方法利用大型语言模型如GPT-3.5或Llama2学习到的广泛知识,以处理从BLIP-2、Whisper和ImageBind获取的视觉和听觉感官模式文本描述。我们不需要进一步调整视频文本模型或数据集,就能证明可用的LLMs有使用这些多媒态文本描述作为“看到”或“听到”的代用品,并在上下文中实现零次机会多媒态分类的能力。我们对流行的行动识别基准点如UCF-101或Kinetics进行评估,表明这些丰富的上下文描述可以在视频理解任务中成功使用。这种方法指向了多媒态分类中的有前途的新研究方向,展示了如何将文本、视觉和听觉机器学习模型之间的交互实现更全面的视频理解。
https://arxiv.org/abs/2309.10783
Learning high-quality video representation has shown significant applications in computer vision and remains challenging. Previous work based on mask autoencoders such as ImageMAE and VideoMAE has proven the effectiveness of learning representations in images and videos through reconstruction strategy in the visual modality. However, these models exhibit inherent limitations, particularly in scenarios where extracting features solely from the visual modality proves challenging, such as when dealing with low-resolution and blurry original videos. Based on this, we propose AV-MaskEnhancer for learning high-quality video representation by combining visual and audio information. Our approach addresses the challenge by demonstrating the complementary nature of audio and video features in cross-modality content. Moreover, our result of the video classification task on the UCF101 dataset outperforms the existing work and reaches the state-of-the-art, with a top-1 accuracy of 98.8% and a top-5 accuracy of 99.9%.
学习高质量的视频表示在计算机视觉方面已经展示了广泛应用,但仍然具有挑战性。基于掩码自编码器的工作,如ImageMAE和VideoMAE,已经证明了通过视觉 modality 的重构策略在图像和视频学习中表示的有效性。然而,这些模型表现出固有的限制,特别是在仅从视觉modality 中提取特征的情况下,例如处理低分辨率和模糊的原始视频时。基于这一点,我们提出了AV-MaskEnhancer,以学习通过视觉和音频信息结合实现的高质量视频表示。我们的方法解决了挑战,通过展示跨modality 内容中音频和视频特征的互补性质。此外,我们在UCF101数据集上的视频分类任务的结果优于现有工作,达到最先进的水平,其中1项准确率为98.8%,5项准确率为99.9%。
https://arxiv.org/abs/2309.08738
Optimizing video inference efficiency has become increasingly important with the growing demand for video analysis in various fields. Some existing methods achieve high efficiency by explicit discard of spatial or temporal information, which poses challenges in fast-changing and fine-grained scenarios. To address these issues, we propose an efficient video representation network with Differentiable Resolution Compression and Alignment mechanism, which compresses non-essential information in the early stage of the network to reduce computational costs while maintaining consistent temporal correlations. Specifically, we leverage a Differentiable Context-aware Compression Module to encode the saliency and non-saliency frame features, refining and updating the features into a high-low resolution video sequence. To process the new sequence, we introduce a new Resolution-Align Transformer Layer to capture global temporal correlations among frame features with different resolutions, while reducing spatial computation costs quadratically by utilizing fewer spatial tokens in low-resolution non-saliency frames. The entire network can be end-to-end optimized via the integration of the differentiable compression module. Experimental results show that our method achieves the best trade-off between efficiency and performance on near-duplicate video retrieval and competitive results on dynamic video classification compared to state-of-the-art methods. Code:this https URL
优化视频推断效率随着各个领域对视频分析的需求不断增加变得越来越重要。一些现有方法通过明确放弃空间或时间信息实现了高效的性能,但在快速变化和精细的场景下会带来挑战。为了解决这些问题,我们提出了一种高效的视频表示网络,采用可分化分辨率压缩和对齐机制。该网络在网络的早期阶段压缩非关键信息,以降低计算成本,同时保持 consistent 的时间相关度。具体来说,我们利用一种可分化上下文 aware 压缩模块编码可见和非可见帧特征,将它们 refine 和更新为高低频分辨率的视频序列。为了处理新的序列,我们引入了一种新分辨率 align Transformer 层,以捕捉不同分辨率帧特征之间的全局时间相关度,同时通过在低分辨率非可见帧中使用更少的空间 token 以减少空间计算成本。整个网络可以通过集成可分化压缩模块进行端到端优化。实验结果显示,与我们现有的方法相比,我们的方法在近同视频检索和动态视频分类中的效率和表现实现了最佳平衡。代码: this https URL
https://arxiv.org/abs/2309.08167
Training deep learning models for video classification from audio-visual data commonly requires immense amounts of labeled training data collected via a costly process. A challenging and underexplored, yet much cheaper, setup is few-shot learning from video data. In particular, the inherently multi-modal nature of video data with sound and visual information has not been leveraged extensively for the few-shot video classification task. Therefore, we introduce a unified audio-visual few-shot video classification benchmark on three datasets, i.e. the VGGSound-FSL, UCF-FSL, ActivityNet-FSL datasets, where we adapt and compare ten methods. In addition, we propose AV-DIFF, a text-to-feature diffusion framework, which first fuses the temporal and audio-visual features via cross-modal attention and then generates multi-modal features for the novel classes. We show that AV-DIFF obtains state-of-the-art performance on our proposed benchmark for audio-visual (generalised) few-shot learning. Our benchmark paves the way for effective audio-visual classification when only limited labeled data is available. Code and data are available at this https URL.
训练从视听数据进行视频分类的深度学习模型通常需要大量经过昂贵过程收集的标记训练数据。一种挑战性且未被充分探索,但成本相对较低的方法是少量视频学习。特别是,具有声音和视觉信息的视听数据本身的多模态性质未被充分利用来处理少量视频分类任务。因此,我们提出了三个数据集上的统一视听少量视频分类基准,即VGG Sound-FSL、UCF-FSL和ActivityNet-FSL数据集,并适应和比较了十种方法。此外,我们提出了AV-diff,一种文本到特征扩散框架,该框架首先通过跨模态注意力将时间和视听特征融合,然后生成新类的新型多模态特征。我们证明,AV-diff在我们提出的视听(普遍化)少量视频学习基准上的最先进的性能。我们的基准当只有有限的标记数据可用时为有效的视听分类打开了道路。代码和数据可在这个httpsURL上可用。
https://arxiv.org/abs/2309.03869
Misinformation on YouTube is a significant concern, necessitating robust detection strategies. In this paper, we introduce a novel methodology for video classification, focusing on the veracity of the content. We convert the conventional video classification task into a text classification task by leveraging the textual content derived from the video transcripts. We employ advanced machine learning techniques like transfer learning to solve the classification challenge. Our approach incorporates two forms of transfer learning: (a) fine-tuning base transformer models such as BERT, RoBERTa, and ELECTRA, and (b) few-shot learning using sentence-transformers MPNet and RoBERTa-large. We apply the trained models to three datasets: (a) YouTube Vaccine-misinformation related videos, (b) YouTube Pseudoscience videos, and (c) Fake-News dataset (a collection of articles). Including the Fake-News dataset extended the evaluation of our approach beyond YouTube videos. Using these datasets, we evaluated the models distinguishing valid information from misinformation. The fine-tuned models yielded Matthews Correlation Coefficient>0.81, accuracy>0.90, and F1 score>0.90 in two of three datasets. Interestingly, the few-shot models outperformed the fine-tuned ones by 20% in both Accuracy and F1 score for the YouTube Pseudoscience dataset, highlighting the potential utility of this approach -- especially in the context of limited training data.
YouTube上的虚假信息是一个严重的问题,需要采取有力的检测策略。在本文中,我们介绍了一种新颖的视频分类方法,重点关注内容的真实性。通过利用视频转录文本的内容,我们将传统的视频分类任务转换为文本分类任务。我们采用了先进的机器学习技术,例如迁移学习,来解决分类挑战。我们的方法包括两种迁移学习形式:(a) fine-tuning基础Transformer模型,如BERT、RoBERTa和ELECTRA,以及(b)使用句子Transformer MPNet和RoBERTa-large进行的少量多次学习。我们将训练模型应用于三个数据集:(a) YouTube疫苗虚假信息相关的视频,(b) YouTube伪科学视频,以及(c) YouTube假新闻数据集(一组文章)。包括假新闻数据集将我们的评估扩展到YouTube视频之外。利用这些数据集,我们评估了模型区分真实信息和虚假信息的能力。 fine-tuning模型在两个数据集上取得了Matthew Correlation Coefficient>0.81、准确率>0.90和F1得分>0.90。有趣的是,少量多次学习在YouTube伪科学数据集上的准确率和F1得分上比 Fine-tuning 模型高出20%,这表明这种方法的潜在用途,特别是在训练数据有限的情况下。
https://arxiv.org/abs/2307.12155
Early diagnosis of renal cancer can greatly improve the survival rate of patients. Contrast-enhanced ultrasound (CEUS) is a cost-effective and non-invasive imaging technique and has become more and more frequently used for renal tumor diagnosis. However, the classification of benign and malignant renal tumors can still be very challenging due to the highly heterogeneous appearance of cancer and imaging artifacts. Our aim is to detect and classify renal tumors by integrating B-mode and CEUS-mode ultrasound videos. To this end, we propose a novel multi-modal ultrasound video fusion network that can effectively perform multi-modal feature fusion and video classification for renal tumor diagnosis. The attention-based multi-modal fusion module uses cross-attention and self-attention to extract modality-invariant features and modality-specific features in parallel. In addition, we design an object-level temporal aggregation (OTA) module that can automatically filter low-quality features and efficiently integrate temporal information from multiple frames to improve the accuracy of tumor diagnosis. Experimental results on a multicenter dataset show that the proposed framework outperforms the single-modal models and the competing methods. Furthermore, our OTA module achieves higher classification accuracy than the frame-level predictions. Our code is available at \url{this https URL}.
早期诊断肾脏癌可以极大地提高患者的生存率。Contrast-enhanced ultrasound(CEUS)是一种成本效益高且非侵入性的成像技术,已经成为肾脏癌诊断越来越常用的方法。然而,良性和恶性肾脏癌的鉴别诊断由于癌症和成像误差的高度异质性仍然非常困难。我们的目标是通过整合B模式和CEUS模式超声波视频,有效地进行多模态特征融合和视频分类,以诊断肾脏癌。为此,我们提出了一种新的多模态超声波视频融合网络,可以进行多模态特征融合和视频分类,以肾脏癌诊断。基于注意力的多模态融合模块使用交叉注意力和自我注意力并行提取模态不相关特征和模态特定特征。此外,我们设计了一个对象级别的时间聚合(OTA)模块,可以自动过滤低质量特征并高效整合多个帧的时间信息,以提高癌症诊断的准确性。在一个多中心数据集的实验结果显示,该提出的框架比帧级预测实现了更高的分类准确性。我们的代码可访问\url{this https URL}。
https://arxiv.org/abs/2307.07807
The ubiquitous and demonstrably suboptimal choice of resizing images to a fixed resolution before processing them with computer vision models has not yet been successfully challenged. However, models such as the Vision Transformer (ViT) offer flexible sequence-based modeling, and hence varying input sequence lengths. We take advantage of this with NaViT (Native Resolution ViT) which uses sequence packing during training to process inputs of arbitrary resolutions and aspect ratios. Alongside flexible model usage, we demonstrate improved training efficiency for large-scale supervised and contrastive image-text pretraining. NaViT can be efficiently transferred to standard tasks such as image and video classification, object detection, and semantic segmentation and leads to improved results on robustness and fairness benchmarks. At inference time, the input resolution flexibility can be used to smoothly navigate the test-time cost-performance trade-off. We believe that NaViT marks a departure from the standard, CNN-designed, input and modelling pipeline used by most computer vision models, and represents a promising direction for ViTs.
将图像在处理前resize到固定分辨率的行为已经成为普遍选择,但视觉转换器(ViT)等模型提供了灵活的序列建模,因此输入序列长度会有所不同。我们使用NaViT(原生分辨率ViT)来利用这一点,它在训练期间使用序列打包来处理任意分辨率和比例的输入。同时,我们展示了在大规模监督和比较的图像文本预训练中改进的训练效率。NaViT可以高效地转移到标准任务,如图像和视频分类、物体检测和语义分割,并取得了更加稳健和公平基准的改进结果。在推理时,输入分辨率的灵活性可以用来平滑地 navigate测试时的成本性能权衡。我们认为,NaViT标志着与大多数计算机视觉模型使用的标准的CNN设计输入和建模管道的 departure,并代表了ViTs 的一个有前途的方向。
https://arxiv.org/abs/2307.06304
In the context of label-efficient learning on video data, the distillation method and the structural design of the teacher-student architecture have a significant impact on knowledge distillation. However, the relationship between these factors has been overlooked in previous research. To address this gap, we propose a new weakly supervised learning framework for knowledge distillation in video classification that is designed to improve the efficiency and accuracy of the student model. Our approach leverages the concept of substage-based learning to distill knowledge based on the combination of student substages and the correlation of corresponding substages. We also employ the progressive cascade training method to address the accuracy loss caused by the large capacity gap between the teacher and the student. Additionally, we propose a pseudo-label optimization strategy to improve the initial data label. To optimize the loss functions of different distillation substages during the training process, we introduce a new loss method based on feature distribution. We conduct extensive experiments on both real and simulated data sets, demonstrating that our proposed approach outperforms existing distillation methods in terms of knowledge distillation for video classification tasks. Our proposed substage-based distillation approach has the potential to inform future research on label-efficient learning for video data.
在视频数据标签高效的学习中,分阶段学习方法和教师和学生架构的结构设计对知识蒸馏有重要影响。然而,在以前的研究中,这些因素的影响被忽视了。为了解决这个问题,我们提出了一种新的弱监督学习框架,用于视频分类任务的知识蒸馏,旨在提高学生模型的效率和准确性。我们利用分阶段学习的概念,通过学生子阶段和对应子阶段之间的组合来蒸馏知识。我们还使用逐步分级训练方法来解决教师和学生之间容量差距造成的精度损失。此外,我们提出了一种伪标签优化策略来改善初始数据标签。为了优化不同蒸馏子阶段的损失函数,在训练过程中我们介绍了基于特征分布的新损失方法。我们在不同的真实和模拟数据集上进行了广泛的实验,表明我们提出的方法在视频分类任务中的知识蒸馏方面优于现有的蒸馏方法。我们提出的基于子阶段的知识蒸馏方法有潜力为未来的视频数据标签高效的学习提供信息。
https://arxiv.org/abs/2307.05201
Deep learning algorithms have pushed the boundaries of computer vision research and have depicted commendable performance in a variety of applications. However, training a robust deep neural network necessitates a large amount of labeled training data, acquiring which involves significant time and human effort. This problem is even more serious for an application like video classification, where a human annotator has to watch an entire video end-to-end to furnish a label. Active learning algorithms automatically identify the most informative samples from large amounts of unlabeled data; this tremendously reduces the human annotation effort in inducing a machine learning model, as only the few samples that are identified by the algorithm, need to be labeled manually. In this paper, we propose a novel active learning framework for video classification, with the goal of further reducing the labeling onus on the human annotators. Our framework identifies a batch of exemplar videos, together with a set of informative frames for each video; the human annotator needs to merely review the frames and provide a label for each video. This involves much less manual work than watching the complete video to come up with a label. We formulate a criterion based on uncertainty and diversity to identify the informative videos and exploit representative sampling techniques to extract a set of exemplar frames from each video. To the best of our knowledge, this is the first research effort to develop an active learning framework for video classification, where the annotators need to inspect only a few frames to produce a label, rather than watching the end-to-end video.
深度学习算法已经推动了计算机视觉研究的 boundaries,并在多种应用中展现出了出色的表现。然而,训练一个稳健的深度学习神经网络需要大量的标记训练数据,获取这些数据需要耗费大量时间和人力资源。对于像视频分类这样的应用,人类标注者需要整段视频逐帧地标注以生成标签。 Active learning算法自动从大量未标记的数据中识别出最有用的样本;这极大地减少了人类标注者生成机器学习模型所需的人力工作量,因为只有被算法识别出的样本需要手动标注。在本文中,我们提出了一种针对视频分类的新颖主动学习框架,旨在进一步减少人类标注者的工作量。我们的框架识别一组示范视频,并每个视频一组有用的帧;人类标注者只需要审查帧并为每个视频提供标签。这比整段视频标注所需的手动工作要少得多。我们基于不确定性和多样性制定了一个标准,以识别有用的视频,并利用代表性抽样技术从每个视频中提取一组示范帧。据我们所知,这是开发视频分类主动学习框架的第一项研究工作,其中人类标注者只需要检查少数帧以生成标签,而不是整段视频。
https://arxiv.org/abs/2307.05587
Multimodal learning assumes all modality combinations of interest are available during training to learn cross-modal correspondences. In this paper, we challenge this modality-complete assumption for multimodal learning and instead strive for generalization to unseen modality combinations during inference. We pose the problem of unseen modality interaction and introduce a first solution. It exploits a feature projection module to project the multidimensional features of different modalities into a common space with rich information reserved. This allows the information to be accumulated with a simple summation operation across available modalities. To reduce overfitting to unreliable modality combinations during training, we further improve the model learning with pseudo-supervision indicating the reliability of a modality's prediction. We demonstrate that our approach is effective for diverse tasks and modalities by evaluating it for multimodal video classification, robot state regression, and multimedia retrieval.
多模态学习假设训练期间感兴趣的所有模态组合都可用来学习跨模态对应关系。在本文中,我们挑战了多模态学习中的模态完备假设,并试图在推理期间 generalization 到未观测的模态组合。我们提出了一个问题,即未观测的模态互动的问题,并介绍了一种解决方案。它利用特征投影模块将不同模态的多维特征投影到具有丰富信息的 common 空间中。这使得可以使用简单的累加操作在可用的模态组合中积累信息。为了在训练期间减少对不可靠的模态组合的过度拟合,我们进一步改进了模型学习,通过伪监督表明一个模态的预测可靠性。我们证明了我们的方法和不同的任务和模态都有效,例如多模态视频分类、机器人状态回归和多媒体检索。
https://arxiv.org/abs/2306.12795
Breast ultrasound videos contain richer information than ultrasound images, therefore it is more meaningful to develop video models for this diagnosis task. However, the collection of ultrasound video datasets is much harder. In this paper, we explore the feasibility of enhancing the performance of ultrasound video classification using the static image dataset. To this end, we propose KGA-Net and coherence loss. The KGA-Net adopts both video clips and static images to train the network. The coherence loss uses the feature centers generated by the static images to guide the frame attention in the video model. Our KGA-Net boosts the performance on the public BUSV dataset by a large margin. The visualization results of frame attention prove the explainability of our method. The codes and model weights of our method will be made publicly available.
breast ultrasound videos 包含比 ultrasound 图像更丰富的信息,因此开发 video models 对这个诊断任务更有意义。然而,收集 ultrasound video datasets 非常困难。在本文中,我们探讨了使用静态图像数据集来提高 ultrasound video 分类性能的可行性。为此,我们提出了 KGA-Net 和 coherence loss。KGA-Net 采用视频片段和静态图像来训练网络。 coherence loss 使用静态图像生成的特征中心来指导 video model 中的帧注意力。我们的 KGA-Net 在 public BUSV 数据集上的性能显著提高。帧注意力的可视化结果证明了我们的方法的可解释性。我们的方法的代码和模型权重将公开发布。
https://arxiv.org/abs/2306.06877
Localization of the narrowest position of the vessel and corresponding vessel and remnant vessel delineation in carotid ultrasound (US) are essential for carotid stenosis grading (CSG) in clinical practice. However, the pipeline is time-consuming and tough due to the ambiguous boundaries of plaque and temporal variation. To automatize this procedure, a large number of manual delineations are usually required, which is not only laborious but also not reliable given the annotation difficulty. In this study, we present the first video classification framework for automatic CSG. Our contribution is three-fold. First, to avoid the requirement of laborious and unreliable annotation, we propose a novel and effective video classification network for weakly-supervised CSG. Second, to ease the model training, we adopt an inflation strategy for the network, where pre-trained 2D convolution weights can be adapted into the 3D counterpart in our network. In this way, the existing pre-trained large model can be used as an effective warm start for our network. Third, to enhance the feature discrimination of the video, we propose a novel attention-guided multi-dimension fusion (AMDF) transformer encoder to model and integrate global dependencies within and across spatial and temporal dimensions, where two lightweight cross-dimensional attention mechanisms are designed. Our approach is extensively validated on a large clinically collected carotid US video dataset, demonstrating state-of-the-art performance compared with strong competitors.
在 carotid 超声波(US)中,确定 vessel 的狭窄位置及其对应的 vessel 和剩余 vessel 的绘制是临床 carotid 微血管狭窄评级(CSG)的关键。然而,由于 plaque 和时间变化的不确定性,这条管道相当耗时且困难。为了自动化这个过程,通常需要大量手动绘制,这不仅繁琐,而且由于标注难度的不可靠性,并不可靠。在本研究中,我们提出了第一个自动 CSG 视频分类框架。我们的贡献是三项。第一,为了避免繁琐的和不可靠的标注要求,我们提议一个 novel 和有效的视频分类网络,以弱监督的 CSG 为例。第二,为了简化模型训练,我们采用网络膨胀策略,其中预先训练的 2D 卷积权重可以适应在我们的网络中的 3D 对应物。这样,现有的预先训练的大型模型就可以用作我们的网络的有效热身。第三,为了增强视频的特征区分性,我们提议一个 novel 的注意引导多通道融合(AMDF)Transformer 编码器,以建模和整合空间和时间维度内和外部 global 依赖关系,并在两个轻量级跨维度注意力机制的设计下。我们的方法在大量 clinically collected carotid US 视频数据集上进行了全面验证,与强大的竞争对手相比,展示了最先进的性能。
https://arxiv.org/abs/2306.02548
Today ship hull inspection including the examination of the external coating, detection of defects, and other types of external degradation such as corrosion and marine growth is conducted underwater by means of Remotely Operated Vehicles (ROVs). The inspection process consists of a manual video analysis which is a time-consuming and labor-intensive process. To address this, we propose an automatic video analysis system using deep learning and computer vision to improve upon existing methods that only consider spatial information on individual frames in underwater ship hull video inspection. By exploring the benefits of adding temporal information and analyzing frame-based classifiers, we propose a multi-label video classification model that exploits the self-attention mechanism of transformers to capture spatiotemporal attention in consecutive video frames. Our proposed method has demonstrated promising results and can serve as a benchmark for future research and development in underwater video inspection applications.
现代船壳检查包括外部涂层的检验、缺陷的发现以及诸如腐蚀和海洋增长等外部退化类型的检查,方法是通过远程操作车辆(ROV)进行水下船壳视频检查。检查过程包括手动视频分析,这是一个耗时且劳动力密集型的过程。为了解决这一问题,我们提出了一种使用深度学习和计算机视觉技术改进现有方法的方法,改进之处在于仅考虑 individual 帧的空间信息在水下船壳视频检查中。通过探索添加时间信息的好处并分析基于帧的分类器,我们提出了一种多标签视频分类模型,利用变压器的自注意力机制,在连续的视频帧中捕捉时间空间注意力。我们提出的这种方法已经表现出良好的结果,可以作为未来水下视频检查应用研究的基准。
https://arxiv.org/abs/2305.17338
Online video platforms receive hundreds of hours of uploads every minute, making manual content moderation impossible. Unfortunately, the most vulnerable consumers of malicious video content are children from ages 1-5 whose attention is easily captured by bursts of color and sound. Scammers attempting to monetize their content may craft malicious children's videos that are superficially similar to educational videos, but include scary and disgusting characters, violent motions, loud music, and disturbing noises. Prominent video hosting platforms like YouTube have taken measures to mitigate malicious content on their platform, but these videos often go undetected by current content moderation tools that are focused on removing pornographic or copyrighted content. This paper introduces our toolkit Malicious or Benign for promoting research on automated content moderation of children's videos. We present 1) a customizable annotation tool for videos, 2) a new dataset with difficult to detect test cases of malicious content and 3) a benchmark suite of state-of-the-art video classification models.
在线视频平台每分钟接受数百小时的上传,使得手动内容 moderation 变得不可能。不幸的是,恶意视频内容的消费者主要是年龄在1-5岁的儿童,他们的注意力集中容易被色彩和声音的突然爆发所捕获。诈骗分子试图以此赚钱可能会制作恶意儿童视频,表面上与教育视频相似,但包括令人恐惧和恶心的角色、暴力行为、音量极高的音乐和令人不安的声音。像YouTube这样的知名视频平台已经采取措施减轻平台上的恶意内容,但这些视频往往被当前专注于删除色情或版权内容的 content moderation 工具所忽略。本文介绍了我们用于促进儿童视频自动内容 moderation 的研究的恶意或良性工具集。我们介绍了1)一个可自定义的视频标注工具、2)一个难以检测的恶意内容测试集和新的数据集,以及3)最先进的视频分类模型的基准套件。
https://arxiv.org/abs/2305.15551
We present Integrated Multimodal Perception (IMP), a simple and scalable multimodal multi-task training and modeling approach. IMP integrates multimodal inputs including image, video, text, and audio into a single Transformer encoder with minimal modality-specific components. IMP makes use of a novel design that combines Alternating Gradient Descent (AGD) and Mixture-of-Experts (MoE) for efficient model \& task scaling. We conduct extensive empirical studies about IMP and reveal the following key insights: 1) performing gradient descent updates by alternating on diverse heterogeneous modalities, loss functions, and tasks, while also varying input resolutions, efficiently improves multimodal understanding. 2) model sparsification with MoE on a single modality-agnostic encoder substantially improves the performance, outperforming dense models that use modality-specific encoders or additional fusion layers and greatly mitigating the conflicts between modalities. IMP achieves competitive performance on a wide range of downstream tasks including image classification, video classification, image-text, and video-text retrieval. Most notably, we train a sparse IMP-MoE-L focusing on video tasks that achieves new state-of-the-art in zero-shot video classification. Our model achieves 77.0% on Kinetics-400, 76.8% on Kinetics-600, and 76.8% on Kinetics-700 zero-shot classification accuracy, improving the previous state-of-the-art by +5%, +6.7%, and +5.8%, respectively, while using only 15% of their total training computational cost.
我们提出了综合modality感知(IMP),这是一种简单且可扩展的多种任务多视角训练和建模方法。IMP将图像、视频、文本和音频等多种输入合并为一个单一的Transformer编码器,并使用最少的modality特定组件。IMP采用了一种独特的设计,将交替进行梯度下降更新(AGD)和混合专家混合(MoE)用于高效的模型和任务扩展。我们对IMP进行了广泛的实证研究,并揭示了以下关键见解: 1)通过交替进行不同modality类型的梯度下降更新,同时 varying input分辨率,有效地提高了modality理解度。 2)使用MoE在一个modality不相关的编码器上显著改进了性能,比使用modality特定编码器或额外的融合层更有效的击败了密度高的模型,并极大地缓解了modality之间的冲突。 IMP在多种下游任务中取得了竞争性能,包括图像分类、视频分类、图像-文本和视频-文本检索。特别是,我们训练了一个稀疏的IMP-MoE-L,专注于视频任务,实现了零样本视频分类的最新技术水平。我们的模型在Kinetics-400、Kinetics-600和Kinetics-700中实现了77.0%、76.8%和76.8%的零样本分类准确性,分别提高了之前的最新技术水平5%、6.7%和5.8%。同时,仅使用了它们总训练计算成本的15%。
https://arxiv.org/abs/2305.06324
Hate speech has become one of the most significant issues in modern society, having implications in both the online and the offline world. Due to this, hate speech research has recently gained a lot of traction. However, most of the work has primarily focused on text media with relatively little work on images and even lesser on videos. Thus, early stage automated video moderation techniques are needed to handle the videos that are being uploaded to keep the platform safe and healthy. With a view to detect and remove hateful content from the video sharing platforms, our work focuses on hate video detection using multi-modalities. To this end, we curate ~43 hours of videos from BitChute and manually annotate them as hate or non-hate, along with the frame spans which could explain the labelling decision. To collect the relevant videos we harnessed search keywords from hate lexicons. We observe various cues in images and audio of hateful videos. Further, we build deep learning multi-modal models to classify the hate videos and observe that using all the modalities of the videos improves the overall hate speech detection performance (accuracy=0.798, macro F1-score=0.790) by ~5.7% compared to the best uni-modal model in terms of macro F1 score. In summary, our work takes the first step toward understanding and modeling hateful videos on video hosting platforms such as BitChute.
恶言已经成为现代社会中最重要的问题之一,它在在线和离线世界中都具有重要意义。因此,恶言研究最近取得了很多进展。然而,大部分研究主要关注文本媒体,对于图像和视频的研究相对较少。因此,需要使用早期阶段的自动化视频编辑技术来处理正在上传的视频,以保持平台安全和健康。为了检测和删除视频分享平台上的仇恨内容,我们的研究重点是使用多模态方法检测恶言视频。为此,我们整理BitChute上的 ~43小时视频,并手动标注它们是否是恶言或非恶言,并考虑每个帧的跨度以解释标签的决定。为了收集相关视频,我们从仇恨词汇库中检索关键词。我们观察仇恨视频图像和音频中的各种线索。进一步,我们构建深度学习多模态模型来分类恶言视频,并观察到使用所有视频模态可以提高整体恶言检测表现(准确率=0.798,宏观F1得分=0.790)相比宏观F1得分最佳的单模态模型高出约5.7%。总之,我们的研究迈出了理解并建模像BitChute这样的视频托管平台中的仇恨视频的第一步。
https://arxiv.org/abs/2305.03915
In this paper, we present a deep learning based multimodal system for classifying daily life videos. To train the system, we propose a two-phase training strategy. In the first training phase (Phase I), we extract the audio and visual (image) data from the original video. We then train the audio data and the visual data with independent deep learning based models. After the training processes, we obtain audio embeddings and visual embeddings by extracting feature maps from the pre-trained deep learning models. In the second training phase (Phase II), we train a fusion layer to combine the audio/visual embeddings and a dense layer to classify the combined embedding into target daily scenes. Our extensive experiments, which were conducted on the benchmark dataset of DCASE (IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events) 2021 Task 1B Development, achieved the best classification accuracy of 80.5%, 91.8%, and 95.3% with only audio data, with only visual data, both audio and visual data, respectively. The highest classification accuracy of 95.3% presents an improvement of 17.9% compared with DCASE baseline and shows very competitive to the state-of-the-art systems.
在本文中,我们介绍了一种基于深度学习的多项式系统,用于分类日常生活视频。为了训练系统,我们提出了一种两阶段的培训策略。在第一个训练阶段(阶段一),我们从原始视频中提取了音频和视觉(图像)数据。然后,我们使用独立的深度学习模型训练音频数据和视觉数据。训练完成后,我们提取了预训练深度学习模型的特征映射,获得音频嵌入和视觉嵌入。在第二个训练阶段(阶段二),我们训练了一个融合层,将音频和视觉嵌入相结合,并训练了一个密集层,将结合的嵌入分类为目标日常生活场景。我们广泛的实验,在DCASE基准数据集(IEEE AASP挑战:语音识别场景和事件检测和分类2021任务1B开发)上进行了测试,仅使用音频数据、仅使用视觉数据、同时使用音频和视觉数据分别取得了最佳的分类准确率80.5%、91.8%、95.3%。最高水平的95.3%的分类准确率与DCASE基准相比提高了17.9%,表明非常与最先进的系统竞争。
https://arxiv.org/abs/2305.01476
Understanding verbs is crucial to modelling how people and objects interact with each other and the environment through space and time. Recently, state-of-the-art video-language models based on CLIP have been shown to have limited verb understanding and to rely extensively on nouns, restricting their performance in real-world video applications that require action and temporal understanding. In this work, we improve verb understanding for CLIP-based video-language models by proposing a new Verb-Focused Contrastive (VFC) framework. This consists of two main components: (1) leveraging pretrained large language models (LLMs) to create hard negatives for cross-modal contrastive learning, together with a calibration strategy to balance the occurrence of concepts in positive and negative pairs; and (2) enforcing a fine-grained, verb phrase alignment loss. Our method achieves state-of-the-art results for zero-shot performance on three downstream tasks that focus on verb understanding: video-text matching, video question-answering and video classification. To the best of our knowledge, this is the first work which proposes a method to alleviate the verb understanding problem, and does not simply highlight it.
理解动词对于模拟人类和物体通过空间和时间相互交互和影响非常重要。最近,基于CLIP的技术先进的视频语言模型被发现在动词理解方面存在局限性,并且主要依赖名词,限制了在需要行动和时间理解的实际视频应用中的表现。在这项工作中,我们提出了一种新的动词重点对比性框架(VFC),以改进基于CLIP的视频语言模型的动词理解能力。该框架由两个主要部分组成:(1)利用预训练的大型语言模型(LLM)创建交叉modal对比性的硬负向量,并使用校准策略平衡正负面对中概念的出现;(2)实施精细的动词短语匹配损失。我们的方法实现了零样本性能目标,在三个主要下游任务中,专注于动词理解:视频-文本匹配、视频问答和视频分类。据我们所知,这是第一个提出减轻动词理解问题的方法的工作,而不仅仅是突出它。
https://arxiv.org/abs/2304.06708
Human visual recognition is a sparse process, where only a few salient visual cues are attended to rather than traversing every detail uniformly. However, most current vision networks follow a dense paradigm, processing every single visual unit (e.g,, pixel or patch) in a uniform manner. In this paper, we challenge this dense paradigm and present a new method, coined SparseFormer, to imitate human's sparse visual recognition in an end-to-end manner. SparseFormer learns to represent images using a highly limited number of tokens (down to 49) in the latent space with sparse feature sampling procedure instead of processing dense units in the original pixel space. Therefore, SparseFormer circumvents most of dense operations on the image space and has much lower computational costs. Experiments on the ImageNet classification benchmark dataset show that SparseFormer achieves performance on par with canonical or well-established models while offering better accuracy-throughput tradeoff. Moreover, the design of our network can be easily extended to the video classification with promising performance at lower computational costs. We hope that our work can provide an alternative way for visual modeling and inspire further research on sparse neural architectures. The code will be publicly available at this https URL
人类视觉识别是一种稀疏过程,只需要注意几个突出的视觉提示,而不是遍历每一个细节都均匀。然而,当前大多数视觉网络遵循密集范式,以相同的方式处理每一个视觉单元(例如像素或块),而不是在原始像素空间中处理密集单元。在本文中,我们挑战了这种密集范式,并提出了一种新的方法,称为稀疏前处理,以模仿人类的稀疏视觉识别,以端到端的方式实现。稀疏前处理使用非常受限的数量代币(甚至降至49)在稀疏特征采样程序中存在于潜在空间中,而不是在原始像素空间中处理密集单元。因此,稀疏前处理绕过了图像空间中大部分密集操作,具有更低的计算成本。在ImageNet分类基准数据集上的实验表明,稀疏前处理能够与标准或成熟的模型相当地表现,同时提供更好的精度与吞吐量权衡。此外,我们的网络设计可以轻松扩展到视频分类,表现出良好的性能,同时降低了计算成本。我们希望我们的工作可以为视觉建模提供另一种方式,并激励进一步研究稀疏神经网络架构。代码将在这个httpsURL上公开可用。
https://arxiv.org/abs/2304.03768