Classifying videos into distinct categories, such as Sport and Music Video, is crucial for multimedia understanding and retrieval, especially when an immense volume of video content is being constantly generated. Traditional methods require video decompression to extract pixel-level features like color, texture, and motion, thereby increasing computational and storage demands. Moreover, these methods often suffer from performance degradation in low-quality videos. We present a novel approach that examines only the post-compression bitstream of a video to perform classification, eliminating the need for bitstream decoding. To validate our approach, we built a comprehensive data set comprising over 29,000 YouTube video clips, totaling 6,000 hours and spanning 11 distinct categories. Our evaluations indicate precision, accuracy, and recall rates consistently above 80%, many exceeding 90%, and some reaching 99%. The algorithm operates approximately 15,000 times faster than real-time for 30fps videos, outperforming traditional Dynamic Time Warping (DTW) algorithm by seven orders of magnitude.
将视频分类为不同的类别(如体育和音乐视频)对于多媒体理解和检索至关重要,尤其是在生成大量视频内容的情况下。传统方法需要对视频进行解压缩以提取像素级别的特征,如颜色、纹理和运动,从而增加计算和存储需求。此外,这些方法在低质量视频中常常存在性能退化。我们提出了一种新方法,只检查视频的后压缩比特流进行分类,无需进行比特流解码。为了验证我们的方法,我们构建了一个由29,000个YouTube视频片段组成的全集,总计6,000小时,跨越11个类别的综合数据集。我们的评估结果表明,精度、准确性和召回率均高于80%,许多甚至超过90%,有些达到99%。该算法在30fps视频上的运行速度约为实时速度的15,000倍,比传统动态时间膨胀(DTW)算法快7 orders of magnitude。
https://arxiv.org/abs/2403.08580
The development of techniques that can be used to analyze and detect animal behavior is a crucial activity for the livestock sector, as it is possible to monitor the stress and animal welfare and contributes to decision making in the farm. Thus, the development of applications can assist breeders in making decisions to improve production performance and reduce costs, once the animal behavior is analyzed by humans and this can lead to susceptible errors and time consumption. Aggressiveness in pigs is an example of behavior that is studied to reduce its impact through animal classification and identification. However, this process is laborious and susceptible to errors, which can be reduced through automation by visually classifying videos captured in controlled environment. The captured videos can be used for training and, as a result, for classification through computer vision and artificial intelligence, employing neural network techniques. The main techniques utilized in this study are variants of transformers: STAM, TimeSformer, and ViViT, as well as techniques using convolutions, such as ResNet3D2, Resnet(2+1)D, and CnnLstm. These techniques were employed for pig video classification with the objective of identifying aggressive and non-aggressive behaviors. In this work, various techniques were compared to analyze the contribution of using transformers, in addition to the effectiveness of the convolution technique in video classification. The performance was evaluated using accuracy, precision, and recall. The TimerSformer technique showed the best results in video classification, with median accuracy of 0.729.
用于分析和检测动物行为的技术的开发对于畜牧业至关重要,因为这可以帮助监测应激和动物福利,并在农场决策中做出贡献。因此,应用程序的开发有助于育种者根据人类分析动物行为后做出改进生产表现和降低成本的决定。通过视觉分类控制环境捕获的视频可以减少其影响。然而,这个过程费力且容易出错,可以通过自动化视觉分类视频来减少错误。捕获的视频可以用于培训,进而通过计算机视觉和人工智能进行分类,采用神经网络技术。 在本研究中使用的技术包括变体的Transformer:STAM,TimeSformer和ViViT,以及使用卷积的技术,如ResNet3D2,Resnet(2+1)D和CnnLstm。这些技术用于将视频分类为攻击性和非攻击性行为。在本研究中,还比较了使用Transformer和卷积技术在视频分类中的效果。STAM技术在视频分类中表现最佳,其均方误差(MSE)为0.729。
https://arxiv.org/abs/2403.08528
Previous face forgery detection methods mainly focus on appearance features, which may be easily attacked by sophisticated manipulation. Considering the majority of current face manipulation methods generate fake faces based on a single frame, which do not take frame consistency and coordination into consideration, artifacts on frame sequences are more effective for face forgery detection. However, current sequence-based face forgery detection methods use general video classification networks directly, which discard the special and discriminative motion information for face manipulation detection. To this end, we propose an effective sequence-based forgery detection framework based on an existing video classification method. To make the motion features more expressive for manipulation detection, we propose an alternative motion consistency block instead of the original motion features module. To make the learned features more generalizable, we propose an auxiliary anomaly detection block. With these two specially designed improvements, we make a general video classification network achieve promising results on three popular face forgery datasets.
过去的面部伪造检测方法主要关注外观特征,这可能很容易被高级操纵所攻击。考虑到当前大多数面部 manipulation 方法都是基于单个帧生成的假脸,没有考虑帧的一致性和协调性,序列中的伪影对于面部伪造检测来说更为有效。然而,现有的序列基于面部伪造检测的方法直接使用通用视频分类网络,这忽略了面部操纵检测的特殊和鉴别信息。为此,我们提出了一个基于现有视频分类方法的序列基于伪造检测框架。为了使操纵检测更具有表现力,我们提出了一个替代的动态一致性模块,而不是原始动态特征模块。为了使学习到的特征更具通用性,我们提出了一个辅助异常检测模块。通过这两个特别设计的改进,我们使一般视频分类网络在三个流行的面部伪造数据集上取得了良好的结果。
https://arxiv.org/abs/2403.05172
An object handover between a robot and a human is a coordinated action which is prone to failure for reasons such as miscommunication, incorrect actions and unexpected object properties. Existing works on handover failure detection and prevention focus on preventing failures due to object slip or external disturbances. However, there is a lack of datasets and evaluation methods that consider unpreventable failures caused by the human participant. To address this deficit, we present the multimodal Handover Failure Detection dataset, which consists of failures induced by the human participant, such as ignoring the robot or not releasing the object. We also present two baseline methods for handover failure detection: (i) a video classification method using 3D CNNs and (ii) a temporal action segmentation approach which jointly classifies the human action, robot action and overall outcome of the action. The results show that video is an important modality, but using force-torque data and gripper position help improve failure detection and action segmentation accuracy.
机器人与人类之间的物体传递是一个协调的动作,由于诸如沟通不畅、操作不正确或意外物体属性等原因,容易出现失败。现有关于物体传递失败检测和预防的作品主要集中在由于物体滑落或外部干扰导致的故障的预防上。然而,目前缺乏考虑人类参与者无法预防的故障的数据集和评估方法。为了弥补这一不足,我们提出了多模态物体传递失败检测数据集,其中包括由人类参与者引起的事故,例如忽略机器人或未释放物体。我们还提出了两种基本的物体传递失败检测方法:(i)使用3D CNN的视觉分类方法;(ii)一种将人类动作、机器人动作和动作结果共同分类为时序动作的方法。结果表明,视频是一个重要的模式,但使用力和扭矩数据以及抓爪位置能够提高故障检测和动作分割的准确性。
https://arxiv.org/abs/2402.18319
High-quality and consistent annotations are fundamental to the successful development of robust machine learning models. Traditional data annotation methods are resource-intensive and inefficient, often leading to a reliance on third-party annotators who are not the domain experts. Hard samples, which are usually the most informative for model training, tend to be difficult to label accurately and consistently without business context. These can arise unpredictably during the annotation process, requiring a variable number of iterations and rounds of feedback, leading to unforeseen expenses and time commitments to guarantee quality. We posit that more direct involvement of domain experts, using a human-in-the-loop system, can resolve many of these practical challenges. We propose a novel framework we call Video Annotator (VA) for annotating, managing, and iterating on video classification datasets. Our approach offers a new paradigm for an end-user-centered model development process, enhancing the efficiency, usability, and effectiveness of video classifiers. Uniquely, VA allows for a continuous annotation process, seamlessly integrating data collection and model training. We leverage the zero-shot capabilities of vision-language foundation models combined with active learning techniques, and demonstrate that VA enables the efficient creation of high-quality models. VA achieves a median 6.8 point improvement in Average Precision relative to the most competitive baseline across a wide-ranging assortment of tasks. We release a dataset with 153k labels across 56 video understanding tasks annotated by three professional video editors using VA, and also release code to replicate our experiments at: this http URL.
高品质和一致的注释是成功开发稳健的机器学习模型的基础。传统的数据注释方法资源密集且效率低下,通常导致在领域专家不是的情况下依赖第三方注释者。难样本(通常是模型训练中最有信息量的)往往在无商业上下文的情况下难以准确且一致地标注。这些挑战在注释过程中可能会意外出现,需要进行反复的迭代和反馈,导致未知的开销和保证质量的时间要求。我们提出了一种名为视频注释者(VA)的新框架,用于标注、管理和迭代视频分类数据集。我们的方法为以用户为中心的模型开发过程提供了一个新的范例,提高了视频分类器的效率、可用性和效果。值得注意的是,VA允许进行连续的注释过程,将数据收集和模型训练无缝集成。我们结合了视觉语言基础模型的零 shot功能与积极学习技术,证明了VA能够高效地创建高质量的模型。VA在广泛的任务中实现了平均6.8个点的AP提高,与最具有竞争力的基线相比。我们发布了由三个专业视频编辑师标注的56个视频理解任务的数据集,并在此处发布代码以复制我们的实验:http:// this URL.
https://arxiv.org/abs/2402.06560
As foundation models become more popular, there is a growing need to efficiently finetune them for downstream tasks. Although numerous adaptation methods have been proposed, they are designed to be efficient only in terms of how many parameters are trained. They, however, typically still require backpropagating gradients throughout the model, meaning that their training-time and -memory cost does not reduce as significantly. We propose an adaptation method which does not backpropagate gradients through the backbone. We achieve this by designing a lightweight network in parallel that operates on features from the frozen, pretrained backbone. As a result, our method is efficient not only in terms of parameters, but also in training-time and memory usage. Our approach achieves state-of-the-art accuracy-parameter trade-offs on the popular VTAB benchmark, and we further show how we outperform prior works with respect to training-time and -memory usage too. We further demonstrate the training efficiency and scalability of our method by adapting a vision transformer backbone of 4 billion parameters for the computationally demanding task of video classification, without any intricate model parallelism. Here, we outperform a prior adaptor-based method which could only scale to a 1 billion parameter backbone, or fully-finetuning a smaller backbone, with the same GPU and less training time.
作为基础模型越来越受欢迎,对于下游任务进行有效微调的需求不断增加。虽然已经提出了许多自适应方法,但它们的设计仅在参数训练方面有效。然而,它们通常还需要在模型中进行反向传播梯度,这意味着它们的训练时间和内存成本不会显著降低。我们提出了一种不通过骨干网络反向传播梯度的自适应方法。我们通过设计一个轻量级的并行网络来实现这个目标,该网络操作于预训练骨架中的特征。这样,我们的方法不仅在参数方面有效,而且在训练时间和内存使用方面也有效。我们的方法在热门的VTAB基准上实现了与最先进方法相同的准确率参数权衡,并且我们还进一步证明了我们在训练时间和内存使用方面的优势。为了进一步证明我们方法的训练效率和可扩展性,我们将一个具有40亿参数的视觉Transformer骨干网络调整为用于计算密集型视频分类任务的模型,而没有任何复杂的模型并行。在这里,我们超过了基于先前的自适应方法,该方法只能扩展到10亿参数的骨干网络,或者对较小的骨干网络进行完全微调,使用相同的GPU和更短的时间进行训练。
https://arxiv.org/abs/2402.02887
While short-form videos head to reshape the entire social media landscape, experts are exceedingly worried about their depressive impacts on viewers, as evidenced by medical studies. To prevent widespread consequences, platforms are eager to predict these videos' impact on viewers' mental health. Subsequently, they can take intervention measures, such as revising recommendation algorithms and displaying viewer discretion. Nevertheless, applicable predictive methods lack relevance to well-established medical knowledge, which outlines clinically proven external and environmental factors of depression. To account for such medical knowledge, we resort to an emergent methodological discipline, seeded Neural Topic Models (NTMs). However, existing seeded NTMs suffer from the limitations of single-origin topics, unknown topic sources, unclear seed supervision, and suboptimal convergence. To address those challenges, we develop a novel Knowledge-guided Multimodal NTM to predict a short-form video's depressive impact on viewers. Extensive empirical analyses using TikTok and Douyin datasets prove that our method outperforms state-of-the-art benchmarks. Our method also discovers medically relevant topics from videos that are linked to depressive impact. We contribute to IS with a novel video analytics method that is generalizable to other video classification problems. Practically, our method can help platforms understand videos' mental impacts, thus adjusting recommendations and video topic disclosure.
尽管短时视频正在彻底改变社交媒体格局,但专家对它们对观众产生的沮丧影响感到担忧,这是医学研究的结果。为了防止广泛后果,平台渴望预测这些视频对观众心理健康的影响。因此,他们可以采取干预措施,例如修改推荐算法和显示观众自主性。然而,适用的预测方法与已建立的抑郁症医学知识不相关,这揭示了抑郁症的外部和环境因素。为了应对这种医学知识,我们求助于新兴的方法论学科,即以知识为导向的神经主题模型(NTMs)。然而,现有的NTM存在单一来源主题、未知主题来源、不明确的种子监督和次优收敛等限制。为了应对这些挑战,我们开发了一种新知识引导的多模态NTM,以预测短时视频对观众的沮丧影响。通过使用TikTok和Douyin数据集的丰富实证分析证明,我们的方法超越了最先进的基准。我们的方法还从与沮丧影响相关的视频中发现了具有医学相关性的主题。我们与IS合作研发了一种可应用于其他视频分类问题的视频分析方法。实际上,我们的方法可以帮助平台理解视频的心理健康影响,从而调整推荐和视频主题披露。
https://arxiv.org/abs/2402.10045
In recent years, researchers combine both audio and video signals to deal with challenges where actions are not well represented or captured by visual cues. However, how to effectively leverage the two modalities is still under development. In this work, we develop a multiscale multimodal Transformer (MMT) that leverages hierarchical representation learning. Particularly, MMT is composed of a novel multiscale audio Transformer (MAT) and a multiscale video Transformer [43]. To learn a discriminative cross-modality fusion, we further design multimodal supervised contrastive objectives called audio-video contrastive loss (AVC) and intra-modal contrastive loss (IMC) that robustly align the two modalities. MMT surpasses previous state-of-the-art approaches by 7.3% and 2.1% on Kinetics-Sounds and VGGSound in terms of the top-1 accuracy without external training data. Moreover, the proposed MAT significantly outperforms AST [28] by 22.2%, 4.4% and 4.7% on three public benchmark datasets, and is about 3% more efficient based on the number of FLOPs and 9.8% more efficient based on GPU memory usage.
近年来,研究人员将音频和视频信号结合起来以处理无法通过视觉线索良好表示或捕获的动作挑战。然而,如何有效利用这两个模式还存在待发展。在这项工作中,我们开发了一个多尺度多模态Transformer(MMT),利用层次表示学习。特别地,MMT由一种新颖的多尺度音频Transformer(MAT)和一种多尺度视频Transformer [43]组成。为了学习具有区分性的跨模态融合,我们进一步设计了一种多模态有监督对比损失(AVC)和内部模式对比损失(IMC),使两个模式具有良好的对齐性。MMT在Kinetics-Sounds和VGGSound上的 top-1 准确率比肩先前最先进的解决方案,并且在没有外部训练数据的情况下,在Kinetics-Sounds和VGGSound上的 top-1 准确率分别提高了 7.3% 和 2.1%。此外,与AST [28]相比,所提出的MAT显著提高了22.2%、4.4%和4.7%,同时在基于FLOPs的效率上提高了约3%,基于GPU内存使用的效率也提高了约9.8%。
https://arxiv.org/abs/2401.04023
Audio and video are two most common modalities in the mainstream media platforms, e.g., YouTube. To learn from multimodal videos effectively, in this work, we propose a novel audio-video recognition approach termed audio video Transformer, AVT, leveraging the effective spatio-temporal representation by the video Transformer to improve action recognition accuracy. For multimodal fusion, simply concatenating multimodal tokens in a cross-modal Transformer requires large computational and memory resources, instead we reduce the cross-modality complexity through an audio-video bottleneck Transformer. To improve the learning efficiency of multimodal Transformer, we integrate self-supervised objectives, i.e., audio-video contrastive learning, audio-video matching, and masked audio and video learning, into AVT training, which maps diverse audio and video representations into a common multimodal representation space. We further propose a masked audio segment loss to learn semantic audio activities in AVT. Extensive experiments and ablation studies on three public datasets and two in-house datasets consistently demonstrate the effectiveness of the proposed AVT. Specifically, AVT outperforms its previous state-of-the-art counterparts on Kinetics-Sounds by 8%. AVT also surpasses one of the previous state-of-the-art video Transformers [25] by 10% on VGGSound by leveraging the audio signal. Compared to one of the previous state-of-the-art multimodal methods, MBT [32], AVT is 1.3% more efficient in terms of FLOPs and improves the accuracy by 3.8% on Epic-Kitchens-100.
音频和视频是主流媒体平台中最常见的两种模式,例如YouTube。为了有效地学习多模态视频,在这项工作中,我们提出了名为音频视频Transformer(AVT)的新音频-视频识别方法,利用视频Transformer的有效时空表示来提高动作识别准确性。对于多模态融合,简单地将跨模态Transformer中的多模态标记连接起来需要大量的计算和内存资源,相反,我们通过音频-视频瓶颈Transformer减少了跨模态复杂性。为了提高多模态Transformer的学习效率,我们将自监督目标,即音频-视频对比学习、音频-视频匹配和遮罩音频和视频学习,融入AVT训练,将多样性的音频和视频表示映射到共同的跨模态表示空间。我们进一步提出了遮罩音频段损失来学习AVT中的语义音频活动。通过对三个公开数据集和两个内部数据集的实验和消融研究,我们一致证明了所提出的AVT的有效性。具体来说,AVT在Kinetics-Sounds上的性能比最先进的同类方法提高了8%。AVT还在VGGSound上超越了之前的最先进视频Transformer[25],其性能提高了10%。与之前的最先进的多模态方法MBT[32]相比,AVT在FLOPs方面提高了1.3%,在Epic-Kitchens-100上的准确性提高了3.8%。
https://arxiv.org/abs/2401.04154
The exponential growth of large language models (LLMs) has opened up numerous possibilities for multi-modal AGI systems. However, the progress in vision and vision-language foundation models, which are also critical elements of multi-modal AGI, has not kept pace with LLMs. In this work, we design a large-scale vision-language foundation model (InternVL), which scales up the vision foundation model to 6 billion parameters and progressively aligns it with the large language model, using web-scale image-text data from various sources. This model can be broadly applied to and achieve state-of-the-art performance on visual perception tasks such as image-level or pixel-level recognition, vision-language tasks such as zero-shot image/video classification, zero-shot image/video-text retrieval, and link with LLMs to create multi-modal dialogue systems. We hope that our research could contribute to the development of multi-modal large models. Code and models are available at this https URL.
大型语言模型的指数增长为多模态AGI系统带来了许多可能性。然而,在多模态AGI中至关重要的视觉和视觉语言基础模型(也是大型语言模型的关键组成部分)的进步并没有跟上LLM的进度。在这项工作中,我们设计了一个大规模视觉语言基础模型(InternVL),将视觉基础模型扩展到60亿参数,并逐步将其与大型语言模型对齐,利用各种来源的千万级图像-文本数据。这个模型可以广泛应用于各种任务,在视觉感知任务(如图像级别或像素级别识别)和视觉-语言任务(如零散图像/视频分类、零散图像/视频-文本检索)方面实现与LLM相当的最佳性能。我们希望我们的研究能为多模态大型模型的研发做出贡献。代码和模型可在此链接下载:https://www.internvl.org/
https://arxiv.org/abs/2312.14238
Weakly-supervised temporal action localization aims to localize action instances in videos with only video-level action labels. Existing methods mainly embrace a localization-by-classification pipeline that optimizes the snippet-level prediction with a video classification loss. However, this formulation suffers from the discrepancy between classification and detection, resulting in inaccurate separation of foreground and background (F\&B) snippets. To alleviate this problem, we propose to explore the underlying structure among the snippets by resorting to unsupervised snippet clustering, rather than heavily relying on the video classification loss. Specifically, we propose a novel clustering-based F\&B separation algorithm. It comprises two core components: a snippet clustering component that groups the snippets into multiple latent clusters and a cluster classification component that further classifies the cluster as foreground or background. As there are no ground-truth labels to train these two components, we introduce a unified self-labeling mechanism based on optimal transport to produce high-quality pseudo-labels that match several plausible prior distributions. This ensures that the cluster assignments of the snippets can be accurately associated with their F\&B labels, thereby boosting the F\&B separation. We evaluate our method on three benchmarks: THUMOS14, ActivityNet v1.2 and v1.3. Our method achieves promising performance on all three benchmarks while being significantly more lightweight than previous methods. Code is available at this https URL
弱监督的时间动作定位旨在通过仅使用视频级别的动作标签来定位视频中的动作实例。现有的方法主要采用基于分类的定位管道来通过视频分类损失优化片段级别的预测。然而,这个公式存在分类和检测之间的差异,导致前景和背景(F&B)片段的准确分离。为了减轻这个问题,我们提出了一种通过无监督片段聚类来探索片段之间的潜在结构,而不是过分依赖视频分类损失。具体来说,我们提出了一种基于聚类的F&B分割算法。它包括两个核心组件:一个片段聚类组件,将片段分组到多个潜在聚类中,和一个聚类分类组件,进一步将聚类分类为前景或背景。由于没有用于训练这两个组件的地面真标签,我们引入了一种基于最优传输的统一自标签机制,产生高质量的反向样本,匹配多个可能的先验分布。这确保了片段分配的聚类可以准确与F&B标签相关联,从而提高F&B分割。我们在THUMOS14、ActivityNet v1.2和v1.3这三个基准上评估我们的方法。我们的方法在所有三个基准上都取得了良好的性能,而重量比以前的方法轻得多。代码可以从这个链接获取:https://www.aclweb.org/anthology/W17-6246
https://arxiv.org/abs/2312.14138
Video panoptic segmentation requires consistently segmenting (for both `thing' and `stuff' classes) and tracking objects in a video over time. In this work, we present MaXTron, a general framework that exploits Mask XFormer with Trajectory Attention to tackle the task. MaXTron enriches an off-the-shelf mask transformer by leveraging trajectory attention. The deployed mask transformer takes as input a short clip consisting of only a few frames and predicts the clip-level segmentation. To enhance the temporal consistency, MaXTron employs within-clip and cross-clip tracking modules, efficiently utilizing trajectory attention. Originally designed for video classification, trajectory attention learns to model the temporal correspondences between neighboring frames and aggregates information along the estimated motion paths. However, it is nontrivial to directly extend trajectory attention to the per-pixel dense prediction tasks due to its quadratic dependency on input size. To alleviate the issue, we propose to adapt the trajectory attention for both the dense pixel features and object queries, aiming to improve the short-term and long-term tracking results, respectively. Particularly, in our within-clip tracking module, we propose axial-trajectory attention that effectively computes the trajectory attention for tracking dense pixels sequentially along the height- and width-axes. The axial decomposition significantly reduces the computational complexity for dense pixel features. In our cross-clip tracking module, since the object queries in mask transformer are learned to encode the object information, we are able to capture the long-term temporal connections by applying trajectory attention to object queries, which learns to track each object across different clips. Without bells and whistles, MaXTron demonstrates state-of-the-art performances on video segmentation benchmarks.
视频全景分割需要对(事物和物品)类别进行一致的分割和实时跟踪视频中的对象。在这项工作中,我们提出了MaXTron,一个利用Mask XFormer和轨迹注意力来解决任务的通用框架。MaXTron通过利用轨迹注意力对标准的mask transformer进行丰富。部署的mask transformer接收一个由几帧组成的短片段作为输入,预测片段级别的分割。为了增强时间一致性,MaXTron采用内部跟踪和跨跟踪模块,有效地利用轨迹注意力。最初设计用于视频分类,轨迹注意力学会了在相邻帧之间建模时间对应关系,并沿着估计的运动路径汇总信息。然而,由于其对输入大小的二次依赖,将轨迹注意力直接扩展到每个像素密集预测任务上并不容易。为了减轻这个问题,我们提出了一个 adapt MaXTron,旨在改进短期和长期跟踪结果。特别地,在我们的 within-clip 跟踪模块中,我们提出了轴向跟踪注意力,有效地计算了在高度和宽度轴上跟踪密集像素的轨迹注意力。轴向分解显著减少了密集像素特征的计算复杂性。在我们的跨跟踪跟踪模块中,由于mask transformer中学习到的对象信息,我们能够通过应用轨迹注意力来对对象进行跟踪,并学习在不同片段上跟踪每个对象。没有花言巧语,MaXTron在视频分割基准测试中展示了最先进的性能。
https://arxiv.org/abs/2311.18537
Deep learning-based models are at the forefront of most driver observation benchmarks due to their remarkable accuracies but are also associated with high computational costs. This is challenging, as resources are often limited in real-world driving scenarios. This paper introduces a lightweight framework for resource-efficient driver activity recognition. The framework enhances 3D MobileNet, a neural architecture optimized for speed in video classification, by incorporating knowledge distillation and model quantization to balance model accuracy and computational efficiency. Knowledge distillation helps maintain accuracy while reducing the model size by leveraging soft labels from a larger teacher model (I3D), instead of relying solely on original ground truth data. Model quantization significantly lowers memory and computation demands by using lower precision integers for model weights and activations. Extensive testing on a public dataset for in-vehicle monitoring during autonomous driving demonstrates that this new framework achieves a threefold reduction in model size and a 1.4-fold improvement in inference time, compared to an already optimized architecture. The code for this study is available at this https URL.
基于深度学习的模型在大多数驾驶观察基准测试中处于领先地位,因为它们的令人印象深刻的精度,但同时也导致了高计算成本。这很具有挑战性,因为在现实世界的驾驶场景中,资源常常有限。本文介绍了一种轻量级的资源高效的驾驶员活动识别框架。该框架通过引入知识蒸馏和模型量化来平衡模型的准确性和计算效率。知识蒸馏通过利用较大模型的软标签来维持准确性,而不是仅依赖原始真实数据。模型量化通过使用较低精度的整数来量化模型权重和激活实现显著的降低内存和计算需求。在自动驾驶期间车内监测的公开数据集上进行的大量测试表明,与已经优化的架构相比,新框架的模型大小降低了三倍,推理时间提高了1.4倍。本研究的代码可以从以下链接获取。
https://arxiv.org/abs/2311.05970
The proliferation of deepfake videos, synthetic media produced through advanced Artificial Intelligence techniques has raised significant concerns across various sectors, encompassing realms such as politics, entertainment, and security. In response, this research introduces an innovative and streamlined model designed to classify deepfake videos generated by five distinct encoders adeptly. Our approach not only achieves state of the art performance but also optimizes computational resources. At its core, our solution employs part of a VGG19bn as a backbone to efficiently extract features, a strategy proven effective in image-related tasks. We integrate a Capsule Network coupled with a Spatial Temporal attention mechanism to bolster the model's classification capabilities while conserving resources. This combination captures intricate hierarchies among features, facilitating robust identification of deepfake attributes. Delving into the intricacies of our innovation, we introduce an existing video level fusion technique that artfully capitalizes on temporal attention mechanisms. This mechanism serves to handle concatenated feature vectors, capitalizing on the intrinsic temporal dependencies embedded within deepfake videos. By aggregating insights across frames, our model gains a holistic comprehension of video content, resulting in more precise predictions. Experimental results on an extensive benchmark dataset of deepfake videos called DFDM showcase the efficacy of our proposed method. Notably, our approach achieves up to a 4 percent improvement in accurately categorizing deepfake videos compared to baseline models, all while demanding fewer computational resources.
深度伪造视频的泛滥已经引发了一系列 sector(包括政治、娱乐和安保领域)的广泛关注。为了应对这一问题,这项研究介绍了一种创新且高效的模型,用于对五种不同编码器生成的深度伪造视频进行分类。我们的方法不仅在性能上实现了最先进的水平,而且在计算资源上进行了优化。 本质上,我们的解决方案采用了一个 VGG19bn 作为骨干网络,以高效地提取特征,这是一种在图像相关任务中已被证明有效的策略。我们结合了一个胶囊网络和一个空间时间注意力机制,以增强模型的分类能力,同时保留资源。这种组合捕捉了特征之间的复杂层次结构,从而有助于准确识别深度伪造属性。 深入研究我们的创新,我们介绍了一种现有的视频级别融合技术,巧妙地利用了时间注意力机制。这一机制用于处理连接特征向量,利用了深度伪造视频内在的时间依赖关系。通过跨越帧的见解汇总,我们的模型获得了对视频内容的全面理解,从而实现了更精确的预测。 在一个名为 DFDM 的广泛基准数据集上对深度伪造视频进行实验测试的结果展示了我们所提出方法的效力。值得注意的是,与基线模型相比,我们的方法将准确分类深度伪造视频的能力提高了4%。同时,这一方法在计算资源上要求更少。
https://arxiv.org/abs/2311.03782
Video domain generalization aims to learn generalizable video classification models for unseen target domains by training in a source domain. A critical challenge of video domain generalization is to defend against the heavy reliance on domain-specific cues extracted from the source domain when recognizing target videos. To this end, we propose to perceive diverse spatial-temporal cues in videos, aiming to discover potential domain-invariant cues in addition to domain-specific cues. We contribute a novel model named Spatial-Temporal Diversification Network (STDN), which improves the diversity from both space and time dimensions of video data. First, our STDN proposes to discover various types of spatial cues within individual frames by spatial grouping. Then, our STDN proposes to explicitly model spatial-temporal dependencies between video contents at multiple space-time scales by spatial-temporal relation modeling. Extensive experiments on three benchmarks of different types demonstrate the effectiveness and versatility of our approach.
视频领域泛化旨在通过在源域中训练来学习未见过的目标领域中的通用的视频分类模型。视频领域泛化的一个关键挑战是在识别目标视频时,抵御源域中提取的领域特定线索的过重依赖。为此,我们提出了一种感知多样时空线索的方法,旨在发现潜在的领域无关线索,除了领域特定线索。我们提出了一个名为 Spatial-Temporal Diversification Network (STDN) 的新模型,该模型从时间和空间维度改善了视频数据的多样性。首先,我们的 STDN 通过空间分组来发现个体帧内的各种类型空间线索。然后,我们的 STDN 通过空间-时间关系建模,明确地建模了多个空间-时间尺度之间视频内容的时空依赖关系。对不同类型的基准测试的广泛实验证明了我们方法的有效性和多样性。
https://arxiv.org/abs/2310.17942
While most modern video understanding models operate on short-range clips, real-world videos are often several minutes long with semantically consistent segments of variable length. A common approach to process long videos is applying a short-form video model over uniformly sampled clips of fixed temporal length and aggregating the outputs. This approach neglects the underlying nature of long videos since fixed-length clips are often redundant or uninformative. In this paper, we aim to provide a generic and adaptive sampling approach for long-form videos in lieu of the de facto uniform sampling. Viewing videos as semantically consistent segments, we formulate a task-agnostic, unsupervised, and scalable approach based on Kernel Temporal Segmentation (KTS) for sampling and tokenizing long videos. We evaluate our method on long-form video understanding tasks such as video classification and temporal action localization, showing consistent gains over existing approaches and achieving state-of-the-art performance on long-form video modeling.
现代视频理解模型通常处理的是短片段,而实际视频往往几十分钟,具有语义 consistent 的片段长度可变。处理长视频的常见方法是使用一段固定时间长度的片段进行均匀采样,并汇总输出。这种方法忽略了长视频的深层次本质,因为固定长度的片段往往重复或无意义。在本文中,我们旨在提供一种通用的、自适应的采样方法,以代替事实上的均匀采样。将视频视为语义 consistent 的片段,我们制定了基于核心时间分割(KTS)的任务无关、无监督和可扩展的方法,用于采样和 tokenizing 长视频。我们针对长视频理解任务,如视频分类和时间行为定位,评估了我们的方法,显示与现有方法一致的增益,并在长视频建模方面实现了最先进的性能。
https://arxiv.org/abs/2309.11569
Despite an exciting new wave of multimodal machine learning models, current approaches still struggle to interpret the complex contextual relationships between the different modalities present in videos. Going beyond existing methods that emphasize simple activities or objects, we propose a new model-agnostic approach for generating detailed textual descriptions that captures multimodal video information. Our method leverages the extensive knowledge learnt by large language models, such as GPT-3.5 or Llama2, to reason about textual descriptions of the visual and aural modalities, obtained from BLIP-2, Whisper and ImageBind. Without needing additional finetuning of video-text models or datasets, we demonstrate that available LLMs have the ability to use these multimodal textual descriptions as proxies for ``sight'' or ``hearing'' and perform zero-shot multimodal classification of videos in-context. Our evaluations on popular action recognition benchmarks, such as UCF-101 or Kinetics, show these context-rich descriptions can be successfully used in video understanding tasks. This method points towards a promising new research direction in multimodal classification, demonstrating how an interplay between textual, visual and auditory machine learning models can enable more holistic video understanding.
尽管出现了令人兴奋的多媒态机器学习模型,但现有的方法仍然难以解释视频中出现的不同感官模式之间的复杂上下文关系。我们提出了一种新的模型无关的方法,用于生成详细文本描述,捕捉多媒态视频信息。我们的方法利用大型语言模型如GPT-3.5或Llama2学习到的广泛知识,以处理从BLIP-2、Whisper和ImageBind获取的视觉和听觉感官模式文本描述。我们不需要进一步调整视频文本模型或数据集,就能证明可用的LLMs有使用这些多媒态文本描述作为“看到”或“听到”的代用品,并在上下文中实现零次机会多媒态分类的能力。我们对流行的行动识别基准点如UCF-101或Kinetics进行评估,表明这些丰富的上下文描述可以在视频理解任务中成功使用。这种方法指向了多媒态分类中的有前途的新研究方向,展示了如何将文本、视觉和听觉机器学习模型之间的交互实现更全面的视频理解。
https://arxiv.org/abs/2309.10783
Learning high-quality video representation has shown significant applications in computer vision and remains challenging. Previous work based on mask autoencoders such as ImageMAE and VideoMAE has proven the effectiveness of learning representations in images and videos through reconstruction strategy in the visual modality. However, these models exhibit inherent limitations, particularly in scenarios where extracting features solely from the visual modality proves challenging, such as when dealing with low-resolution and blurry original videos. Based on this, we propose AV-MaskEnhancer for learning high-quality video representation by combining visual and audio information. Our approach addresses the challenge by demonstrating the complementary nature of audio and video features in cross-modality content. Moreover, our result of the video classification task on the UCF101 dataset outperforms the existing work and reaches the state-of-the-art, with a top-1 accuracy of 98.8% and a top-5 accuracy of 99.9%.
学习高质量的视频表示在计算机视觉方面已经展示了广泛应用,但仍然具有挑战性。基于掩码自编码器的工作,如ImageMAE和VideoMAE,已经证明了通过视觉 modality 的重构策略在图像和视频学习中表示的有效性。然而,这些模型表现出固有的限制,特别是在仅从视觉modality 中提取特征的情况下,例如处理低分辨率和模糊的原始视频时。基于这一点,我们提出了AV-MaskEnhancer,以学习通过视觉和音频信息结合实现的高质量视频表示。我们的方法解决了挑战,通过展示跨modality 内容中音频和视频特征的互补性质。此外,我们在UCF101数据集上的视频分类任务的结果优于现有工作,达到最先进的水平,其中1项准确率为98.8%,5项准确率为99.9%。
https://arxiv.org/abs/2309.08738
Optimizing video inference efficiency has become increasingly important with the growing demand for video analysis in various fields. Some existing methods achieve high efficiency by explicit discard of spatial or temporal information, which poses challenges in fast-changing and fine-grained scenarios. To address these issues, we propose an efficient video representation network with Differentiable Resolution Compression and Alignment mechanism, which compresses non-essential information in the early stage of the network to reduce computational costs while maintaining consistent temporal correlations. Specifically, we leverage a Differentiable Context-aware Compression Module to encode the saliency and non-saliency frame features, refining and updating the features into a high-low resolution video sequence. To process the new sequence, we introduce a new Resolution-Align Transformer Layer to capture global temporal correlations among frame features with different resolutions, while reducing spatial computation costs quadratically by utilizing fewer spatial tokens in low-resolution non-saliency frames. The entire network can be end-to-end optimized via the integration of the differentiable compression module. Experimental results show that our method achieves the best trade-off between efficiency and performance on near-duplicate video retrieval and competitive results on dynamic video classification compared to state-of-the-art methods. Code:this https URL
优化视频推断效率随着各个领域对视频分析的需求不断增加变得越来越重要。一些现有方法通过明确放弃空间或时间信息实现了高效的性能,但在快速变化和精细的场景下会带来挑战。为了解决这些问题,我们提出了一种高效的视频表示网络,采用可分化分辨率压缩和对齐机制。该网络在网络的早期阶段压缩非关键信息,以降低计算成本,同时保持 consistent 的时间相关度。具体来说,我们利用一种可分化上下文 aware 压缩模块编码可见和非可见帧特征,将它们 refine 和更新为高低频分辨率的视频序列。为了处理新的序列,我们引入了一种新分辨率 align Transformer 层,以捕捉不同分辨率帧特征之间的全局时间相关度,同时通过在低分辨率非可见帧中使用更少的空间 token 以减少空间计算成本。整个网络可以通过集成可分化压缩模块进行端到端优化。实验结果显示,与我们现有的方法相比,我们的方法在近同视频检索和动态视频分类中的效率和表现实现了最佳平衡。代码: this https URL
https://arxiv.org/abs/2309.08167
Training deep learning models for video classification from audio-visual data commonly requires immense amounts of labeled training data collected via a costly process. A challenging and underexplored, yet much cheaper, setup is few-shot learning from video data. In particular, the inherently multi-modal nature of video data with sound and visual information has not been leveraged extensively for the few-shot video classification task. Therefore, we introduce a unified audio-visual few-shot video classification benchmark on three datasets, i.e. the VGGSound-FSL, UCF-FSL, ActivityNet-FSL datasets, where we adapt and compare ten methods. In addition, we propose AV-DIFF, a text-to-feature diffusion framework, which first fuses the temporal and audio-visual features via cross-modal attention and then generates multi-modal features for the novel classes. We show that AV-DIFF obtains state-of-the-art performance on our proposed benchmark for audio-visual (generalised) few-shot learning. Our benchmark paves the way for effective audio-visual classification when only limited labeled data is available. Code and data are available at this https URL.
训练从视听数据进行视频分类的深度学习模型通常需要大量经过昂贵过程收集的标记训练数据。一种挑战性且未被充分探索,但成本相对较低的方法是少量视频学习。特别是,具有声音和视觉信息的视听数据本身的多模态性质未被充分利用来处理少量视频分类任务。因此,我们提出了三个数据集上的统一视听少量视频分类基准,即VGG Sound-FSL、UCF-FSL和ActivityNet-FSL数据集,并适应和比较了十种方法。此外,我们提出了AV-diff,一种文本到特征扩散框架,该框架首先通过跨模态注意力将时间和视听特征融合,然后生成新类的新型多模态特征。我们证明,AV-diff在我们提出的视听(普遍化)少量视频学习基准上的最先进的性能。我们的基准当只有有限的标记数据可用时为有效的视听分类打开了道路。代码和数据可在这个httpsURL上可用。
https://arxiv.org/abs/2309.03869