Multimodal embedding models have been crucial in enabling various downstream tasks such as semantic similarity, information retrieval, and clustering over different modalities. However, existing multimodal embeddings like VLM2Vec, E5-V, GME are predominantly focused on natural images, with limited support for other visual forms such as videos and visual documents. This restricts their applicability in real-world scenarios, including AI agents, multi-modal search and recommendation, and retrieval-augmented generation (RAG). To close this gap, we propose VLM2Vec-V2, a unified framework for learning embeddings across diverse visual forms. First, we introduce MMEB-V2, a comprehensive benchmark that extends MMEB with five new task types: visual document retrieval, video retrieval, temporal grounding, video classification and video question answering - spanning text, image, video, and visual document inputs. Next, we train VLM2Vec-V2, a general-purpose embedding model that supports text, image, video, and visual document inputs. Extensive experiments show that VLM2Vec-V2 achieves strong performance not only on the newly introduced video and document retrieval tasks, but also improves over prior baselines on the original image benchmarks. Through extensive evaluation, our study offers insights into the generalizability of various multimodal embedding models and highlights effective strategies for unified embedding learning, laying the groundwork for more scalable and adaptable representation learning in both research and real-world settings.
多模态嵌入模型在各种下游任务中发挥了关键作用,如语义相似性、信息检索和跨不同模式的聚类。然而,现有的多模态嵌入方法(例如VLM2Vec、E5-V、GME)主要关注自然图像,对视频和其他视觉形式的支持有限,这限制了它们在现实场景中的应用范围,包括AI代理、多模态搜索和推荐以及检索增强生成(RAG)。为填补这一空白,我们提出了VLM2Vec-V2,这是一个用于跨多种视觉形式学习嵌入的统一框架。 首先,我们介绍MMEB-V2,这是一个综合基准测试库,它在原有的MMEB基础上新增了五种任务类型:视觉文档检索、视频检索、时间定位(temporal grounding)、视频分类和视频问答——这些任务涵盖了文本、图像、视频和可视文档输入。接下来,我们训练了一个通用的嵌入模型VLM2Vec-V2,该模型支持文本、图像、视频和可视文档输入。 经过广泛的实验验证,结果显示VLM2Vec-V2不仅在新引入的视频和文档检索任务中表现出色,在原有的图像基准测试上也超越了先前的方法。通过全面评估,我们的研究为各种多模态嵌入模型的泛化能力提供了见解,并突出了统一嵌入学习的有效策略,从而为科研界和实际应用中的可扩展性和适应性表示学习奠定了基础。
https://arxiv.org/abs/2507.04590
Fine-grained video classification requires understanding complex spatio-temporal and semantic cues that often exceed the capacity of a single modality. In this paper, we propose a multimodal framework that fuses video, image, and text representations using GRU-based sequence encoders and cross-modal attention mechanisms. The model is trained using a combination of classification or regression loss, depending on the task, and is further regularized through feature-level augmentation and autoencoding techniques. To evaluate the generality of our framework, we conduct experiments on two challenging benchmarks: the DVD dataset for real-world violence detection and the Aff-Wild2 dataset for valence-arousal estimation. Our results demonstrate that the proposed fusion strategy significantly outperforms unimodal baselines, with cross-attention and feature augmentation contributing notably to robustness and performance.
精细粒度的视频分类需要理解复杂的时空和语义线索,这些往往超出了单一模态的能力。在本文中,我们提出了一种多模态框架,该框架融合了视频、图像和文本表示,并使用基于GRU(门控循环单元)的序列编码器和跨模态注意力机制来实现这一目标。模型通过结合分类或回归损失进行训练,具体取决于任务需求,并且通过特征级增强和自编码技术进一步正则化。 为了评估我们框架的通用性,我们在两个具有挑战性的基准数据集上进行了实验:DVD数据集用于现实世界的暴力检测,以及Aff-Wild2数据集用于效价唤醒度估计。我们的结果显示,所提出的融合策略显著优于单模态基线模型,在鲁棒性和性能方面,跨注意力机制和特征增强的贡献尤为突出。
https://arxiv.org/abs/2507.03531
We address the task of zero-shot fine-grained video classification, where no video examples or temporal annotations are available for unseen action classes. While contrastive vision-language models such as SigLIP demonstrate strong open-set recognition via mean-pooled image-text similarity, they fail to capture the temporal structure critical for distinguishing fine-grained activities. We introduce ActAlign, a zero-shot framework that formulates video classification as sequence alignment. For each class, a large language model generates an ordered sub-action sequence, which is aligned with video frames using Dynamic Time Warping (DTW) in a shared embedding space. Without any video-text supervision or fine-tuning, ActAlign achieves 30.5% accuracy on the extremely challenging ActionAtlas benchmark, where human accuracy is only 61.6%. ActAlign outperforms billion-parameter video-language models while using approximately 8x less parameters. These results demonstrate that structured language priors, combined with classical alignment techniques, offer a scalable and general approach to unlocking the open-set recognition potential of vision-language models for fine-grained video understanding.
我们解决了零样本细粒度视频分类任务,该任务在未知动作类别上没有提供任何视频示例或时间标注。虽然对比视觉-语言模型(如SigLIP)通过均值池化的图像-文本相似性展示了强大的开放集识别能力,但它们未能捕捉到区分精细动作所需的时序结构。我们引入了ActAlign,这是一个零样本框架,它将视频分类定义为序列对齐问题。对于每个类别,大型语言模型生成一个有序的子行动序列,并通过在共享嵌入空间中使用动态时间规整(DTW)与视频帧进行对齐。无需任何视频-文本监督或微调的情况下,ActAlign在极具挑战性的ActionAtlas基准测试上达到了30.5%的准确率,而人类的表现仅为61.6%。ActAlign在参数量仅约为数十亿参数视觉语言模型八分之一的情况下超越了这些模型。这一结果表明,结合结构化的语言先验知识和经典的对齐技术可以提供一种可扩展且通用的方法来释放视觉-语言模型的开放集识别潜力,从而实现细粒度视频理解。
https://arxiv.org/abs/2506.22967
In recent years, large transformer-based video encoder models have greatly advanced state-of-the-art performance on video classification tasks. However, these large models typically process videos by averaging embedding outputs from multiple clips over time to produce fixed-length representations. This approach fails to account for a variety of time-related features, such as variable video durations, chronological order of events, and temporal variance in feature significance. While methods for temporal modeling do exist, they often require significant architectural changes and expensive retraining, making them impractical for off-the-shelf, fine-tuned large encoders. To overcome these limitations, we propose DejaVid, an encoder-agnostic method that enhances model performance without the need for retraining or altering the architecture. Our framework converts a video into a variable-length temporal sequence of embeddings, which we call a multivariate time series (MTS). An MTS naturally preserves temporal order and accommodates variable video durations. We then learn per-timestep, per-feature weights over the encoded MTS frames, allowing us to account for variations in feature importance over time. We introduce a new neural network architecture inspired by traditional time series alignment algorithms for this learning task. Our evaluation demonstrates that DejaVid substantially improves the performance of a state-of-the-art large encoder, achieving leading Top-1 accuracy of 77.2% on Something-Something V2, 89.1% on Kinetics-400, and 88.6% on HMDB51, while adding fewer than 1.8% additional learnable parameters and requiring less than 3 hours of training time. Our code is available at this https URL.
近年来,基于大规模变压器的视频编码模型在视频分类任务中取得了显著的进步。然而,这些大型模型通常通过从多个片段中平均嵌入输出来处理视频,并生成固定长度的表示,这种方法未能考虑到多种时间相关特征,如不同视频时长、事件发生的顺序以及随时间变化的功能重要性差异。虽然存在一些用于时间建模的方法,但它们往往需要对架构进行重大调整和昂贵的重新训练,这使得这些方法对于现成的微调大型编码器来说不切实际。 为克服这些限制,我们提出了DejaVid——一种无需重新训练或修改架构即可提升模型性能的编码无关方法。我们的框架将视频转换为包含多变量时间序列(MTS)的可变长度的时间序列嵌入形式,这种格式自然保留了时间顺序并适应不同的视频时长。接下来,我们在编码后的MTS帧上学习每时刻、每个特征的权重,从而能够考虑随时间变化的功能重要性差异。为了执行这一学习任务,我们引入了一种受传统时间序列对齐算法启发的新神经网络架构。 我们的评估表明,DejaVid显著提升了最先进的大型编码器的表现,在Something-Something V2上实现了77.2%的领先Top-1准确率,在Kinetics-400上为89.1%,在HMDB51上达到88.6%,同时仅增加了不到1.8%的可学习参数,并且训练时间少于3小时。我们的代码可以在提供的链接中获取。
https://arxiv.org/abs/2506.12585
Test-time adaptation (TTA) aims to boost the generalization capability of a trained model by conducting self-/unsupervised learning during the testing phase. While most existing TTA methods for video primarily utilize visual supervisory signals, they often overlook the potential contribution of inherent audio data. To address this gap, we propose a novel approach that incorporates audio information into video TTA. Our method capitalizes on the rich semantic content of audio to generate audio-assisted pseudo-labels, a new concept in the context of video TTA. Specifically, we propose an audio-to-video label mapping method by first employing pre-trained audio models to classify audio signals extracted from videos and then mapping the audio-based predictions to video label spaces through large language models, thereby establishing a connection between the audio categories and video labels. To effectively leverage the generated pseudo-labels, we present a flexible adaptation cycle that determines the optimal number of adaptation iterations for each sample, based on changes in loss and consistency across different views. This enables a customized adaptation process for each sample. Experimental results on two widely used datasets (UCF101-C and Kinetics-Sounds-C), as well as on two newly constructed audio-video TTA datasets (AVE-C and AVMIT-C) with various corruption types, demonstrate the superiority of our approach. Our method consistently improves adaptation performance across different video classification models and represents a significant step forward in integrating audio information into video TTA. Code: this https URL.
测试时适应(TTA,Test-time Adaptation)旨在通过在测试阶段进行自监督或无监督学习来增强训练模型的泛化能力。尽管现有的大多数视频TTA方法主要利用视觉线索,但它们通常忽视了音频数据固有的潜在贡献。为了解决这一问题,我们提出了一种新方法,将音频信息融入到视频TTA中。我们的方法利用音频中的丰富语义内容生成辅助伪标签,这是视频TTA领域的一个新概念。具体来说,我们提出了一个从音频到视频的标签映射方法:首先使用预训练的音频模型对提取自视频的音频信号进行分类,然后通过大型语言模型将基于音频的预测映射至视频标签空间中,从而建立起音频类别与视频标签之间的联系。 为了有效地利用生成的伪标签,我们提出了一种灵活适应循环,可以根据损失和不同视角的一致性变化来确定每个样本的最佳适应迭代次数。这使得可以为每个样本定制一个适应过程。在两个常用的基准数据集(UCF101-C和Kinetics-Sounds-C)以及两个新的音频-视频TTA数据集(AVE-C和AVMIT-C)上进行的实验,表明了我们的方法对各种损坏类型的有效性。 实验结果表明,在不同的视频分类模型中,我们的方法始终提升适应性能,并且在将音频信息整合到视频TTA方面取得了重要进展。代码:[提供链接]。
https://arxiv.org/abs/2506.12481
Dense video prediction tasks, such as object tracking and semantic segmentation, require video encoders that generate temporally consistent, spatially dense features for every frame. However, existing approaches fall short: image encoders like DINO or CLIP lack temporal awareness, while video models such as VideoMAE underperform compared to image encoders on dense prediction tasks. We address this gap with FRAME, a self-supervised video frame encoder tailored for dense video understanding. FRAME learns to predict current and future DINO patch features from past and present RGB frames, leading to spatially precise and temporally coherent representations. To our knowledge, FRAME is the first video encoder to leverage image-based models for dense prediction while outperforming them on tasks requiring fine-grained visual correspondence. As an auxiliary capability, FRAME aligns its class token with CLIP's semantic space, supporting language-driven tasks such as video classification. We evaluate FRAME across six dense prediction tasks on seven datasets, where it consistently outperforms image encoders and existing self-supervised video models. Despite its versatility, FRAME maintains a compact architecture suitable for a range of downstream applications.
密集视频预测任务,如物体跟踪和语义分割,需要生成每一帧上具有时间一致性且空间稠密特征的视频编码器。然而,现有方法存在不足:图像编码器(例如DINO或CLIP)缺乏对时间信息的理解,而诸如VideoMAE之类的视频模型在密集预测任务中表现不如图像编码器。我们通过引入FRAME来填补这一空白,这是一种针对密集视频理解自我监督视频帧编码器。FRAME学习从过去和当前的RGB帧中预测当前及未来的DINO补丁特征,从而生成空间上精确且时间上一致的表示。据我们所知,FRAME是第一个利用基于图像模型进行密集预测并超越它们在需要细粒度视觉对应的任务上的表现的视频编码器。作为辅助能力,FRAME将其类别令牌与CLIP的语义空间对齐,支持如视频分类等语言驱动任务。我们在六个不同的密集预测任务上使用七个数据集评估了FRAME,并发现其始终优于图像编码器和现有的自我监督视频模型。尽管具备多功能性,但FRAME仍保持紧凑型架构,适用于各种下游应用。
https://arxiv.org/abs/2506.05543
Evaluating the robustness of Video classification models is very challenging, specifically when compared to image-based models. With their increased temporal dimension, there is a significant increase in complexity and computational cost. One of the key challenges is to keep the perturbations to a minimum to induce misclassification. In this work, we propose a multi-agent reinforcement learning approach (spatial and temporal) that cooperatively learns to identify the given video's sensitive spatial and temporal regions. The agents consider temporal coherence in generating fine perturbations, leading to a more effective and visually imperceptible attack. Our method outperforms the state-of-the-art solutions on the Lp metric and the average queries. Our method enables custom distortion types, making the robustness evaluation more relevant to the use case. We extensively evaluate 4 popular models for video action recognition on two popular datasets, HMDB-51 and UCF-101.
评估视频分类模型的鲁棒性非常具有挑战性,尤其是与基于图像的模型相比。由于增加了时间维度,其复杂性和计算成本显著增加。其中一个关键挑战是尽量减少扰动以诱导错误分类。在本工作中,我们提出了一种多智能体强化学习方法(空间和时间),该方法协同学习识别给定视频中敏感的空间和时间区域。这些代理在生成精细的扰动时考虑了时间一致性,从而导致更有效且视觉上难以察觉的攻击。我们的方法在Lp度量和平均查询数量方面优于最先进的解决方案。此外,我们的方法支持自定义失真类型,使鲁棒性评估更加符合应用场景需求。我们对四个流行的视频动作识别模型在两个流行数据集(HMDB-51 和 UCF-101)上进行了广泛评估。
https://arxiv.org/abs/2506.05431
This paper presents a deep learning-based framework for classifying forestry operations from dashcam video footage. Focusing on four key work elements - crane-out, cutting-and-to-processing, driving, and processing - the approach employs a 3D ResNet-50 architecture implemented with PyTorchVideo. Trained on a manually annotated dataset of field recordings, the model achieves strong performance, with a validation F1 score of 0.88 and precision of 0.90. These results underscore the effectiveness of spatiotemporal convolutional networks for capturing both motion patterns and appearance in real-world forestry environments. The system integrates standard preprocessing and augmentation techniques to improve generalization, but overfitting is evident, highlighting the need for more training data and better class balance. Despite these challenges, the method demonstrates clear potential for reducing the manual workload associated with traditional time studies, offering a scalable solution for operational monitoring and efficiency analysis in forestry. This work contributes to the growing application of AI in natural resource management and sets the foundation for future systems capable of real-time activity recognition in forest machinery. Planned improvements include dataset expansion, enhanced regularization, and deployment trials on embedded systems for in-field use.
本文提出了一种基于深度学习的框架,用于从行车记录仪视频片段中分类林业作业。该方法重点关注四个关键工作元素——起重机操作、砍伐和运输、驾驶以及加工,并采用了PyTorchVideo实现的3D ResNet-50架构。模型在手动标注的实地录音数据集上进行训练,在验证集上的F1评分为0.88,精确度为0.90。这些结果突显了时空卷积网络在捕捉现实世界林业环境中运动模式和外观方面的有效性。 该系统整合了标准的数据预处理和增强技术以提高泛化能力,但过拟合问题明显存在,这表明需要更多的训练数据和更好的类别平衡。尽管面临挑战,该方法展示了减少传统时间研究中的手动工作量的明确潜力,并为林业作业监控和效率分析提供了一种可扩展解决方案。 这项工作促进了人工智能在自然资源管理领域的应用,并为未来实时活动识别森林机械设备奠定了基础。计划改进包括扩大数据集、增强正则化以及在嵌入式系统上进行部署试验以供现场使用。
https://arxiv.org/abs/2505.24375
Event cameras offer high temporal resolution and power efficiency, making them well-suited for edge AI applications. However, their high event rates present challenges for data transmission and processing. Subsampling methods provide a practical solution, but their effect on downstream visual tasks remains underexplored. In this work, we systematically evaluate six hardware-friendly subsampling methods using convolutional neural networks for event video classification on various benchmark datasets. We hypothesize that events from high-density regions carry more task-relevant information and are therefore better suited for subsampling. To test this, we introduce a simple causal density-based subsampling method, demonstrating improved classification accuracy in sparse regimes. Our analysis further highlights key factors affecting subsampling performance, including sensitivity to hyperparameters and failure cases in scenarios with large event count variance. These findings provide insights for utilization of hardware-efficient subsampling strategies that balance data efficiency and task accuracy. The code for this paper will be released at: this https URL.
事件相机提供高时间分辨率和电源效率,非常适合边缘AI应用。然而,它们的高事件率给数据传输和处理带来了挑战。降采样方法提供了实用的解决方案,但其对下游视觉任务的影响仍需进一步探索。在这项工作中,我们系统地评估了六种硬件友好的降采样方法,并使用卷积神经网络在各种基准数据集上进行事件视频分类。我们假设来自高密度区域的事件携带更多与任务相关的信息,因此更适合进行降采样。为此,我们引入了一种简单的基于因果密度的降采样方法,在稀疏场景中展示了改进的分类准确性。我们的分析进一步突出了影响降采样性能的关键因素,包括对超参数的敏感性和在具有较大事件计数变化率情况下失效的情况。这些发现为如何利用硬件高效的降采样策略提供了见解,从而平衡数据效率和任务准确性。 本文的代码将在以下网址发布:[此链接](https://this https URL)。
https://arxiv.org/abs/2505.21187
The Equine Facial Action Coding System (EquiFACS) enables the systematic annotation of facial movements through distinct Action Units (AUs). It serves as a crucial tool for assessing affective states in horses by identifying subtle facial expressions associated with discomfort. However, the field of horse affective state assessment is constrained by the scarcity of annotated data, as manually labelling facial AUs is both time-consuming and costly. To address this challenge, automated annotation systems are essential for leveraging existing datasets and improving affective states detection tools. In this work, we study different methods for specific ear AU detection and localization from horse videos. We leverage past works on deep learning-based video feature extraction combined with recurrent neural networks for the video classification task, as well as a classic optical flow based approach. We achieve 87.5% classification accuracy of ear movement presence on a public horse video dataset, demonstrating the potential of our approach. We discuss future directions to develop these systems, with the aim of bridging the gap between automated AU detection and practical applications in equine welfare and veterinary diagnostics. Our code will be made publicly available at this https URL.
《马面部动作编码系统(EquiFACS)》允许通过特定的动作单元(AUs,Action Units)对马的面部运动进行系统的标注。它作为评估马的情绪状态的重要工具,在识别与不适相关的小幅面部表情方面发挥着关键作用。然而,由于缺乏标注数据,马情绪状态评估领域的研究受到了限制,因为手动标记面部动作单位既耗时又昂贵。为了应对这一挑战,自动注释系统对于利用现有数据集和改进情绪状态检测工具至关重要。 在本项工作中,我们研究了不同方法,用于从马视频中识别和定位特定的耳朵AUs。我们结合深度学习基线视频特征提取与递归神经网络应用于视频分类任务,并且采用了一种经典的基于光学流的方法。我们在一个公共马视频数据集中实现了87.5%的耳部运动存在分类准确率,展示了我们方法的潜力。 我们将讨论未来的发展方向,旨在弥合自动AUs检测和在赛马福利及兽医诊断中的实际应用之间的差距。我们的代码将在以下链接公开提供:[此 URL]。
https://arxiv.org/abs/2505.03554
Despite the demonstrated effectiveness of transformer models in NLP, and image and video classification, the available tools for extracting features from captured IoT network flow packets fail to capture sequential patterns in addition to the absence of spatial patterns consequently limiting transformer model application. This work introduces a novel preprocessing method to adapt transformer models, the vision transformer (ViT) in particular, for IoT botnet attack detection using network flow packets. The approach involves feature extraction from .pcap files and transforming each instance into a 1-channel 2D image shape, enabling ViT-based classification. Also, the ViT model was enhanced to allow use any classifier besides Multilayer Perceptron (MLP) that was deployed in the initial ViT paper. Models including the conventional feed forward Deep Neural Network (DNN), LSTM and Bidirectional-LSTM (BLSTM) demonstrated competitive performance in terms of precision, recall, and F1-score for multiclass-based attack detection when evaluated on two IoT attack datasets.
尽管变换器模型(如变压器模型)在自然语言处理、图像和视频分类方面表现出强大的效果,但现有的用于从捕获的物联网网络流量包中提取特征的工具无法捕捉序列模式,并且缺乏空间模式,这限制了变压器模型的应用。这项工作提出了一种新的预处理方法,以使变换器模型(特别是视觉变换器ViT)能够利用网络流量包进行物联网僵尸网络攻击检测。该方法涉及从.pcap文件中提取特征,并将每个实例转换为1通道2D图像形式,从而实现基于ViT的分类。此外,对ViT模型进行了改进,以允许使用除多层感知机(MLP)以外的任何分类器,后者在最初的ViT论文中被部署。 实验表明,在两个物联网攻击数据集上评估时,包括传统的前馈深度神经网络(DNN)、长短时记忆网络(LSTM)和双向-LSTM(BLSTM)在内的模型,在多类攻击检测方面具有竞争力的准确率、召回率和F1分数表现。
https://arxiv.org/abs/2504.18781
We introduce Perception Encoder (PE), a state-of-the-art encoder for image and video understanding trained via simple vision-language learning. Traditionally, vision encoders have relied on a variety of pretraining objectives, each tailored to specific downstream tasks such as classification, captioning, or localization. Surprisingly, after scaling our carefully tuned image pretraining recipe and refining with our robust video data engine, we find that contrastive vision-language training alone can produce strong, general embeddings for all of these downstream tasks. There is only one caveat: these embeddings are hidden within the intermediate layers of the network. To draw them out, we introduce two alignment methods, language alignment for multimodal language modeling, and spatial alignment for dense prediction. Together with the core contrastive checkpoint, our PE family of models achieves state-of-the-art performance on a wide variety of tasks, including zero-shot image and video classification and retrieval; document, image, and video Q&A; and spatial tasks such as detection, depth estimation, and tracking. To foster further research, we are releasing our models, code, and a novel dataset of synthetically and human-annotated videos.
我们介绍了一种先进的感知编码器(PE),这是一种通过简单视觉-语言学习训练出来的图像和视频理解的编码器。传统上,视觉编码器依赖于一系列用于特定下游任务如分类、描述或定位的预训练目标。令人惊讶的是,在扩展了我们精心调整的图像预训练方案并用我们的稳健视频数据引擎进行微调后,我们发现仅通过对比式视觉-语言训练就能产生适用于所有这些下游任务的强大且通用的嵌入表示。唯一的不足是:这些嵌入隐藏在网络中间层中。 为了提取它们,我们引入了两种对齐方法:多模态语言模型的语言对齐和密集预测的空间对齐。结合核心对比检查点,我们的PE家族模型在广泛的任务上取得了最先进的性能,包括零样本图像和视频分类及检索;文档、图像和视频问答;以及空间任务如检测、深度估计和跟踪。 为了促进进一步的研究,我们将发布我们的模型、代码以及一套新颖的合成和人工标注视频数据集。
https://arxiv.org/abs/2504.13181
In this study, hypertension is utilized as an indicator of individual vascular damage. This damage can be identified through machine learning techniques, providing an early risk marker for potential major cardiovascular events and offering valuable insights into the overall arterial condition of individual patients. To this end, the VideoMAE deep learning model, originally developed for video classification, was adapted by finetuning for application in the domain of ultrasound imaging. The model was trained and tested using a dataset comprising over 31,000 carotid sonography videos sourced from the Gutenberg Health Study (15,010 participants), one of the largest prospective population health studies. This adaptation facilitates the classification of individuals as hypertensive or non-hypertensive (75.7% validation accuracy), functioning as a proxy for detecting visual arterial damage. We demonstrate that our machine learning model effectively captures visual features that provide valuable insights into an individual's overall cardiovascular health.
在这项研究中,高血压被用作个体血管损伤的指标。这种损伤可以通过机器学习技术识别出来,为潜在的重大心血管事件提供早期风险标志,并对个人患者的动脉状况提供有价值的见解。为此,原本用于视频分类的VideoMAE深度学习模型经过微调后应用于超声成像领域。该模型使用来自古腾堡健康研究(15,010名参与者)的大于31,000个颈动脉超声影像视频的数据集进行训练和测试。这项适应使得能够将个体分类为高血压或非高血压患者(验证准确率为75.7%),从而作为检测视觉动脉损伤的代理工具。我们证明了我们的机器学习模型能够有效地捕捉到提供有关个人整体心血管健康状况有价值见解的视觉特征。
https://arxiv.org/abs/2504.06680
Integrating vision models into large language models (LLMs) has sparked significant interest in creating vision-language foundation models, especially for video understanding. Recent methods often utilize memory banks to handle untrimmed videos for video-level understanding. However, they typically compress visual memory using similarity-based greedy approaches, which can overlook the contextual importance of individual tokens. To address this, we introduce an efficient LLM adapter designed for video-level understanding of untrimmed videos that prioritizes the contextual relevance of spatio-temporal tokens. Our framework leverages scorer networks to selectively compress the visual memory bank and filter spatial tokens based on relevance, using a differentiable Top-K operator for end-to-end training. Across three key video-level understanding tasks$\unicode{x2013}$ untrimmed video classification, video question answering, and video captioning$\unicode{x2013}$our method achieves competitive or superior results on four large-scale datasets while reducing computational overhead by up to 34%. The code will be available soon on GitHub.
将视觉模型整合到大型语言模型(LLMs)中,激发了创建视觉-语言基础模型的兴趣,特别是在视频理解方面。近期的方法通常利用内存银行处理未修剪的视频以进行视频级别的理解。然而,它们一般采用基于相似性的贪婪方法来压缩视觉记忆,这可能会忽视单个标记上下文的重要性。为解决这一问题,我们引入了一种高效的LLM适配器,专门用于未修剪视频的视频级别理解,并优先考虑空间-时间令牌的上下文相关性。我们的框架利用评分网络选择性地压缩视觉内存银行并根据相关性过滤空间令牌,使用可微分Top-K操作符进行端到端训练。在三个关键的视频级理解任务——未修剪视频分类、视频问答和视频字幕上,我们的方法在四个大规模数据集上取得了竞争性的或优越的结果,并且减少了最多34%的计算开销。代码不久将在GitHub上发布。
https://arxiv.org/abs/2504.05491
Deep learning models have achieved remarkable success in computer vision but remain vulnerable to adversarial attacks, particularly in black-box settings where model details are unknown. Existing adversarial attack methods(even those works with key frames) often treat video data as simple vectors, ignoring their inherent multi-dimensional structure, and require a large number of queries, making them inefficient and detectable. In this paper, we propose \textbf{TenAd}, a novel tensor-based low-rank adversarial attack that leverages the multi-dimensional properties of video data by representing videos as fourth-order tensors. By exploiting low-rank attack, our method significantly reduces the search space and the number of queries needed to generate adversarial examples in black-box settings. Experimental results on standard video classification datasets demonstrate that \textbf{TenAd} effectively generates imperceptible adversarial perturbations while achieving higher attack success rates and query efficiency compared to state-of-the-art methods. Our approach outperforms existing black-box adversarial attacks in terms of success rate, query efficiency, and perturbation imperceptibility, highlighting the potential of tensor-based methods for adversarial attacks on video models.
深度学习模型在计算机视觉领域取得了显著的成功,但仍然容易受到对抗性攻击的影响,尤其是在黑盒设置中,即当模型的细节未知时。现有的对抗性攻击方法(即使那些针对关键帧的方法)通常将视频数据视为简单的向量,忽略了它们固有的多维结构,并且需要大量的查询,这使得这些方法既低效又易于检测。在本文中,我们提出了一种名为**TenAd**的新颖张量基低秩对抗性攻击方法,该方法通过将视频表示为四阶张量来利用视频数据的多维特性。通过利用低秩攻击,我们的方法显著减少了生成黑盒设置下对抗样本所需的搜索空间和查询次数。在标准视频分类数据集上的实验结果表明,**TenAd**能够有效生成不可察觉的对抗性扰动,并且与现有最先进的方法相比,在攻击成功率和查询效率方面表现出更高的性能。我们的方法在成功概率、查询效率以及扰动生成的不可感知性方面优于现有的黑盒对抗性攻击方法,这突显了基于张量的方法在视频模型对抗攻击中的潜力。
https://arxiv.org/abs/2504.01228
We propose a new "Unbiased through Textual Description (UTD)" video benchmark based on unbiased subsets of existing video classification and retrieval datasets to enable a more robust assessment of video understanding capabilities. Namely, we tackle the problem that current video benchmarks may suffer from different representation biases, e.g., object bias or single-frame bias, where mere recognition of objects or utilization of only a single frame is sufficient for correct prediction. We leverage VLMs and LLMs to analyze and debias benchmarks from such representation biases. Specifically, we generate frame-wise textual descriptions of videos, filter them for specific information (e.g. only objects) and leverage them to examine representation biases across three dimensions: 1) concept bias - determining if a specific concept (e.g., objects) alone suffice for prediction; 2) temporal bias - assessing if temporal information contributes to prediction; and 3) common sense vs. dataset bias - evaluating whether zero-shot reasoning or dataset correlations contribute to prediction. We conduct a systematic analysis of 12 popular video classification and retrieval datasets and create new object-debiased test splits for these datasets. Moreover, we benchmark 30 state-of-the-art video models on original and debiased splits and analyze biases in the models. To facilitate the future development of more robust video understanding benchmarks and models, we release: "UTD-descriptions", a dataset with our rich structured descriptions for each dataset, and "UTD-splits", a dataset of object-debiased test splits.
我们提出了一种新的基于现有视频分类和检索数据集的无偏子集构建的“通过文本描述去偏(Unbiased through Textual Description,简称UTD)”视频基准测试方法,以实现对视频理解能力更为稳健的评估。具体来说,我们解决了当前视频基准可能存在的不同表示偏差问题,例如对象偏差或单帧偏差,这些问题使得仅仅识别出的对象或是仅利用一个帧就足以做出正确的预测。 为了分析并消除这些表示偏差,我们采用了视觉语言模型(VLMs)和大型语言模型(LLMs)。具体做法是为每段视频生成逐帧的文本描述,并过滤掉特定信息(例如,只保留对象信息),然后使用这些描述来检查在三个维度上的表示偏差:1) 概念偏差 - 确定一个特定概念(如物体)是否足以做出预测;2) 时间偏差 - 评估时间信息是否有助于预测;3) 常识与数据集偏差 - 判断零样本推理或数据集相关性是否影响预测。 我们对十二个流行的视频分类和检索数据集进行了系统分析,并为此创建了新的去偏测试集。此外,我们在原始和去偏的测试集上对三十种最先进的视频模型进行基准测试,并分析这些模型中的偏差。为了促进未来更稳健的视频理解基准和模型的发展,我们发布了两个资源:"UTD-descriptions" - 包含每个数据集丰富结构化描述的数据集;以及 "UTD-splits" - 包括对象去偏测试集的数据集。
https://arxiv.org/abs/2503.18637
Computer-aided pathology detection algorithms for video-based imaging modalities must accurately interpret complex spatiotemporal information by integrating findings across multiple frames. Current state-of-the-art methods operate by classifying on video sub-volumes (tubelets), but they often lose global spatial context by focusing only on local regions within detection ROIs. Here we propose a lightweight framework for tubelet-based object detection and video classification that preserves both global spatial context and fine spatiotemporal features. To address the loss of global context, we embed tubelet location, size, and confidence as inputs to the classifier. Additionally, we use ROI-aligned feature maps from a pre-trained detection model, leveraging learned feature representations to increase the receptive field and reduce computational complexity. Our method is efficient, with the spatiotemporal tubelet classifier comprising only 0.4M parameters. We apply our approach to detect and classify lung consolidation and pleural effusion in ultrasound videos. Five-fold cross-validation on 14,804 videos from 828 patients shows our method outperforms previous tubelet-based approaches and is suited for real-time workflows.
计算机辅助的病理检测算法在基于视频成像模式的应用中,必须能够准确地解析复杂的时空信息,并将多个帧中的发现进行整合。当前最先进的方法通过在视频子体积(管状体)上分类来运行,但它们往往由于只关注检测区域内的局部区域而丧失全局空间上下文。在这里,我们提出了一种轻量级框架,用于基于管状体的对象检测和视频分类,该框架能够保留全局空间上下文及精细的时空特征。 为了解决全局上下文丢失的问题,我们将管状体的位置、大小和置信度作为输入提供给分类器。此外,我们使用来自预先训练好的检测模型的关键区域(ROI)对齐后的特征图,利用学习到的特征表示来扩大感受野并减少计算复杂性。我们的方法是高效的,时空管状体分类器仅包含0.4M参数。 我们将这种方法应用于超声视频中肺实变和胸腔积液的检测与分类。通过对来自828名患者的14,804段视频进行五折交叉验证,结果显示我们的方法优于以前基于管状体的方法,并且适合于实时工作流程。
https://arxiv.org/abs/2503.17475
Ultrasound video classification enables automated diagnosis and has emerged as an important research area. However, publicly available ultrasound video datasets remain scarce, hindering progress in developing effective video classification models. We propose addressing this shortage by synthesizing plausible ultrasound videos from readily available, abundant ultrasound images. To this end, we introduce a latent dynamic diffusion model (LDDM) to efficiently translate static images to dynamic sequences with realistic video characteristics. We demonstrate strong quantitative results and visually appealing synthesized videos on the BUSV benchmark. Notably, training video classification models on combinations of real and LDDM-synthesized videos substantially improves performance over using real data alone, indicating our method successfully emulates dynamics critical for discrimination. Our image-to-video approach provides an effective data augmentation solution to advance ultrasound video analysis. Code is available at this https URL.
超声视频分类能够实现自动化诊断,并已成为一个重要研究领域。然而,公开可用的超声视频数据集仍然稀缺,阻碍了有效视频分类模型的发展。我们提出通过从大量可获得的超声图像中合成逼真的超声视频来解决这一短缺问题。为此,我们引入了一种潜在动态扩散模型(LDDM),该模型可以高效地将静态图像转换为具有现实视频特征的时间序列。我们在BUSV基准测试上展示了强大的定量结果和视觉效果出色的合成视频。值得注意的是,在真实数据与使用LDDM生成的合成数据组合训练视频分类模型时,性能显著优于仅使用真实数据的情况,表明我们的方法成功模拟了对于区分至关重要动态特性。我们从图像到视频的方法提供了一种有效的数据增强解决方案,以推进超声视频分析的进步。代码可在[提供的URL]获取。 注:原文中的“this https URL”应替换为实际的链接地址,以便读者可以访问相关代码资源。
https://arxiv.org/abs/2503.14966
Automatic video activity recognition is crucial across numerous domains like surveillance, healthcare, and robotics. However, recognizing human activities from video data becomes challenging when training and test data stem from diverse domains. Domain generalization, adapting to unforeseen domains, is thus essential. This paper focuses on office activity recognition amidst environmental variability. We propose three pre-processing techniques applicable to any video encoder, enhancing robustness against environmental variations. Our study showcases the efficacy of MViT, a leading state-of-the-art video classification model, and other video encoders combined with our techniques, outperforming state-of-the-art domain adaptation methods. Our approach significantly boosts accuracy, precision, recall and F1 score on unseen domains, emphasizing its adaptability in real-world scenarios with diverse video data sources. This method lays a foundation for more reliable video activity recognition systems across heterogeneous data domains.
自动视频活动识别在监控、医疗保健和机器人技术等多个领域至关重要。然而,当训练数据和测试数据来自不同的环境时,从视频数据中识别人类活动变得极具挑战性。因此,在面对未知领域的适应(域泛化)方面显得尤为重要。本文重点关注办公环境中由于环境变化带来的活动识别问题,并提出三种适用于任何视频编码器的预处理技术,以增强其对环境变异的鲁棒性。 我们的研究展示了MViT这一先进的视频分类模型及其与其他视频编码器结合使用我们所提出的技巧时,在面对未见过的数据域时表现出色,超越了现有的领域适应方法。我们的方法显著提高了在未知领域的准确率、精确度、召回率和F1分数,突显其在处理来自不同数据源的真实世界视频场景中的灵活性。 这项研究为建立一个更可靠的跨异构数据域的视频活动识别系统奠定了基础。
https://arxiv.org/abs/2503.12678
Object detection in videos plays a crucial role in advancing applications such as public safety and anomaly detection. Existing methods have explored different techniques, including CNN, deep learning, and Transformers, for object detection and video classification. However, detecting tiny objects, e.g., guns, in videos remains challenging due to their small scale and varying appearances in complex scenes. Moreover, existing video analysis models for classification or detection often perform poorly in real-world gun detection scenarios due to limited labeled video datasets for training. Thus, developing efficient methods for effectively capturing tiny object features and designing models capable of accurate gun detection in real-world videos is imperative. To address these challenges, we make three original contributions in this paper. First, we conduct an empirical study of several existing video classification and object detection methods to identify guns in videos. Our extensive analysis shows that these methods may not accurately detect guns in videos. Second, we propose a novel two-stage gun detection method. In stage 1, we train an image-augmented model to effectively classify ``Gun'' videos. To make the detection more precise and efficient, stage 2 employs an object detection model to locate the exact region of the gun within video frames for videos classified as ``Gun'' by stage 1. Third, our experimental results demonstrate that the proposed domain-specific method achieves significant performance improvements and enhances efficiency compared with existing techniques. We also discuss challenges and future research directions in gun detection tasks in computer vision.
视频中的物体检测在推进公共安全和异常检测等应用方面起着关键作用。现有方法探索了包括卷积神经网络(CNN)、深度学习和Transformer在内的多种技术,用于对象检测和视频分类。然而,在视频中检测如枪支这样微小的物品仍然极具挑战性,因为它们的小尺寸及其在复杂场景中的多变外观导致难以准确识别。此外,现有的视频分析模型在进行分类或检测时,在真实的枪支检测场景下通常表现不佳,原因在于训练过程中使用的带标签视频数据集有限。因此,开发能够有效捕捉微小物体特征的方法,并设计出能够在真实世界视频中精确检测枪支的模型至关重要。 为了解决这些挑战,我们在本文中有三个原创贡献: 首先,我们对几种现有的视频分类和对象检测方法进行了经验研究,以识别它们在视频中定位枪支的有效性。我们的广泛分析显示,这些方法可能无法准确地检测出视频中的枪支。 其次,我们提出了一种新颖的两阶段枪支检测方法。第一阶段训练一个增强图像的数据模型来有效地将“含枪”和“不含枪”的视频进行分类。为了使检测更加精确高效,在第二阶段中,对于由第一阶段标记为"Gun"类别的视频,使用对象检测模型在视频帧内定位枪的确切区域。 第三,我们的实验结果表明,所提出的领域特定方法相比现有技术实现了显著的性能提升和效率增强。此外,我们还讨论了计算机视觉中的枪支检测任务面临的挑战及未来的研究方向。
https://arxiv.org/abs/2503.06317