We introduce Perception Encoder (PE), a state-of-the-art encoder for image and video understanding trained via simple vision-language learning. Traditionally, vision encoders have relied on a variety of pretraining objectives, each tailored to specific downstream tasks such as classification, captioning, or localization. Surprisingly, after scaling our carefully tuned image pretraining recipe and refining with our robust video data engine, we find that contrastive vision-language training alone can produce strong, general embeddings for all of these downstream tasks. There is only one caveat: these embeddings are hidden within the intermediate layers of the network. To draw them out, we introduce two alignment methods, language alignment for multimodal language modeling, and spatial alignment for dense prediction. Together with the core contrastive checkpoint, our PE family of models achieves state-of-the-art performance on a wide variety of tasks, including zero-shot image and video classification and retrieval; document, image, and video Q&A; and spatial tasks such as detection, depth estimation, and tracking. To foster further research, we are releasing our models, code, and a novel dataset of synthetically and human-annotated videos.
我们介绍了一种先进的感知编码器(PE),这是一种通过简单视觉-语言学习训练出来的图像和视频理解的编码器。传统上,视觉编码器依赖于一系列用于特定下游任务如分类、描述或定位的预训练目标。令人惊讶的是,在扩展了我们精心调整的图像预训练方案并用我们的稳健视频数据引擎进行微调后,我们发现仅通过对比式视觉-语言训练就能产生适用于所有这些下游任务的强大且通用的嵌入表示。唯一的不足是:这些嵌入隐藏在网络中间层中。 为了提取它们,我们引入了两种对齐方法:多模态语言模型的语言对齐和密集预测的空间对齐。结合核心对比检查点,我们的PE家族模型在广泛的任务上取得了最先进的性能,包括零样本图像和视频分类及检索;文档、图像和视频问答;以及空间任务如检测、深度估计和跟踪。 为了促进进一步的研究,我们将发布我们的模型、代码以及一套新颖的合成和人工标注视频数据集。
https://arxiv.org/abs/2504.13181
In this study, hypertension is utilized as an indicator of individual vascular damage. This damage can be identified through machine learning techniques, providing an early risk marker for potential major cardiovascular events and offering valuable insights into the overall arterial condition of individual patients. To this end, the VideoMAE deep learning model, originally developed for video classification, was adapted by finetuning for application in the domain of ultrasound imaging. The model was trained and tested using a dataset comprising over 31,000 carotid sonography videos sourced from the Gutenberg Health Study (15,010 participants), one of the largest prospective population health studies. This adaptation facilitates the classification of individuals as hypertensive or non-hypertensive (75.7% validation accuracy), functioning as a proxy for detecting visual arterial damage. We demonstrate that our machine learning model effectively captures visual features that provide valuable insights into an individual's overall cardiovascular health.
在这项研究中,高血压被用作个体血管损伤的指标。这种损伤可以通过机器学习技术识别出来,为潜在的重大心血管事件提供早期风险标志,并对个人患者的动脉状况提供有价值的见解。为此,原本用于视频分类的VideoMAE深度学习模型经过微调后应用于超声成像领域。该模型使用来自古腾堡健康研究(15,010名参与者)的大于31,000个颈动脉超声影像视频的数据集进行训练和测试。这项适应使得能够将个体分类为高血压或非高血压患者(验证准确率为75.7%),从而作为检测视觉动脉损伤的代理工具。我们证明了我们的机器学习模型能够有效地捕捉到提供有关个人整体心血管健康状况有价值见解的视觉特征。
https://arxiv.org/abs/2504.06680
Integrating vision models into large language models (LLMs) has sparked significant interest in creating vision-language foundation models, especially for video understanding. Recent methods often utilize memory banks to handle untrimmed videos for video-level understanding. However, they typically compress visual memory using similarity-based greedy approaches, which can overlook the contextual importance of individual tokens. To address this, we introduce an efficient LLM adapter designed for video-level understanding of untrimmed videos that prioritizes the contextual relevance of spatio-temporal tokens. Our framework leverages scorer networks to selectively compress the visual memory bank and filter spatial tokens based on relevance, using a differentiable Top-K operator for end-to-end training. Across three key video-level understanding tasks$\unicode{x2013}$ untrimmed video classification, video question answering, and video captioning$\unicode{x2013}$our method achieves competitive or superior results on four large-scale datasets while reducing computational overhead by up to 34%. The code will be available soon on GitHub.
将视觉模型整合到大型语言模型(LLMs)中,激发了创建视觉-语言基础模型的兴趣,特别是在视频理解方面。近期的方法通常利用内存银行处理未修剪的视频以进行视频级别的理解。然而,它们一般采用基于相似性的贪婪方法来压缩视觉记忆,这可能会忽视单个标记上下文的重要性。为解决这一问题,我们引入了一种高效的LLM适配器,专门用于未修剪视频的视频级别理解,并优先考虑空间-时间令牌的上下文相关性。我们的框架利用评分网络选择性地压缩视觉内存银行并根据相关性过滤空间令牌,使用可微分Top-K操作符进行端到端训练。在三个关键的视频级理解任务——未修剪视频分类、视频问答和视频字幕上,我们的方法在四个大规模数据集上取得了竞争性的或优越的结果,并且减少了最多34%的计算开销。代码不久将在GitHub上发布。
https://arxiv.org/abs/2504.05491
Deep learning models have achieved remarkable success in computer vision but remain vulnerable to adversarial attacks, particularly in black-box settings where model details are unknown. Existing adversarial attack methods(even those works with key frames) often treat video data as simple vectors, ignoring their inherent multi-dimensional structure, and require a large number of queries, making them inefficient and detectable. In this paper, we propose \textbf{TenAd}, a novel tensor-based low-rank adversarial attack that leverages the multi-dimensional properties of video data by representing videos as fourth-order tensors. By exploiting low-rank attack, our method significantly reduces the search space and the number of queries needed to generate adversarial examples in black-box settings. Experimental results on standard video classification datasets demonstrate that \textbf{TenAd} effectively generates imperceptible adversarial perturbations while achieving higher attack success rates and query efficiency compared to state-of-the-art methods. Our approach outperforms existing black-box adversarial attacks in terms of success rate, query efficiency, and perturbation imperceptibility, highlighting the potential of tensor-based methods for adversarial attacks on video models.
深度学习模型在计算机视觉领域取得了显著的成功,但仍然容易受到对抗性攻击的影响,尤其是在黑盒设置中,即当模型的细节未知时。现有的对抗性攻击方法(即使那些针对关键帧的方法)通常将视频数据视为简单的向量,忽略了它们固有的多维结构,并且需要大量的查询,这使得这些方法既低效又易于检测。在本文中,我们提出了一种名为**TenAd**的新颖张量基低秩对抗性攻击方法,该方法通过将视频表示为四阶张量来利用视频数据的多维特性。通过利用低秩攻击,我们的方法显著减少了生成黑盒设置下对抗样本所需的搜索空间和查询次数。在标准视频分类数据集上的实验结果表明,**TenAd**能够有效生成不可察觉的对抗性扰动,并且与现有最先进的方法相比,在攻击成功率和查询效率方面表现出更高的性能。我们的方法在成功概率、查询效率以及扰动生成的不可感知性方面优于现有的黑盒对抗性攻击方法,这突显了基于张量的方法在视频模型对抗攻击中的潜力。
https://arxiv.org/abs/2504.01228
We propose a new "Unbiased through Textual Description (UTD)" video benchmark based on unbiased subsets of existing video classification and retrieval datasets to enable a more robust assessment of video understanding capabilities. Namely, we tackle the problem that current video benchmarks may suffer from different representation biases, e.g., object bias or single-frame bias, where mere recognition of objects or utilization of only a single frame is sufficient for correct prediction. We leverage VLMs and LLMs to analyze and debias benchmarks from such representation biases. Specifically, we generate frame-wise textual descriptions of videos, filter them for specific information (e.g. only objects) and leverage them to examine representation biases across three dimensions: 1) concept bias - determining if a specific concept (e.g., objects) alone suffice for prediction; 2) temporal bias - assessing if temporal information contributes to prediction; and 3) common sense vs. dataset bias - evaluating whether zero-shot reasoning or dataset correlations contribute to prediction. We conduct a systematic analysis of 12 popular video classification and retrieval datasets and create new object-debiased test splits for these datasets. Moreover, we benchmark 30 state-of-the-art video models on original and debiased splits and analyze biases in the models. To facilitate the future development of more robust video understanding benchmarks and models, we release: "UTD-descriptions", a dataset with our rich structured descriptions for each dataset, and "UTD-splits", a dataset of object-debiased test splits.
我们提出了一种新的基于现有视频分类和检索数据集的无偏子集构建的“通过文本描述去偏(Unbiased through Textual Description,简称UTD)”视频基准测试方法,以实现对视频理解能力更为稳健的评估。具体来说,我们解决了当前视频基准可能存在的不同表示偏差问题,例如对象偏差或单帧偏差,这些问题使得仅仅识别出的对象或是仅利用一个帧就足以做出正确的预测。 为了分析并消除这些表示偏差,我们采用了视觉语言模型(VLMs)和大型语言模型(LLMs)。具体做法是为每段视频生成逐帧的文本描述,并过滤掉特定信息(例如,只保留对象信息),然后使用这些描述来检查在三个维度上的表示偏差:1) 概念偏差 - 确定一个特定概念(如物体)是否足以做出预测;2) 时间偏差 - 评估时间信息是否有助于预测;3) 常识与数据集偏差 - 判断零样本推理或数据集相关性是否影响预测。 我们对十二个流行的视频分类和检索数据集进行了系统分析,并为此创建了新的去偏测试集。此外,我们在原始和去偏的测试集上对三十种最先进的视频模型进行基准测试,并分析这些模型中的偏差。为了促进未来更稳健的视频理解基准和模型的发展,我们发布了两个资源:"UTD-descriptions" - 包含每个数据集丰富结构化描述的数据集;以及 "UTD-splits" - 包括对象去偏测试集的数据集。
https://arxiv.org/abs/2503.18637
Computer-aided pathology detection algorithms for video-based imaging modalities must accurately interpret complex spatiotemporal information by integrating findings across multiple frames. Current state-of-the-art methods operate by classifying on video sub-volumes (tubelets), but they often lose global spatial context by focusing only on local regions within detection ROIs. Here we propose a lightweight framework for tubelet-based object detection and video classification that preserves both global spatial context and fine spatiotemporal features. To address the loss of global context, we embed tubelet location, size, and confidence as inputs to the classifier. Additionally, we use ROI-aligned feature maps from a pre-trained detection model, leveraging learned feature representations to increase the receptive field and reduce computational complexity. Our method is efficient, with the spatiotemporal tubelet classifier comprising only 0.4M parameters. We apply our approach to detect and classify lung consolidation and pleural effusion in ultrasound videos. Five-fold cross-validation on 14,804 videos from 828 patients shows our method outperforms previous tubelet-based approaches and is suited for real-time workflows.
计算机辅助的病理检测算法在基于视频成像模式的应用中,必须能够准确地解析复杂的时空信息,并将多个帧中的发现进行整合。当前最先进的方法通过在视频子体积(管状体)上分类来运行,但它们往往由于只关注检测区域内的局部区域而丧失全局空间上下文。在这里,我们提出了一种轻量级框架,用于基于管状体的对象检测和视频分类,该框架能够保留全局空间上下文及精细的时空特征。 为了解决全局上下文丢失的问题,我们将管状体的位置、大小和置信度作为输入提供给分类器。此外,我们使用来自预先训练好的检测模型的关键区域(ROI)对齐后的特征图,利用学习到的特征表示来扩大感受野并减少计算复杂性。我们的方法是高效的,时空管状体分类器仅包含0.4M参数。 我们将这种方法应用于超声视频中肺实变和胸腔积液的检测与分类。通过对来自828名患者的14,804段视频进行五折交叉验证,结果显示我们的方法优于以前基于管状体的方法,并且适合于实时工作流程。
https://arxiv.org/abs/2503.17475
Ultrasound video classification enables automated diagnosis and has emerged as an important research area. However, publicly available ultrasound video datasets remain scarce, hindering progress in developing effective video classification models. We propose addressing this shortage by synthesizing plausible ultrasound videos from readily available, abundant ultrasound images. To this end, we introduce a latent dynamic diffusion model (LDDM) to efficiently translate static images to dynamic sequences with realistic video characteristics. We demonstrate strong quantitative results and visually appealing synthesized videos on the BUSV benchmark. Notably, training video classification models on combinations of real and LDDM-synthesized videos substantially improves performance over using real data alone, indicating our method successfully emulates dynamics critical for discrimination. Our image-to-video approach provides an effective data augmentation solution to advance ultrasound video analysis. Code is available at this https URL.
超声视频分类能够实现自动化诊断,并已成为一个重要研究领域。然而,公开可用的超声视频数据集仍然稀缺,阻碍了有效视频分类模型的发展。我们提出通过从大量可获得的超声图像中合成逼真的超声视频来解决这一短缺问题。为此,我们引入了一种潜在动态扩散模型(LDDM),该模型可以高效地将静态图像转换为具有现实视频特征的时间序列。我们在BUSV基准测试上展示了强大的定量结果和视觉效果出色的合成视频。值得注意的是,在真实数据与使用LDDM生成的合成数据组合训练视频分类模型时,性能显著优于仅使用真实数据的情况,表明我们的方法成功模拟了对于区分至关重要动态特性。我们从图像到视频的方法提供了一种有效的数据增强解决方案,以推进超声视频分析的进步。代码可在[提供的URL]获取。 注:原文中的“this https URL”应替换为实际的链接地址,以便读者可以访问相关代码资源。
https://arxiv.org/abs/2503.14966
Automatic video activity recognition is crucial across numerous domains like surveillance, healthcare, and robotics. However, recognizing human activities from video data becomes challenging when training and test data stem from diverse domains. Domain generalization, adapting to unforeseen domains, is thus essential. This paper focuses on office activity recognition amidst environmental variability. We propose three pre-processing techniques applicable to any video encoder, enhancing robustness against environmental variations. Our study showcases the efficacy of MViT, a leading state-of-the-art video classification model, and other video encoders combined with our techniques, outperforming state-of-the-art domain adaptation methods. Our approach significantly boosts accuracy, precision, recall and F1 score on unseen domains, emphasizing its adaptability in real-world scenarios with diverse video data sources. This method lays a foundation for more reliable video activity recognition systems across heterogeneous data domains.
自动视频活动识别在监控、医疗保健和机器人技术等多个领域至关重要。然而,当训练数据和测试数据来自不同的环境时,从视频数据中识别人类活动变得极具挑战性。因此,在面对未知领域的适应(域泛化)方面显得尤为重要。本文重点关注办公环境中由于环境变化带来的活动识别问题,并提出三种适用于任何视频编码器的预处理技术,以增强其对环境变异的鲁棒性。 我们的研究展示了MViT这一先进的视频分类模型及其与其他视频编码器结合使用我们所提出的技巧时,在面对未见过的数据域时表现出色,超越了现有的领域适应方法。我们的方法显著提高了在未知领域的准确率、精确度、召回率和F1分数,突显其在处理来自不同数据源的真实世界视频场景中的灵活性。 这项研究为建立一个更可靠的跨异构数据域的视频活动识别系统奠定了基础。
https://arxiv.org/abs/2503.12678
Object detection in videos plays a crucial role in advancing applications such as public safety and anomaly detection. Existing methods have explored different techniques, including CNN, deep learning, and Transformers, for object detection and video classification. However, detecting tiny objects, e.g., guns, in videos remains challenging due to their small scale and varying appearances in complex scenes. Moreover, existing video analysis models for classification or detection often perform poorly in real-world gun detection scenarios due to limited labeled video datasets for training. Thus, developing efficient methods for effectively capturing tiny object features and designing models capable of accurate gun detection in real-world videos is imperative. To address these challenges, we make three original contributions in this paper. First, we conduct an empirical study of several existing video classification and object detection methods to identify guns in videos. Our extensive analysis shows that these methods may not accurately detect guns in videos. Second, we propose a novel two-stage gun detection method. In stage 1, we train an image-augmented model to effectively classify ``Gun'' videos. To make the detection more precise and efficient, stage 2 employs an object detection model to locate the exact region of the gun within video frames for videos classified as ``Gun'' by stage 1. Third, our experimental results demonstrate that the proposed domain-specific method achieves significant performance improvements and enhances efficiency compared with existing techniques. We also discuss challenges and future research directions in gun detection tasks in computer vision.
视频中的物体检测在推进公共安全和异常检测等应用方面起着关键作用。现有方法探索了包括卷积神经网络(CNN)、深度学习和Transformer在内的多种技术,用于对象检测和视频分类。然而,在视频中检测如枪支这样微小的物品仍然极具挑战性,因为它们的小尺寸及其在复杂场景中的多变外观导致难以准确识别。此外,现有的视频分析模型在进行分类或检测时,在真实的枪支检测场景下通常表现不佳,原因在于训练过程中使用的带标签视频数据集有限。因此,开发能够有效捕捉微小物体特征的方法,并设计出能够在真实世界视频中精确检测枪支的模型至关重要。 为了解决这些挑战,我们在本文中有三个原创贡献: 首先,我们对几种现有的视频分类和对象检测方法进行了经验研究,以识别它们在视频中定位枪支的有效性。我们的广泛分析显示,这些方法可能无法准确地检测出视频中的枪支。 其次,我们提出了一种新颖的两阶段枪支检测方法。第一阶段训练一个增强图像的数据模型来有效地将“含枪”和“不含枪”的视频进行分类。为了使检测更加精确高效,在第二阶段中,对于由第一阶段标记为"Gun"类别的视频,使用对象检测模型在视频帧内定位枪的确切区域。 第三,我们的实验结果表明,所提出的领域特定方法相比现有技术实现了显著的性能提升和效率增强。此外,我们还讨论了计算机视觉中的枪支检测任务面临的挑战及未来的研究方向。
https://arxiv.org/abs/2503.06317
The intersection of medical imaging and artificial intelligence has become an important research direction in intelligent medical treatment, particularly in the analysis of medical images using deep learning for clinical diagnosis. Despite the advances, existing keyframe classification methods lack extraction of time series features, while ultrasonic video classification based on three-dimensional convolution requires uniform frame numbers across patients, resulting in poor feature extraction efficiency and model classification performance. This study proposes a novel video classification method based on CNN and LSTM, introducing NLP's long and short sentence processing scheme into video classification for the first time. The method reduces CNN-extracted image features to 1x512 dimension, followed by sorting and compressing feature vectors for LSTM training. Specifically, feature vectors are sorted by patient video frame numbers and populated with padding value 0 to form variable batches, with invalid padding values compressed before LSTM training to conserve computing resources. Experimental results demonstrate that our variable-frame CNNLSTM method outperforms other approaches across all metrics, showing improvements of 3-6% in F1 score and 1.5% in specificity compared to keyframe methods. The variable-frame CNNLSTM also achieves better accuracy and precision than equal-frame CNNLSTM. These findings validate the effectiveness of our approach in classifying variable-frame ultrasound videos and suggest potential applications in other medical imaging modalities.
医学影像与人工智能的交叉领域已成为智能医疗治疗中的一个重要研究方向,特别是在使用深度学习分析临床诊断所需的医学图像方面。尽管已经取得了一些进展,但现有的关键帧分类方法缺乏对时间序列特征的提取,而基于三维卷积的超声视频分类需要患者之间的帧数统一,这导致了较差的特征提取效率和模型分类性能。 本研究提出了一种基于CNN(卷积神经网络)和LSTM(长短时记忆网络)的新颖视频分类方法,并首次将NLP(自然语言处理)中的长短句处理方案引入到视频分类中。该方法首先通过CNN提取图像特征并将其压缩为1x512维,然后对特征向量进行排序和压缩以便于LSTM训练。具体而言,根据患者视频帧数来排序特征向量,并使用填充值0形成可变批次大小的序列,在送入LSTM训练前将无效的填充值压缩以节省计算资源。 实验结果表明,我们的可变帧CNNLSTM方法在所有评价指标上均优于其他方法。与关键帧方法相比,F1分数提高了3-6%,特异性提高了1.5%。此外,可变帧CNNLSTM还比等帧的CNNLSTM实现了更高的准确性和精度。 这些发现验证了我们在分类不同长度帧数超声视频中的有效性,并为该技术在其他医学影像模态上的应用提供了可能。
https://arxiv.org/abs/2502.11481
In this study, we tackle industry challenges in video content classification by exploring and optimizing GPT-based models for zero-shot classification across seven critical categories of video quality. We contribute a novel approach to improving GPT's performance through prompt optimization and policy refinement, demonstrating that simplifying complex policies significantly reduces false negatives. Additionally, we introduce a new decomposition-aggregation-based prompt engineering technique, which outperforms traditional single-prompt methods. These experiments, conducted on real industry problems, show that thoughtful prompt design can substantially enhance GPT's performance without additional finetuning, offering an effective and scalable solution for improving video classification systems across various domains in industry.
在这项研究中,我们通过探索和优化基于GPT的模型来解决视频内容分类中的行业挑战,在七个关键的视频质量类别上实现了零样本分类。我们提出了一种新颖的方法,通过优化提示和细化策略来提升GPT的表现,证明了简化复杂策略可以显著减少假阴性错误。此外,我们还引入了一种基于分解-聚合的新颖提示工程技术,这种技术超越了传统的单一提示方法。这些实验是在实际行业问题上进行的,结果显示,精心设计的提示可以在不增加额外微调的情况下大幅提升GPT的表现,为改善各个领域的视频分类系统提供了一个有效且可扩展的解决方案。
https://arxiv.org/abs/2502.09573
Self-supervised learning has been a powerful approach for learning meaningful representations from unlabeled data across various domains, reducing the reliance on large labeled datasets. Inspired by BERT's success in capturing deep bidirectional contexts in natural language processing, similar frameworks have been adapted to other modalities such as audio, with models like BEATs extending the bidirectional training paradigm to audio signals using vector quantization (VQ). However, these frameworks face challenges, notably their dependence on a single codebook for quantization, which may not capture the complex, multifaceted nature of signals. In addition, inefficiencies in codebook utilization lead to underutilized code vectors. To address these limitations, we introduce BRIDLE (Bidirectional Residual Quantization Interleaved Discrete Learning Encoder), a self-supervised encoder pretraining framework that incorporates residual quantization (RQ) into the bidirectional training process, and is generalized for pretraining with audio, image, and video. Using multiple hierarchical codebooks, RQ enables fine-grained discretization in the latent space, enhancing representation quality. BRIDLE involves an interleaved training procedure between the encoder and tokenizer. We evaluate BRIDLE on audio understanding tasks using classification benchmarks, achieving state-of-the-art results, and demonstrate competitive performance on image classification and video classification tasks, showing consistent improvements over traditional VQ methods in downstream performance.
自监督学习已经成为从无标签数据中提取有意义表示的强大方法,覆盖了各个领域,并减少了对大型标注数据集的依赖。受 BERT 在自然语言处理中成功捕捉深度双向上下文的启发,类似的框架已被应用于其他模式如音频信号,模型如 BEATs 将双向训练范式扩展到音频信号,使用向量量化 (VQ) 技术。然而,这些框架面临着挑战,特别是它们依赖于单一代码本进行量化,这可能无法捕捉复杂多面的信号特性。此外,在代码本利用中的低效导致了未充分利用的码矢量。为了解决这些问题,我们介绍了 BRIDLE(双向残差量化交织离散学习编码器),这是一种自监督编码器预训练框架,它将残差量化 (RQ) 集成到双向训练过程中,并且适用于音频、图像和视频的预训练。通过使用多个分层代码本,RQ 在潜在空间中实现了细粒度的离散化,从而提升了表示的质量。BRIDLE 包含编码器和标记器之间的交错式训练程序。我们在音频理解任务上使用分类基准评估了 BRIDLE,并取得了最先进的结果;同时在图像分类和视频分类任务中展示了竞争性的性能,显示出与传统 VQ 方法相比,在下游任务中的持续改进。
https://arxiv.org/abs/2502.02118
Recently, adapting Vision Language Models (VLMs) to zero-shot visual classification by tuning class embedding with a few prompts (Test-time Prompt Tuning, TPT) or replacing class names with generated visual samples (support-set) has shown promising results. However, TPT cannot avoid the semantic gap between modalities while the support-set cannot be tuned. To this end, we draw on each other's strengths and propose a novel framework namely TEst-time Support-set Tuning for zero-shot Video Classification (TEST-V). It first dilates the support-set with multiple prompts (Multi-prompting Support-set Dilation, MSD) and then erodes the support-set via learnable weights to mine key cues dynamically (Temporal-aware Support-set Erosion, TSE). Specifically, i) MSD expands the support samples for each class based on multiple prompts enquired from LLMs to enrich the diversity of the support-set. ii) TSE tunes the support-set with factorized learnable weights according to the temporal prediction consistency in a self-supervised manner to dig pivotal supporting cues for each class. $\textbf{TEST-V}$ achieves state-of-the-art results across four benchmarks and has good interpretability for the support-set dilation and erosion.
最近,通过使用少量提示调整类嵌入(测试时提示调优,TPT)或用生成的视觉样本替换类名(支持集),将视觉语言模型(VLMs)适应于零样本视觉分类显示出有希望的结果。然而,TPT无法避免模态间的语义差距,而支持集则不能被调整。为此,我们借鉴彼此的优势,提出了一种新颖的框架,即用于零样本视频分类的测试时支持集调优 (TEST-V)。该框架首先使用多个提示(多提示支持集膨胀,MSD)来扩展支持集,并通过可学习权重对支持集进行侵蚀以动态挖掘每个类的关键线索(时间感知支持集侵蚀,TSE)。具体来说: i) MSD 通过从大型语言模型(LLMs)获取的多个提示为基础,为每个类别扩展现有的支持样本,从而丰富了支持集的多样性。 ii) TSE 则使用因子化可学习权重根据自监督的方式进行时间预测一致性来调整支持集,以挖掘对每个类至关重要的支撑线索。 **TEST-V 在四个基准测试上取得了最先进的结果,并且对于支持集的膨胀和侵蚀过程具有良好的解释性。**
https://arxiv.org/abs/2502.00426
We introduce VIBA, a novel approach for explainable video classification by adapting Information Bottlenecks for Attribution (IBA) to video sequences. While most traditional explainability methods are designed for image models, our IBA framework addresses the need for explainability in temporal models used for video analysis. To demonstrate its effectiveness, we apply VIBA to video deepfake detection, testing it on two architectures: the Xception model for spatial features and a VGG11-based model for capturing motion dynamics through optical flow. Using a custom dataset that reflects recent deepfake generation techniques, we adapt IBA to create relevance and optical flow maps, visually highlighting manipulated regions and motion inconsistencies. Our results show that VIBA generates temporally and spatially consistent explanations, which align closely with human annotations, thus providing interpretability for video classification and particularly for deepfake detection.
我们介绍了一种新的方法VIBA,这是一种通过将信息瓶颈归因法(Information Bottlenecks for Attribution,简称IBA)应用于视频序列来实现可解释性视频分类的方法。大多数传统的可解释性方法是为图像模型设计的,而我们的IBA框架则解决了在用于视频分析的时间模型中需要可解释性的问题。为了证明其有效性,我们将VIBA应用于视频深度伪造检测,并针对两种架构进行了测试:一种是使用Xception模型来提取空间特征,另一种则是基于VGG11模型通过光学流捕捉运动动态的架构。 我们利用了一个定制的数据集来进行实验,该数据集反映了最新的深度伪造生成技术。在这一基础上,我们将IBA方法扩展到创建相关性和光学流图,以视觉方式突出显示被篡改的区域和运动不一致性。我们的结果显示,VIBA能够生成与人类标注高度一致的时间和空间上连贯的解释,从而为视频分类及其特定应用——深度伪造检测提供了可解释性。
https://arxiv.org/abs/2501.16889
Document level Urdu Sentiment Analysis (SA) is a challenging Natural Language Processing (NLP) task as it deals with large documents in a resource-poor language. In large documents, there are ample amounts of words that exhibit different viewpoints. Deep learning (DL) models comprise of complex neural network architectures that have the ability to learn diverse features of the data to classify various sentiments. Besides audio, image and video classification; DL algorithms are now extensively used in text-based classification problems. To explore the powerful DL techniques for Urdu SA, we have applied five different DL architectures namely, Bidirectional Long Short Term Memory (BiLSTM), Convolutional Neural Network (CNN), Convolutional Neural Network with Bidirectional Long Short Term Memory (CNN-BiLSTM), Bidirectional Encoder Representation from Transformer (BERT). In this paper, we have proposed a DL hybrid model that integrates BiLSTM with Single Layer Multi Filter Convolutional Neural Network (BiLSTM-SLMFCNN). The proposed and baseline techniques are applied on Urdu Customer Support data set and IMDB Urdu movie review data set by using pretrained Urdu word embeddings that are suitable for (SA) at the document level. Results of these techniques are evaluated and our proposed model outperforms all other DL techniques for Urdu SA. BiLSTM-SLMFCNN outperformed the baseline DL models and achieved 83{\%}, 79{\%}, 83{\%} and 94{\%} accuracy on small, medium and large sized IMDB Urdu movie review data set and Urdu Customer Support data set respectively.
文档级别的乌尔都语情感分析(SA)是一项具有挑战性的自然语言处理(NLP)任务,因为它涉及在资源有限的语言中处理大量文本。在大型文档中,有许多词汇展示了不同的观点和视角。深度学习(DL)模型包含复杂的神经网络架构,能够从数据中学习各种特征以分类不同的情感。除了音频、图像和视频分类之外,如今深度学习算法也被广泛应用于基于文本的分类问题。 为了探索强大的DL技术用于乌尔都语情感分析,我们应用了五种不同的DL架构:双向长短期记忆(BiLSTM)、卷积神经网络(CNN)、带有双向长短期记忆的卷积神经网络(CNN-BiLSTM)、变压器编码器表示从变换器(BERT)。在此论文中,我们提出了一种将双向长短期记忆与单层多滤波卷积神经网络相结合的DL混合模型(BiLSTM-SLMFCNN)。我们将所提出的和基准方法应用于乌尔都语客户支持数据集和IMDb乌尔都语电影评论数据集,并使用了适用于文档级情感分析的预训练乌尔都词嵌入。我们评估了这些技术的结果,我们的提议模型在乌尔都语SA中优于所有其他DL技术。 BiLSTM-SLMFCNN超越了基准DL模型,在小型、中型和大型IMDb乌尔都语电影评论数据集以及乌尔都语客户支持数据集中分别实现了83%、79%、83%和94%的准确率。
https://arxiv.org/abs/2501.17175
With the advent of the COVID-19 pandemic, ultrasound imaging has emerged as a promising technique for COVID-19 detection, due to its non-invasive nature, affordability, and portability. In response, researchers have focused on developing AI-based scoring systems to provide real-time diagnostic support. However, the limited size and lack of proper annotation in publicly available ultrasound datasets pose significant challenges for training a robust AI model. This paper proposes MeDiVLAD, a novel pipeline to address the above issue for multi-level lung-ultrasound (LUS) severity scoring. In particular, we leverage self-knowledge distillation to pretrain a vision transformer (ViT) without label and aggregate frame-level features via dual-level VLAD aggregation. We show that with minimal finetuning, MeDiVLAD outperforms conventional fully-supervised methods in both frame- and video-level scoring, while offering classification reasoning with exceptional quality. This superior performance enables key applications such as the automatic identification of critical lung pathology areas and provides a robust solution for broader medical video classification tasks.
随着COVID-19疫情的爆发,超声成像因其非侵入性、经济性和便携性而成为一种有前景的COVID-19检测技术。对此,研究人员专注于开发基于AI的支持实时诊断的评分系统。然而,公开可用的超声数据集规模有限且标注不充分,这给训练稳健的AI模型带来了重大挑战。本文提出了一种名为MeDiVLAD的新颖管道,旨在解决多级肺部超声(LUS)严重程度评分中的上述问题。具体来说,我们利用自我知识蒸馏来在无标签的情况下预训练视觉变换器(ViT),并通过双层VLAD聚合技术聚集帧级别的特征。结果显示,在仅进行少量微调的情况下,MeDiVLAD在帧级和视频级评分方面均优于传统的完全监督方法,并且能够提供质量卓越的分类推理能力。这种优越性能使得诸如自动识别关键肺部病理区域等重要应用成为可能,同时也为更广泛的医学视频分类任务提供了稳健解决方案。
https://arxiv.org/abs/2501.12524
The rapid advancement of video generation models has made it increasingly challenging to distinguish AI-generated videos from real ones. This issue underscores the urgent need for effective AI-generated video detectors to prevent the dissemination of false information through such videos. However, the development of high-performance generative video detectors is currently impeded by the lack of large-scale, high-quality datasets specifically designed for generative video detection. To this end, we introduce GenVidBench, a challenging AI-generated video detection dataset with several key advantages: 1) Cross Source and Cross Generator: The cross-generation source mitigates the interference of video content on the detection. The cross-generator ensures diversity in video attributes between the training and test sets, preventing them from being overly similar. 2) State-of-the-Art Video Generators: The dataset includes videos from 8 state-of-the-art AI video generators, ensuring that it covers the latest advancements in the field of video generation. 3) Rich Semantics: The videos in GenVidBench are analyzed from multiple dimensions and classified into various semantic categories based on their content. This classification ensures that the dataset is not only large but also diverse, aiding in the development of more generalized and effective detection models. We conduct a comprehensive evaluation of different advanced video generators and present a challenging setting. Additionally, we present rich experimental results including advanced video classification models as baselines. With the GenVidBench, researchers can efficiently develop and evaluate AI-generated video detection models. Datasets and code are available at this https URL.
视频生成模型的快速进步使得区分AI生成的视频和真实视频变得越来越困难。这一问题突显了开发有效的AI生成视频检测器以防止通过此类视频传播虚假信息的迫切需求。然而,由于缺乏专门为视频生成检测设计的大规模高质量数据集,高性能生成式视频检测器的发展目前受到了阻碍。为此,我们引入了一个名为GenVidBench的数据集,这是一个具有挑战性的AI生成视频检测数据集,并具备以下几个关键优势: 1. **跨来源与跨生成器**:跨生成源的特性减少了视频内容对检测结果的影响。而跨生成器则确保了训练集和测试集中视频属性之间的多样性,防止它们过于相似。 2. **最先进的视频生成器**:该数据集包括来自8种最先进AI视频生成器的视频,涵盖了视频生成领域的最新进展。 3. **丰富的语义信息**:GenVidBench中的视频从多个维度进行分析,并根据内容分类到不同的语义类别中。这种分类确保了数据集不仅规模庞大而且多样性高,有助于开发更通用和有效的检测模型。 我们对不同的高级视频生成器进行了全面的评估,并提供了一个具有挑战性的设置。此外,还提供了包括先进视频分类模型在内的丰富实验结果作为基准参考。通过GenVidBench,研究人员可以高效地开发和评估AI生成视频检测模型。数据集和代码可在以下链接获取:[此网址](请将"this https URL"替换为实际的URL)。
https://arxiv.org/abs/2501.11340
The increasing availability of traffic videos functioning on a 24/7/365 time scale has the great potential of increasing the spatio-temporal coverage of traffic accidents, which will help improve traffic safety. However, analyzing footage from hundreds, if not thousands, of traffic cameras in a 24/7/365 working protocol remains an extremely challenging task, as current vision-based approaches primarily focus on extracting raw information, such as vehicle trajectories or individual object detection, but require laborious post-processing to derive actionable insights. We propose SeeUnsafe, a new framework that integrates Multimodal Large Language Model (MLLM) agents to transform video-based traffic accident analysis from a traditional extraction-then-explanation workflow to a more interactive, conversational approach. This shift significantly enhances processing throughput by automating complex tasks like video classification and visual grounding, while improving adaptability by enabling seamless adjustments to diverse traffic scenarios and user-defined queries. Our framework employs a severity-based aggregation strategy to handle videos of various lengths and a novel multimodal prompt to generate structured responses for review and evaluation and enable fine-grained visual grounding. We introduce IMS (Information Matching Score), a new MLLM-based metric for aligning structured responses with ground truth. We conduct extensive experiments on the Toyota Woven Traffic Safety dataset, demonstrating that SeeUnsafe effectively performs accident-aware video classification and visual grounding by leveraging off-the-shelf MLLMs. Source code will be available at \url{this https URL}.
随着24/7/365时间尺度上交通视频的日益普及,这为增加交通事故的空间时间和覆盖范围带来了巨大的潜力,从而有助于提高交通安全。然而,分析成百上千个全天候运行的摄像头拍摄的录像仍然是一个极具挑战性的任务。当前基于视觉的方法主要集中在提取原始信息(如车辆轨迹或单个对象检测)上,但需要繁琐的手动后处理才能得出具有实际操作意义的见解。 我们提出了一种新的框架SeeUnsafe,该框架集成了多模态大型语言模型(MLLM)代理,将视频基础的交通事故分析从传统的“提取-解释”工作流程转变为更加互动和对话式的模式。这种转变通过自动化复杂的任务(如视频分类和视觉定位)来显著提高处理效率,同时通过使系统能够无缝适应各种交通场景和用户定义的问题而提高了灵活性。 我们的框架采用基于严重程度的聚合策略来处理不同长度的视频,并使用一种新型的多模态提示来生成结构化的响应以供审查和评估,这有助于实现精细化的视觉定位。我们还引入了IMS(信息匹配得分),这是一种新的MLLM基础指标,用于将结构化响应与真实情况对齐。 我们在丰田编织交通安全数据集上进行了广泛的实验,结果表明SeeUnsafe可以通过利用现成的MLLM有效地执行事故感知视频分类和视觉定位。源代码将在[此处](https://this https URL)提供。
https://arxiv.org/abs/2501.10604
In the current era of Machine Learning, Transformers have become the de facto approach across a variety of domains, such as computer vision and natural language processing. Transformer-based solutions are the backbone of current state-of-the-art methods for language generation, image and video classification, segmentation, action and object recognition, among many others. Interestingly enough, while these state-of-the-art methods produce impressive results in their respective domains, the problem of understanding the relationship between vision and language is still beyond our reach. In this work, we propose a common ground between vision and language based on events in space and time in an explainable and programmatic way, to connect learning-based vision and language state of the art models and provide a solution to the long standing problem of describing videos in natural language. We validate that our algorithmic approach is able to generate coherent, rich and relevant textual descriptions on videos collected from a variety of datasets, using both standard metrics (e.g. Bleu, ROUGE) and the modern LLM-as-a-Jury approach.
在当今的机器学习时代,Transformer模型已成为计算机视觉和自然语言处理等多个领域的标准方法。基于Transformer的解决方案是当前最先进的语言生成、图像和视频分类、分割、动作与物体识别等众多领域技术的基础。有趣的是,尽管这些最先进技术在其各自领域取得了令人瞩目的成果,但理解视觉与语言之间的关系仍然是我们尚未攻克的问题。 在这项工作中,我们提出了一种基于时间和空间中事件的解释性和程序化方法来连接以学习为基础的视觉和语言最新模型,并提供了解决长期以来存在的用自然语言描述视频问题的新方案。通过使用标准指标(如Bleu、ROUGE)以及现代的大规模语言模型作为评审的方法,我们验证了我们的算法能够生成连贯、丰富且相关的文本描述,适用于从不同数据集中收集的各种视频。
https://arxiv.org/abs/2501.08460
We empirically study autoregressive pre-training from videos. To perform our study, we construct a series of autoregressive video models, called Toto. We treat videos as sequences of visual tokens and train transformer models to autoregressively predict future tokens. Our models are pre-trained on a diverse dataset of videos and images comprising over 1 trillion visual tokens. We explore different architectural, training, and inference design choices. We evaluate the learned visual representations on a range of downstream tasks including image recognition, video classification, object tracking, and robotics. Our results demonstrate that, despite minimal inductive biases, autoregressive pre-training leads to competitive performance across all benchmarks. Finally, we find that scaling our video models results in similar scaling curves to those seen in language models, albeit with a different rate. More details at this https URL
我们通过实证研究了从视频中进行自回归预训练的方法。为了开展这项研究,我们构建了一系列名为Toto的自回归视频模型。我们将视频视为视觉令牌序列,并训练变压器模型以自回归方式预测未来的令牌。我们的模型在包含超过1万亿个视觉令牌的多样化数据集(包括视频和图像)上进行了预训练。我们在架构、训练和推理设计选择方面做了各种探索。我们评估了所学习到的视觉表示形式在一系列下游任务上的表现,包括图像识别、视频分类、对象跟踪和机器人技术。我们的研究结果表明,尽管具有最少的归纳偏置,自回归预训练仍然能在一个广泛的标准上取得竞争性的性能。 最后,我们发现随着视频模型规模的增长,其扩展曲线与语言模型相似,不过增长率有所不同。更多详情请参阅此链接(在实际回答中应提供具体网址,此处以“https URL”表示)。
https://arxiv.org/abs/2501.05453