AI-assisted ultrasound video diagnosis presents new opportunities to enhance the efficiency and accuracy of medical imaging analysis. However, existing research remains limited in terms of dataset diversity, diagnostic performance, and clinical applicability. In this study, we propose \textbf{Auto-US}, an intelligent diagnosis agent that integrates ultrasound video data with clinical diagnostic text. To support this, we constructed \textbf{CUV Dataset} of 495 ultrasound videos spanning five categories and three organs, aggregated from multiple open-access sources. We developed \textbf{CTU-Net}, which achieves state-of-the-art performance in ultrasound video classification, reaching an accuracy of 86.73\% Furthermore, by incorporating large language models, Auto-US is capable of generating clinically meaningful diagnostic suggestions. The final diagnostic scores for each case exceeded 3 out of 5 and were validated by professional clinicians. These results demonstrate the effectiveness and clinical potential of Auto-US in real-world ultrasound applications. Code and data are available at: this https URL.
https://arxiv.org/abs/2511.07748
Spiking Neural Networks (SNNs) are a promising, energy-efficient alternative to standard Artificial Neural Networks (ANNs) and are particularly well-suited to spatio-temporal tasks such as keyword spotting and video classification. However, SNNs have a much lower arithmetic intensity than ANNs and are therefore not well-matched to standard accelerators like GPUs and TPUs. Field Programmable Gate Arrays(FPGAs) are designed for such memory-bound workloads and here we develop a novel, fully-programmable RISC-V-based system-on-chip (FeNN-DMA), tailored to simulating SNNs on modern UltraScale+ FPGAs. We show that FeNN-DMA has comparable resource usage and energy requirements to state-of-the-art fixed-function SNN accelerators, yet it is capable of simulating much larger and more complex models. Using this functionality, we demonstrate state-of-the-art classification accuracy on the Spiking Heidelberg Digits and Neuromorphic MNIST tasks.
https://arxiv.org/abs/2511.00732
Large Vision Language models have seen huge application in several sports use-cases recently. Most of these works have been targeted towards a limited subset of popular sports like soccer, cricket, basketball etc; focusing on generative tasks like visual question answering, highlight generation. This work analyzes the applicability of the modern video foundation models (both encoder and decoder) for a very niche but hugely popular dance sports - breakdance. Our results show that Video Encoder models continue to outperform state-of-the-art Video Language Models for prediction tasks. We provide insights on how to choose the encoder model and provide a thorough analysis into the workings of a finetuned decoder model for breakdance video classification.
近期,大型视觉语言模型在多个体育应用场景中得到了广泛应用。这些研究大多集中在一些流行运动(如足球、板球和篮球)的特定子集上,并侧重于诸如视觉问答和精彩片段生成等生成任务。本项工作分析了现代视频基础模型(包括编码器和解码器)在一项非常小众但极其受欢迎的舞蹈体育项目——霹雳舞中的适用性。我们的研究结果表明,视频编码器模型在预测任务中仍优于最先进的视频语言模型。我们还提供了如何选择编码器模型的见解,并对经过微调的解码器模型在霹雳舞视频分类方面的运作进行了深入分析。
https://arxiv.org/abs/2510.20287
Camera movement conveys spatial and narrative information essential for understanding video content. While recent camera movement classification (CMC) methods perform well on modern datasets, their generalization to historical footage remains unexplored. This paper presents the first systematic evaluation of deep video CMC models on archival film material. We summarize representative methods and datasets, highlighting differences in model design and label definitions. Five standard video classification models are assessed on the HISTORIAN dataset, which includes expert-annotated World War II footage. The best-performing model, Video Swin Transformer, achieves 80.25% accuracy, showing strong convergence despite limited training data. Our findings highlight the challenges and potential of adapting existing models to low-quality video and motivate future work combining diverse input modalities and temporal architectures.
摄像机运动传达了理解视频内容所需的空间和叙事信息。虽然最近的摄像机运动分类(CMC)方法在现代数据集上表现良好,但它们对历史影像资料的泛化能力尚未被探索。本文首次系统地评估了深度视频 CMC 模型在档案电影材料上的性能。我们总结了代表性方法和数据集,并强调了模型设计和标签定义之间的差异。研究团队使用包含专家标注的世界大战 II 影像片段的 HISTORIAN 数据集,对五个标准视频分类模型进行了评估。表现最佳的模型——Video Swin Transformer,在仅有有限训练数据的情况下达到了 80.25% 的准确率,显示出强大的适应能力。 我们的研究成果突出了将现有模型应用于低质量视频所面临的挑战,并且为未来结合多种输入模式和时间架构的研究工作提供了动机。
https://arxiv.org/abs/2510.14713
Although deep neural networks have provided impressive gains in performance, these improvements often come at the cost of increased computational complexity and expense. In many cases, such as 3D volume or video classification tasks, not all slices or frames are necessary due to inherent redundancies. To address this issue, we propose a novel learnable subsampling framework that can be integrated into any neural network architecture. Subsampling, being a nondifferentiable operation, poses significant challenges for direct adaptation into deep learning models. While some works, have proposed solutions using the Gumbel-max trick to overcome the problem of non-differentiability, they fall short in a crucial aspect: they are only task-adaptive and not inputadaptive. Once the sampling mechanism is learned, it remains static and does not adjust to different inputs, making it unsuitable for real-world applications. To this end, we propose an attention-guided sampling module that adapts to inputs even during inference. This dynamic adaptation results in performance gains and reduces complexity in deep neural network models. We demonstrate the effectiveness of our method on 3D medical imaging datasets from MedMNIST3D as well as two ultrasound video datasets for classification tasks, one of them being a challenging in-house dataset collected under real-world clinical conditions.
尽管深度神经网络在性能上取得了令人印象深刻的提升,但这些改进通常是以计算复杂性和成本增加为代价的。在许多情况下,例如3D体积或视频分类任务中,并非所有的切片或帧都是必要的,因为存在内在冗余性。为此,我们提出了一种新型的学习采样框架,它可以集成到任何神经网络架构中。作为非可微操作的采样,在直接适应深度学习模型时会带来显著挑战。尽管有些工作提出了使用Gumbel-max技巧来克服非可微性的难题,但它们在关键方面仍有不足:仅适用于特定任务,并不具备输入自适应性。一旦采样机制被学会后便保持静态,不能根据不同的输入进行调整,这使其不适合应用于现实世界中。 为了解决这个问题,我们提出了一种基于注意力引导的采样模块,在推理过程中也能根据输入数据进行动态调整。这种动态适配不仅带来了性能提升,还减少了深度神经网络模型中的复杂性。我们在MedMNIST3D的数据集中验证了我们的方法在3D医学成像上的有效性,同时也在两个超声视频分类任务的数据集上进行了实验,其中一个为一个具有挑战性的内部数据集,在真实的临床条件下收集。
https://arxiv.org/abs/2510.12376
Recently, pre-trained state space models have shown great potential for video classification, which sequentially compresses visual tokens in videos with linear complexity, thereby improving the processing efficiency of video data while maintaining high performance. To apply powerful pre-trained models to downstream tasks, prompt learning is proposed to achieve efficient downstream task adaptation with only a small number of fine-tuned parameters. However, the sequentially compressed visual prompt tokens fail to capture the spatial and temporal contextual information in the video, thus limiting the effective propagation of spatial information within a video frame and temporal information between frames in the state compression model and the extraction of discriminative information. To tackle the above issue, we proposed a State Space Prompting (SSP) method for video understanding, which combines intra-frame and inter-frame prompts to aggregate and propagate key spatiotemporal information in the video. Specifically, an Intra-Frame Gathering (IFG) module is designed to aggregate spatial key information within each frame. Besides, an Inter-Frame Spreading (IFS) module is designed to spread discriminative spatio-temporal information across different frames. By adaptively balancing and compressing key spatio-temporal information within and between frames, our SSP effectively propagates discriminative information in videos in a complementary manner. Extensive experiments on four video benchmark datasets verify that our SSP significantly outperforms existing SOTA methods by 2.76% on average while reducing the overhead of fine-tuning parameters.
最近,预训练状态空间模型在视频分类任务中展现了巨大的潜力。这些模型可以以线性复杂度顺序压缩视频中的视觉标记,从而提高视频数据处理的效率,同时保持高性能。为了将强大的预训练模型应用于下游任务,提出了提示学习方法,通过仅调整少量参数即可实现高效的下游任务适配。然而,顺序压缩的视觉提示令牌无法捕捉到视频中的空间和时间上下文信息,这限制了状态压缩模型中视频帧内空间信息的有效传播以及不同帧之间的时间信息提取,并且影响鉴别性信息的抽取。 为了解决上述问题,我们提出了一个用于视频理解的状态空间提示(SSP)方法。该方法结合了帧内和帧间提示来聚合并传播视频中的关键时空信息。具体来说,设计了一个帧内收集(IFG)模块以在每个帧内部聚集重要的空间信息,并且设计了一个帧间扩散(IFS)模块以便跨不同帧传递区分性时空信息。通过自适应地平衡和压缩帧内的以及帧间的关鍵时-空信息,我们的SSP方法能够互补地传播视频中的鉴别性信息。 在四个视频基准数据集上的大量实验验证了我们提出的SSP方法相较于现有最先进(SOTA)方法平均提高了2.76%,同时减少了微调参数的开销。
https://arxiv.org/abs/2510.12160
Purpose: The FedSurg challenge was designed to benchmark the state of the art in federated learning for surgical video classification. Its goal was to assess how well current methods generalize to unseen clinical centers and adapt through local fine-tuning while enabling collaborative model development without sharing patient data. Methods: Participants developed strategies to classify inflammation stages in appendicitis using a preliminary version of the multi-center Appendix300 video dataset. The challenge evaluated two tasks: generalization to an unseen center and center-specific adaptation after fine-tuning. Submitted approaches included foundation models with linear probing, metric learning with triplet loss, and various FL aggregation schemes (FedAvg, FedMedian, FedSAM). Performance was assessed using F1-score and Expected Cost, with ranking robustness evaluated via bootstrapping and statistical testing. Results: In the generalization task, performance across centers was limited. In the adaptation task, all teams improved after fine-tuning, though ranking stability was low. The ViViT-based submission achieved the strongest overall performance. The challenge highlighted limitations in generalization, sensitivity to class imbalance, and difficulties in hyperparameter tuning in decentralized training, while spatiotemporal modeling and context-aware preprocessing emerged as promising strategies. Conclusion: The FedSurg Challenge establishes the first benchmark for evaluating FL strategies in surgical video classification. Findings highlight the trade-off between local personalization and global robustness, and underscore the importance of architecture choice, preprocessing, and loss design. This benchmarking offers a reference point for future development of imbalance-aware, adaptive, and robust FL methods in clinical surgical AI.
目的:FedSurg挑战赛旨在评估联邦学习在手术视频分类领域的最新成果。其目标是衡量现有方法推广到未见过的临床中心的能力,以及通过本地微调适应这些新环境的能力,并且能够在不分享患者数据的情况下促进合作模型开发。 方法:参赛者利用多中心Appendix300视频数据集的一个初步版本,开发了用于分类阑尾炎炎症阶段的策略。挑战赛评估了两项任务:推广到未见过的中心和在微调后的特定中心适应性。提交的方法包括带有线性探测的基础模型、使用三元组损失的度量学习以及各种FL聚合方案(FedAvg、FedMedian、FedSAM)。性能通过F1分数和预期成本进行评估,而排名稳健性的评估则通过自助法和统计检验来进行。 结果:在推广任务中,跨中心的性能有限。在适应性任务中,所有团队在微调后都有所改进,尽管排名稳定性较低。基于ViViT的方法取得了最强的整体表现。该挑战赛突显了推广能力、对类别不平衡敏感度以及去中心化训练中超参数调整困难等限制,并且时空建模和上下文感知预处理被视为有前景的策略。 结论:FedSurg挑战赛建立了首个用于评估手术视频分类中FL策略的基准测试。研究结果强调了本地个性化与全局稳健性之间的权衡,突出了架构选择、预处理和损失设计的重要性。这一基准为未来开发针对不平衡、适应性和稳健性的临床外科AI FL方法提供了参考点。
https://arxiv.org/abs/2510.04772
This article is inspired by video classification technology. If the user behavior subspace is viewed as a frame image, consecutive frame images are viewed as a video. Following this novel idea, a model for spammer detection based on user videoization, called UVSD, is proposed. Firstly, a user2piexl algorithm for user pixelization is proposed. Considering the adversarial behavior of user stances, the user is viewed as a pixel, and the stance is quantified as the pixel's RGB. Secondly, a behavior2image algorithm is proposed for transforming user behavior subspace into frame images. Low-rank dense vectorization of subspace user relations is performed using representation learning, while cutting and diffusion algorithms are introduced to complete the frame imageization. Finally, user behavior videos are constructed based on temporal features. Subsequently, a video classification algorithm is combined to identify the spammers. Experiments using publicly available datasets, i.e., WEIBO and TWITTER, show an advantage of the UVSD model over state-of-the-art methods.
本文受到视频分类技术的启发。如果将用户行为子空间视为一个帧图像,那么连续的帧图像则被视为一段视频。基于这一新颖的理念,提出了名为UVSD(User Videoization-based Spammer Detection)的检测垃圾信息发送者的模型。 首先,提出了一种用于用户像素化的user2pixel算法。考虑到用户的对抗性立场行为,将每个用户视为一个像素,并且使用RGB值量化其立场。 其次,为了将用户行为子空间转换为帧图像,提出了behavior2image算法。通过表示学习执行了低秩稠密向量化来处理子空间中的用户关系,在此过程中引入切割和扩散算法以完成帧图的创建过程。 最后,根据时间特征构建用户行为视频,并结合视频分类算法识别垃圾信息发送者。 在使用公开数据集(如WEIBO和TWITTER)进行实验时,UVSD模型表现出优于现有方法的优势。
https://arxiv.org/abs/2510.06233
Multimodal learning plays a pivotal role in advancing artificial intelligence systems by incorporating information from multiple modalities to build a more comprehensive representation. Despite its importance, current state-of-the-art models still suffer from severe limitations that prevent the successful development of a fully multimodal model. Such methods may not provide indicators that all the involved modalities are effectively aligned. As a result, some modalities may not be aligned, undermining the effectiveness of the model in downstream tasks where multiple modalities should provide additional information that the model fails to exploit. In this paper, we present TRIANGLE: TRI-modAl Neural Geometric LEarning, the novel proposed similarity measure that is directly computed in the higher-dimensional space spanned by the modality embeddings. TRIANGLE improves the joint alignment of three modalities via a triangle-area similarity, avoiding additional fusion layers or pairwise similarities. When incorporated in contrastive losses replacing cosine similarity, TRIANGLE significantly boosts the performance of multimodal modeling, while yielding interpretable alignment rationales. Extensive evaluation in three-modal tasks such as video-text and audio-text retrieval or audio-video classification, demonstrates that TRIANGLE achieves state-of-the-art results across different datasets improving the performance of cosine-based methods up to 9 points of Recall@1.
多模态学习在通过融合来自多种模式的信息来构建更全面的表示,从而推进人工智能系统方面发挥着核心作用。尽管其重要性不言而喻,但目前最先进的模型仍然受到严重影响性能限制的问题,这些问题阻碍了完全多模态模型的成功开发。现有方法可能无法提供所有涉及模式有效对齐的指标。因此,在下游任务中,某些模式可能未能充分对齐,这些任务要求多种模式共同提供额外信息以增强模型性能,而当前模型往往未能充分利用这一点。 在本文中,我们提出了TRIANGLE:TRI模态神经几何学习,这是一种新颖的相似度测量方法,可以直接计算出现在由模式嵌入所定义的高维空间中的三角形面积。通过采用这种三角形面积相似性,TRIANGLE改进了三种模式的同时对齐,从而避免了额外融合层或成对相似性的使用。当在对比损失中替换余弦相似性时,TRIANGLE显著增强了多模态建模性能,并提供了可解释的对齐原理。 在视频-文本、音频-文本检索以及音频-视频分类等三模式任务中的广泛评估表明,TRIANGLE在各种数据集上实现了最先进的结果,在基于余弦的方法的基础上将Recall@1提高了高达9个百分点。
https://arxiv.org/abs/2509.24734
Conceptual models such as Concept Bottleneck Models (CBMs) have driven substantial progress in improving interpretability for image classification by leveraging human-interpretable concepts. However, extending these models from static images to sequences of images, such as video data, introduces a significant challenge due to the temporal dependencies inherent in videos, which are essential for capturing actions and events. In this work, we introduce MoTIF (Moving Temporal Interpretable Framework), an architectural design inspired by a transformer that adapts the concept bottleneck framework for video classification and handles sequences of arbitrary length. Within the video domain, concepts refer to semantic entities such as objects, attributes, or higher-level components (e.g., 'bow', 'mount', 'shoot') that reoccur across time - forming motifs collectively describing and explaining actions. Our design explicitly enables three complementary perspectives: global concept importance across the entire video, local concept relevance within specific windows, and temporal dependencies of a concept over time. Our results demonstrate that the concept-based modeling paradigm can be effectively transferred to video data, enabling a better understanding of concept contributions in temporal contexts while maintaining competitive performance. Code available at this http URL.
概念模型,如概念瓶颈模型(CBM),通过利用人类可解释的概念,在提高图像分类的可解释性方面取得了显著进展。然而,将这些模型从静态图像扩展到序列图像(如视频数据)带来了重大挑战,因为视频中的时间依赖性对于捕捉动作和事件至关重要。在本工作中,我们介绍了MoTIF(移动时序可解释框架),这是一种受变压器架构启发的设计,它适应了概念瓶颈框架以适用于视频分类,并能够处理任意长度的序列。在视频领域中,概念指的是语义实体,如物体、属性或更高层次的组件(例如,“拉弓”、“上箭”、“射击”)这些成分随着时间反复出现,共同描述和解释动作。 我们的设计明确支持三个互补的角度:整个视频中的全局概念重要性、特定窗口内的局部概念相关性和时间跨度上的概念时序依赖。研究结果表明,基于概念的建模范式可以有效地应用于视频数据,在保持竞争力的同时,使我们更好地理解在时间上下文中概念的作用和贡献。 代码可在该链接获取(请将“this http URL”替换为实际提供的GitHub或任何版本控制系统中的URL)。
https://arxiv.org/abs/2509.20899
Cardiac amyloidosis (CA) is a rare cardiomyopathy, with typical abnormalities in clinical measurements from echocardiograms such as reduced global longitudinal strain of the myocardium. An alternative approach for detecting CA is via neural networks, using video classification models such as convolutional neural networks. These models process entire video clips, but provide no assurance that classification is based on clinically relevant features known to be associated with CA. An alternative paradigm for disease classification is to apply models to quantitative features such as strain, ensuring that the classification relates to clinically relevant features. Drawing inspiration from this approach, we explicitly constrain a transformer model to the anatomical region where many known CA abnormalities occur -- the myocardium, which we embed as a set of deforming points and corresponding sampled image patches into input tokens. We show that our anatomical constraint can also be applied to the popular self-supervised learning masked autoencoder pre-training, where we propose to mask and reconstruct only anatomical patches. We show that by constraining both the transformer and pre-training task to the myocardium where CA imaging features are localized, we achieve increased performance on a CA classification task compared to full video transformers. Our model provides an explicit guarantee that the classification is focused on only anatomical regions of the echo, and enables us to visualize transformer attention scores over the deforming myocardium.
心脏淀粉样变性(CA)是一种罕见的心肌病,其临床测量中的典型异常包括心肌整体纵向应变降低。检测 CA 的另一种方法是利用神经网络,如卷积神经网络进行视频分类模型的应用。这些模型处理整个视频片段,但无法保证分类基于与 CA 相关的临床重要特征。疾病分类的一种替代范例是将模型应用于诸如应变等定量特性上,从而确保分类与临床相关特征有关。 受到这一方法的启发,我们明确限制了一个转换器模型只关注许多已知 CA 异常发生的心肌解剖区域——我们将心肌嵌入到输入令牌中的一组变形点和相应的采样图像补丁。我们展示了这种解剖学约束也可以应用于流行的自我监督学习掩码自动编码器预训练,其中我们建议仅对解剖补丁进行掩码和重建。我们表明,通过将转换器及其预训练任务限制在 CA 影像特征局部化的区域(心肌),与完整视频转换器相比,在 CA 分类任务中实现了更高的性能。 我们的模型提供了明确的保证:分类仅关注回声图像中的解剖学区域,并使我们能够可视化变换器注意分数在整个变形的心肌上。
https://arxiv.org/abs/2509.19691
Hateful videos present serious risks to online safety and real-world well-being, necessitating effective detection methods. Although multimodal classification approaches integrating information from several modalities outperform unimodal ones, they typically neglect that even minimal hateful content defines a video's category. Specifically, they generally treat all content uniformly, instead of emphasizing the hateful components. Additionally, existing multimodal methods cannot systematically capture structured information in videos, limiting the effectiveness of multimodal fusion. To address these limitations, we propose a novel multimodal dual-stream graph neural network model. It constructs an instance graph by separating the given video into several instances to extract instance-level features. Then, a complementary weight graph assigns importance weights to these features, highlighting hateful instances. Importance weights and instance features are combined to generate video labels. Our model employs a graph-based framework to systematically model structured relationships within and across modalities. Extensive experiments on public datasets show that our model is state-of-the-art in hateful video classification and has strong explainability. Code is available: this https URL.
仇恨视频对在线安全和现实生活中的福祉构成严重威胁,因此需要有效的检测方法。虽然多模态分类方法通过结合来自多种模式的信息优于单模态方法,但它们通常忽视了即使是少量的仇恨内容也会定义一个视频的类别。具体来说,现有方法普遍将所有内容视为统一处理,而没有强调仇恨成分的重要性。此外,现有的多模态方法无法系统地捕捉视频中的结构化信息,限制了多模态融合的有效性。 为了解决这些局限性,我们提出了一种新颖的多模态双流图神经网络模型。该模型通过将给定的视频分割成多个实例来构建实例图,并提取出实例级别的特征。然后,一个互补权重图会赋予这些特征重要性权重,突出仇恨内容的实例。之后,根据重要性权重和实例特征组合生成视频标签。 我们的模型采用基于图的方法,系统地建模了多模态之间的结构化关系。在公共数据集上的广泛实验表明,我们的模型在仇恨视频分类方面处于行业领先水平,并具有很强的可解释性。代码可在以下链接获取:[提供URL]。
https://arxiv.org/abs/2509.13515
Federated learning (FL) allows multiple entities to train a shared model collaboratively. Its core, privacy-preserving principle is that participants only exchange model updates, such as gradients, and never their raw, sensitive data. This approach is fundamental for applications in domains where privacy and confidentiality are important. However, the security of this very mechanism is threatened by gradient inversion attacks, which can reverse-engineer private training data directly from the shared gradients, defeating the purpose of FL. While the impact of these attacks is known for image, text, and tabular data, their effect on video data remains an unexamined area of research. This paper presents the first analysis of video data leakage in FL using gradient inversion attacks. We evaluate two common video classification approaches: one employing pre-trained feature extractors and another that processes raw video frames with simple transformations. Our initial results indicate that the use of feature extractors offers greater resilience against gradient inversion attacks. We also demonstrate that image super-resolution techniques can enhance the frames extracted through gradient inversion attacks, enabling attackers to reconstruct higher-quality videos. Our experiments validate this across scenarios where the attacker has access to zero, one, or more reference frames from the target environment. We find that although feature extractors make attacks more challenging, leakage is still possible if the classifier lacks sufficient complexity. We, therefore, conclude that video data leakage in FL is a viable threat, and the conditions under which it occurs warrant further investigation.
联邦学习(FL)允许多个实体协作训练一个共享模型。其核心的隐私保护原则是参与者之间仅交换模型更新,例如梯度信息,而不分享原始敏感数据。这种做法对于需要严格保障隐私和保密性的领域至关重要。然而,这种机制的安全性受到了梯度逆向攻击的威胁,此类攻击可以直接从共享梯度中反演私人训练数据,从而破坏了FL的目的。尽管人们已经了解这些攻击对图像、文本和表格数据的影响,但它们对视频数据的效果仍然是一个尚未被探索的研究领域。 本文首次分析了在联邦学习环境下使用梯度逆向攻击导致的视频数据泄露问题。我们评估了两种常见的视频分类方法:一种采用预训练特征提取器的方法,另一种则是直接处理原始视频帧并通过简单变换进行处理的方式。我们的初步结果表明,使用特征提取器能够提供更强的抗梯度逆向攻击的能力。此外,我们还展示了图像超分辨率技术可以增强通过梯度逆向攻击获取到的视频帧,使攻击者能够重建更高质量的视频片段。 在实验中,无论是在攻击者拥有零个、一个还是多个目标环境中的参考帧的情况下,我们都验证了上述发现。研究结果表明,虽然特征提取器能增加攻击难度,但如果分类器复杂性不足,则仍然可能存在数据泄露的风险。因此,我们得出结论,在联邦学习环境中视频数据的泄漏确实是一种可行的安全威胁,并且需要进一步探索这种威胁发生的具体条件。
https://arxiv.org/abs/2509.09742
The rapid growth of visual content consumption across platforms necessitates automated video classification for age-suitability standards like the MPAA rating system (G, PG, PG-13, R). Traditional methods struggle with large labeled data requirements, poor generalization, and inefficient feature learning. To address these challenges, we employ contrastive learning for improved discrimination and adaptability, exploring three frameworks: Instance Discrimination, Contextual Contrastive Learning, and Multi-View Contrastive Learning. Our hybrid architecture integrates an LRCN (CNN+LSTM) backbone with a Bahdanau attention mechanism, achieving state-of-the-art performance in the Contextual Contrastive Learning framework, with 88% accuracy and an F1 score of 0.8815. By combining CNNs for spatial features, LSTMs for temporal modeling, and attention mechanisms for dynamic frame prioritization, the model excels in fine-grained borderline distinctions, such as differentiating PG-13 and R-rated content. We evaluate the model's performance across various contrastive loss functions, including NT-Xent, NT-logistic, and Margin Triplet, demonstrating the robustness of our proposed architecture. To ensure practical application, the model is deployed as a web application for real-time MPAA rating classification, offering an efficient solution for automated content compliance across streaming platforms.
随着跨平台视觉内容消费的快速增长,对自动视频分类的需求日益增加,特别是在像MPAA评级系统(G、PG、PG-13、R)这样的年龄适宜性标准方面。传统的分类方法面临着大规模标注数据需求高、泛化能力差以及特征学习效率低下的挑战。为了解决这些问题,我们采用了对比学习以提高区分能力和适应性,并探索了三种框架:实例鉴别(Instance Discrimination)、上下文对比学习(Contextual Contrastive Learning)和多视图对比学习(Multi-View Contrastive Learning)。我们的混合架构结合了LRCN(CNN+LSTM)主干网络与Bahdanau注意力机制,在上下文对比学习框架中表现出色,达到了88%的准确率和0.8815的F1分数。通过将卷积神经网络用于空间特征提取、长短期记忆网络用于时间建模以及使用注意力机制进行动态帧优先级排序,该模型在细微区分PG-13与R评级内容方面表现尤为出色。 我们在多种对比损失函数(包括NT-Xent、NT-logistic和Margin Triplet)下评估了模型的性能,展示了我们提出的架构的强大鲁棒性。为了确保实际应用的有效性,我们将模型部署为一个网页应用程序,以实现实时MPAA评分分类,在流媒体平台上提供自动化内容合规解决方案。
https://arxiv.org/abs/2509.06826
Video processing is generally divided into two main categories: processing of the entire video, which typically yields optimal classification outcomes, and real-time processing, where the objective is to make a decision as promptly as possible. Although the models dedicated to the processing of entire videos are typically well-defined and clearly presented in the literature, this is not the case for online processing, where a~plethora of hand-devised methods exist. To address this issue, we present PrAViC, a novel, unified, and theoretically-based adaptation framework for tackling the online classification problem in video data. The initial phase of our study is to establish a mathematical background for the classification of sequential data, with the potential to make a decision at an early stage. This allows us to construct a natural function that encourages the model to return a result much faster. The subsequent phase is to present a straightforward and readily implementable method for adapting offline models to the online setting using recurrent operations. Finally, PrAViC is evaluated by comparing it with existing state-of-the-art offline and online models and datasets. This enables the network to significantly reduce the time required to reach classification decisions while maintaining, or even enhancing, accuracy.
视频处理通常分为两大类:一类是对整个视频的处理,这类方法通常能获得最佳分类效果;另一类是实时处理,在这种情况下目标是在尽可能短的时间内作出决策。尽管针对整个视频处理的研究在文献中已有明确和完善的定义,但对于在线处理的情况则存在许多手工设计的方法,并且没有统一的标准。为了解决这一问题,我们提出了一种新的、基于理论的适应框架PrAViC,旨在解决视频数据中的实时分类难题。 我们的研究首先建立了一个数学基础,用于对序列数据进行早期决策分类。这使得我们可以构造一个自然函数来鼓励模型以更快的速度返回结果。接下来,我们将展示一种简单且易于实现的方法,通过使用递归操作将离线模型适应到在线环境中。最后,我们通过与现有的最先进的离线和在线模型以及数据集进行比较来评估PrAViC的表现。 这种方法使得网络能够在保持或甚至提高分类准确性的同时显著减少做出决策所需的时间。
https://arxiv.org/abs/2406.11443
The field of visual and audio generation is burgeoning with new state-of-the-art methods. This rapid proliferation of new techniques underscores the need for robust solutions for detecting synthetic content in videos. In particular, when fine-grained alterations via localized manipulations are performed in visual, audio, or both domains, these subtle modifications add challenges to the detection algorithms. This paper presents solutions for the problems of deepfake video classification and localization. The methods were submitted to the ACM 1M Deepfakes Detection Challenge, achieving the best performance in the temporal localization task and a top four ranking in the classification task for the TestA split of the evaluation dataset.
视觉和音频生成领域的最新技术层出不穷。这种新方法的快速涌现强调了需要开发出稳健的解决方案来检测视频中的合成内容。特别是在对图像、音频或两者进行精细调整的情况下,通过局部操作进行细微修改给检测算法带来了新的挑战。本文提出了解决深度伪造视频分类和定位问题的方法。这些方法提交给了ACM 1M Deepfakes Detection Challenge,在评估数据集TestA部分的时序定位任务中取得了最佳成绩,并在该分组的分类任务中排名第四。
https://arxiv.org/abs/2508.08141
The integration of prompt tuning with multimodal learning has shown significant generalization abilities for various downstream tasks. Despite advancements, existing methods heavily depend on massive modality-specific labeled data (e.g., video, audio, and image), or are customized for a single modality. In this study, we present Text as Any-Modality by Consistent Prompt Tuning (TaAM-CPT), a scalable approach for constructing a general representation model toward unlimited modalities using solely text data. TaAM-CPT comprises modality prompt pools, text construction, and modality-aligned text encoders from pre-trained models, which allows for extending new modalities by simply adding prompt pools and modality-aligned text encoders. To harmonize the learning across different modalities, TaAM-CPT designs intra- and inter-modal learning objectives, which can capture category details within modalities while maintaining semantic consistency across different modalities. Benefiting from its scalable architecture and pre-trained models, TaAM-CPT can be seamlessly extended to accommodate unlimited modalities. Remarkably, without any modality-specific labeled data, TaAM-CPT achieves leading results on diverse datasets spanning various modalities, including video classification, image classification, and audio classification. The code is available at this https URL.
将提示调优与多模态学习相结合已经展示了在各种下游任务中的显著泛化能力。尽管有所进展,现有的方法仍然严重依赖大量特定模式的标注数据(例如视频、音频和图像),或者只为单一模式定制。在这项研究中,我们提出了由一致提示调优实现的“文本作为任何模态”(Text as Any-Modality by Consistent Prompt Tuning,简称TaAM-CPT),这是一种可扩展的方法,仅使用文本数据即可构建面向无限数量模态的一般表示模型。 TaAM-CPT 包括模式提示池、文本构造和从预训练模型中获取的与模式一致的文本编码器。这使得通过简单地添加提示池和与模式一致的文本编码器就可以扩展新的模态。为了在不同模态之间实现和谐学习,TaAM-CPT 设计了内模态和跨模态的学习目标,这些目标可以在捕获模态内部类别细节的同时保持跨不同模态的语义一致性。 得益于其可扩展架构和预训练模型,TaAM-CPT 可以无缝地扩展以适应无限数量的模式。特别值得注意的是,在没有特定模式标注数据的情况下,TaAM-CPT 在跨越视频分类、图像分类和音频分类等不同模态的各种数据集上取得了领先的结果。 该项目代码可在以下链接获取:[https URL](请将 [https URL] 替换为实际提供的URL)。
https://arxiv.org/abs/2508.06382
The rapid proliferation of online multimedia content has intensified the spread of hate speech, presenting critical societal and regulatory challenges. While recent work has advanced multimodal hateful video detection, most approaches rely on coarse, video-level annotations that overlook the temporal granularity of hateful content. This introduces substantial label noise, as videos annotated as hateful often contain long non-hateful segments. In this paper, we investigate the impact of such label ambiguity through a fine-grained approach. Specifically, we trim hateful videos from the HateMM and MultiHateClip English datasets using annotated timestamps to isolate explicitly hateful segments. We then conduct an exploratory analysis of these trimmed segments to examine the distribution and characteristics of both hateful and non-hateful content. This analysis highlights the degree of semantic overlap and the confusion introduced by coarse, video-level annotations. Finally, controlled experiments demonstrated that time-stamp noise fundamentally alters model decision boundaries and weakens classification confidence, highlighting the inherent context dependency and temporal continuity of hate speech expression. Our findings provide new insights into the temporal dynamics of multimodal hateful videos and highlight the need for temporally aware models and benchmarks for improved robustness and interpretability. Code and data are available at this https URL.
在线多媒体内容的迅速增长加剧了仇恨言论的传播,带来了重要的社会和监管挑战。尽管最近的工作在多模态仇恨视频检测方面有所进展,但大多数方法依赖于粗略的、以视频为单位的标注,这忽视了仇恨内容的时间粒度。这种做法引入了大量的标签噪音,因为被标记为仇恨的内容通常包含长时间非仇恨的部分。在这篇论文中,我们通过细粒度的方法研究了此类标签模糊性的影响。 具体来说,我们将HateMM和MultiHateClip英文数据集中标注有时间戳的仇恨视频进行裁剪,以分离出明确的仇恨片段。然后,我们对这些裁剪后的片段进行了探索性分析,考察了仇恨与非仇恨内容的分布和特征。这一分析突显了语义重叠的程度以及粗略、基于视频级别的标签带来的混淆。 最后,在控制实验中证明时间戳噪音从根本上改变了模型决策边界,并削弱了分类信心,强调了仇恨言论表达的内在上下文依赖性和时间连续性。我们的研究结果为多模态仇恨视频的时间动态提供了新的见解,并突显了开发具有时间意识的模型和基准的重要性,以提高其鲁棒性和可解释性。 代码和数据可在以下链接获取:[此URL](请将"this https URL"替换为实际提供的网址)。
https://arxiv.org/abs/2508.04900
Detecting hate speech in videos remains challenging due to the complexity of multimodal content and the lack of fine-grained annotations in existing datasets. We present HateClipSeg, a large-scale multimodal dataset with both video-level and segment-level annotations, comprising over 11,714 segments labeled as Normal or across five Offensive categories: Hateful, Insulting, Sexual, Violence, Self-Harm, along with explicit target victim labels. Our three-stage annotation process yields high inter-annotator agreement (Krippendorff's alpha = 0.817). We propose three tasks to benchmark performance: (1) Trimmed Hateful Video Classification, (2) Temporal Hateful Video Localization, and (3) Online Hateful Video Classification. Results highlight substantial gaps in current models, emphasizing the need for more sophisticated multimodal and temporally aware approaches. The HateClipSeg dataset are publicly available at this https URL.
在视频中检测仇恨言论仍然是一个挑战,这主要是由于多模态内容的复杂性和现有数据集中缺乏精细标注。我们介绍了HateClipSeg,这是一个大规模的多模态数据集,其中包含视频级别和片段级别的注释,包括超过11,714个被标记为正常或五个 Offensive 类别中的一个:仇恨、侮辱、性、暴力、自残,并且带有明确的目标受害者的标签。我们采用三阶段标注过程以获得高一致性(Krippendorff 的 α = 0.817)。为了评估性能,我们提出了三项任务:(1) Trimmed Hateful Video Classification (修剪后的仇恨视频分类),(2) Temporal Hateful Video Localization (时间轴上的仇恨视频定位),以及 (3) Online Hateful Video Classification (在线仇恨视频分类)。研究结果突显了现有模型在当前性能方面的显著差距,强调了需要更复杂的多模态和时间感知方法。HateClipSeg 数据集可以在以下网址公开获取:[此 https URL](请将 [此 https URL] 替换为实际链接)。
https://arxiv.org/abs/2508.01712
Real-time multi-label video classification on embedded devices is constrained by limited compute and energy budgets. Yet, video streams exhibit structural properties such as label sparsity, temporal continuity, and label co-occurrence that can be leveraged for more efficient inference. We introduce Polymorph, a context-aware framework that activates a minimal set of lightweight Low Rank Adapters (LoRA) per frame. Each adapter specializes in a subset of classes derived from co-occurrence patterns and is implemented as a LoRA weight over a shared backbone. At runtime, Polymorph dynamically selects and composes only the adapters needed to cover the active labels, avoiding full-model switching and weight merging. This modular strategy improves scalability while reducing latency and energy overhead. Polymorph achieves 40% lower energy consumption and improves mAP by 9 points over strong baselines on the TAO dataset. Polymorph is open source at this https URL.
嵌入式设备上的实时多标签视频分类受到计算能力和能量预算的限制。然而,视频流具有结构化特性,如标签稀疏性、时间连续性和标签共现等,这些特性可以被利用来实现更高效的推理过程。我们介绍了Polymorph,这是一个上下文感知框架,它在每一帧激活一组轻量级的低秩适配器(LoRA)的最小集合。每个适配器专门处理从共现模式中衍生出的一部分类别,并作为共享骨干网络上的一个LoRA权重来实现。运行时,Polymorph动态选择并组合仅需要覆盖活动标签的适配器,从而避免了全模型切换和权重合并的操作。这种模块化策略提高了可扩展性,同时减少了延迟和能量开销。 在TAO数据集上,与强大的基线相比,Polymorph实现了40%更低的能量消耗,并且将mAP(平均精度)提升了9个百分点。Polymorph的开源代码可在上述链接中获取。(原链接未能提供,请访问Alibaba Cloud的官方资源或相关论文以获取最新信息。)
https://arxiv.org/abs/2507.14959