Spatial reasoning in vision language models (VLMs) remains fragile when semantics hinge on subtle temporal or geometric cues. We introduce a synthetic benchmark that probes two complementary skills: situational awareness (recognizing whether an interaction is harmful or benign) and spatial awareness (tracking who does what to whom, and reasoning about relative positions and motion). Through minimal video pairs, we test three challenges: distinguishing violence from benign activity, binding assailant roles across viewpoints, and judging fine-grained trajectory alignment. While we evaluate recent VLMs in a training-free setting, the benchmark is applicable to any video classification model. Results show performance only slightly above chance across tasks. A simple aid, stable color cues, partly reduces assailant role confusions but does not resolve the underlying weakness. By releasing data and code, we aim to provide reproducible diagnostics and seed exploration of lightweight spatial priors to complement large-scale pretraining.
在视觉语言模型(VLMs)中,空间推理依然脆弱,尤其是在语义依赖于微妙的时间或几何线索时。我们引入了一个合成基准测试,该测试考察两种互补技能:情境意识(识别互动是否具有危害性或无害)和空间意识(追踪谁对谁做了什么,并推断相对位置和运动)。通过最小化视频配对,我们测试了三个挑战:区分暴力行为与无害活动、在不同视角下绑定施暴者角色以及判断细微的轨迹对齐。尽管我们在零训练设置中评估了最近的VLMs,但该基准适用于任何视频分类模型。结果表明,在各个任务上的表现仅略高于随机猜测水平。一种简单的辅助手段——稳定颜色线索——部分减少了施暴者角色混淆,但仍未能解决根本弱点。通过发布数据和代码,我们旨在提供可重复诊断工具,并促进对轻量级空间先验的探索,以补充大规模预训练。
https://arxiv.org/abs/2601.15780
Humans often experience not just a single basic emotion at a time, but rather a blend of several emotions with varying salience. Despite the importance of such blended emotions, most video-based emotion recognition approaches are designed to recognize single emotions only. The few approaches that have attempted to recognize blended emotions typically cannot assess the relative salience of the emotions within a blend. This limitation largely stems from the lack of datasets containing a substantial number of blended emotion samples annotated with relative salience. To address this shortcoming, we introduce BLEMORE, a novel dataset for multimodal (video, audio) blended emotion recognition that includes information on the relative salience of each emotion within a blend. BLEMORE comprises over 3,000 clips from 58 actors, performing 6 basic emotions and 10 distinct blends, where each blend has 3 different salience configurations (50/50, 70/30, and 30/70). Using this dataset, we conduct extensive evaluations of state-of-the-art video classification approaches on two blended emotion prediction tasks: (1) predicting the presence of emotions in a given sample, and (2) predicting the relative salience of emotions in a blend. Our results show that unimodal classifiers achieve up to 29% presence accuracy and 13% salience accuracy on the validation set, while multimodal methods yield clear improvements, with ImageBind + WavLM reaching 35% presence accuracy and HiCMAE 18% salience accuracy. On the held-out test set, the best models achieve 33% presence accuracy (VideoMAEv2 + HuBERT) and 18% salience accuracy (HiCMAE). In sum, the BLEMORE dataset provides a valuable resource to advancing research on emotion recognition systems that account for the complexity and significance of blended emotion expressions.
人类在某一时刻通常体验的不是单一的基本情绪,而是一种混合了多种不同强度情感的状态。尽管这种复合情感的重要性不容忽视,但大多数基于视频的情绪识别方法仅设计用于识别单一情绪。少数尝试识别复合情感的方法通常也无法评估这些情感之间的相对显著性。这一局限主要源于缺少包含大量注有相对显著性的复合情感样本的数据集。 为了解决这个问题,我们引入了BLEMORE数据集,这是一个新的多模态(视频、音频)混合情绪识别数据集,包含了每种情绪在混合中的相对显著信息。BLEMORE包括超过3000个片段,来自58名演员的表演,涵盖了6种基本情感和10种不同的复合情感组合,其中每一种复合情感有三种不同的情感强度配置(50/50, 70/30 和 30/70)。利用这个数据集,我们对最先进的视频分类方法进行了广泛的评估,针对两个混合情绪预测任务:(1) 预测给定样本中是否存在特定情绪;(2) 预测情感混合中的相对显著性。我们的结果显示,在验证集中,单模态分类器在存在准确性上可达到最高29%,在显著性准确性上可达13%;而多模态方法则明显表现出改进,其中ImageBind + WavLM达到了35%的存在准确性,HiCMAE的显著性准确率为18%。在未公开的测试集中,最佳模型实现了VideoMAEv2 + HuBERT的33%存在准确性和HiCMAE的18%显著性准确度。 综上所述,BLEMORE数据集为推进考虑情感表达复杂性和重要性的混合情绪识别系统研究提供了宝贵的资源。
https://arxiv.org/abs/2601.13225
Human action recognition has become an important research focus in computer vision due to the wide range of applications where it is used. 3D Resnet-based CNN models, particularly MC3, R3D, and R(2+1)D, have different convolutional filters to extract spatiotemporal features. This paper investigates the impact of reducing the captured knowledge from temporal data, while increasing the resolution of the frames. To establish this experiment, we created similar designs to the three originals, but with a dropout layer added before the final classifier. Secondly, we then developed ten new versions for each one of these three designs. The variants include special attention blocks within their architecture, such as convolutional block attention module (CBAM), temporal convolution networks (TCN), in addition to multi-headed and channel attention mechanisms. The purpose behind that is to observe the extent of the influence each of these blocks has on performance for the restricted-temporal models. The results of testing all the models on UCF101 have shown accuracy of 88.98% for the variant with multiheaded attention added to the modified R(2+1)D. This paper concludes the significance of missing temporal features in the performance of the newly created increased resolution models. The variants had different behavior on class-level accuracy, despite the similarity of their enhancements to the overall performance.
人类行为识别由于其广泛的应用场景,已成为计算机视觉领域的重要研究焦点。基于3D Resnet的CNN模型(如MC3、R3D和R(2+1)D)通过不同的卷积滤波器来提取空间-时间特征。本文探讨了减少捕获的时间数据知识的影响,并增加帧分辨率的效果。为了进行这项实验,我们创建了与原始设计相似的设计,但在最终分类器之前添加了一个dropout层。其次,为每种设计方案开发了十个新版本,这些变体在其架构中加入了特殊的注意模块,如卷积块注意力模块(CBAM)、时间卷积网络(TCN),以及多头和通道注意机制。 引入这些模块的目的是观察它们在限制时间特征的新建高分辨率模型性能中的影响程度。所有模型在UCF101数据集上的测试结果表明,在修改后的R(2+1)D中添加了多头注意力机制的变体获得了88.98%的准确率。本文得出结论,缺失的时间特性对新创建的、增加分辨率的模型性能具有重要意义。尽管这些变体对其整体性能进行了类似的增强,但它们在类别级别的准确性上表现出不同的行为。
https://arxiv.org/abs/2601.10854
The classification of microscopy videos capturing complex cellular behaviors is crucial for understanding and quantifying the dynamics of biological processes over time. However, it remains a frontier in computer vision, requiring approaches that effectively model the shape and motion of objects without rigid boundaries, extract hierarchical spatiotemporal features from entire image sequences rather than static frames, and account for multiple objects within the field of view. To this end, we organized the Cell Behavior Video Classification Challenge (CBVCC), benchmarking 35 methods based on three approaches: classification of tracking-derived features, end-to-end deep learning architectures to directly learn spatiotemporal features from the entire video sequence without explicit cell tracking, or ensembling tracking-derived with image-derived features. We discuss the results achieved by the participants and compare the potential and limitations of each approach, serving as a basis to foster the development of computer vision methods for studying cellular dynamics.
捕捉复杂细胞行为的显微视频分类对于理解并量化生物过程的时间动态至关重要。然而,这仍然是计算机视觉领域的一个前沿问题,需要能够有效建模没有刚性边界的对象形状和运动、从整个图像序列而非静态帧中提取层级时空特征,并考虑视野内多个对象的方法。为此,我们组织了细胞行为视频分类挑战赛(CBVCC),根据三种方法对35种方法进行了基准测试:基于追踪衍生特性的分类、端到端的深度学习架构直接从整个视频序列中学习时空特征而无需显式细胞跟踪,或者结合追踪衍生与图像衍生特性进行集成。我们讨论了参赛者所取得的结果,并比较了每种方法的潜力和局限性,旨在为研究细胞动力学的计算机视觉方法的发展奠定基础。
https://arxiv.org/abs/2601.10250
We introduce VL-JEPA, a vision-language model built on a Joint Embedding Predictive Architecture (JEPA). Instead of autoregressively generating tokens as in classical VLMs, VL-JEPA predicts continuous embeddings of the target texts. By learning in an abstract representation space, the model focuses on task-relevant semantics while abstracting away surface-level linguistic variability. In a strictly controlled comparison against standard token-space VLM training with the same vision encoder and training data, VL-JEPA achieves stronger performance while having 50% fewer trainable parameters. At inference time, a lightweight text decoder is invoked only when needed to translate VL-JEPA predicted embeddings into text. We show that VL-JEPA natively supports selective decoding that reduces the number of decoding operations by 2.85x while maintaining similar performance compared to non-adaptive uniform decoding. Beyond generation, the VL-JEPA's embedding space naturally supports open-vocabulary classification, text-to-video retrieval, and discriminative VQA without any architecture modification. On eight video classification and eight video retrieval datasets, the average performance VL-JEPA surpasses that of CLIP, SigLIP2, and Perception Encoder. At the same time, the model achieves comparable performance as classical VLMs (InstructBLIP, QwenVL) on four VQA datasets: GQA, TallyQA, POPE and POPEv2, despite only having 1.6B parameters.
我们介绍了一种基于联合嵌入预测架构(JEPA)的视觉-语言模型——VL-JEPA。不同于传统的视觉-语言模型(VLMs),在生成时逐个自回归地产生标记,VL-JEPA 预测目标文本的连续嵌入。通过在一个抽象表示空间中学习,该模型专注于任务相关的语义,并忽略表面级的语言变异性。在使用相同的视觉编码器和训练数据进行严格控制对比的情况下,与标准词元空间VLM训练相比,VL-JEPA 达到了更强的性能表现,同时其可训练参数减少了50%。推理时,当需要将VL-JEPA预测的嵌入转换为文本时才调用轻量级文本解码器。 我们展示了VL-JEPA原生支持选择性解码,在保持与非自适应均匀解码相似性能的同时,减少了解码操作次数2.85倍。除了生成任务外,VL-JEPA 的嵌入空间自然地支持开放词汇分类、文本到视频检索和判别式视觉问答(VQA),无需进行任何架构修改。 在八个视频分类数据集和八个视频检索数据集中,平均性能表明VL-JEPA超越了CLIP、SigLIP2和感知编码器。同时,在GQA、TallyQA、POPE 和 POPEv2这四个VQA 数据集上,尽管参数量仅为1.6B,其性能与经典的视觉-语言模型(如InstructBLIP, QwenVL)相当。
https://arxiv.org/abs/2512.10942
Face forgery detection encompasses multiple critical tasks, including identifying forged images and videos and localizing manipulated regions and temporal segments. Current approaches typically employ task-specific models with independent architectures, leading to computational redundancy and ignoring potential correlations across related tasks. We introduce OmniFD, a unified framework that jointly addresses four core face forgery detection tasks within a single model, i.e., image and video classification, spatial localization, and temporal localization. Our architecture consists of three principal components: (1) a shared Swin Transformer encoder that extracts unified 4D spatiotemporal representations from both images and video inputs, (2) a cross-task interaction module with learnable queries that dynamically captures inter-task dependencies through attention-based reasoning, and (3) lightweight decoding heads that transform refined representations into corresponding predictions for all FFD tasks. Extensive experiments demonstrate OmniFD's advantage over task-specific models. Its unified design leverages multi-task learning to capture generalized representations across tasks, especially enabling fine-grained knowledge transfer that facilitates other tasks. For example, video classification accuracy improves by 4.63% when image data are incorporated. Furthermore, by unifying images, videos and the four tasks within one framework, OmniFD achieves superior performance across diverse benchmarks with high efficiency and scalability, e.g., reducing 63% model parameters and 50% training time. It establishes a practical and generalizable solution for comprehensive face forgery detection in real-world applications. The source code is made available at this https URL.
面部伪造检测涵盖了多个关键任务,包括识别篡改的图像和视频以及定位被操纵的空间区域和时间片段。目前的方法通常采用特定于每个任务的独立架构模型,导致计算冗余并忽略相关任务之间的潜在关联性。我们引入了OmniFD,这是一个统一框架,在单一模型中共同解决了四个核心面部伪造检测任务,即图像分类、视频分类、空间定位以及时间定位。 该架构由三个主要组件组成: 1. **共享的Swin Transformer编码器**:从图像和视频输入中提取统一的4D时空表示。 2. **跨任务交互模块**:带有可学习查询的模块通过基于注意力机制的推理动态地捕捉任务间的相互依赖关系。 3. **轻量级解码头**:将细化后的表示转换为所有FFD(Face Forgery Detection)任务对应的预测。 广泛实验表明,OmniFD相对于特定于任务的模型具有优势。其统一的设计利用多任务学习来捕获跨任务的一般化表征,并特别支持细粒度的知识转移以帮助其他任务。例如,在引入图像数据的情况下,视频分类准确率提高了4.63%。 此外,通过将图像、视频和四个任务整合在一个框架内,OmniFD在多个基准测试上实现了高效且可扩展的卓越性能,例如减少了63%的模型参数量以及50%的训练时间。它为现实应用中的综合面部伪造检测提供了实际可行且通用的解决方案。 该研究的源代码可在上述提供的链接中获取。
https://arxiv.org/abs/2512.01128
Vision Language Models (VLMs) are becoming increasingly integral to multimedia understanding; however, they often struggle with domain-specific video classification tasks, particularly in cases with limited data. This stems from a critical \textit{rationale gap}, where sparse domain data is insufficient to bridge the semantic distance between complex spatio-temporal content and abstract classification labels. We propose a two-stage self-improvement paradigm to bridge this gap without new annotations. First, we prompt the VLMs to generate detailed textual rationales for each video, compelling them to articulate the domain-specific logic. The VLM is then fine-tuned on these self-generated rationales, utilizing this intermediate supervision to align its representations with the nuances of the target domain. Second, conventional supervised fine-tuning (SFT) is performed on the task labels, achieving markedly higher effectiveness as a result of the model's pre-acquired domain reasoning. Extensive experiments on diverse datasets demonstrate that our method significantly outperforms direct SFT, validating self-generated rationale as an effective, annotation-efficient paradigm for adapting VLMs to domain-specific video analysis.
https://arxiv.org/abs/2511.15923
AI-assisted ultrasound video diagnosis presents new opportunities to enhance the efficiency and accuracy of medical imaging analysis. However, existing research remains limited in terms of dataset diversity, diagnostic performance, and clinical applicability. In this study, we propose \textbf{Auto-US}, an intelligent diagnosis agent that integrates ultrasound video data with clinical diagnostic text. To support this, we constructed \textbf{CUV Dataset} of 495 ultrasound videos spanning five categories and three organs, aggregated from multiple open-access sources. We developed \textbf{CTU-Net}, which achieves state-of-the-art performance in ultrasound video classification, reaching an accuracy of 86.73\% Furthermore, by incorporating large language models, Auto-US is capable of generating clinically meaningful diagnostic suggestions. The final diagnostic scores for each case exceeded 3 out of 5 and were validated by professional clinicians. These results demonstrate the effectiveness and clinical potential of Auto-US in real-world ultrasound applications. Code and data are available at: this https URL.
https://arxiv.org/abs/2511.07748
Spiking Neural Networks (SNNs) are a promising, energy-efficient alternative to standard Artificial Neural Networks (ANNs) and are particularly well-suited to spatio-temporal tasks such as keyword spotting and video classification. However, SNNs have a much lower arithmetic intensity than ANNs and are therefore not well-matched to standard accelerators like GPUs and TPUs. Field Programmable Gate Arrays(FPGAs) are designed for such memory-bound workloads and here we develop a novel, fully-programmable RISC-V-based system-on-chip (FeNN-DMA), tailored to simulating SNNs on modern UltraScale+ FPGAs. We show that FeNN-DMA has comparable resource usage and energy requirements to state-of-the-art fixed-function SNN accelerators, yet it is capable of simulating much larger and more complex models. Using this functionality, we demonstrate state-of-the-art classification accuracy on the Spiking Heidelberg Digits and Neuromorphic MNIST tasks.
https://arxiv.org/abs/2511.00732
Large Vision Language models have seen huge application in several sports use-cases recently. Most of these works have been targeted towards a limited subset of popular sports like soccer, cricket, basketball etc; focusing on generative tasks like visual question answering, highlight generation. This work analyzes the applicability of the modern video foundation models (both encoder and decoder) for a very niche but hugely popular dance sports - breakdance. Our results show that Video Encoder models continue to outperform state-of-the-art Video Language Models for prediction tasks. We provide insights on how to choose the encoder model and provide a thorough analysis into the workings of a finetuned decoder model for breakdance video classification.
近期,大型视觉语言模型在多个体育应用场景中得到了广泛应用。这些研究大多集中在一些流行运动(如足球、板球和篮球)的特定子集上,并侧重于诸如视觉问答和精彩片段生成等生成任务。本项工作分析了现代视频基础模型(包括编码器和解码器)在一项非常小众但极其受欢迎的舞蹈体育项目——霹雳舞中的适用性。我们的研究结果表明,视频编码器模型在预测任务中仍优于最先进的视频语言模型。我们还提供了如何选择编码器模型的见解,并对经过微调的解码器模型在霹雳舞视频分类方面的运作进行了深入分析。
https://arxiv.org/abs/2510.20287
Camera movement conveys spatial and narrative information essential for understanding video content. While recent camera movement classification (CMC) methods perform well on modern datasets, their generalization to historical footage remains unexplored. This paper presents the first systematic evaluation of deep video CMC models on archival film material. We summarize representative methods and datasets, highlighting differences in model design and label definitions. Five standard video classification models are assessed on the HISTORIAN dataset, which includes expert-annotated World War II footage. The best-performing model, Video Swin Transformer, achieves 80.25% accuracy, showing strong convergence despite limited training data. Our findings highlight the challenges and potential of adapting existing models to low-quality video and motivate future work combining diverse input modalities and temporal architectures.
摄像机运动传达了理解视频内容所需的空间和叙事信息。虽然最近的摄像机运动分类(CMC)方法在现代数据集上表现良好,但它们对历史影像资料的泛化能力尚未被探索。本文首次系统地评估了深度视频 CMC 模型在档案电影材料上的性能。我们总结了代表性方法和数据集,并强调了模型设计和标签定义之间的差异。研究团队使用包含专家标注的世界大战 II 影像片段的 HISTORIAN 数据集,对五个标准视频分类模型进行了评估。表现最佳的模型——Video Swin Transformer,在仅有有限训练数据的情况下达到了 80.25% 的准确率,显示出强大的适应能力。 我们的研究成果突出了将现有模型应用于低质量视频所面临的挑战,并且为未来结合多种输入模式和时间架构的研究工作提供了动机。
https://arxiv.org/abs/2510.14713
Although deep neural networks have provided impressive gains in performance, these improvements often come at the cost of increased computational complexity and expense. In many cases, such as 3D volume or video classification tasks, not all slices or frames are necessary due to inherent redundancies. To address this issue, we propose a novel learnable subsampling framework that can be integrated into any neural network architecture. Subsampling, being a nondifferentiable operation, poses significant challenges for direct adaptation into deep learning models. While some works, have proposed solutions using the Gumbel-max trick to overcome the problem of non-differentiability, they fall short in a crucial aspect: they are only task-adaptive and not inputadaptive. Once the sampling mechanism is learned, it remains static and does not adjust to different inputs, making it unsuitable for real-world applications. To this end, we propose an attention-guided sampling module that adapts to inputs even during inference. This dynamic adaptation results in performance gains and reduces complexity in deep neural network models. We demonstrate the effectiveness of our method on 3D medical imaging datasets from MedMNIST3D as well as two ultrasound video datasets for classification tasks, one of them being a challenging in-house dataset collected under real-world clinical conditions.
尽管深度神经网络在性能上取得了令人印象深刻的提升,但这些改进通常是以计算复杂性和成本增加为代价的。在许多情况下,例如3D体积或视频分类任务中,并非所有的切片或帧都是必要的,因为存在内在冗余性。为此,我们提出了一种新型的学习采样框架,它可以集成到任何神经网络架构中。作为非可微操作的采样,在直接适应深度学习模型时会带来显著挑战。尽管有些工作提出了使用Gumbel-max技巧来克服非可微性的难题,但它们在关键方面仍有不足:仅适用于特定任务,并不具备输入自适应性。一旦采样机制被学会后便保持静态,不能根据不同的输入进行调整,这使其不适合应用于现实世界中。 为了解决这个问题,我们提出了一种基于注意力引导的采样模块,在推理过程中也能根据输入数据进行动态调整。这种动态适配不仅带来了性能提升,还减少了深度神经网络模型中的复杂性。我们在MedMNIST3D的数据集中验证了我们的方法在3D医学成像上的有效性,同时也在两个超声视频分类任务的数据集上进行了实验,其中一个为一个具有挑战性的内部数据集,在真实的临床条件下收集。
https://arxiv.org/abs/2510.12376
Recently, pre-trained state space models have shown great potential for video classification, which sequentially compresses visual tokens in videos with linear complexity, thereby improving the processing efficiency of video data while maintaining high performance. To apply powerful pre-trained models to downstream tasks, prompt learning is proposed to achieve efficient downstream task adaptation with only a small number of fine-tuned parameters. However, the sequentially compressed visual prompt tokens fail to capture the spatial and temporal contextual information in the video, thus limiting the effective propagation of spatial information within a video frame and temporal information between frames in the state compression model and the extraction of discriminative information. To tackle the above issue, we proposed a State Space Prompting (SSP) method for video understanding, which combines intra-frame and inter-frame prompts to aggregate and propagate key spatiotemporal information in the video. Specifically, an Intra-Frame Gathering (IFG) module is designed to aggregate spatial key information within each frame. Besides, an Inter-Frame Spreading (IFS) module is designed to spread discriminative spatio-temporal information across different frames. By adaptively balancing and compressing key spatio-temporal information within and between frames, our SSP effectively propagates discriminative information in videos in a complementary manner. Extensive experiments on four video benchmark datasets verify that our SSP significantly outperforms existing SOTA methods by 2.76% on average while reducing the overhead of fine-tuning parameters.
最近,预训练状态空间模型在视频分类任务中展现了巨大的潜力。这些模型可以以线性复杂度顺序压缩视频中的视觉标记,从而提高视频数据处理的效率,同时保持高性能。为了将强大的预训练模型应用于下游任务,提出了提示学习方法,通过仅调整少量参数即可实现高效的下游任务适配。然而,顺序压缩的视觉提示令牌无法捕捉到视频中的空间和时间上下文信息,这限制了状态压缩模型中视频帧内空间信息的有效传播以及不同帧之间的时间信息提取,并且影响鉴别性信息的抽取。 为了解决上述问题,我们提出了一个用于视频理解的状态空间提示(SSP)方法。该方法结合了帧内和帧间提示来聚合并传播视频中的关键时空信息。具体来说,设计了一个帧内收集(IFG)模块以在每个帧内部聚集重要的空间信息,并且设计了一个帧间扩散(IFS)模块以便跨不同帧传递区分性时空信息。通过自适应地平衡和压缩帧内的以及帧间的关鍵时-空信息,我们的SSP方法能够互补地传播视频中的鉴别性信息。 在四个视频基准数据集上的大量实验验证了我们提出的SSP方法相较于现有最先进(SOTA)方法平均提高了2.76%,同时减少了微调参数的开销。
https://arxiv.org/abs/2510.12160
Purpose: The FedSurg challenge was designed to benchmark the state of the art in federated learning for surgical video classification. Its goal was to assess how well current methods generalize to unseen clinical centers and adapt through local fine-tuning while enabling collaborative model development without sharing patient data. Methods: Participants developed strategies to classify inflammation stages in appendicitis using a preliminary version of the multi-center Appendix300 video dataset. The challenge evaluated two tasks: generalization to an unseen center and center-specific adaptation after fine-tuning. Submitted approaches included foundation models with linear probing, metric learning with triplet loss, and various FL aggregation schemes (FedAvg, FedMedian, FedSAM). Performance was assessed using F1-score and Expected Cost, with ranking robustness evaluated via bootstrapping and statistical testing. Results: In the generalization task, performance across centers was limited. In the adaptation task, all teams improved after fine-tuning, though ranking stability was low. The ViViT-based submission achieved the strongest overall performance. The challenge highlighted limitations in generalization, sensitivity to class imbalance, and difficulties in hyperparameter tuning in decentralized training, while spatiotemporal modeling and context-aware preprocessing emerged as promising strategies. Conclusion: The FedSurg Challenge establishes the first benchmark for evaluating FL strategies in surgical video classification. Findings highlight the trade-off between local personalization and global robustness, and underscore the importance of architecture choice, preprocessing, and loss design. This benchmarking offers a reference point for future development of imbalance-aware, adaptive, and robust FL methods in clinical surgical AI.
目的:FedSurg挑战赛旨在评估联邦学习在手术视频分类领域的最新成果。其目标是衡量现有方法推广到未见过的临床中心的能力,以及通过本地微调适应这些新环境的能力,并且能够在不分享患者数据的情况下促进合作模型开发。 方法:参赛者利用多中心Appendix300视频数据集的一个初步版本,开发了用于分类阑尾炎炎症阶段的策略。挑战赛评估了两项任务:推广到未见过的中心和在微调后的特定中心适应性。提交的方法包括带有线性探测的基础模型、使用三元组损失的度量学习以及各种FL聚合方案(FedAvg、FedMedian、FedSAM)。性能通过F1分数和预期成本进行评估,而排名稳健性的评估则通过自助法和统计检验来进行。 结果:在推广任务中,跨中心的性能有限。在适应性任务中,所有团队在微调后都有所改进,尽管排名稳定性较低。基于ViViT的方法取得了最强的整体表现。该挑战赛突显了推广能力、对类别不平衡敏感度以及去中心化训练中超参数调整困难等限制,并且时空建模和上下文感知预处理被视为有前景的策略。 结论:FedSurg挑战赛建立了首个用于评估手术视频分类中FL策略的基准测试。研究结果强调了本地个性化与全局稳健性之间的权衡,突出了架构选择、预处理和损失设计的重要性。这一基准为未来开发针对不平衡、适应性和稳健性的临床外科AI FL方法提供了参考点。
https://arxiv.org/abs/2510.04772
This article is inspired by video classification technology. If the user behavior subspace is viewed as a frame image, consecutive frame images are viewed as a video. Following this novel idea, a model for spammer detection based on user videoization, called UVSD, is proposed. Firstly, a user2piexl algorithm for user pixelization is proposed. Considering the adversarial behavior of user stances, the user is viewed as a pixel, and the stance is quantified as the pixel's RGB. Secondly, a behavior2image algorithm is proposed for transforming user behavior subspace into frame images. Low-rank dense vectorization of subspace user relations is performed using representation learning, while cutting and diffusion algorithms are introduced to complete the frame imageization. Finally, user behavior videos are constructed based on temporal features. Subsequently, a video classification algorithm is combined to identify the spammers. Experiments using publicly available datasets, i.e., WEIBO and TWITTER, show an advantage of the UVSD model over state-of-the-art methods.
本文受到视频分类技术的启发。如果将用户行为子空间视为一个帧图像,那么连续的帧图像则被视为一段视频。基于这一新颖的理念,提出了名为UVSD(User Videoization-based Spammer Detection)的检测垃圾信息发送者的模型。 首先,提出了一种用于用户像素化的user2pixel算法。考虑到用户的对抗性立场行为,将每个用户视为一个像素,并且使用RGB值量化其立场。 其次,为了将用户行为子空间转换为帧图像,提出了behavior2image算法。通过表示学习执行了低秩稠密向量化来处理子空间中的用户关系,在此过程中引入切割和扩散算法以完成帧图的创建过程。 最后,根据时间特征构建用户行为视频,并结合视频分类算法识别垃圾信息发送者。 在使用公开数据集(如WEIBO和TWITTER)进行实验时,UVSD模型表现出优于现有方法的优势。
https://arxiv.org/abs/2510.06233
Multimodal learning plays a pivotal role in advancing artificial intelligence systems by incorporating information from multiple modalities to build a more comprehensive representation. Despite its importance, current state-of-the-art models still suffer from severe limitations that prevent the successful development of a fully multimodal model. Such methods may not provide indicators that all the involved modalities are effectively aligned. As a result, some modalities may not be aligned, undermining the effectiveness of the model in downstream tasks where multiple modalities should provide additional information that the model fails to exploit. In this paper, we present TRIANGLE: TRI-modAl Neural Geometric LEarning, the novel proposed similarity measure that is directly computed in the higher-dimensional space spanned by the modality embeddings. TRIANGLE improves the joint alignment of three modalities via a triangle-area similarity, avoiding additional fusion layers or pairwise similarities. When incorporated in contrastive losses replacing cosine similarity, TRIANGLE significantly boosts the performance of multimodal modeling, while yielding interpretable alignment rationales. Extensive evaluation in three-modal tasks such as video-text and audio-text retrieval or audio-video classification, demonstrates that TRIANGLE achieves state-of-the-art results across different datasets improving the performance of cosine-based methods up to 9 points of Recall@1.
多模态学习在通过融合来自多种模式的信息来构建更全面的表示,从而推进人工智能系统方面发挥着核心作用。尽管其重要性不言而喻,但目前最先进的模型仍然受到严重影响性能限制的问题,这些问题阻碍了完全多模态模型的成功开发。现有方法可能无法提供所有涉及模式有效对齐的指标。因此,在下游任务中,某些模式可能未能充分对齐,这些任务要求多种模式共同提供额外信息以增强模型性能,而当前模型往往未能充分利用这一点。 在本文中,我们提出了TRIANGLE:TRI模态神经几何学习,这是一种新颖的相似度测量方法,可以直接计算出现在由模式嵌入所定义的高维空间中的三角形面积。通过采用这种三角形面积相似性,TRIANGLE改进了三种模式的同时对齐,从而避免了额外融合层或成对相似性的使用。当在对比损失中替换余弦相似性时,TRIANGLE显著增强了多模态建模性能,并提供了可解释的对齐原理。 在视频-文本、音频-文本检索以及音频-视频分类等三模式任务中的广泛评估表明,TRIANGLE在各种数据集上实现了最先进的结果,在基于余弦的方法的基础上将Recall@1提高了高达9个百分点。
https://arxiv.org/abs/2509.24734
Conceptual models such as Concept Bottleneck Models (CBMs) have driven substantial progress in improving interpretability for image classification by leveraging human-interpretable concepts. However, extending these models from static images to sequences of images, such as video data, introduces a significant challenge due to the temporal dependencies inherent in videos, which are essential for capturing actions and events. In this work, we introduce MoTIF (Moving Temporal Interpretable Framework), an architectural design inspired by a transformer that adapts the concept bottleneck framework for video classification and handles sequences of arbitrary length. Within the video domain, concepts refer to semantic entities such as objects, attributes, or higher-level components (e.g., 'bow', 'mount', 'shoot') that reoccur across time - forming motifs collectively describing and explaining actions. Our design explicitly enables three complementary perspectives: global concept importance across the entire video, local concept relevance within specific windows, and temporal dependencies of a concept over time. Our results demonstrate that the concept-based modeling paradigm can be effectively transferred to video data, enabling a better understanding of concept contributions in temporal contexts while maintaining competitive performance. Code available at this http URL.
概念模型,如概念瓶颈模型(CBM),通过利用人类可解释的概念,在提高图像分类的可解释性方面取得了显著进展。然而,将这些模型从静态图像扩展到序列图像(如视频数据)带来了重大挑战,因为视频中的时间依赖性对于捕捉动作和事件至关重要。在本工作中,我们介绍了MoTIF(移动时序可解释框架),这是一种受变压器架构启发的设计,它适应了概念瓶颈框架以适用于视频分类,并能够处理任意长度的序列。在视频领域中,概念指的是语义实体,如物体、属性或更高层次的组件(例如,“拉弓”、“上箭”、“射击”)这些成分随着时间反复出现,共同描述和解释动作。 我们的设计明确支持三个互补的角度:整个视频中的全局概念重要性、特定窗口内的局部概念相关性和时间跨度上的概念时序依赖。研究结果表明,基于概念的建模范式可以有效地应用于视频数据,在保持竞争力的同时,使我们更好地理解在时间上下文中概念的作用和贡献。 代码可在该链接获取(请将“this http URL”替换为实际提供的GitHub或任何版本控制系统中的URL)。
https://arxiv.org/abs/2509.20899
Cardiac amyloidosis (CA) is a rare cardiomyopathy, with typical abnormalities in clinical measurements from echocardiograms such as reduced global longitudinal strain of the myocardium. An alternative approach for detecting CA is via neural networks, using video classification models such as convolutional neural networks. These models process entire video clips, but provide no assurance that classification is based on clinically relevant features known to be associated with CA. An alternative paradigm for disease classification is to apply models to quantitative features such as strain, ensuring that the classification relates to clinically relevant features. Drawing inspiration from this approach, we explicitly constrain a transformer model to the anatomical region where many known CA abnormalities occur -- the myocardium, which we embed as a set of deforming points and corresponding sampled image patches into input tokens. We show that our anatomical constraint can also be applied to the popular self-supervised learning masked autoencoder pre-training, where we propose to mask and reconstruct only anatomical patches. We show that by constraining both the transformer and pre-training task to the myocardium where CA imaging features are localized, we achieve increased performance on a CA classification task compared to full video transformers. Our model provides an explicit guarantee that the classification is focused on only anatomical regions of the echo, and enables us to visualize transformer attention scores over the deforming myocardium.
心脏淀粉样变性(CA)是一种罕见的心肌病,其临床测量中的典型异常包括心肌整体纵向应变降低。检测 CA 的另一种方法是利用神经网络,如卷积神经网络进行视频分类模型的应用。这些模型处理整个视频片段,但无法保证分类基于与 CA 相关的临床重要特征。疾病分类的一种替代范例是将模型应用于诸如应变等定量特性上,从而确保分类与临床相关特征有关。 受到这一方法的启发,我们明确限制了一个转换器模型只关注许多已知 CA 异常发生的心肌解剖区域——我们将心肌嵌入到输入令牌中的一组变形点和相应的采样图像补丁。我们展示了这种解剖学约束也可以应用于流行的自我监督学习掩码自动编码器预训练,其中我们建议仅对解剖补丁进行掩码和重建。我们表明,通过将转换器及其预训练任务限制在 CA 影像特征局部化的区域(心肌),与完整视频转换器相比,在 CA 分类任务中实现了更高的性能。 我们的模型提供了明确的保证:分类仅关注回声图像中的解剖学区域,并使我们能够可视化变换器注意分数在整个变形的心肌上。
https://arxiv.org/abs/2509.19691
Hateful videos present serious risks to online safety and real-world well-being, necessitating effective detection methods. Although multimodal classification approaches integrating information from several modalities outperform unimodal ones, they typically neglect that even minimal hateful content defines a video's category. Specifically, they generally treat all content uniformly, instead of emphasizing the hateful components. Additionally, existing multimodal methods cannot systematically capture structured information in videos, limiting the effectiveness of multimodal fusion. To address these limitations, we propose a novel multimodal dual-stream graph neural network model. It constructs an instance graph by separating the given video into several instances to extract instance-level features. Then, a complementary weight graph assigns importance weights to these features, highlighting hateful instances. Importance weights and instance features are combined to generate video labels. Our model employs a graph-based framework to systematically model structured relationships within and across modalities. Extensive experiments on public datasets show that our model is state-of-the-art in hateful video classification and has strong explainability. Code is available: this https URL.
仇恨视频对在线安全和现实生活中的福祉构成严重威胁,因此需要有效的检测方法。虽然多模态分类方法通过结合来自多种模式的信息优于单模态方法,但它们通常忽视了即使是少量的仇恨内容也会定义一个视频的类别。具体来说,现有方法普遍将所有内容视为统一处理,而没有强调仇恨成分的重要性。此外,现有的多模态方法无法系统地捕捉视频中的结构化信息,限制了多模态融合的有效性。 为了解决这些局限性,我们提出了一种新颖的多模态双流图神经网络模型。该模型通过将给定的视频分割成多个实例来构建实例图,并提取出实例级别的特征。然后,一个互补权重图会赋予这些特征重要性权重,突出仇恨内容的实例。之后,根据重要性权重和实例特征组合生成视频标签。 我们的模型采用基于图的方法,系统地建模了多模态之间的结构化关系。在公共数据集上的广泛实验表明,我们的模型在仇恨视频分类方面处于行业领先水平,并具有很强的可解释性。代码可在以下链接获取:[提供URL]。
https://arxiv.org/abs/2509.13515
Federated learning (FL) allows multiple entities to train a shared model collaboratively. Its core, privacy-preserving principle is that participants only exchange model updates, such as gradients, and never their raw, sensitive data. This approach is fundamental for applications in domains where privacy and confidentiality are important. However, the security of this very mechanism is threatened by gradient inversion attacks, which can reverse-engineer private training data directly from the shared gradients, defeating the purpose of FL. While the impact of these attacks is known for image, text, and tabular data, their effect on video data remains an unexamined area of research. This paper presents the first analysis of video data leakage in FL using gradient inversion attacks. We evaluate two common video classification approaches: one employing pre-trained feature extractors and another that processes raw video frames with simple transformations. Our initial results indicate that the use of feature extractors offers greater resilience against gradient inversion attacks. We also demonstrate that image super-resolution techniques can enhance the frames extracted through gradient inversion attacks, enabling attackers to reconstruct higher-quality videos. Our experiments validate this across scenarios where the attacker has access to zero, one, or more reference frames from the target environment. We find that although feature extractors make attacks more challenging, leakage is still possible if the classifier lacks sufficient complexity. We, therefore, conclude that video data leakage in FL is a viable threat, and the conditions under which it occurs warrant further investigation.
联邦学习(FL)允许多个实体协作训练一个共享模型。其核心的隐私保护原则是参与者之间仅交换模型更新,例如梯度信息,而不分享原始敏感数据。这种做法对于需要严格保障隐私和保密性的领域至关重要。然而,这种机制的安全性受到了梯度逆向攻击的威胁,此类攻击可以直接从共享梯度中反演私人训练数据,从而破坏了FL的目的。尽管人们已经了解这些攻击对图像、文本和表格数据的影响,但它们对视频数据的效果仍然是一个尚未被探索的研究领域。 本文首次分析了在联邦学习环境下使用梯度逆向攻击导致的视频数据泄露问题。我们评估了两种常见的视频分类方法:一种采用预训练特征提取器的方法,另一种则是直接处理原始视频帧并通过简单变换进行处理的方式。我们的初步结果表明,使用特征提取器能够提供更强的抗梯度逆向攻击的能力。此外,我们还展示了图像超分辨率技术可以增强通过梯度逆向攻击获取到的视频帧,使攻击者能够重建更高质量的视频片段。 在实验中,无论是在攻击者拥有零个、一个还是多个目标环境中的参考帧的情况下,我们都验证了上述发现。研究结果表明,虽然特征提取器能增加攻击难度,但如果分类器复杂性不足,则仍然可能存在数据泄露的风险。因此,我们得出结论,在联邦学习环境中视频数据的泄漏确实是一种可行的安全威胁,并且需要进一步探索这种威胁发生的具体条件。
https://arxiv.org/abs/2509.09742