It's no secret that video has become the primary way we share information online. That's why there's been a surge in demand for algorithms that can analyze and understand video content. It's a trend going to continue as video continues to dominate the digital landscape. These algorithms will extract and classify related features from the video and will use them to describe the events and objects in the video. Deep neural networks have displayed encouraging outcomes in the realm of feature extraction and video description. This paper will explore the spatiotemporal features found in videos and recent advancements in deep neural networks in video understanding. We will review some of the main trends in video understanding models and their structural design, the main problems, and some offered solutions in this topic. We will also review and compare significant video understanding and action recognition datasets.
毫无疑问,视频已经成为我们在线分享信息的主要方式。因此,对于能够分析和理解视频内容的算法的需求激增。随着视频在数字领域继续占据主导地位,这一趋势将继续发展。这些算法将从视频中提取并分类相关特征,并利用它们来描述视频中的事件和物体。深度神经网络在特征提取和视频描述方面已经展示出了令人鼓舞的结果。本文将探讨视频中的时空特征以及视频理解领域中深度神经网络的最新进展。我们将回顾一些主要的视频理解模型及其结构设计、该领域的核心问题,以及在此主题上提出的一些解决方案。我们还将回顾并比较重要的视频理解和动作识别数据集。
https://arxiv.org/abs/2502.07277
Diffusion models have revolutionized generative modeling, enabling unprecedented realism in image and video synthesis. This success has sparked interest in leveraging their representations for visual understanding tasks. While recent works have explored this potential for image generation, the visual understanding capabilities of video diffusion models remain largely uncharted. To address this gap, we systematically compare the same model architecture trained for video versus image generation, analyzing the performance of their latent representations on various downstream tasks including image classification, action recognition, depth estimation, and tracking. Results show that video diffusion models consistently outperform their image counterparts, though we find a striking range in the extent of this superiority. We further analyze features extracted from different layers and with varying noise levels, as well as the effect of model size and training budget on representation and generation quality. This work marks the first direct comparison of video and image diffusion objectives for visual understanding, offering insights into the role of temporal information in representation learning.
扩散模型已经彻底改变了生成式建模领域,使得图像和视频合成的逼真度达到了前所未有的水平。这一成功激发了人们探索其表示形式在视觉理解任务中的潜力的兴趣。虽然最近的研究已经开始探讨这种用于图像生成的可能性,但关于视频扩散模型的视觉理解能力仍有很多未知之处。为了填补这一空白,我们系统地比较了同一种架构分别针对视频和图像生成训练后的性能,并分析它们在各种下游任务(包括图像分类、动作识别、深度估计以及跟踪)上的潜在表示效果。实验结果表明,视频扩散模型一贯优于其对应的图像模型,尽管这种优越性的程度差异显著。 此外,我们还分析了不同层次的特征提取及其噪声水平的影响,同时探讨了模型大小和训练预算对表示质量和生成质量的影响。这项研究标志着首次直接比较视频与图像扩散目标在视觉理解中的作用,并揭示了时序信息在表征学习中所扮演的角色。
https://arxiv.org/abs/2502.07001
Human-In-The-Loop (HITL) frameworks are integral to many real-world computer vision systems, enabling human operators to make informed decisions with AI assistance. Conformal Predictions (CP), which provide label sets with rigorous guarantees on ground truth inclusion probabilities, have recently gained traction as a valuable tool in HITL settings. One key application area is video surveillance, closely associated with Human Action Recognition (HAR). This study explores the application of CP on top of state-of-the-art HAR methods that utilize extensively pre-trained Vision-Language Models (VLMs). Our findings reveal that CP can significantly reduce the average number of candidate classes without modifying the underlying VLM. However, these reductions often result in distributions with long tails. To address this, we introduce a method based on tuning the temperature parameter of the VLMs to minimize these tails without requiring additional calibration data. Our code is made available on GitHub at the address this https URL.
人机协作(HITL)框架在许多现实世界的计算机视觉系统中扮演着重要角色,它使人类操作员能够借助人工智能的帮助做出明智的决策。近年来,一致性预测(CP)因其能提供包含真实标签概率保证的标签集而成为HITL设置中的一个有价值的工具。其中一个重要应用领域是视频监控,与人体动作识别(HAR)紧密相关。本研究探讨了在最先进的基于视觉-语言模型(VLMs)的人体动作识别方法上应用一致性预测的方法。 我们的研究成果表明,在不修改底层VLM的情况下,CP可以显著减少候选类别数量的平均值。然而,这种减少通常会导致长尾分布的情况出现。为了解决这一问题,我们提出了一种基于调整VLM温度参数的方法来最小化这些长尾现象,并且这种方法不需要额外的校准数据。 我们的代码已发布在GitHub上的以下地址:[请在此处插入实际链接]。
https://arxiv.org/abs/2502.06631
Transformers have demonstrated remarkable performance in skeleton-based human action recognition, yet their quadratic computational complexity remains a bottleneck for real-world applications. To mitigate this, linear attention mechanisms have been explored but struggle to capture the hierarchical structure of skeleton data. Meanwhile, the Poincaré model, as a typical hyperbolic geometry, offers a powerful framework for modeling hierarchical structures but lacks well-defined operations for existing mainstream linear attention. In this paper, we propose HyLiFormer, a novel hyperbolic linear attention Transformer tailored for skeleton-based action recognition. Our approach incorporates a Hyperbolic Transformation with Curvatures (HTC) module to map skeleton data into hyperbolic space and a Hyperbolic Linear Attention (HLA) module for efficient long-range dependency modeling. Theoretical analysis and extensive experiments on NTU RGB+D and NTU RGB+D 120 datasets demonstrate that HyLiFormer significantly reduces computational complexity while preserving model accuracy, making it a promising solution for efficiency-critical applications.
变换器(Transformers)在基于骨架的人体动作识别中展示了卓越的性能,但其二次计算复杂度仍然是实际应用中的瓶颈。为解决这一问题,线性注意力机制已被探索,但仍难以捕捉骨架数据的层次结构。与此同时,作为典型的双曲几何模型,庞加莱模型提供了一种用于建模层次结构的强大框架,但它缺乏对现有主流线性注意操作的良好定义。在本文中,我们提出了HyLiFormer,这是一种新型的双曲线线性注意力变换器,专为基于骨架的动作识别设计。我们的方法引入了带有曲率(HTC)的双曲转换模块,将骨架数据映射到双曲空间,并采用双曲线性注意(HLA)模块来高效地建模长距离依赖关系。理论分析和在NTU RGB+D及NTU RGB+D 120数据集上的广泛实验表明,HyLiFormer显著降低了计算复杂度并保持了模型的准确性,使其成为对效率要求高的应用中的一个有前景的解决方案。
https://arxiv.org/abs/2502.05869
We present a validation dataset of newly-collected kitchen-based egocentric videos, manually annotated with highly detailed and interconnected ground-truth labels covering: recipe steps, fine-grained actions, ingredients with nutritional values, moving objects, and audio annotations. Importantly, all annotations are grounded in 3D through digital twinning of the scene, fixtures, object locations, and primed with gaze. Footage is collected from unscripted recordings in diverse home environments, making HDEPIC the first dataset collected in-the-wild but with detailed annotations matching those in controlled lab environments. We show the potential of our highly-detailed annotations through a challenging VQA benchmark of 26K questions assessing the capability to recognise recipes, ingredients, nutrition, fine-grained actions, 3D perception, object motion, and gaze direction. The powerful long-context Gemini Pro only achieves 38.5% on this benchmark, showcasing its difficulty and highlighting shortcomings in current VLMs. We additionally assess action recognition, sound recognition, and long-term video-object segmentation on HD-EPIC. HD-EPIC is 41 hours of video in 9 kitchens with digital twins of 413 kitchen fixtures, capturing 69 recipes, 59K fine-grained actions, 51K audio events, 20K object movements and 37K object masks lifted to 3D. On average, we have 263 annotations per minute of our unscripted videos.
我们介绍了一组新的以厨房为中心的自视角视频验证数据集,这些视频经过人工标注,包含了高度详细且相互关联的真实标签,涵盖:食谱步骤、细粒度动作、包含营养信息的食材、移动物体以及音频注释。重要的是,所有注释都通过场景、固定装置、对象位置的数字孪生,并结合注视点数据进行三维定位。视频素材来自多样化的家庭环境中未经脚本编写的记录,在真实环境条件下收集但具有与受控实验室环境下详细标注相匹配的数据集,使HDEPIC成为首个野外采集但具备精细注释标准的数据集。 我们通过一个包含26,000个问题的挑战性VQA基准测试展示了高度详细的注释潜力,该测试评估识别食谱、食材、营养成分、细粒度动作、三维感知、物体运动以及注视方向的能力。强大的长上下文Gemini Pro模型在这一基准上仅达到38.5%,表明了任务难度并突显了当前视觉语言模型的不足之处。 我们还在HD-EPIC数据集上评估了动作识别、声音识别及长期视频对象分割技术。HD-EPIC包含9个厨房内共计41小时的视频,拥有413种厨房固定装置的数字孪生体,记录了69道食谱,59,000次细粒度动作,51,000次音频事件,20,000次物体移动和37,000个三维提升的对象掩码。平均每分钟未经过脚本编写的视频包含约263项注释。
https://arxiv.org/abs/2502.04144
Action recognition in dark, low-light (under-exposed) or noisy videos is a challenging task due to visibility degradation, which can hinder critical spatiotemporal details. This paper proposes MD-BERT, a novel multi-stream approach that integrates complementary pre-processing techniques such as gamma correction and histogram equalization alongside raw dark frames to address these challenges. We introduce the Dynamic Feature Fusion (DFF) module, extending existing attentional fusion methods to a three-stream setting, thereby capturing fine-grained and global contextual information across different brightness and contrast enhancements. The fused spatiotemporal features are then processed by a BERT-based temporal model, which leverages its bidirectional self-attention to effectively capture long-range dependencies and contextual relationships across frames. Extensive experiments on the ARID V1.0 and ARID V1.5 dark video datasets show that MD-BERT outperforms existing methods, establishing a new state-of-the-art performance. Ablation studies further highlight the individual contributions of each input stream and the effectiveness of the proposed DFF and BERT modules. The official website of this work is available at: this https URL
在光线昏暗、低光(曝光不足)或噪点较多的视频中进行动作识别是一项具有挑战性的任务,因为这些条件会降低可见度,并阻碍对关键空间-时间细节的捕捉。本文提出了一种新的多流方法——MD-BERT,该方法结合了互补的预处理技术,如伽马校正和直方图均衡化等手段与原始暗帧一起使用,以应对上述挑战。 我们引入了动态特征融合(DFF)模块,在现有的注意式融合方法的基础上进一步扩展到三流设置中,从而能够捕获不同亮度和对比度增强下的细微和全局上下文信息。随后,将融合后的空间-时间特征通过基于BERT的时序模型进行处理,利用其双向自注意力机制有效捕捉帧之间的长程依赖关系及上下文关联。 在ARID V1.0和ARID V1.5黑暗视频数据集上的广泛实验表明,MD-BERT优于现有方法,并建立了新的性能标杆。消融研究进一步揭示了每个输入流的单独贡献以及所提出的DFF和BERT模块的有效性。 该工作的官方网站可在此网址访问:this https URL
https://arxiv.org/abs/2502.03724
Contrastive language-image pretraining (CLIP) has significantly advanced image-based vision learning. A pressing topic subsequently arises: how can we effectively adapt CLIP to the video domain? Recent studies have focused on adjusting either the textual or visual branch of CLIP for action recognition. However, we argue that adaptations of both branches are crucial. In this paper, we propose \textbf{CLAVER}: a \textbf{C}ontrastive \textbf{L}anguage-\textbf{A}ction \textbf{V}ideo Learn\textbf{er}, designed to shift CLIP's focus from the alignment of static visual objects and concrete nouns to the alignment of dynamic action behaviors and abstract verbs. Specifically, we introduce a novel Kronecker mask attention for temporal modeling. Our tailored Kronecker mask offers three benefits 1) it expands the temporal receptive field for each token, 2) it serves as an effective spatiotemporal heterogeneity inductive bias, mitigating the issue of spatiotemporal homogenization, and 3) it can be seamlessly plugged into transformer-based models. Regarding the textual branch, we leverage large language models to generate diverse, sentence-level and semantically rich interpretive prompts of actions, which shift the model's focus towards the verb comprehension. Extensive experiments on various benchmarks and learning scenarios demonstrate the superiority and generality of our approach. The code will be available soon.
对比语言-图像预训练(CLIP)在基于图像的视觉学习方面取得了显著进展。随之而来的紧迫问题是:我们如何有效地将CLIP适应到视频领域?近期的研究主要集中在调整CLIP的文字或视觉分支以进行动作识别。然而,我们认为两者的改进都是至关重要的。在这篇文章中,我们提出了一种新方法——**CLAVER(对比语言-行动视频学习者)**,旨在让CLIP从静态视觉对象和具体名词的对齐转向动态行为和抽象动词的对齐。 具体而言,我们引入了一种新颖的克罗内克掩码注意机制用于时间建模。我们的定制化克罗内克掩码带来了三大益处:1)它扩大了每个标记的时间接受域;2)作为有效的时空异质性归纳偏差,有助于缓解时空同质化问题;3)可以无缝地集成到基于变换器的模型中。 在文字分支方面,我们利用大型语言模型生成多样化的、句子级和语义丰富的动作解释提示,从而将模型的关注点转向动词理解。广泛的实验显示了我们在各种基准测试及学习场景中的方法具有优越性和通用性。相关代码即将发布。
https://arxiv.org/abs/2502.03549
The introduction of vision-language models like CLIP has enabled the development of foundational video models capable of generalizing to unseen videos and human actions. However, these models are typically trained on web videos, which often fail to capture the challenges present in Activities of Daily Living (ADL) videos. Existing works address ADL-specific challenges, such as similar appearances, subtle motion patterns, and multiple viewpoints, by combining 3D skeletons and RGB videos. However, these approaches are not integrated with language, limiting their ability to generalize to unseen action classes. In this paper, we introduce SKI models, which integrate 3D skeletons into the vision-language embedding space. SKI models leverage a skeleton-language model, SkeletonCLIP, to infuse skeleton information into Vision Language Models (VLMs) and Large Vision Language Models (LVLMs) through collaborative training. Notably, SKI models do not require skeleton data during inference, enhancing their robustness for real-world applications. The effectiveness of SKI models is validated on three popular ADL datasets for zero-shot action recognition and video caption generation tasks.
视觉-语言模型(如CLIP)的引入,促进了能够泛化到未见过视频和人类动作的基础视频模型的发展。然而,这些模型通常是在网络视频上进行训练的,而这些视频往往无法捕捉日常活动(ADL)视频中存在的挑战。现有研究通过结合3D骨架与RGB视频来解决类似外观、细微的动作模式及多视角等特定于ADL的问题。不过,这种方法未将语言整合进来,从而限制了其对新动作类别的泛化能力。 在本文中,我们提出了SKI模型,该模型将3D骨架融入到视觉-语言嵌入空间中。通过联合训练,SKI模型利用了一种骨骼-语言模型(SkeletonCLIP),能够将骨架信息注入到视觉语言模型(VLMs)和大型视觉语言模型(LVLMs)中。值得注意的是,在推理阶段SKI模型不需要骨架数据,从而增强了其在实际应用中的鲁棒性。 我们通过三个流行的ADL数据集上的零样本动作识别与视频字幕生成任务验证了SKI模型的有效性。
https://arxiv.org/abs/2502.03459
Bias in machine learning models can lead to unfair decision making, and while it has been well-studied in the image and text domains, it remains underexplored in action recognition. Action recognition models often suffer from background bias (i.e., inferring actions based on background cues) and foreground bias (i.e., relying on subject appearance), which can be detrimental to real-life applications such as autonomous vehicles or assisted living monitoring. While prior approaches have mainly focused on mitigating background bias using specialized augmentations, we thoroughly study both biases. We propose ALBAR, a novel adversarial training method that mitigates foreground and background biases without requiring specialized knowledge of the bias attributes. Our framework applies an adversarial cross-entropy loss to the sampled static clip (where all the frames are the same) and aims to make its class probabilities uniform using a proposed entropy maximization loss. Additionally, we introduce a gradient penalty loss for regularization against the debiasing process. We evaluate our method on established background and foreground bias protocols, setting a new state-of-the-art and strongly improving combined debiasing performance by over 12% on HMDB51. Furthermore, we identify an issue of background leakage in the existing UCF101 protocol for bias evaluation which provides a shortcut to predict actions and does not provide an accurate measure of the debiasing capability of a model. We address this issue by proposing more fine-grained segmentation boundaries for the actor, where our method also outperforms existing approaches. Project Page: this https URL
机器学习模型中的偏见可能导致不公平的决策制定。虽然在图像和文本领域中这一问题已经得到了广泛研究,但在动作识别领域的相关探讨却相对较少。动作识别模型常常会受到背景偏差(即基于背景线索推断动作)和前景偏差(即依赖于主体外观)的影响,这对自动驾驶汽车或辅助生活监控等现实应用来说可能产生不利影响。 尽管先前的方法主要集中在使用专业增广技术来缓解背景偏见,但我们的研究全面地考察了这两种偏差。我们提出了一种名为ALBAR的新型对抗训练方法,该方法可以在不需要特定偏见属性知识的情况下减轻前景和背景偏见。我们的框架应用了一个对抗交叉熵损失到抽样的静态片段(所有帧相同),并试图通过提议的熵最大化损失使这些片段的概率分布趋于均匀化。此外,我们引入了一种梯度惩罚损失来对去偏过程进行正则化。 我们在已建立的背景和前景偏差协议上评估了我们的方法,并设立了一个新的最先进状态,在HMDB51数据集上的综合去偏性能提升了超过12%。此外,我们发现现有的UCF101偏见评价协议中存在一个“背景泄漏”问题,即提供了一条预测动作的捷径,无法准确衡量模型的去偏能力。为解决此问题,我们提出了更精细的动作者分割边界,并且我们的方法在此场景下也优于现有方法。 项目页面:[这个链接](https://this-url-is-to-be-replaced.com/)
https://arxiv.org/abs/2502.00156
Human Action Recognition (HAR) plays a crucial role in applications such as health monitoring, smart home automation, and human-computer interaction. While HAR has been extensively studied, action summarization, which involves identifying and summarizing continuous actions, remains an emerging task. This paper introduces the novel XRF V2 dataset, designed for indoor daily activity Temporal Action Localization (TAL) and action summarization. XRF V2 integrates multimodal data from Wi-Fi signals, IMU sensors (smartphones, smartwatches, headphones, and smart glasses), and synchronized video recordings, offering a diverse collection of indoor activities from 16 volunteers across three distinct environments. To tackle TAL and action summarization, we propose the XRFMamba neural network, which excels at capturing long-term dependencies in untrimmed sensory sequences and outperforms state-of-the-art methods, such as ActionFormer and WiFiTAD. We envision XRF V2 as a valuable resource for advancing research in human action localization, action forecasting, pose estimation, multimodal foundation models pre-training, synthetic data generation, and more.
人体动作识别(HAR)在健康监测、智能家居自动化和人机交互等应用中发挥着关键作用。尽管HAR已经得到了广泛的研究,但涉及连续动作的识别与总结的动作摘要任务仍然是一项新兴的任务。本文介绍了一种新型数据集XRF V2,该数据集专为室内日常活动的时间动作定位(TAL)和动作摘要设计。XRF V2整合了来自Wi-Fi信号、IMU传感器(智能手机、智能手表、耳机和智能眼镜)、以及同步视频记录的多模态数据,提供了16名志愿者在三种不同环境下的多样化室内活动集合。 为了应对TAL和动作摘要任务,我们提出了XRFMamba神经网络。该模型擅长捕捉未经修剪的感觉序列中的长期依赖关系,并且优于现有的最佳方法,如ActionFormer和WiFiTAD。我们认为,XRF V2将作为推进人类动作定位、动作预测、姿态估计、多模态基础模型预训练、合成数据生成等领域的研究的重要资源。
https://arxiv.org/abs/2501.19034
In real-world scenarios, achieving domain adaptation and generalization poses significant challenges, as models must adapt to or generalize across unknown target distributions. Extending these capabilities to unseen multimodal distributions, i.e., multimodal domain adaptation and generalization, is even more challenging due to the distinct characteristics of different modalities. Significant progress has been made over the years, with applications ranging from action recognition to semantic segmentation. Besides, the recent advent of large-scale pre-trained multimodal foundation models, such as CLIP, has inspired works leveraging these models to enhance adaptation and generalization performances or adapting them to downstream tasks. This survey provides the first comprehensive review of recent advances from traditional approaches to foundation models, covering: (1) Multimodal domain adaptation; (2) Multimodal test-time adaptation; (3) Multimodal domain generalization; (4) Domain adaptation and generalization with the help of multimodal foundation models; and (5) Adaptation of multimodal foundation models. For each topic, we formally define the problem and thoroughly review existing methods. Additionally, we analyze relevant datasets and applications, highlighting open challenges and potential future research directions. We maintain an active repository that contains up-to-date literature at this https URL.
在实际场景中,实现领域适应和泛化面临着重大挑战,因为模型必须能够适应或泛化到未知的目标分布。将这些能力扩展到未见的多模态分布上——即多模态领域的适应与泛化——由于不同模式的独特特性而更加困难。多年来已取得了显著的进步,应用范围从动作识别延伸到了语义分割。此外,大规模预训练的多模态基础模型(如CLIP)的出现激发了利用这些模型来增强适应性和泛化的性能或将它们应用于下游任务的研究工作。 本综述首次全面回顾了近年来从传统方法到基础模型领域的最新进展,涵盖了以下方面: 1. 多模态领域适应; 2. 多模态测试时间适应; 3. 多模态领域泛化; 4. 依赖于多模态基础模型的领域适应与泛化; 5. 多模态基础模型的适应。 对于每个主题,我们正式定义问题并详细回顾现有的方法。此外,我们分析了相关的数据集和应用,并强调开放性挑战及潜在的未来研究方向。我们在[此处](https://这个URL)维护一个包含最新文献的活跃仓库。
https://arxiv.org/abs/2501.18592
This paper presents the first-rank solution for the Multi-Modal Action Recognition Challenge, part of the Multi-Modal Visual Pattern Recognition Workshop at the \acl{ICPR} 2024. The competition aimed to recognize human actions using a diverse dataset of 20 action classes, collected from multi-modal sources. The proposed approach is built upon the \acl{TSM}, a technique aimed at efficiently capturing temporal dynamics in video data, incorporating multiple data input types. Our strategy included transfer learning to leverage pre-trained models, followed by meticulous fine-tuning on the challenge's specific dataset to optimize performance for the 20 action classes. We carefully selected a backbone network to balance computational efficiency and recognition accuracy and further refined the model using an ensemble technique that integrates outputs from different modalities. This ensemble approach proved crucial in boosting the overall performance. Our solution achieved a perfect top-1 accuracy on the test set, demonstrating the effectiveness of the proposed approach in recognizing human actions across 20 classes. Our code is available online this https URL.
本文提出了在2024年ICPR多模态视觉模式识别研讨会的多模态动作识别挑战赛中的第一名解决方案。该竞赛旨在利用一个包含20个不同动作类别的多样化数据集,从多种来源收集的数据来识别人类的动作。所提出的方法基于时间分割模块(TSM)技术构建,该技术旨在高效地捕捉视频数据中的时间动态,并结合了多种类型的数据输入。 我们的策略包括使用迁移学习来利用预训练模型,并在挑战赛特定数据集上进行精细调整以优化对20个动作类别的性能。我们仔细选择了骨干网络,在计算效率和识别准确性之间取得了平衡,进一步通过集成技术来细化模型,该技术整合了来自不同模态的输出。这种集成方法对于提高整体性能至关重要。 我们的解决方案在测试集上达到了完美的顶级准确率(top-1 accuracy),证明了所提出的方案能够有效地跨20个类别识别人类动作。代码可在以下网址获取:[此链接应为实际可访问的具体URL,原文中的“this https URL”是占位符]。
https://arxiv.org/abs/2501.17550
We present RASO, a foundation model designed to Recognize Any Surgical Object, offering robust open-set recognition capabilities across a broad range of surgical procedures and object classes, in both surgical images and videos. RASO leverages a novel weakly-supervised learning framework that generates tag-image-text pairs automatically from large-scale unannotated surgical lecture videos, significantly reducing the need for manual annotations. Our scalable data generation pipeline gatherers to 2,200 surgical procedures and produces 3.6 million tag annotations across 2,066 unique surgical tags. Our experiments show that RASO achieves improvements of 2.9 mAP, 4.5 mAP, 10.6 mAP, and 7.2 mAP on four standard surgical benchmarks respectively in zero-shot settings, and surpasses state-of-the-art models in supervised surgical action recognition tasks. We will open-source our code, model, and dataset to facilitate further research.
我们介绍了RASO,这是一种基础模型,旨在识别任何手术对象(Recognize Any Surgical Object),提供强大的开放集识别能力,涵盖广泛的手术程序和物体类别,在手术图像和视频中均表现出色。RASO利用了一种创新的弱监督学习框架,可以从大规模未标注的手术讲座视频中自动生成标签-图像-文本对,大幅减少了手动注释的需求。我们的可扩展数据生成管道涵盖了2,200种不同的手术程序,并产生了超过360万个标签标注,涉及2,066个独特的手术标签。 实验结果表明,在零样本设置下,RASO在四个标准的手术基准测试中分别实现了2.9 mAP、4.5 mAP、10.6 mAP和7.2 mAP的改进,并且在监督下的手术动作识别任务中超越了现有的最佳模型。我们将开源我们的代码、模型和数据集,以促进进一步的研究工作。
https://arxiv.org/abs/2501.15326
Human pose estimation has given rise to a broad spectrum of novel and compelling applications, including action recognition, sports analysis, as well as surveillance. However, accurate video pose estimation remains an open challenge. One aspect that has been overlooked so far is that existing methods learn motion clues from all pixels rather than focusing on the target human body, making them easily misled and disrupted by unimportant information such as background changes or movements of other people. Additionally, while the current Transformer-based pose estimation methods has demonstrated impressive performance with global modeling, they struggle with local context perception and precise positional identification. In this paper, we try to tackle these challenges from three aspects: (1) We propose a bilayer Human-Keypoint Mask module that performs coarse-to-fine visual token refinement, which gradually zooms in on the target human body and keypoints while masking out unimportant figure regions. (2) We further introduce a novel deformable cross attention mechanism and a bidirectional separation strategy to adaptively aggregate spatial and temporal motion clues from constrained surrounding contexts. (3) We mathematically formulate the deformable cross attention, constraining that the model focuses solely on the regions centered at the target person body. Empirically, our method achieves state-of-the-art performance on three large-scale benchmark datasets. A remarkable highlight is that our method achieves an 84.8 mean Average Precision (mAP) on the challenging wrist joint, which significantly outperforms the 81.5 mAP achieved by the current state-of-the-art method on the PoseTrack2017 dataset.
人体姿态估计已经催生了一系列新颖且引人注目的应用,包括动作识别、体育分析以及监控。然而,准确的视频姿态估计仍然是一个开放性的挑战。迄今为止被忽视的一个方面是:现有的方法从所有像素中学习运动线索,而不是专注于目标人体,这使得它们容易受到诸如背景变化或其他人移动等无关信息的影响而产生误导和干扰。此外,尽管基于Transformer的姿态估计方法展示了通过全局建模所实现的令人印象深刻的性能,但它们在局部上下文感知及精确位置识别上仍面临困难。 在这篇论文中,我们从三个方面尝试解决这些挑战: 1. 我们提出了一种双层人体关键点掩码模块(Bilayer Human-Keypoint Mask module),该模块执行从粗到细的视觉标记细化过程。这个过程逐渐聚焦于目标人体及其关节点,并掩盖掉不重要的区域。 2. 进一步地,我们引入了一种新颖的可变形交叉注意力机制和双向分离策略,用于自适应聚合来自受限周围上下文的空间和时间运动线索。 3. 我们从数学上对可变形交叉注意力进行了形式化定义,确保模型仅关注以目标人体为中心的区域。 在实验中,我们的方法在三个大规模基准数据集上的表现达到了最先进的水平。一个显著的特点是,在具有挑战性的腕关节(手腕)姿态估计任务上,我们的方法实现了84.8的平均准确率(mAP),这大大超过了目前在PoseTrack2017数据集中最先进的方法所达到的81.5 mAP的成绩。 通过上述技术改进和创新,我们不仅提高了视频中人体姿态估计的整体准确性,同时也为解决复杂场景中的局部细节感知问题提供了一个新的思路。
https://arxiv.org/abs/2501.14439
Recent advancements in multi-view action recognition have largely relied on Transformer-based models. While effective and adaptable, these models often require substantial computational resources, especially in scenarios with multiple views and multiple temporal sequences. Addressing this limitation, this paper introduces the MV-GMN model, a state-space model specifically designed to efficiently aggregate multi-modal data (RGB and skeleton), multi-view perspectives, and multi-temporal information for action recognition with reduced computational complexity. The MV-GMN model employs an innovative Multi-View Graph Mamba network comprising a series of MV-GMN blocks. Each block includes a proposed Bidirectional State Space Block and a GCN module. The Bidirectional State Space Block introduces four scanning strategies, including view-prioritized and time-prioritized approaches. The GCN module leverages rule-based and KNN-based methods to construct the graph network, effectively integrating features from different viewpoints and temporal instances. Demonstrating its efficacy, MV-GMN outperforms the state-of-the-arts on several datasets, achieving notable accuracies of 97.3\% and 96.7\% on the NTU RGB+D 120 dataset in cross-subject and cross-view scenarios, respectively. MV-GMN also surpasses Transformer-based baselines while requiring only linear inference complexity, underscoring the model's ability to reduce computational load and enhance the scalability and applicability of multi-view action recognition technologies.
最近在多视角动作识别领域的进展主要依赖于基于Transformer的模型。尽管这些模型有效且适应性强,但在处理多个视角和多个时间序列时往往需要大量的计算资源。为了解决这一限制,本文介绍了一种名为MV-GMN(Multi-View Graph Mamba Network)的模型,这是一种专门设计用于高效聚合多模态数据(RGB和骨架)、多视角观点以及多时间信息的动作识别模型,同时降低了计算复杂度。 MV-GMN模型采用了一个创新性的多视图图形马玛网络,其中包括一系列MV-GMN模块。每个块包括一个提出的双向状态空间块和一个GCN(图卷积网络)模块。双向状态空间块引入了四种扫描策略,包括视角优先和时间优先的方法。GCN模块则利用基于规则的和KNN(k近邻)方法构建图形网络,有效地整合来自不同视角和时间实例的特征。 通过在多个数据集上的实验验证,MV-GMN在动作识别任务上超过了现有的最佳模型,在NTU RGB+D 120数据集中分别实现了97.3% 和96.7% 的准确率(跨主体和跨视图场景)。此外,与基于Transformer的基准相比,MV-GMN仅需要线性推理复杂度,这突显了该模型在减少计算负荷、提高多视角动作识别技术的可扩展性和适用性方面的优势。
https://arxiv.org/abs/2501.13829
Human Action Recognition (HAR) is a challenging domain in computer vision, involving recognizing complex patterns by analyzing the spatiotemporal dynamics of individuals' movements in videos. These patterns arise in sequential data, such as video frames, which are often essential to accurately distinguish actions that would be ambiguous in a single image. HAR has garnered considerable interest due to its broad applicability, ranging from robotics and surveillance systems to sports motion analysis, healthcare, and the burgeoning field of autonomous vehicles. While several taxonomies have been proposed to categorize HAR approaches in surveys, they often overlook hybrid methodologies and fail to demonstrate how different models incorporate various architectures and modalities. In this comprehensive survey, we present the novel SMART-Vision taxonomy, which illustrates how innovations in deep learning for HAR complement one another, leading to hybrid approaches beyond traditional categories. Our survey provides a clear roadmap from foundational HAR works to current state-of-the-art systems, highlighting emerging research directions and addressing unresolved challenges in discussion sections for architectures within the HAR domain. We provide details of the research datasets that various approaches used to measure and compare goodness HAR approaches. We also explore the rapidly emerging field of Open-HAR systems, which challenges HAR systems by presenting samples from unknown, novel classes during test time.
人类动作识别(HAR)是计算机视觉领域中的一个挑战性课题,涉及通过分析视频中个体运动的时空动态来识别复杂的模式。这些模式出现在序列数据中,如视频帧,并且通常对于准确地区分在单个图像中会显得模糊的动作至关重要。由于其广泛的应用性——从机器人技术和监控系统到体育动作分析、医疗保健以及新兴的自动驾驶汽车领域——HAR引起了极大的兴趣。 尽管已有多份文献提出了分类HAR方法的不同体系,但它们往往忽视了混合的方法,并未能展示不同模型如何融合各种架构和模态的特点。在这篇全面的综述中,我们介绍了SMART-Vision分类法这一新颖的概念,该分类法展示了深度学习在HAR领域的创新如何相互补充,从而催生出超越传统类别的混合方法。我们的调查提供了一个从基础HAR工作到当前最先进的系统的清晰路线图,并强调了新兴的研究方向,同时在架构部分的讨论中解决未决挑战。 我们详细介绍了各种方法使用的研究数据集及其用来衡量和比较HAR方法优劣的具体细节。此外,我们也探讨了迅速发展的开放HAR系统领域,这些系统通过测试时展示来自未知、新颖类别的样本对HAR系统提出了新的挑战。
https://arxiv.org/abs/2501.13066
In this paper, we address the issue of static bias in zero-shot action recognition. Action recognition models need to represent the action itself, not the appearance. However, some fully-supervised works show that models often rely on static appearances, such as the background and objects, rather than human actions. This issue, known as static bias, has not been investigated for zero-shot. Although CLIP-based zero-shot models are now common, it remains unclear if they sufficiently focus on human actions, as CLIP primarily captures appearance features related to languages. In this paper, we investigate the influence of static bias in zero-shot action recognition with CLIP-based models. Our approach involves masking backgrounds, objects, and people differently during training and validation. Experiments with masking background show that models depend on background bias as their performance decreases for Kinetics400. However, for Mimetics, which has a weak background bias, masking the background leads to improved performance even if the background is masked during validation. Furthermore, masking both the background and objects in different colors improves performance for SSv2, which has a strong object bias. These results suggest that masking the background or objects during training prevents models from overly depending on static bias and makes them focus more on human action.
在这篇论文中,我们解决了零样本动作识别中的静态偏差问题。动作识别模型需要表示动作本身而非外观特征,然而一些完全监督的工作表明,模型往往依赖于静止的外观特征,例如背景和物体,而不是人类的动作。这一问题被称为静态偏差,在零样本设置下尚未得到充分研究。尽管基于CLIP(对比语言-图像预训练)的零样本模型现在非常普遍,但仍然不清楚这些模型是否足够关注人类动作,因为CLIP主要捕捉与语言相关的外观特征。在这篇文章中,我们通过使用基于CLIP的模型来探讨静态偏差在零样本动作识别中的影响。我们的方法是在训练和验证过程中分别对背景、物体以及人物进行遮挡处理。 实验结果表明,在Kinetics400数据集上,当屏蔽背景时,模型依赖于背景偏差,其性能随背景消失而下降。然而对于Mimetics(一个背景偏差较弱的数据集),即使在验证阶段屏蔽了背景,性能仍有所提升。此外,在SSv2(物体偏差较强的数据集中),以不同颜色遮挡背景和物体可以进一步提高性能。 这些结果表明,在训练过程中对背景或物体进行遮挡可以使模型减少过度依赖静态偏差,并促使它们更加关注人类动作。
https://arxiv.org/abs/2501.12681
Human Pose Estimation (HPE) from monocular RGB images is crucial for clinical in-bed skeleton-based action recognition, however, it poses unique challenges for HPE models due to the frequent presence of blankets occluding the person, while labeled HPE data in this scenario is scarce. To address this we introduce BlanketGen2-Fit3D (BG2-Fit3D), an augmentation of Fit3D dataset that contains 1,217,312 frames with synthetic photo-realistic blankets. To generate it we used BlanketGen2, our new and improved version of our BlanketGen pipeline that simulates synthetic blankets using ground-truth Skinned Multi-Person Linear model (SMPL) meshes and then renders them as transparent images that can be layered on top of the original frames. This dataset was used in combination with the original Fit3D to finetune the ViTPose-B HPE model, to evaluate synthetic blanket augmentation effectiveness. The trained models were further evaluated on a real-world blanket occluded in-bed HPE dataset (SLP dataset). Comparing architectures trained on only Fit3D with the ones trained with our synthetic blanket augmentation the later improved pose estimation performance on BG2-Fit3D, the synthetic blanket occluded dataset significantly to (0.977 Percentage of Correct Keypoints (PCK), 0.149 Normalized Mean Error (NME)) with an absolute 4.4% PCK increase. Furthermore, the test results on SLP demonstrated the utility of synthetic data augmentation by improving performance by an absolute 2.3% PCK, on real-world images with the poses occluded by real blankets. These results show synthetic blanket augmentation has the potential to improve in-bed blanket occluded HPE from RGB images. The dataset as well as the code will be made available to the public.
从单目RGB图像进行人体姿态估计(HPE)对于基于床内骨架的动作识别在临床上至关重要,然而,在这种情况下由于被子频繁遮挡人体,这为HPE模型带来了独特的挑战。同时,此类场景下的标注数据极为稀缺。为此,我们引入了BlanketGen2-Fit3D (BG2-Fit3D),这是一个通过使用改进版的BlanketGen管道生成的Fit3D数据集的增强版本。新管道利用真实的Skinned Multi-Person Linear模型(SMPL)网格来模拟合成被子,并将它们渲染为透明图像,可以叠加到原始帧上。此数据集中包含了1,217,312张带有合成逼真被子的帧。 我们使用这个BG2-Fit3D数据集与原始Fit3D数据集结合训练了ViTPose-B HPE模型,并评估了合成被子增强的有效性。在包含真实床内遮挡情况下的SLP测试数据集上,与仅基于Fit3D数据集训练的架构相比,使用我们合成被子增强的数据集训练的模型,在BG2-Fit3D(即合成被子遮挡数据集)上的姿态估计性能显著提高至0.977百分比正确关键点(PCK)和0.149归一化均方误差(NME),PCK值提高了绝对4.4%。此外,SLP测试结果表明,合成数据增强在现实世界图像上具有实用性,对于姿势被真实被子遮挡的情况,性能提升了2.3%的PCK。 这些结果显示,合成被子增强有可能提高从RGB图像进行床内遮挡人体姿态估计的能力。该数据集及代码将公开发布给公众使用。
https://arxiv.org/abs/2501.12318
The improved competence of generative models can help building multi-modal virtual assistants that leverage modalities beyond language. By observing humans performing multi-step tasks, one can build assistants that have situational awareness of actions and tasks being performed, enabling them to cater assistance based on this understanding. In this paper, we develop a Context-aware Instructional Task Assistant with Multi-modal Large Language Models (InsTALL) that leverages an online visual stream (e.g. a user's screen share or video recording) and responds in real-time to user queries related to the task at hand. To enable useful assistance, InsTALL 1) trains a multi-modal model on task videos and paired textual data, and 2) automatically extracts task graph from video data and leverages it at training and inference time. We show InsTALL achieves state-of-the-art performance across proposed sub-tasks considered for multimodal activity understanding -- task recognition (TR), action recognition (AR), next action prediction (AP), and plan prediction (PP) -- and outperforms existing baselines on two novel sub-tasks related to automatic error identification.
生成模型能力的提升可以帮助构建多模态虚拟助手,这些助手可以利用语言之外的信息模式。通过观察人类执行多步骤任务的过程,我们可以开发出具有情境意识的助理,使其能够根据对动作和任务的理解提供相应的帮助。在这篇论文中,我们开发了一种基于多模态大型语言模型的情境感知指令任务助手(InsTALL),该助手利用在线视觉流(例如用户的屏幕共享或视频录制)并实时响应与当前任务相关的用户查询。 为了实现有用的辅助功能,InsTALL 做了两件事情: 1. 在任务视频和配对文本数据上训练多模态模型。 2. 自动从视频数据中提取任务图,并在训练和推理阶段使用它。 我们展示了 InsTALL 在所提出用于多模式活动理解的各个子任务中达到了最先进的性能,包括任务识别(TR)、动作识别(AR)、下一步动作预测(AP)以及计划预测(PP),并且在两个新的自动错误识别相关子任务上优于现有的基线模型。
https://arxiv.org/abs/2501.12231
Graph convolutional networks (GCNs) have emerged as a powerful tool for skeleton-based action and gesture recognition, thanks to their ability to model spatial and temporal dependencies in skeleton data. However, existing GCN-based methods face critical limitations: (1) they lack effective spatio-temporal topology modeling that captures dynamic variations in skeletal motion, and (2) they struggle to model multiscale structural relationships beyond local joint connectivity. To address these issues, we propose a novel framework called Dynamic Spatial-Temporal Semantic Awareness Graph Convolutional Network (DSTSA-GCN). DSTSA-GCN introduces three key modules: Group Channel-wise Graph Convolution (GC-GC), Group Temporal-wise Graph Convolution (GT-GC), and Multi-Scale Temporal Convolution (MS-TCN). GC-GC and GT-GC operate in parallel to independently model channel-specific and frame-specific correlations, enabling robust topology learning that accounts for temporal variations. Additionally, both modules employ a grouping strategy to adaptively capture multiscale structural relationships. Complementing this, MS-TCN enhances temporal modeling through group-wise temporal convolutions with diverse receptive fields. Extensive experiments demonstrate that DSTSA-GCN significantly improves the topology modeling capabilities of GCNs, achieving state-of-the-art performance on benchmark datasets for gesture and action recognition, including SHREC17 Track, DHG-14\/28, NTU-RGB+D, and NTU-RGB+D-120.
图卷积网络(GCN)已成为基于骨架的动作和手势识别的强大工具,这得益于它们在骨架数据中建模空间和时间依赖关系的能力。然而,现有的基于GCN的方法面临关键限制:(1) 缺乏有效的时空拓扑模型以捕捉骨骼运动中的动态变化;(2) 难以建模超出局部关节连接的多尺度结构关系。为了解决这些问题,我们提出了一种名为动态空间-时间语义感知图卷积网络(DSTSA-GCN)的新框架。 DSTSA-GCN引入了三个关键模块:组通道式图卷积(GC-GC)、组时序式图卷积(GT-GC)和多尺度时序卷积(MS-TCN)。GC-GC和GT-GC以并行方式独立地建模特定于通道和帧的相关性,使拓扑学习能够适应时间变化,并且这两个模块都采用了分组策略来自适应捕捉多尺度结构关系。此外,MS-TCN通过具有不同感受野的组时序卷积增强了时序模型。 广泛的实验表明,DSTSA-GCN显著提高了GCN的拓扑建模能力,在包括SHREC17 Track、DHG-14/28、NTU-RGB+D和NTU-RGB+D-120在内的基准数据集上实现了动作和手势识别方面的最新性能。
https://arxiv.org/abs/2501.12086