The introduction of vision-language models like CLIP has enabled the development of foundational video models capable of generalizing to unseen videos and human actions. However, these models are typically trained on web videos, which often fail to capture the challenges present in Activities of Daily Living (ADL) videos. Existing works address ADL-specific challenges, such as similar appearances, subtle motion patterns, and multiple viewpoints, by combining 3D skeletons and RGB videos. However, these approaches are not integrated with language, limiting their ability to generalize to unseen action classes. In this paper, we introduce SKI models, which integrate 3D skeletons into the vision-language embedding space. SKI models leverage a skeleton-language model, SkeletonCLIP, to infuse skeleton information into Vision Language Models (VLMs) and Large Vision Language Models (LVLMs) through collaborative training. Notably, SKI models do not require skeleton data during inference, enhancing their robustness for real-world applications. The effectiveness of SKI models is validated on three popular ADL datasets for zero-shot action recognition and video caption generation tasks.
视觉-语言模型(如CLIP)的引入,促进了能够泛化到未见过视频和人类动作的基础视频模型的发展。然而,这些模型通常是在网络视频上进行训练的,而这些视频往往无法捕捉日常活动(ADL)视频中存在的挑战。现有研究通过结合3D骨架与RGB视频来解决类似外观、细微的动作模式及多视角等特定于ADL的问题。不过,这种方法未将语言整合进来,从而限制了其对新动作类别的泛化能力。 在本文中,我们提出了SKI模型,该模型将3D骨架融入到视觉-语言嵌入空间中。通过联合训练,SKI模型利用了一种骨骼-语言模型(SkeletonCLIP),能够将骨架信息注入到视觉语言模型(VLMs)和大型视觉语言模型(LVLMs)中。值得注意的是,在推理阶段SKI模型不需要骨架数据,从而增强了其在实际应用中的鲁棒性。 我们通过三个流行的ADL数据集上的零样本动作识别与视频字幕生成任务验证了SKI模型的有效性。
https://arxiv.org/abs/2502.03459
Bias in machine learning models can lead to unfair decision making, and while it has been well-studied in the image and text domains, it remains underexplored in action recognition. Action recognition models often suffer from background bias (i.e., inferring actions based on background cues) and foreground bias (i.e., relying on subject appearance), which can be detrimental to real-life applications such as autonomous vehicles or assisted living monitoring. While prior approaches have mainly focused on mitigating background bias using specialized augmentations, we thoroughly study both biases. We propose ALBAR, a novel adversarial training method that mitigates foreground and background biases without requiring specialized knowledge of the bias attributes. Our framework applies an adversarial cross-entropy loss to the sampled static clip (where all the frames are the same) and aims to make its class probabilities uniform using a proposed entropy maximization loss. Additionally, we introduce a gradient penalty loss for regularization against the debiasing process. We evaluate our method on established background and foreground bias protocols, setting a new state-of-the-art and strongly improving combined debiasing performance by over 12% on HMDB51. Furthermore, we identify an issue of background leakage in the existing UCF101 protocol for bias evaluation which provides a shortcut to predict actions and does not provide an accurate measure of the debiasing capability of a model. We address this issue by proposing more fine-grained segmentation boundaries for the actor, where our method also outperforms existing approaches. Project Page: this https URL
机器学习模型中的偏见可能导致不公平的决策制定。虽然在图像和文本领域中这一问题已经得到了广泛研究,但在动作识别领域的相关探讨却相对较少。动作识别模型常常会受到背景偏差(即基于背景线索推断动作)和前景偏差(即依赖于主体外观)的影响,这对自动驾驶汽车或辅助生活监控等现实应用来说可能产生不利影响。 尽管先前的方法主要集中在使用专业增广技术来缓解背景偏见,但我们的研究全面地考察了这两种偏差。我们提出了一种名为ALBAR的新型对抗训练方法,该方法可以在不需要特定偏见属性知识的情况下减轻前景和背景偏见。我们的框架应用了一个对抗交叉熵损失到抽样的静态片段(所有帧相同),并试图通过提议的熵最大化损失使这些片段的概率分布趋于均匀化。此外,我们引入了一种梯度惩罚损失来对去偏过程进行正则化。 我们在已建立的背景和前景偏差协议上评估了我们的方法,并设立了一个新的最先进状态,在HMDB51数据集上的综合去偏性能提升了超过12%。此外,我们发现现有的UCF101偏见评价协议中存在一个“背景泄漏”问题,即提供了一条预测动作的捷径,无法准确衡量模型的去偏能力。为解决此问题,我们提出了更精细的动作者分割边界,并且我们的方法在此场景下也优于现有方法。 项目页面:[这个链接](https://this-url-is-to-be-replaced.com/)
https://arxiv.org/abs/2502.00156
Human Action Recognition (HAR) plays a crucial role in applications such as health monitoring, smart home automation, and human-computer interaction. While HAR has been extensively studied, action summarization, which involves identifying and summarizing continuous actions, remains an emerging task. This paper introduces the novel XRF V2 dataset, designed for indoor daily activity Temporal Action Localization (TAL) and action summarization. XRF V2 integrates multimodal data from Wi-Fi signals, IMU sensors (smartphones, smartwatches, headphones, and smart glasses), and synchronized video recordings, offering a diverse collection of indoor activities from 16 volunteers across three distinct environments. To tackle TAL and action summarization, we propose the XRFMamba neural network, which excels at capturing long-term dependencies in untrimmed sensory sequences and outperforms state-of-the-art methods, such as ActionFormer and WiFiTAD. We envision XRF V2 as a valuable resource for advancing research in human action localization, action forecasting, pose estimation, multimodal foundation models pre-training, synthetic data generation, and more.
人体动作识别(HAR)在健康监测、智能家居自动化和人机交互等应用中发挥着关键作用。尽管HAR已经得到了广泛的研究,但涉及连续动作的识别与总结的动作摘要任务仍然是一项新兴的任务。本文介绍了一种新型数据集XRF V2,该数据集专为室内日常活动的时间动作定位(TAL)和动作摘要设计。XRF V2整合了来自Wi-Fi信号、IMU传感器(智能手机、智能手表、耳机和智能眼镜)、以及同步视频记录的多模态数据,提供了16名志愿者在三种不同环境下的多样化室内活动集合。 为了应对TAL和动作摘要任务,我们提出了XRFMamba神经网络。该模型擅长捕捉未经修剪的感觉序列中的长期依赖关系,并且优于现有的最佳方法,如ActionFormer和WiFiTAD。我们认为,XRF V2将作为推进人类动作定位、动作预测、姿态估计、多模态基础模型预训练、合成数据生成等领域的研究的重要资源。
https://arxiv.org/abs/2501.19034
In real-world scenarios, achieving domain adaptation and generalization poses significant challenges, as models must adapt to or generalize across unknown target distributions. Extending these capabilities to unseen multimodal distributions, i.e., multimodal domain adaptation and generalization, is even more challenging due to the distinct characteristics of different modalities. Significant progress has been made over the years, with applications ranging from action recognition to semantic segmentation. Besides, the recent advent of large-scale pre-trained multimodal foundation models, such as CLIP, has inspired works leveraging these models to enhance adaptation and generalization performances or adapting them to downstream tasks. This survey provides the first comprehensive review of recent advances from traditional approaches to foundation models, covering: (1) Multimodal domain adaptation; (2) Multimodal test-time adaptation; (3) Multimodal domain generalization; (4) Domain adaptation and generalization with the help of multimodal foundation models; and (5) Adaptation of multimodal foundation models. For each topic, we formally define the problem and thoroughly review existing methods. Additionally, we analyze relevant datasets and applications, highlighting open challenges and potential future research directions. We maintain an active repository that contains up-to-date literature at this https URL.
在实际场景中,实现领域适应和泛化面临着重大挑战,因为模型必须能够适应或泛化到未知的目标分布。将这些能力扩展到未见的多模态分布上——即多模态领域的适应与泛化——由于不同模式的独特特性而更加困难。多年来已取得了显著的进步,应用范围从动作识别延伸到了语义分割。此外,大规模预训练的多模态基础模型(如CLIP)的出现激发了利用这些模型来增强适应性和泛化的性能或将它们应用于下游任务的研究工作。 本综述首次全面回顾了近年来从传统方法到基础模型领域的最新进展,涵盖了以下方面: 1. 多模态领域适应; 2. 多模态测试时间适应; 3. 多模态领域泛化; 4. 依赖于多模态基础模型的领域适应与泛化; 5. 多模态基础模型的适应。 对于每个主题,我们正式定义问题并详细回顾现有的方法。此外,我们分析了相关的数据集和应用,并强调开放性挑战及潜在的未来研究方向。我们在[此处](https://这个URL)维护一个包含最新文献的活跃仓库。
https://arxiv.org/abs/2501.18592
This paper presents the first-rank solution for the Multi-Modal Action Recognition Challenge, part of the Multi-Modal Visual Pattern Recognition Workshop at the \acl{ICPR} 2024. The competition aimed to recognize human actions using a diverse dataset of 20 action classes, collected from multi-modal sources. The proposed approach is built upon the \acl{TSM}, a technique aimed at efficiently capturing temporal dynamics in video data, incorporating multiple data input types. Our strategy included transfer learning to leverage pre-trained models, followed by meticulous fine-tuning on the challenge's specific dataset to optimize performance for the 20 action classes. We carefully selected a backbone network to balance computational efficiency and recognition accuracy and further refined the model using an ensemble technique that integrates outputs from different modalities. This ensemble approach proved crucial in boosting the overall performance. Our solution achieved a perfect top-1 accuracy on the test set, demonstrating the effectiveness of the proposed approach in recognizing human actions across 20 classes. Our code is available online this https URL.
本文提出了在2024年ICPR多模态视觉模式识别研讨会的多模态动作识别挑战赛中的第一名解决方案。该竞赛旨在利用一个包含20个不同动作类别的多样化数据集,从多种来源收集的数据来识别人类的动作。所提出的方法基于时间分割模块(TSM)技术构建,该技术旨在高效地捕捉视频数据中的时间动态,并结合了多种类型的数据输入。 我们的策略包括使用迁移学习来利用预训练模型,并在挑战赛特定数据集上进行精细调整以优化对20个动作类别的性能。我们仔细选择了骨干网络,在计算效率和识别准确性之间取得了平衡,进一步通过集成技术来细化模型,该技术整合了来自不同模态的输出。这种集成方法对于提高整体性能至关重要。 我们的解决方案在测试集上达到了完美的顶级准确率(top-1 accuracy),证明了所提出的方案能够有效地跨20个类别识别人类动作。代码可在以下网址获取:[此链接应为实际可访问的具体URL,原文中的“this https URL”是占位符]。
https://arxiv.org/abs/2501.17550
We present RASO, a foundation model designed to Recognize Any Surgical Object, offering robust open-set recognition capabilities across a broad range of surgical procedures and object classes, in both surgical images and videos. RASO leverages a novel weakly-supervised learning framework that generates tag-image-text pairs automatically from large-scale unannotated surgical lecture videos, significantly reducing the need for manual annotations. Our scalable data generation pipeline gatherers to 2,200 surgical procedures and produces 3.6 million tag annotations across 2,066 unique surgical tags. Our experiments show that RASO achieves improvements of 2.9 mAP, 4.5 mAP, 10.6 mAP, and 7.2 mAP on four standard surgical benchmarks respectively in zero-shot settings, and surpasses state-of-the-art models in supervised surgical action recognition tasks. We will open-source our code, model, and dataset to facilitate further research.
我们介绍了RASO,这是一种基础模型,旨在识别任何手术对象(Recognize Any Surgical Object),提供强大的开放集识别能力,涵盖广泛的手术程序和物体类别,在手术图像和视频中均表现出色。RASO利用了一种创新的弱监督学习框架,可以从大规模未标注的手术讲座视频中自动生成标签-图像-文本对,大幅减少了手动注释的需求。我们的可扩展数据生成管道涵盖了2,200种不同的手术程序,并产生了超过360万个标签标注,涉及2,066个独特的手术标签。 实验结果表明,在零样本设置下,RASO在四个标准的手术基准测试中分别实现了2.9 mAP、4.5 mAP、10.6 mAP和7.2 mAP的改进,并且在监督下的手术动作识别任务中超越了现有的最佳模型。我们将开源我们的代码、模型和数据集,以促进进一步的研究工作。
https://arxiv.org/abs/2501.15326
Human pose estimation has given rise to a broad spectrum of novel and compelling applications, including action recognition, sports analysis, as well as surveillance. However, accurate video pose estimation remains an open challenge. One aspect that has been overlooked so far is that existing methods learn motion clues from all pixels rather than focusing on the target human body, making them easily misled and disrupted by unimportant information such as background changes or movements of other people. Additionally, while the current Transformer-based pose estimation methods has demonstrated impressive performance with global modeling, they struggle with local context perception and precise positional identification. In this paper, we try to tackle these challenges from three aspects: (1) We propose a bilayer Human-Keypoint Mask module that performs coarse-to-fine visual token refinement, which gradually zooms in on the target human body and keypoints while masking out unimportant figure regions. (2) We further introduce a novel deformable cross attention mechanism and a bidirectional separation strategy to adaptively aggregate spatial and temporal motion clues from constrained surrounding contexts. (3) We mathematically formulate the deformable cross attention, constraining that the model focuses solely on the regions centered at the target person body. Empirically, our method achieves state-of-the-art performance on three large-scale benchmark datasets. A remarkable highlight is that our method achieves an 84.8 mean Average Precision (mAP) on the challenging wrist joint, which significantly outperforms the 81.5 mAP achieved by the current state-of-the-art method on the PoseTrack2017 dataset.
人体姿态估计已经催生了一系列新颖且引人注目的应用,包括动作识别、体育分析以及监控。然而,准确的视频姿态估计仍然是一个开放性的挑战。迄今为止被忽视的一个方面是:现有的方法从所有像素中学习运动线索,而不是专注于目标人体,这使得它们容易受到诸如背景变化或其他人移动等无关信息的影响而产生误导和干扰。此外,尽管基于Transformer的姿态估计方法展示了通过全局建模所实现的令人印象深刻的性能,但它们在局部上下文感知及精确位置识别上仍面临困难。 在这篇论文中,我们从三个方面尝试解决这些挑战: 1. 我们提出了一种双层人体关键点掩码模块(Bilayer Human-Keypoint Mask module),该模块执行从粗到细的视觉标记细化过程。这个过程逐渐聚焦于目标人体及其关节点,并掩盖掉不重要的区域。 2. 进一步地,我们引入了一种新颖的可变形交叉注意力机制和双向分离策略,用于自适应聚合来自受限周围上下文的空间和时间运动线索。 3. 我们从数学上对可变形交叉注意力进行了形式化定义,确保模型仅关注以目标人体为中心的区域。 在实验中,我们的方法在三个大规模基准数据集上的表现达到了最先进的水平。一个显著的特点是,在具有挑战性的腕关节(手腕)姿态估计任务上,我们的方法实现了84.8的平均准确率(mAP),这大大超过了目前在PoseTrack2017数据集中最先进的方法所达到的81.5 mAP的成绩。 通过上述技术改进和创新,我们不仅提高了视频中人体姿态估计的整体准确性,同时也为解决复杂场景中的局部细节感知问题提供了一个新的思路。
https://arxiv.org/abs/2501.14439
Recent advancements in multi-view action recognition have largely relied on Transformer-based models. While effective and adaptable, these models often require substantial computational resources, especially in scenarios with multiple views and multiple temporal sequences. Addressing this limitation, this paper introduces the MV-GMN model, a state-space model specifically designed to efficiently aggregate multi-modal data (RGB and skeleton), multi-view perspectives, and multi-temporal information for action recognition with reduced computational complexity. The MV-GMN model employs an innovative Multi-View Graph Mamba network comprising a series of MV-GMN blocks. Each block includes a proposed Bidirectional State Space Block and a GCN module. The Bidirectional State Space Block introduces four scanning strategies, including view-prioritized and time-prioritized approaches. The GCN module leverages rule-based and KNN-based methods to construct the graph network, effectively integrating features from different viewpoints and temporal instances. Demonstrating its efficacy, MV-GMN outperforms the state-of-the-arts on several datasets, achieving notable accuracies of 97.3\% and 96.7\% on the NTU RGB+D 120 dataset in cross-subject and cross-view scenarios, respectively. MV-GMN also surpasses Transformer-based baselines while requiring only linear inference complexity, underscoring the model's ability to reduce computational load and enhance the scalability and applicability of multi-view action recognition technologies.
最近在多视角动作识别领域的进展主要依赖于基于Transformer的模型。尽管这些模型有效且适应性强,但在处理多个视角和多个时间序列时往往需要大量的计算资源。为了解决这一限制,本文介绍了一种名为MV-GMN(Multi-View Graph Mamba Network)的模型,这是一种专门设计用于高效聚合多模态数据(RGB和骨架)、多视角观点以及多时间信息的动作识别模型,同时降低了计算复杂度。 MV-GMN模型采用了一个创新性的多视图图形马玛网络,其中包括一系列MV-GMN模块。每个块包括一个提出的双向状态空间块和一个GCN(图卷积网络)模块。双向状态空间块引入了四种扫描策略,包括视角优先和时间优先的方法。GCN模块则利用基于规则的和KNN(k近邻)方法构建图形网络,有效地整合来自不同视角和时间实例的特征。 通过在多个数据集上的实验验证,MV-GMN在动作识别任务上超过了现有的最佳模型,在NTU RGB+D 120数据集中分别实现了97.3% 和96.7% 的准确率(跨主体和跨视图场景)。此外,与基于Transformer的基准相比,MV-GMN仅需要线性推理复杂度,这突显了该模型在减少计算负荷、提高多视角动作识别技术的可扩展性和适用性方面的优势。
https://arxiv.org/abs/2501.13829
Human Action Recognition (HAR) is a challenging domain in computer vision, involving recognizing complex patterns by analyzing the spatiotemporal dynamics of individuals' movements in videos. These patterns arise in sequential data, such as video frames, which are often essential to accurately distinguish actions that would be ambiguous in a single image. HAR has garnered considerable interest due to its broad applicability, ranging from robotics and surveillance systems to sports motion analysis, healthcare, and the burgeoning field of autonomous vehicles. While several taxonomies have been proposed to categorize HAR approaches in surveys, they often overlook hybrid methodologies and fail to demonstrate how different models incorporate various architectures and modalities. In this comprehensive survey, we present the novel SMART-Vision taxonomy, which illustrates how innovations in deep learning for HAR complement one another, leading to hybrid approaches beyond traditional categories. Our survey provides a clear roadmap from foundational HAR works to current state-of-the-art systems, highlighting emerging research directions and addressing unresolved challenges in discussion sections for architectures within the HAR domain. We provide details of the research datasets that various approaches used to measure and compare goodness HAR approaches. We also explore the rapidly emerging field of Open-HAR systems, which challenges HAR systems by presenting samples from unknown, novel classes during test time.
人类动作识别(HAR)是计算机视觉领域中的一个挑战性课题,涉及通过分析视频中个体运动的时空动态来识别复杂的模式。这些模式出现在序列数据中,如视频帧,并且通常对于准确地区分在单个图像中会显得模糊的动作至关重要。由于其广泛的应用性——从机器人技术和监控系统到体育动作分析、医疗保健以及新兴的自动驾驶汽车领域——HAR引起了极大的兴趣。 尽管已有多份文献提出了分类HAR方法的不同体系,但它们往往忽视了混合的方法,并未能展示不同模型如何融合各种架构和模态的特点。在这篇全面的综述中,我们介绍了SMART-Vision分类法这一新颖的概念,该分类法展示了深度学习在HAR领域的创新如何相互补充,从而催生出超越传统类别的混合方法。我们的调查提供了一个从基础HAR工作到当前最先进的系统的清晰路线图,并强调了新兴的研究方向,同时在架构部分的讨论中解决未决挑战。 我们详细介绍了各种方法使用的研究数据集及其用来衡量和比较HAR方法优劣的具体细节。此外,我们也探讨了迅速发展的开放HAR系统领域,这些系统通过测试时展示来自未知、新颖类别的样本对HAR系统提出了新的挑战。
https://arxiv.org/abs/2501.13066
In this paper, we address the issue of static bias in zero-shot action recognition. Action recognition models need to represent the action itself, not the appearance. However, some fully-supervised works show that models often rely on static appearances, such as the background and objects, rather than human actions. This issue, known as static bias, has not been investigated for zero-shot. Although CLIP-based zero-shot models are now common, it remains unclear if they sufficiently focus on human actions, as CLIP primarily captures appearance features related to languages. In this paper, we investigate the influence of static bias in zero-shot action recognition with CLIP-based models. Our approach involves masking backgrounds, objects, and people differently during training and validation. Experiments with masking background show that models depend on background bias as their performance decreases for Kinetics400. However, for Mimetics, which has a weak background bias, masking the background leads to improved performance even if the background is masked during validation. Furthermore, masking both the background and objects in different colors improves performance for SSv2, which has a strong object bias. These results suggest that masking the background or objects during training prevents models from overly depending on static bias and makes them focus more on human action.
在这篇论文中,我们解决了零样本动作识别中的静态偏差问题。动作识别模型需要表示动作本身而非外观特征,然而一些完全监督的工作表明,模型往往依赖于静止的外观特征,例如背景和物体,而不是人类的动作。这一问题被称为静态偏差,在零样本设置下尚未得到充分研究。尽管基于CLIP(对比语言-图像预训练)的零样本模型现在非常普遍,但仍然不清楚这些模型是否足够关注人类动作,因为CLIP主要捕捉与语言相关的外观特征。在这篇文章中,我们通过使用基于CLIP的模型来探讨静态偏差在零样本动作识别中的影响。我们的方法是在训练和验证过程中分别对背景、物体以及人物进行遮挡处理。 实验结果表明,在Kinetics400数据集上,当屏蔽背景时,模型依赖于背景偏差,其性能随背景消失而下降。然而对于Mimetics(一个背景偏差较弱的数据集),即使在验证阶段屏蔽了背景,性能仍有所提升。此外,在SSv2(物体偏差较强的数据集中),以不同颜色遮挡背景和物体可以进一步提高性能。 这些结果表明,在训练过程中对背景或物体进行遮挡可以使模型减少过度依赖静态偏差,并促使它们更加关注人类动作。
https://arxiv.org/abs/2501.12681
Human Pose Estimation (HPE) from monocular RGB images is crucial for clinical in-bed skeleton-based action recognition, however, it poses unique challenges for HPE models due to the frequent presence of blankets occluding the person, while labeled HPE data in this scenario is scarce. To address this we introduce BlanketGen2-Fit3D (BG2-Fit3D), an augmentation of Fit3D dataset that contains 1,217,312 frames with synthetic photo-realistic blankets. To generate it we used BlanketGen2, our new and improved version of our BlanketGen pipeline that simulates synthetic blankets using ground-truth Skinned Multi-Person Linear model (SMPL) meshes and then renders them as transparent images that can be layered on top of the original frames. This dataset was used in combination with the original Fit3D to finetune the ViTPose-B HPE model, to evaluate synthetic blanket augmentation effectiveness. The trained models were further evaluated on a real-world blanket occluded in-bed HPE dataset (SLP dataset). Comparing architectures trained on only Fit3D with the ones trained with our synthetic blanket augmentation the later improved pose estimation performance on BG2-Fit3D, the synthetic blanket occluded dataset significantly to (0.977 Percentage of Correct Keypoints (PCK), 0.149 Normalized Mean Error (NME)) with an absolute 4.4% PCK increase. Furthermore, the test results on SLP demonstrated the utility of synthetic data augmentation by improving performance by an absolute 2.3% PCK, on real-world images with the poses occluded by real blankets. These results show synthetic blanket augmentation has the potential to improve in-bed blanket occluded HPE from RGB images. The dataset as well as the code will be made available to the public.
从单目RGB图像进行人体姿态估计(HPE)对于基于床内骨架的动作识别在临床上至关重要,然而,在这种情况下由于被子频繁遮挡人体,这为HPE模型带来了独特的挑战。同时,此类场景下的标注数据极为稀缺。为此,我们引入了BlanketGen2-Fit3D (BG2-Fit3D),这是一个通过使用改进版的BlanketGen管道生成的Fit3D数据集的增强版本。新管道利用真实的Skinned Multi-Person Linear模型(SMPL)网格来模拟合成被子,并将它们渲染为透明图像,可以叠加到原始帧上。此数据集中包含了1,217,312张带有合成逼真被子的帧。 我们使用这个BG2-Fit3D数据集与原始Fit3D数据集结合训练了ViTPose-B HPE模型,并评估了合成被子增强的有效性。在包含真实床内遮挡情况下的SLP测试数据集上,与仅基于Fit3D数据集训练的架构相比,使用我们合成被子增强的数据集训练的模型,在BG2-Fit3D(即合成被子遮挡数据集)上的姿态估计性能显著提高至0.977百分比正确关键点(PCK)和0.149归一化均方误差(NME),PCK值提高了绝对4.4%。此外,SLP测试结果表明,合成数据增强在现实世界图像上具有实用性,对于姿势被真实被子遮挡的情况,性能提升了2.3%的PCK。 这些结果显示,合成被子增强有可能提高从RGB图像进行床内遮挡人体姿态估计的能力。该数据集及代码将公开发布给公众使用。
https://arxiv.org/abs/2501.12318
The improved competence of generative models can help building multi-modal virtual assistants that leverage modalities beyond language. By observing humans performing multi-step tasks, one can build assistants that have situational awareness of actions and tasks being performed, enabling them to cater assistance based on this understanding. In this paper, we develop a Context-aware Instructional Task Assistant with Multi-modal Large Language Models (InsTALL) that leverages an online visual stream (e.g. a user's screen share or video recording) and responds in real-time to user queries related to the task at hand. To enable useful assistance, InsTALL 1) trains a multi-modal model on task videos and paired textual data, and 2) automatically extracts task graph from video data and leverages it at training and inference time. We show InsTALL achieves state-of-the-art performance across proposed sub-tasks considered for multimodal activity understanding -- task recognition (TR), action recognition (AR), next action prediction (AP), and plan prediction (PP) -- and outperforms existing baselines on two novel sub-tasks related to automatic error identification.
生成模型能力的提升可以帮助构建多模态虚拟助手,这些助手可以利用语言之外的信息模式。通过观察人类执行多步骤任务的过程,我们可以开发出具有情境意识的助理,使其能够根据对动作和任务的理解提供相应的帮助。在这篇论文中,我们开发了一种基于多模态大型语言模型的情境感知指令任务助手(InsTALL),该助手利用在线视觉流(例如用户的屏幕共享或视频录制)并实时响应与当前任务相关的用户查询。 为了实现有用的辅助功能,InsTALL 做了两件事情: 1. 在任务视频和配对文本数据上训练多模态模型。 2. 自动从视频数据中提取任务图,并在训练和推理阶段使用它。 我们展示了 InsTALL 在所提出用于多模式活动理解的各个子任务中达到了最先进的性能,包括任务识别(TR)、动作识别(AR)、下一步动作预测(AP)以及计划预测(PP),并且在两个新的自动错误识别相关子任务上优于现有的基线模型。
https://arxiv.org/abs/2501.12231
Graph convolutional networks (GCNs) have emerged as a powerful tool for skeleton-based action and gesture recognition, thanks to their ability to model spatial and temporal dependencies in skeleton data. However, existing GCN-based methods face critical limitations: (1) they lack effective spatio-temporal topology modeling that captures dynamic variations in skeletal motion, and (2) they struggle to model multiscale structural relationships beyond local joint connectivity. To address these issues, we propose a novel framework called Dynamic Spatial-Temporal Semantic Awareness Graph Convolutional Network (DSTSA-GCN). DSTSA-GCN introduces three key modules: Group Channel-wise Graph Convolution (GC-GC), Group Temporal-wise Graph Convolution (GT-GC), and Multi-Scale Temporal Convolution (MS-TCN). GC-GC and GT-GC operate in parallel to independently model channel-specific and frame-specific correlations, enabling robust topology learning that accounts for temporal variations. Additionally, both modules employ a grouping strategy to adaptively capture multiscale structural relationships. Complementing this, MS-TCN enhances temporal modeling through group-wise temporal convolutions with diverse receptive fields. Extensive experiments demonstrate that DSTSA-GCN significantly improves the topology modeling capabilities of GCNs, achieving state-of-the-art performance on benchmark datasets for gesture and action recognition, including SHREC17 Track, DHG-14\/28, NTU-RGB+D, and NTU-RGB+D-120.
图卷积网络(GCN)已成为基于骨架的动作和手势识别的强大工具,这得益于它们在骨架数据中建模空间和时间依赖关系的能力。然而,现有的基于GCN的方法面临关键限制:(1) 缺乏有效的时空拓扑模型以捕捉骨骼运动中的动态变化;(2) 难以建模超出局部关节连接的多尺度结构关系。为了解决这些问题,我们提出了一种名为动态空间-时间语义感知图卷积网络(DSTSA-GCN)的新框架。 DSTSA-GCN引入了三个关键模块:组通道式图卷积(GC-GC)、组时序式图卷积(GT-GC)和多尺度时序卷积(MS-TCN)。GC-GC和GT-GC以并行方式独立地建模特定于通道和帧的相关性,使拓扑学习能够适应时间变化,并且这两个模块都采用了分组策略来自适应捕捉多尺度结构关系。此外,MS-TCN通过具有不同感受野的组时序卷积增强了时序模型。 广泛的实验表明,DSTSA-GCN显著提高了GCN的拓扑建模能力,在包括SHREC17 Track、DHG-14/28、NTU-RGB+D和NTU-RGB+D-120在内的基准数据集上实现了动作和手势识别方面的最新性能。
https://arxiv.org/abs/2501.12086
In recent years, action recognition has received much attention and wide application due to its important role in video understanding. Most of the researches on action recognition methods focused on improving the performance via various deep learning methods rather than the classification of skeleton points. The topological modeling between skeleton points and body parts was seldom considered. Although some studies have used a data-driven approach to classify the topology of the skeleton point, the nature of the skeleton point in terms of kinematics has not been taken into consideration. Therefore, in this paper, we draw on the theory of kinematics to adapt the topological relations of the skeleton point and propose a topological relation classification based on body parts and distance from core of body. To synthesize these topological relations for action recognition, we propose a novel Hypergraph Fusion Graph Convolutional Network (HFGCN). In particular, the proposed model is able to focus on the human skeleton points and the different body parts simultaneously, and thus construct the topology, which improves the recognition accuracy obviously. We use a hypergraph to represent the categorical relationships of these skeleton points and incorporate the hypergraph into a graph convolution network to model the higher-order relationships among the skeleton points and enhance the feature representation of the network. In addition, our proposed hypergraph attention module and hypergraph graph convolution module optimize topology modeling in temporal and channel dimensions, respectively, to further enhance the feature representation of the network. We conducted extensive experiments on three widely used this http URL results validate that our proposed method can achieve the best performance when compared with the state-of-the-art skeleton-based methods.
近年来,由于在视频理解中的重要作用,动作识别受到了广泛关注和广泛应用。大多数关于动作识别方法的研究重点在于通过各种深度学习方法提高性能,而不是分类骨骼点。骨架点与身体部位之间的拓扑建模很少被考虑。尽管有一些研究采用了数据驱动的方法来分类骨架点的拓扑结构,但并没有考虑到骨架点在运动学方面的本质特性。因此,在本文中,我们借鉴了运动学理论,调整了骨架点的拓扑关系,并提出了一种基于人体各部位和距身体核心距离的拓扑关系分类方法。为了将这些拓扑关系综合用于动作识别,我们提出了一种新颖的超图融合图卷积网络(HFGCN)。特别是,提出的模型能够同时关注人类骨骼点和不同的身体部位,从而构建出拓扑结构,并显著提高识别精度。我们使用一个超图来表示这些骨架点之间的分类关系,并将其融入到图卷积网络中以建模骨架点之间的高阶关系并增强网络的特征表示能力。此外,我们的提出的超图注意力模块和超图图卷积模块分别在时间维度和通道维度上优化了拓扑建模,进一步增强了网络的特征表示能力。我们在三个广泛使用的数据集上进行了广泛的实验,结果验证了与最先进的基于骨架的方法相比,我们提出的方法可以实现最佳性能。 注意:原文中的“this http URL results validate...”中似乎包含了一个不完整的网址或者是一个错误的内容引用,这段在翻译时保留原样,并建议在实际应用或发表前进行修正。
https://arxiv.org/abs/2501.11007
The current biodiversity loss crisis makes animal monitoring a relevant field of study. In light of this, data collected through monitoring can provide essential insights, and information for decision-making aimed at preserving global biodiversity. Despite the importance of such data, there is a notable scarcity of datasets featuring videos of birds, and none of the existing datasets offer detailed annotations of bird behaviors in video format. In response to this gap, our study introduces the first fine-grained video dataset specifically designed for bird behavior detection and species classification. This dataset addresses the need for comprehensive bird video datasets and provides detailed data on bird actions, facilitating the development of deep learning models to recognize these, similar to the advancements made in human action recognition. The proposed dataset comprises 178 videos recorded in Spanish wetlands, capturing 13 different bird species performing 7 distinct behavior classes. In addition, we also present baseline results using state of the art models on two tasks: bird behavior recognition and species classification.
当前的生物多样性丧失危机使得动物监测成为一个重要的研究领域。鉴于此,通过监测收集的数据可以提供关于保护全球生物多样性的关键见解和信息。尽管此类数据的重要性不言而喻,但目前可供使用的鸟类视频数据集却非常有限,并且现有的数据集中没有任何一个提供了详尽的鸟类行为视频注释。为了填补这一空白,我们的研究引入了首个专门用于鸟类行为检测及物种分类的细粒度视频数据集。该数据集满足了全面鸟类视频数据库的需求,并提供了关于鸟类行动的详细信息,有助于开发能够识别这些行为的深度学习模型,类似于人类动作识别领域的进展。 我们提出的这个数据集中包含了在西班牙湿地录制的178个视频片段,记录了13种不同的鸟种类执行的7类不同行为。此外,我们还使用最先进的模型,在两个任务上展示了基线结果:鸟类行为识别和物种分类。
https://arxiv.org/abs/2501.08931
With the availability of egocentric 3D hand-object interaction datasets, there is increasing interest in developing unified models for hand-object pose estimation and action recognition. However, existing methods still struggle to recognise seen actions on unseen objects due to the limitations in representing object shape and movement using 3D bounding boxes. Additionally, the reliance on object templates at test time limits their generalisability to unseen objects. To address these challenges, we propose to leverage superquadrics as an alternative 3D object representation to bounding boxes and demonstrate their effectiveness on both template-free object reconstruction and action recognition tasks. Moreover, as we find that pure appearance-based methods can outperform the unified methods, the potential benefits from 3D geometric information remain unclear. Therefore, we study the compositionality of actions by considering a more challenging task where the training combinations of verbs and nouns do not overlap with the testing split. We extend H2O and FPHA datasets with compositional splits and design a novel collaborative learning framework that can explicitly reason about the geometric relations between hands and the manipulated object. Through extensive quantitative and qualitative evaluations, we demonstrate significant improvements over the state-of-the-arts in (compositional) action recognition.
随着自中心3D手部与物体交互数据集的可用性提高,开发用于手部和物体姿态估计及动作识别的统一模型的兴趣日益增长。然而,现有方法仍然难以在未见过的物体上准确识别已见的动作,这是由于使用三维边界框来表示物体形状和运动存在局限性所致。此外,在测试时依赖于对象模板会限制其对未见过的物体的泛化能力。为了解决这些挑战,我们提议采用超二次体作为边界框的替代3D对象表示,并展示了它们在无模板的对象重建和动作识别任务中的有效性。 另外,我们发现基于外观的方法可以优于统一方法,但从三维几何信息中可能带来的潜在收益仍不清楚。因此,通过考虑一个更具挑战性的任务——训练动词和名词组合与测试分组不重叠——我们研究了行动的构成性,并将H2O和FPHA数据集扩展为具有组成性分割的数据集,设计了一种新的协作学习框架,该框架可以明确地推理出手部和被操作物体之间的几何关系。通过广泛的定量和定性评估,我们在(组成型)动作识别方面显著优于现有的最佳方法。
https://arxiv.org/abs/2501.07100
Human skeleton information is important in skeleton-based action recognition, which provides a simple and efficient way to describe human pose. However, existing skeleton-based methods focus more on the skeleton, ignoring the objects interacting with humans, resulting in poor performance in recognizing actions that involve object interactions. We propose a new action recognition framework introducing object nodes to supplement absent interactive object information. We also propose Spatial Temporal Variable Graph Convolutional Networks (ST-VGCN) to effectively model the Variable Graph (VG) containing object nodes. Specifically, in order to validate the role of interactive object information, by leveraging a simple self-training approach, we establish a new dataset, JXGC 24, and an extended dataset, NTU RGB+D+Object 60, including more than 2 million additional object nodes. At the same time, we designe the Variable Graph construction method to accommodate a variable number of nodes for graph structure. Additionally, we are the first to explore the overfitting issue introduced by incorporating additional object information, and we propose a VG-based data augmentation method to address this issue, called Random Node Attack. Finally, regarding the network structure, we introduce two fusion modules, CAF and WNPool, along with a novel Node Balance Loss, to enhance the comprehensive performance by effectively fusing and balancing skeleton and object node information. Our method surpasses the previous state-of-the-art on multiple skeleton-based action recognition benchmarks. The accuracy of our method on NTU RGB+D 60 cross-subject split is 96.7\%, and on cross-view split, it is 99.2\%.
人体骨骼信息在基于骨架的动作识别中非常重要,它提供了一种简单而有效的方式来描述人体姿态。然而,现有的基于骨架的方法更侧重于捕捉人的动作本身,忽视了与人互动的物体,导致在需要识别涉及物体交互的动作时表现不佳。为此,我们提出了一种新的动作识别框架,引入了对象节点来补充缺失的互动物信息。此外,我们还提出了空间时间可变图卷积网络(ST-VGCN),以有效地建模包含对象节点的可变图(VG)。 具体而言,为了验证交互物体信息的作用,通过简单的自我训练方法,我们建立了一个新的数据集JXGC 24以及一个扩展数据集NTU RGB+D+Object 60,这两个数据集中包含了超过两百万个额外的对象节点。同时,我们也设计了一种可变图构建方法以适应不同数量的节点以调整图结构。此外,我们首次探索了引入额外对象信息时出现的数据过拟合问题,并提出一种基于VG的数据增强方法来解决这一问题,即随机节点攻击(Random Node Attack)。 最后,在网络结构方面,我们提出了两个融合模块CAF(Cross Attention Fusion)和WNPool(Weighted Neighbor Pooling),以及一个新颖的节点平衡损失函数Node Balance Loss。这些措施通过有效地融合和平衡骨架与物体节点信息来增强系统的综合性能。 我们的方法在多个基于骨架的动作识别基准测试上超越了之前最先进的技术,具体来说,在NTU RGB+D 60跨主体分割上的准确率为96.7%,而在跨视角分割上的准确率则达到了99.2%。
https://arxiv.org/abs/2501.05066
The proliferation of video content production has led to vast amounts of data, posing substantial challenges in terms of analysis efficiency and resource utilization. Addressing this issue calls for the development of robust video analysis tools. This paper proposes a novel approach leveraging Generative Artificial Intelligence (GenAI) to facilitate streamlined video analysis. Our tool aims to deliver tailored textual summaries of user-defined queries, offering a focused insight amidst extensive video datasets. Unlike conventional frameworks that offer generic summaries or limited action recognition, our method harnesses the power of GenAI to distil relevant information, enhancing analysis precision and efficiency. Employing YOLO-V8 for object detection and Gemini for comprehensive video and text analysis, our solution achieves heightened contextual accuracy. By combining YOLO with Gemini, our approach furnishes textual summaries extracted from extensive CCTV footage, enabling users to swiftly navigate and verify pertinent events without the need for exhaustive manual review. The quantitative evaluation revealed a similarity of 72.8%, while the qualitative assessment rated an accuracy of 85%, demonstrating the capability of the proposed method.
视频内容制作的激增导致了大量的数据,这对分析效率和资源利用提出了重大挑战。为解决这一问题,开发强大的视频分析工具至关重要。本文提出了一种新方法,利用生成式人工智能(Generative Artificial Intelligence, GenAI)来促进高效的视频分析流程。我们的工具旨在根据用户的自定义查询提供定制化的文本摘要,在庞大的视频数据集中提供有针对性的洞察。 与传统的框架相比,这些传统框架通常只提供通用总结或有限的动作识别,我们的方法则运用GenAI提取相关的信息,从而提高分析的准确性和效率。通过使用YOLO-V8进行目标检测和Gemini进行全面的视频及文本分析,本解决方案实现了更高的上下文准确性。结合YOLO与Gemini的能力,我们能够从大量的闭路电视(CCTV)录像中抽取文本摘要,使用户能够在无需详细人工审查的情况下快速导航并验证相关事件。 定量评估显示相似度为72.8%,而定性评估则达到了85%的准确率,这展示了所提出方法的有效能力。
https://arxiv.org/abs/2501.04764
The rapid increase in video content production has resulted in enormous data volumes, creating significant challenges for efficient analysis and resource management. To address this, robust video analysis tools are essential. This paper presents an innovative proof of concept using Generative Artificial Intelligence (GenAI) in the form of Vision Language Models to enhance the downstream video analysis process. Our tool generates customized textual summaries based on user-defined queries, providing focused insights within extensive video datasets. Unlike traditional methods that offer generic summaries or limited action recognition, our approach utilizes Vision Language Models to extract relevant information, improving analysis precision and efficiency. The proposed method produces textual summaries from extensive CCTV footage, which can then be stored for an indefinite time in a very small storage space compared to videos, allowing users to quickly navigate and verify significant events without exhaustive manual review. Qualitative evaluations result in 80% and 70% accuracy in temporal and spatial quality and consistency of the pipeline respectively.
视频内容生产的迅速增加导致了数据量的激增,这对有效分析和资源管理带来了巨大挑战。为了解决这个问题,强大的视频分析工具至关重要。本文提出了一种创新的概念验证方法,利用生成式人工智能(GenAI)中的视觉语言模型来增强下游视频分析过程。我们的工具可以根据用户定义的查询生成定制化的文本摘要,在广泛的视频数据集中提供有针对性的见解。与传统的通用摘要或有限的动作识别方法不同,我们的方法使用视觉语言模型提取相关信息,从而提高分析的精确性和效率。 所提出的方法能够从大量的闭路电视(CCTV)录像中生成文本摘要,并且这些文本可以存储在相对较小的空间里,相比视频文件占用的空间来说显著减少。这样用户可以在无需详细手动审查的情况下快速浏览和验证重要事件。 定性评估结果显示,在时间质量和管道的一致性方面,该方法的准确性分别为80%和70%。
https://arxiv.org/abs/2501.02850
Skeleton-based action recognition has gained significant attention for its ability to efficiently represent spatiotemporal information in a lightweight format. Most existing approaches use graph-based models to process skeleton sequences, where each pose is represented as a skeletal graph structured around human physical connectivity. Among these, the Spatiotemporal Graph Convolutional Network (ST-GCN) has become a widely used framework. Alternatively, hypergraph-based models, such as the Hyperformer, capture higher-order correlations, offering a more expressive representation of complex joint interactions. A recent advancement, termed Taylor Videos, introduces motion-enhanced skeleton sequences by embedding motion concepts, providing a fresh perspective on interpreting human actions in skeleton-based action recognition. In this paper, we conduct a comprehensive evaluation of both traditional skeleton sequences and Taylor-transformed skeletons using ST-GCN and Hyperformer models on the NTU-60 and NTU-120 datasets. We compare skeletal graph and hypergraph representations, analyzing static poses against motion-injected poses. Our findings highlight the strengths and limitations of Taylor-transformed skeletons, demonstrating their potential to enhance motion dynamics while exposing current challenges in fully using their benefits. This study underscores the need for innovative skeletal modelling techniques to effectively handle motion-rich data and advance the field of action recognition.
基于骨架的动作识别由于其能够以轻量级格式高效地表示时空信息而备受关注。目前,大多数现有方法使用图模型来处理骨骼序列,其中每个姿势都被表示为一个围绕人体物理连接构建的骨骼图。在这之中,时空图卷积网络(ST-GCN)已经成为一种广泛使用的框架。相比之下,超图模型如Hyperformer能够捕获高阶相关性,提供了一种更具有表现力的方式来表达复杂的关节交互。最近的一项进展,称为Taylor Videos,则通过嵌入运动概念来生成增强的骨骼序列,为基于骨架的动作识别中的人类动作解释提供了新的视角。 在本文中,我们使用ST-GCN和Hyperformer模型,在NTU-60和NTU-120数据集上对传统骨架序列与经过Taylor变换后的骨架进行了全面评估。我们在静态姿势对比于运动注入的姿势的情况下,分析了骨骼图和超图表示法。我们的研究结果突显出Taylor变换后的骨骼的优势与局限性,并展示了它们在增强运动动态方面具有潜力的同时也暴露出了当前挑战,即如何充分利用这些优点。 这项研究表明需要创新性的骨架建模技术以有效地处理包含丰富运动信息的数据,从而推动动作识别领域的进一步发展。
https://arxiv.org/abs/2501.02593