Foundation models have ushered in a new era for multimodal video understanding by enabling the extraction of rich spatiotemporal and semantic representations. In this work, we introduce a novel graph-based framework that integrates a vision-language foundation, leveraging VideoMAE for dynamic visual encoding and BERT for contextual textual embedding, to address the challenge of recognizing fine-grained bimanual manipulation actions. Departing from conventional static graph architectures, our approach constructs an adaptive multimodal graph where nodes represent frames, objects, and textual annotations, and edges encode spatial, temporal, and semantic relationships. These graph structures evolve dynamically based on learned interactions, allowing for flexible and context-aware reasoning. A task-specific attention mechanism within a Graph Attention Network further enhances this reasoning by modulating edge importance based on action semantics. Through extensive evaluations on diverse benchmark datasets, we demonstrate that our method consistently outperforms state-of-the-art baselines, underscoring the strength of combining foundation models with dynamic graph-based reasoning for robust and generalizable action recognition.
基础模型通过支持提取丰富的时间空间和语义表示,开启了多模态视频理解的新时代。在此项工作中,我们引入了一种基于图的新型框架,该框架整合了视觉-语言基础模型,利用VideoMAE进行动态视觉编码,并使用BERT进行上下文文本嵌入,以应对精细的手部操作动作识别挑战。与传统的静态图架构不同,我们的方法构建了一个自适应多模态图,其中节点代表帧、对象和文本注释,边则编码空间、时间和语义关系。这些图形结构根据学习到的交互动态演变,允许进行灵活且基于上下文的推理。在图注意力网络中的任务特定注意机制进一步增强了这种推理能力,通过调整边的重要性来适应动作语义的变化。通过对多种基准数据集进行全面评估,我们展示了我们的方法始终优于最先进的基线模型,强调了结合基础模型与动态图推理对于鲁棒性和泛化性动作识别的重要价值。
https://arxiv.org/abs/2505.15192
This paper presents a novel inertial localization framework named Egocentric Action-aware Inertial Localization (EAIL), which leverages egocentric action cues from head-mounted IMU signals to localize the target individual within a 3D point cloud. Human inertial localization is challenging due to IMU sensor noise that causes trajectory drift over time. The diversity of human actions further complicates IMU signal processing by introducing various motion patterns. Nevertheless, we observe that some actions observed through the head-mounted IMU correlate with spatial environmental structures (e.g., bending down to look inside an oven, washing dishes next to a sink), thereby serving as spatial anchors to compensate for the localization drift. The proposed EAIL framework learns such correlations via hierarchical multi-modal alignment. By assuming that the 3D point cloud of the environment is available, it contrastively learns modality encoders that align short-term egocentric action cues in IMU signals with local environmental features in the point cloud. These encoders are then used in reasoning the IMU data and the point cloud over time and space to perform inertial localization. Interestingly, these encoders can further be utilized to recognize the corresponding sequence of actions as a by-product. Extensive experiments demonstrate the effectiveness of the proposed framework over state-of-the-art inertial localization and inertial action recognition baselines.
本文提出了一种新颖的惯性定位框架,名为第一人称行为感知惯性定位(Egocentric Action-aware Inertial Localization, EAIL),该框架利用头戴式IMU信号中的第一人称动作线索,在3D点云中对目标个体进行定位。人体惯性定位由于IMU传感器噪声导致的轨迹漂移而变得具有挑战性,而且人类行为多样性进一步复杂化了IMU信号处理过程,因为这会引入各种运动模式。然而,我们观察到某些通过头戴式IMU捕捉到的行为与空间环境结构相关(例如,弯腰看烤箱内部、在水槽旁洗碗),这些动作可以作为空间锚点来补偿定位漂移。 提出的EAIL框架通过层次化的多模态对齐学习了这种关联。该框架假设环境中3D点云是可用的,并且它对比性地学习模式编码器,以将IMU信号中的短期第一人称动作线索与点云中局部环境特征进行对齐。然后利用这些编码器在时间和空间上推理IMU数据和点云,从而执行惯性定位。有趣的是,这些编码器还可以作为副产品被用来识别相应的行为序列。 大量的实验表明,所提出的框架比最先进的惯性定位和惯性动作识别基准具有更有效的性能。
https://arxiv.org/abs/2505.14346
Spatial-temporal graph convolutional networks (ST-GCNs) showcase impressive performance in skeleton-based human action recognition (HAR). However, despite the development of numerous models, their recognition performance does not differ significantly after aligning the input settings. With this observation, we hypothesize that ST-GCNs are over-parameterized for HAR, a conjecture subsequently confirmed through experiments employing the lottery ticket hypothesis. Additionally, a novel sparse ST-GCNs generator is proposed, which trains a sparse architecture from a randomly initialized dense network while maintaining comparable performance levels to the dense components. Moreover, we generate multi-level sparsity ST-GCNs by integrating sparse structures at various sparsity levels and demonstrate that the assembled model yields a significant enhancement in HAR performance. Thorough experiments on four datasets, including NTU-RGB+D 60(120), Kinetics-400, and FineGYM, demonstrate that the proposed sparse ST-GCNs can achieve comparable performance to their dense components. Even with 95% fewer parameters, the sparse ST-GCNs exhibit a degradation of <1% in top-1 accuracy. Meanwhile, the multi-level sparsity ST-GCNs, which require only 66% of the parameters of the dense ST-GCNs, demonstrate an improvement of >1% in top-1 accuracy. The code is available at this https URL.
空间-时间图卷积网络(ST-GCN)在基于骨架的人体动作识别(HAR)中展示了出色的表现。然而,尽管开发了许多模型,在对输入设置进行对齐后,它们的识别性能并没有显著差异。基于这一观察,我们假设对于HAR而言,ST-GCN存在过度参数化的情况,此假设通过使用彩票假设实验得到了证实。此外,还提出了一种新颖的稀疏ST-GCN生成器,该生成器在从随机初始化的密集网络中训练稀疏架构的同时保持了与密集组件相当的表现水平。并且我们通过结合不同稀疏程度的稀疏结构来生成多层次稀疏的ST-GCN,并证明所组装的模型在HAR性能上实现了显著提升。 通过对NTU-RGB+D 60(120)、Kinetics-400和FineGYM四个数据集进行详尽实验,表明提出的稀疏ST-GCN可实现与密集组件相当的表现。即使参数减少了95%,稀疏的ST-GCNs在top-1准确率上的下降也小于1%。同时,多层次稀疏的ST-GCN仅需要密集型ST-GCN 66%的参数量,其在top-1准确率上提升了超过1%。 相关代码可在此网址获得:[链接](假设此链接指向项目源码)。
https://arxiv.org/abs/2505.10679
Computer-assisted interventions can improve intra-operative guidance, particularly through deep learning methods that harness the spatiotemporal information in surgical videos. However, the severe data imbalance often found in surgical video datasets hinders the development of high-performing models. In this work, we aim to overcome the data imbalance by synthesizing surgical videos. We propose a unique two-stage, text-conditioned diffusion-based method to generate high-fidelity surgical videos for under-represented classes. Our approach conditions the generation process on text prompts and decouples spatial and temporal modeling by utilizing a 2D latent diffusion model to capture spatial content and then integrating temporal attention layers to ensure temporal consistency. Furthermore, we introduce a rejection sampling strategy to select the most suitable synthetic samples, effectively augmenting existing datasets to address class imbalance. We evaluate our method on two downstream tasks-surgical action recognition and intra-operative event prediction-demonstrating that incorporating synthetic videos from our approach substantially enhances model performance. We open-source our implementation at this https URL.
计算机辅助干预可以通过深度学习方法利用手术视频中的时空信息来提高术中指导的准确性。然而,手术视频数据集中常见的严重数据不平衡问题阻碍了高性能模型的发展。在本研究中,我们旨在通过合成手术视频克服数据不平衡的问题。为此,我们提出了一种独特的两阶段、基于文本条件的扩散法生成高保真度的手术视频,以弥补代表性不足的类别。我们的方法将生成过程与文本提示相结合,并通过使用2D潜在扩散模型捕捉空间内容和加入时间注意力层确保时间一致性来分离空间和时间建模。此外,我们引入了一种拒绝采样策略,用于选择最合适的合成样本,从而有效地扩充现有数据集以解决类不平衡问题。 我们在两个下游任务上评估了我们的方法:手术动作识别和术中事件预测,结果表明将我们提出的方法生成的合成视频整合到训练中显著提升了模型性能。我们将代码开源在以下链接:[提供的URL]。
https://arxiv.org/abs/2505.09858
Masked video modeling~(MVM) has emerged as a highly effective pre-training strategy for visual foundation models, whereby the model reconstructs masked spatiotemporal tokens using information from visible tokens. However, a key challenge in such approaches lies in selecting an appropriate masking strategy. Previous studies have explored predefined masking techniques, including random and tube-based masking, as well as approaches that leverage key motion priors, optical flow and semantic cues from externally pre-trained models. In this work, we introduce a novel and generalizable Trajectory-Aware Adaptive Token Sampler (TATS), which models the motion dynamics of tokens and can be seamlessly integrated into the masked autoencoder (MAE) framework to select motion-centric tokens in videos. Additionally, we propose a unified training strategy that enables joint optimization of both MAE and TATS from scratch using Proximal Policy Optimization (PPO). We show that our model allows for aggressive masking without compromising performance on the downstream task of action recognition while also ensuring that the pre-training remains memory efficient. Extensive experiments of the proposed approach across four benchmarks, including Something-Something v2, Kinetics-400, UCF101, and HMDB51, demonstrate the effectiveness, transferability, generalization, and efficiency of our work compared to other state-of-the-art methods.
遮罩视频建模(Masked Video Modeling,MVM)已成为视觉基础模型预训练的高效策略之一。在这种方法中,通过可见标记的信息来重构被掩盖的空间-时间标记。然而,在这种技术中的关键挑战是如何选择合适的掩码策略。先前的研究探索了多种预先定义的掩码技术,包括随机和基于管状的方法,并且也考虑了从外部预训练模型中提取的关键运动先验、光流以及语义线索。 在此项工作中,我们引入了一种新颖且具普适性的轨迹感知自适应标记采样器(Trajectory-Aware Adaptive Token Sampler, TATS),该方法能够建模标记的动态性,并可无缝集成到遮罩自动编码器(MAE)框架中以选择视频中的运动核心标记。此外,我们提出了一种统一的训练策略,利用近端策略优化(Proximal Policy Optimization,PPO)从零开始联合优化 MAE 和 TATS。我们的模型允许在不损害动作识别下游任务性能的情况下进行激进的掩码操作,并且还能确保预训练过程保持内存效率。 我们在四个基准数据集上的广泛实验——包括Something-Something v2、Kinetics-400、UCF101和HMDB51,表明我们提出的方法在有效性、可迁移性、泛化能力和效率方面优于其他最先进的方法。
https://arxiv.org/abs/2505.08561
Wi-Fi sensing has emerged as a significant technology in wireless sensing and Integrated Sensing and Communication (ISAC), offering benefits such as low cost, high penetration, and enhanced privacy. Currently, it is widely utilized in various applications, including action recognition, human localization, and crowd counting. However, Wi-Fi sensing also faces challenges, such as low robustness and difficulties in data collection. Recently, there has been an increasing focus on multi-modal Wi-Fi sensing, where other modalities can act as teachers, providing ground truth or robust features for Wi-Fi sensing models to learn from, or can be directly fused with Wi-Fi for enhanced sensing capabilities. Although these methods have demonstrated promising results and substantial value in practical applications, there is a lack of comprehensive surveys reviewing them. To address this gap, this paper reviews the multi-modal Wi-Fi sensing literature \textbf{from the past 24 months} and highlights the current limitations, challenges and future directions in this field.
Wi-Fi感知技术作为无线传感和集成感测与通信(ISAC)领域的一项重要技术,因其低成本、高穿透性和增强隐私性等优点而受到重视。目前,它被广泛应用于行动识别、人体定位及人群计数等多种应用中。然而,Wi-Fi感知也面临着一些挑战,例如抗干扰能力差以及数据采集困难等问题。近期,多模态Wi-Fi感知受到了越来越多的关注,在这种模式下,其他类型的传感器可以作为教师提供真实的地面实况或鲁棒特征供Wi-Fi感知模型学习,或者直接与Wi-Fi融合以增强感测功能。尽管这些方法已经显示出在实际应用中的巨大潜力和价值,但目前缺乏对其进行全面回顾的文献综述。 为弥补这一不足,本文对过去24个月内的多模态Wi-Fi感知研究进行了全面回顾,并指出现有技术的局限性、挑战以及未来的发展方向。
https://arxiv.org/abs/2505.06682
Large-scale pre-trained models have achieved remarkable success in language and image tasks, leading an increasing number of studies to explore the application of pre-trained image models, such as CLIP, in the domain of few-shot action recognition (FSAR). However, current methods generally suffer from several problems: 1) Direct fine-tuning often undermines the generalization capability of the pre-trained model; 2) The exploration of task-specific information is insufficient in the visual tasks; 3) The semantic order information is typically overlooked during text modeling; 4) Existing cross-modal alignment techniques ignore the temporal coupling of multimodal information. To address these, we propose Task-Adapter++, a parameter-efficient dual adaptation method for both image and text encoders. Specifically, to make full use of the variations across different few-shot learning tasks, we design a task-specific adaptation for the image encoder so that the most discriminative information can be well noticed during feature extraction. Furthermore, we leverage large language models (LLMs) to generate detailed sequential sub-action descriptions for each action class, and introduce semantic order adapters into the text encoder to effectively model the sequential relationships between these sub-actions. Finally, we develop an innovative fine-grained cross-modal alignment strategy that actively maps visual features to reside in the same temporal stage as semantic descriptions. Extensive experiments fully demonstrate the effectiveness and superiority of the proposed method, which achieves state-of-the-art performance on 5 benchmarks consistently. The code is open-sourced at this https URL.
大规模预训练模型在语言和图像任务中取得了显著的成功,促使越来越多的研究探索将如CLIP等预训练图像模型应用于少样本动作识别(FSAR)领域。然而,当前的方法通常面临以下几个问题:1) 直接微调常常会削弱预训练模型的泛化能力;2) 在视觉任务中对于特定任务信息的探索不足;3) 在文本建模时往往忽视了语义顺序信息;4) 现有的跨模态对齐技术忽略了多模态信息的时间耦合。为了解决这些问题,我们提出了Task-Adapter++,这是一种参数高效的双适应方法,用于图像和文本编码器。具体而言,为了充分利用不同少样本学习任务之间的差异,我们在图像编码器中设计了特定的任务适应方式,使得在特征提取过程中能够很好地识别出最具判别性的信息。此外,我们利用大型语言模型(LLMs)为每个动作类别生成详细的子动作序列描述,并在文本编码器中引入语义顺序适配器以有效建模这些子动作之间的序列关系。最后,我们开发了一种创新的细粒度跨模态对齐策略,该策略主动将视觉特征映射到与语义描述相同的时间阶段。广泛的实验充分证明了所提出方法的有效性和优越性,在五个基准测试中始终实现了最先进的性能。代码开源地址见此链接:[提供的URL]。
https://arxiv.org/abs/2505.06002
Person reidentification (ReID) technology has been considered to perform relatively well under controlled, ground-level conditions, but it breaks down when deployed in challenging real-world settings. Evidently, this is due to extreme data variability factors such as resolution, viewpoint changes, scale variations, occlusions, and appearance shifts from clothing or session drifts. Moreover, the publicly available data sets do not realistically incorporate such kinds and magnitudes of variability, which limits the progress of this technology. This paper introduces DetReIDX, a large-scale aerial-ground person dataset, that was explicitly designed as a stress test to ReID under real-world conditions. DetReIDX is a multi-session set that includes over 13 million bounding boxes from 509 identities, collected in seven university campuses from three continents, with drone altitudes between 5.8 and 120 meters. More important, as a key novelty, DetReIDX subjects were recorded in (at least) two sessions on different days, with changes in clothing, daylight and location, making it suitable to actually evaluate long-term person ReID. Plus, data were annotated from 16 soft biometric attributes and multitask labels for detection, tracking, ReID, and action recognition. In order to provide empirical evidence of DetReIDX usefulness, we considered the specific tasks of human detection and ReID, where SOTA methods catastrophically degrade performance (up to 80% in detection accuracy and over 70% in Rank-1 ReID) when exposed to DetReIDXs conditions. The dataset, annotations, and official evaluation protocols are publicly available at this https URL
人重新识别(ReID)技术在受控和地面条件下的表现被认为相对较好,但在具有挑战性的现实场景中部署时则会失效。显然,这是由于极端数据变化因素如分辨率、视角变化、尺度变换、遮挡以及服装或时段差异导致的外观改变造成的。此外,公开可用的数据集未能真实地涵盖这些类型和规模的变化,这限制了该技术的进步。本文介绍了一个大规模的新数据集DetReIDX,它被专门设计为对现实世界条件下的人重新识别进行压力测试。 DetReIDX是一个多会话集合,包含来自509个不同身份的超过1300万个边界框的数据,这些数据从三个大陆上的七所大学校园中收集而来。无人机的飞行高度在5.8到120米之间变化。更重要的是,作为关键创新点,DetReIDX中的个体被记录了至少两次不同的会话(不同日期),期间他们的服装、光照和位置发生变化,这使得它适合于长期的人重新识别评估。此外,数据集使用16个软生物特征属性以及检测、跟踪、人再识别和动作识别的多任务标签进行了标注。 为了证明DetReIDX的实际应用价值,我们选择了人类检测和人再识别这两个特定任务进行研究,在这些任务中,当前最先进的方法在面对DetReIDX环境条件时性能急剧下降(检测准确度最高下降80%,重排1精度的人再识别准确率超过70%)。该数据集、注释以及官方评估协议可在此网址公开获取。
https://arxiv.org/abs/2505.04793
A current limitation of video generative video models is that they generate plausible looking frames, but poor motion -- an issue that is not well captured by FVD and other popular methods for evaluating generated videos. Here we go beyond FVD by developing a metric which better measures plausible object interactions and motion. Our novel approach is based on auto-encoding point tracks and yields motion features that can be used to not only compare distributions of videos (as few as one generated and one ground truth, or as many as two datasets), but also for evaluating motion of single videos. We show that using point tracks instead of pixel reconstruction or action recognition features results in a metric which is markedly more sensitive to temporal distortions in synthetic data, and can predict human evaluations of temporal consistency and realism in generated videos obtained from open-source models better than a wide range of alternatives. We also show that by using a point track representation, we can spatiotemporally localize generative video inconsistencies, providing extra interpretability of generated video errors relative to prior work. An overview of the results and link to the code can be found on the project page: this http URL.
当前视频生成模型的一个限制是,它们可以生成看起来合理的画面帧,但动作却显得不自然——这是一个FVD及其他常用的评估生成视频的方法难以捕捉的问题。在这里,我们通过开发一种新指标来超越FVD,该指标能够更好地衡量物体互动的合理性及运动情况。我们的创新方法基于自动编码点轨迹,并能产生可用于比较视频分布(可少至一对生成和真实数据集,也可多至两组不同数据集)以及评估单一视频中运动特性的特征。 我们发现,使用点轨迹而非像素重建或行为识别特性来构建指标,能够更敏感地检测合成数据中的时间扭曲问题,并且在预测开源模型所生成的视频的时间一致性及现实感方面优于众多其他方法。此外,通过采用点轨迹表示法,我们可以时空定位生成视频中的不一致之处,相较于以往的工作而言,这为解释生成视频错误提供了额外的可解释性。 项目结果概览及相关代码链接请参见:[此网址](http://this.http.url/)。(请注意实际使用时应替换为正确的URL地址。)
https://arxiv.org/abs/2505.00209
In action recognition tasks, feature diversity is essential for enhancing model generalization and performance. Existing methods typically promote feature diversity by expanding the training data in the sample space, which often leads to inefficiencies and semantic inconsistencies. To overcome these problems, we propose a novel Coarse-fine text co-guidance Diffusion model (CoCoDiff). CoCoDiff generates diverse yet semantically consistent features in the latent space by leveraging diffusion and multi-granularity textual guidance. Specifically, our approach feeds spatio-temporal features extracted from skeleton sequences into a latent diffusion model to generate diverse action representations. Meanwhile, we introduce a coarse-fine text co-guided strategy that leverages textual information from large language models (LLMs) to ensure semantic consistency between the generated features and the original inputs. It is noted that CoCoDiff operates as a plug-and-play auxiliary module during training, incurring no additional inference cost. Extensive experiments demonstrate that CoCoDiff achieves SOTA performance on skeleton-based action recognition benchmarks, including NTU RGB+D, NTU RGB+D 120 and Kinetics-Skeleton.
在动作识别任务中,特征多样性对于提升模型泛化能力和性能至关重要。现有方法通常通过扩展样本空间中的训练数据来促进特征多样性,但这往往会引发效率低下和语义不一致的问题。为了解决这些问题,我们提出了一种新颖的粗细文本协同引导扩散模型(CoCoDiff)。CoCoDiff 通过利用扩散机制和多粒度文本指导,在潜在空间中生成多样且语义一致的动作表示。 具体而言,我们的方法将从骨架序列中提取的空间-时间特征输入到一个潜在扩散模型中以生成多种动作表示。同时,我们引入了一种粗细文本协同引导策略,该策略利用大型语言模型(LLM)中的文本信息来确保生成的特征与原始输入之间的语义一致性。 值得注意的是,CoCoDiff 在训练过程中作为一个即插即用的辅助模块运行,并不会增加额外的推理成本。广泛实验表明,在基于骨架的动作识别基准测试中,包括 NTU RGB+D、NTU RGB+D 120 和 Kinetics-Skeleton 数据集上,CoCoDiff 达到了最先进的性能水平。
https://arxiv.org/abs/2504.21266
Action recognition in unmanned aerial vehicles (UAVs) poses unique challenges due to significant view variations along the vertical spatial axis. Unlike traditional ground-based settings, UAVs capture actions from a wide range of altitudes, resulting in considerable appearance discrepancies. We introduce a multi-view formulation tailored to varying UAV altitudes and empirically observe a partial order among views, where recognition accuracy consistently decreases as the altitude increases. This motivates a novel approach that explicitly models the hierarchical structure of UAV views to improve recognition performance across altitudes. To this end, we propose the Partial Order Guided Multi-View Network (POG-MVNet), designed to address drastic view variations by effectively leveraging view-dependent information across different altitude levels. The framework comprises three key components: a View Partition (VP) module, which uses the head-to-body ratio to group views by altitude; an Order-aware Feature Decoupling (OFD) module, which disentangles action-relevant and view-specific features under partial order guidance; and an Action Partial Order Guide (APOG), which leverages the partial order to transfer informative knowledge from easier views to support learning in more challenging ones. We conduct experiments on Drone-Action, MOD20, and UAV datasets, demonstrating that POG-MVNet significantly outperforms competing methods. For example, POG-MVNet achieves a 4.7% improvement on Drone-Action dataset and a 3.5% improvement on UAV dataset compared to state-of-the-art methods ASAT and FAR. The code for POG-MVNet will be made available soon.
无人飞行器(UAV)中的动作识别面临着独特的挑战,主要是由于垂直空间轴上的显著视角变化。与传统的地面设置不同,无人机从广泛的海拔高度捕获动作,导致了明显的外观差异。我们引入了一种专为不同的无人机高度设计的多视图公式,并通过实验证明了一个部分顺序关系,在这种关系中,随着海拔高度的增加识别准确率会逐渐下降。这激发了一个新的方法,该方法显式地建模UAV视角的层次结构以提高跨不同高度的动作识别性能。 为此,我们提出了一个由部分顺序指导的多视图网络(POG-MVNet),旨在通过有效地利用各海拔水平上的视点依赖信息来应对急剧变化的视角。此框架包括三个关键组件:使用头身比对视点按高度进行分组的视点分区(VP)模块;在部分顺序指导下解耦动作相关和视图特有特征的顺序感知特性分解(OFD)模块;以及利用部分顺序将有用的知识从较容易的视角转移到更具有挑战性视角的动作部分顺序指导(APOG)。我们在Drone-Action、MOD20 和UAV 数据集上进行了实验,证明了POG-MVNet 显著优于竞争方法。例如,在Drone-Action数据集中,POG-MVNet 比最先进的方法ASAT和FAR 分别提高了4.7% 和3.5% 的性能。POG-MVNet的代码将在不久之后发布。
https://arxiv.org/abs/2504.20530
Figure skating, known as the "Art on Ice," is among the most artistic sports, challenging to understand due to its blend of technical elements (like jumps and spins) and overall artistic expression. Existing figure skating datasets mainly focus on single tasks, such as action recognition or scoring, lacking comprehensive annotations for both technical and artistic evaluation. Current sports research is largely centered on ball games, with limited relevance to artistic sports like figure skating. To address this, we introduce FSAnno, a large-scale dataset advancing artistic sports understanding through figure skating. FSAnno includes an open-access training and test dataset, alongside a benchmark dataset, FSBench, for fair model evaluation. FSBench consists of FSBench-Text, with multiple-choice questions and explanations, and FSBench-Motion, containing multimodal data and Question and Answer (QA) pairs, supporting tasks from technical analysis to performance commentary. Initial tests on FSBench reveal significant limitations in existing models' understanding of artistic sports. We hope FSBench will become a key tool for evaluating and enhancing model comprehension of figure skating.
花样滑冰,被誉为“冰上的艺术”,是一种极具艺术性的体育项目,由于它融合了技术元素(如跳跃和旋转)与整体的艺术表现力而难以理解。现有的花样滑冰数据集主要侧重于单一任务,例如动作识别或评分,缺乏对技术和艺术评价的全面注释。当前的运动研究大多集中在球类比赛中,对于像花样滑冰这样的艺术性体育项目而言,相关度较低。为了解决这一问题,我们推出了FSAnno,这是一个大规模的数据集,旨在通过花样滑冰来推进对艺术性体育的理解。FSAnno 包括一个开放访问的训练和测试数据集,以及一个用于公平模型评估的基准数据集 FSBench。FSBench 由 FSBench-Text 和 FSBench-Motion 组成,前者包含多选题及解释,后者则包括了多媒体数据与问答对(QA 对),涵盖了从技术分析到比赛评论的任务支持。在 FSBench 上进行的初步测试表明,现有模型对于艺术性体育的理解存在显著局限。我们希望FSBench能够成为评估和提升花样滑冰领域中模型理解能力的关键工具。
https://arxiv.org/abs/2504.19514
Convolutional neural network (CNN) slides a kernel over the whole image to produce an output map. This kernel scheme reduces the number of parameters with respect to a fully connected neural network (NN). While CNN has proven to be an effective model in recognition of handwritten characters and traffic signal sign boards, etc. recently, its deep variants have proven to be effective in similar as well as more challenging applications like object, scene and action recognition. Deep CNN add more layers and kernels to the classical CNN, increasing the number of parameters, and partly reducing the main advantage of CNN which is less parameters. In this paper, a 3D pyramidal neural network called 3DPyraNet and a discriminative approach for spatio-temporal feature learning based on it, called 3DPyraNet-F, are proposed. 3DPyraNet introduces a new weighting scheme which learns features from both spatial and temporal dimensions analyzing multiple adjacent frames and keeping a biological plausible structure. It keeps the spatial topology of the input image and presents fewer parameters and lower computational and memory costs compared to both fully connected NNs and recent deep CNNs. 3DPyraNet-F extract the features maps of the highest layer of the learned network, fuse them in a single vector, and provide it as input in such a way to a linear-SVM classifier that enhances the recognition of human actions and dynamic scenes from the videos. Encouraging results are reported with 3DPyraNet in real-world environments, especially in the presence of camera induced motion. Further, 3DPyraNet-F clearly outperforms the state-of-the-art on three benchmark datasets and shows comparable result for the fourth.
卷积神经网络(CNN)通过在整个图像上滑动核来生成输出映射。这种核方案相对于全连接神经网络(NN)减少了参数的数量。尽管CNN在识别手写字符和交通标志等任务中已经证明是有效的模型,但它的深度变体也在类似及更为复杂的应用如物体、场景和动作识别方面表现出了有效性。深层的CNN增加了更多的层和核,虽然这增加了参数数量,但也部分削弱了CNN的主要优势——即较少的参数数量。在这篇论文中,提出了一个名为3DPyraNet的三维金字塔神经网络以及一种基于它的时空特征学习的判别方法,称为3DPyraNet-F。 3DPyraNet引入了一种新的加权方案,该方案可以从空间和时间维度学习特征,并分析多个相邻帧同时保持生物合理的结构。它保留了输入图像的空间拓扑,并且相比全连接的NN以及最近的深层CNN而言,具有更少的参数、更低的计算及内存成本。 3DPyraNet-F提取所学网络最高层的特征图并将其融合为单一向量,然后以这种方式提供给线性-SVM分类器,从而增强了从视频中识别人类动作和动态场景的能力。在实际环境中的实验结果表明,特别是在存在相机引起的运动的情况下,3DPyraNet表现出了令人鼓舞的效果。此外,3DPyraNet-F在三个基准数据集上明显超过了最新的技术水平,并且对于第四个数据集也表现出相当的结果。 通过这种描述,可以清楚地看出3DPyraNet和3DPyraNet-F在网络架构、参数效率及任务性能方面的独特优势。
https://arxiv.org/abs/2504.18977
As extended reality (XR) is redefining how users interact with computing devices, research in human action recognition is gaining prominence. Typically, models deployed on immersive computing devices are static and limited to their default set of classes. The goal of our research is to provide users and developers with the capability to personalize their experience by adding new action classes to their device models continually. Importantly, a user should be able to add new classes in a low-shot and efficient manner, while this process should not require storing or replaying any of user's sensitive training data. We formalize this problem as privacy-aware few-shot continual action recognition. Towards this end, we propose POET: Prompt-Offset Tuning. While existing prompt tuning approaches have shown great promise for continual learning of image, text, and video modalities; they demand access to extensively pretrained transformers. Breaking away from this assumption, POET demonstrates the efficacy of prompt tuning a significantly lightweight backbone, pretrained exclusively on the base class data. We propose a novel spatio-temporal learnable prompt offset tuning approach, and are the first to apply such prompt tuning to Graph Neural Networks. We contribute two new benchmarks for our new problem setting in human action recognition: (i) NTU RGB+D dataset for activity recognition, and (ii) SHREC-2017 dataset for hand gesture recognition. We find that POET consistently outperforms comprehensive benchmarks. Source code at this https URL.
随着扩展现实(XR)重新定义了用户与计算设备的交互方式,人类动作识别领域的研究正变得越来越重要。传统上,在沉浸式计算设备中部署的模型是静态的,并且局限于其默认的一组类。我们的研究目标是为用户提供能力,让他们能够通过向设备模型持续添加新的动作类别来个性化他们的体验。至关重要的是,用户应当能够在低样本和高效的方式下添加新类别,而这一过程不应涉及存储或重放任何用户的敏感训练数据。我们将这个问题形式化为隐私感知的少样本连续动作识别问题。 为此,我们提出了POET:Prompt-Offset Tuning(提示偏移微调)。尽管现有的提示微调方法在图像、文本和视频模式的持续学习中显示出巨大的潜力;但它们要求访问大规模预训练的变压器。POET打破了这一假设,展示了对仅基于基础类数据进行预先训练的轻量级骨干网络进行提示微调的有效性。 我们提出了一种新颖的空间-时间可学习提示偏移微调方法,并且首次将这种提示微调应用于图神经网络(Graph Neural Networks, GNN)中。为了我们的新问题设置,我们在人类动作识别领域贡献了两个新的基准测试:(i) NTU RGB+D 数据集用于活动识别;以及 (ii) SHREC-2017 数据集用于手部姿态识别。 我们发现POET在全面的基准测试中表现一致出色。源代码可在提供的链接地址获取。
https://arxiv.org/abs/2504.18059
Human pose estimation and action recognition have received attention due to their critical roles in healthcare monitoring, rehabilitation, and assistive technologies. In this study, we proposed a novel architecture named Transformer based Encoder Decoder Network (TED Net) designed for estimating human skeleton poses from WiFi Channel State Information (CSI). TED Net integrates convolutional encoders with transformer based attention mechanisms to capture spatiotemporal features from CSI signals. The estimated skeleton poses were used as input to a customized Directed Graph Neural Network (DGNN) for action recognition. We validated our model on two datasets: a publicly available multi modal dataset for assessing general pose estimation, and a newly collected dataset focused on fall related scenarios involving 20 participants. Experimental results demonstrated that TED Net outperformed existing approaches in pose estimation, and that the DGNN achieves reliable action classification using CSI based skeletons, with performance comparable to RGB based systems. Notably, TED Net maintains robust performance across both fall and non fall cases. These findings highlight the potential of CSI driven human skeleton estimation for effective action recognition, particularly in home environments such as elderly fall detection. In such settings, WiFi signals are often readily available, offering a privacy preserving alternative to vision based methods, which may raise concerns about continuous camera monitoring.
人体姿态估计和动作识别因其在医疗监护、康复及辅助技术中的关键作用而受到了广泛关注。本研究提出了一种名为基于变换器的编码解码网络(TED Net)的新架构,用于从WiFi信道状态信息(CSI)中估算人体骨骼姿势。TED Net结合了卷积编码器与基于变换器的注意力机制,以捕捉来自CSI信号的时空特征。所估计的人体姿态作为输入被送入一种定制化的有向图神经网络(DGNN),用于动作识别。我们在两个数据集上验证了我们的模型:一个是公开可用的多模态数据集,用于评估一般姿势估计;另一个是新收集的数据集,重点关注涉及20名参与者的与跌倒相关的场景。实验结果显示,TED Net在姿态估计方面超越了现有方法,并且DGNN能够使用基于CSI的人体骨骼进行可靠的动作分类,性能与基于RGB的方法相当。值得注意的是,TED Net在跌倒和非跌倒情况下都保持了稳健的性能。这些发现突显了CSI驱动的人体骨架估算在有效动作识别中的潜力,尤其是在如老人跌倒检测这样的家庭环境中。在这种环境下,WiFi信号往往易于获取,并提供了一种隐私保护的方法来替代基于视觉的方法,后者可能因持续摄像监控而引起担忧。
https://arxiv.org/abs/2504.16655
Wireless signal-based human sensing technologies, such as WiFi, millimeter-wave (mmWave) radar, and Radio Frequency Identification (RFID), enable the detection and interpretation of human presence, posture, and activities, thereby providing critical support for applications in public security, healthcare, and smart environments. These technologies exhibit notable advantages due to their non-contact operation and environmental adaptability; however, existing systems often fail to leverage the textual information inherent in datasets. To address this, we propose an innovative text-enhanced wireless sensing framework, WiTalk, that seamlessly integrates semantic knowledge through three hierarchical prompt strategies-label-only, brief description, and detailed action description-without requiring architectural modifications or incurring additional data costs. We rigorously validate this framework across three public benchmark datasets: XRF55 for human action recognition (HAR), and WiFiTAL and XRFV2 for WiFi temporal action localization (TAL). Experimental results demonstrate significant performance improvements: on XRF55, accuracy for WiFi, RFID, and mmWave increases by 3.9%, 2.59%, and 0.46%, respectively; on WiFiTAL, the average performance of WiFiTAD improves by 4.98%; and on XRFV2, the mean average precision gains across various methods range from 4.02% to 13.68%. Our codes have been included in this https URL.
基于无线信号的人体感知技术,如WiFi、毫米波(mmWave)雷达和射频识别(RFID),能够检测并解释人体的存在、姿势及活动情况,从而为公共安全、医疗保健和智能环境的应用提供关键支持。这些技术由于其非接触操作能力和环境适应性而具有显著优势;然而,现有的系统常常未能充分利用数据集中的文本信息。为此,我们提出了一种创新的增强型无线感知框架WiTalk,在不进行架构修改或增加额外数据成本的情况下,通过三种层次化的提示策略——仅标签、简短描述和详细动作描述——无缝地整合语义知识。我们在三个公开基准数据集中严格验证了该框架的有效性:XRF55用于人体行为识别(HAR),以及WiFiTAL和XRFV2用于WiFi时间动作定位(TAL)。实验结果显示性能显著提升:在XRF55上,对于WiFi、RFID和mmWave的准确率分别提高了3.9%、2.59% 和0.46%;在WiFiTAL上,WiFiTAD的平均性能提升了4.98%;而在XRFV2上,不同方法的平均精度增益范围从4.02%到13.68%。我们的代码已包含在此链接中:https URL。
https://arxiv.org/abs/2504.14621
The rapid development of video surveillance systems for object detection, tracking, activity recognition, and anomaly detection has revolutionized our day-to-day lives while setting alarms for privacy concerns. It isn't easy to strike a balance between visual privacy and action recognition performance in most computer vision models. Is it possible to safeguard privacy without sacrificing performance? It poses a formidable challenge, as even minor privacy enhancements can lead to substantial performance degradation. To address this challenge, we propose a privacy-preserving image anonymization technique that optimizes the anonymizer using penalties from the utility branch, ensuring improved action recognition performance while minimally affecting privacy leakage. This approach addresses the trade-off between minimizing privacy leakage and maintaining high action performance. The proposed approach is primarily designed to align with the regulatory standards of the EU AI Act and GDPR, ensuring the protection of personally identifiable information while maintaining action performance. To the best of our knowledge, we are the first to introduce a feature-based penalty scheme that exclusively controls the action features, allowing freedom to anonymize private attributes. Extensive experiments were conducted to validate the effectiveness of the proposed method. The results demonstrate that applying a penalty to anonymizer from utility branch enhances action performance while maintaining nearly consistent privacy leakage across different penalty settings.
物体检测、跟踪、活动识别和异常检测的视频监控系统的快速发展在改善我们日常生活的同时,也引发了隐私保护的关注。大多数计算机视觉模型很难在视觉隐私和行为识别性能之间取得平衡。有没有可能在不牺牲性能的情况下保护隐私呢?这是一项艰巨的任务,因为即使是微小的隐私改进也可能导致显著的表现下降。为了解决这一挑战,我们提出了一种隐私保护图像匿名化技术,该技术通过利用效用分支中的惩罚来优化匿名器,从而确保行为识别性能得到改善,同时对隐私泄露的影响最小。这种方法解决了在减少隐私泄露的同时保持高行为表现的权衡问题。 所提出的方案主要设计目的是符合欧盟人工智能法案和GDPR的相关规定,在保护个人身份信息的同时保持行动性能。据我们所知,我们是第一个引入仅控制行为特征的基于功能的惩罚计划,允许自由匿名化私人属性的人。进行了广泛的实验以验证提出的方法的有效性。结果表明,从效用分支对匿名器应用惩罚可以提高动作表现,同时在不同的处罚设置下保持几乎一致的隐私泄露水平。
https://arxiv.org/abs/2504.14301
Human action recognition (HAR) has achieved impressive results with deep learning models, but their decision-making process remains opaque due to their black-box nature. Ensuring interpretability is crucial, especially for real-world applications requiring transparency and accountability. Existing video XAI methods primarily rely on feature attribution or static textual concepts, both of which struggle to capture motion dynamics and temporal dependencies essential for action understanding. To address these challenges, we propose Pose Concept Bottleneck for Explainable Action Recognition (PCBEAR), a novel concept bottleneck framework that introduces human pose sequences as motion-aware, structured concepts for video action recognition. Unlike methods based on pixel-level features or static textual descriptions, PCBEAR leverages human skeleton poses, which focus solely on body movements, providing robust and interpretable explanations of motion dynamics. We define two types of pose-based concepts: static pose concepts for spatial configurations at individual frames, and dynamic pose concepts for motion patterns across multiple frames. To construct these concepts, PCBEAR applies clustering to video pose sequences, allowing for automatic discovery of meaningful concepts without manual annotation. We validate PCBEAR on KTH, Penn-Action, and HAA500, showing that it achieves high classification performance while offering interpretable, motion-driven explanations. Our method provides both strong predictive performance and human-understandable insights into the model's reasoning process, enabling test-time interventions for debugging and improving model behavior.
人类动作识别(HAR)在深度学习模型的应用中取得了显著成果,但由于这些模型的黑箱特性,其决策过程仍然难以理解。确保解释性对于需要透明度和问责制的实际应用场景至关重要。现有的视频可解释人工智能(XAI)方法主要依赖于特征归因或静态文本概念,这两种方式都难以捕捉动作识别所必需的动作动态性和时间依赖性。为了解决这些问题,我们提出了姿势概念瓶颈的可解释动作识别框架(PCBEAR),这是一种新型的概念瓶颈框架,引入了人体姿态序列作为具有运动感知和结构化概念的视频动作识别方法。 与基于像素级特征或静态文本描述的方法不同,PCBEAR利用人体骨架姿态,专注于身体的动作,从而提供关于动作动态性的稳健且可解释的说明。我们定义了两种类型的姿势相关概念:静态姿势概念用于描述单帧的空间配置,而动态姿势概念则用于跨越多帧的运动模式。为了构建这些概念,PCBEAR对视频中的姿态序列进行聚类分析,能够自动发现有意义的概念而无需人工注释。 我们在KTH、Penn-Action和HAA500数据集上验证了PCBEAR的有效性,结果表明它不仅具有高水平的分类性能,还提供了与运动相关的可解释说明。我们的方法在提供强大的预测性能的同时,也向人类使用者揭示了模型推理过程中的直观理解,并支持测试时的操作干预以调试和改进模型行为。
https://arxiv.org/abs/2504.13140
While current skeleton action recognition models demonstrate impressive performance on large-scale datasets, their adaptation to new application scenarios remains challenging. These challenges are particularly pronounced when facing new action categories, diverse performers, and varied skeleton layouts, leading to significant performance degeneration. Additionally, the high cost and difficulty of collecting skeleton data make large-scale data collection impractical. This paper studies one-shot and limited-scale learning settings to enable efficient adaptation with minimal data. Existing approaches often overlook the rich mutual information between labeled samples, resulting in sub-optimal performance in low-data scenarios. To boost the utility of labeled data, we identify the variability among performers and the commonality within each action as two key attributes. We present SkeletonX, a lightweight training pipeline that integrates seamlessly with existing GCN-based skeleton action recognizers, promoting effective training under limited labeled data. First, we propose a tailored sample pair construction strategy on two key attributes to form and aggregate sample pairs. Next, we develop a concise and effective feature aggregation module to process these pairs. Extensive experiments are conducted on NTU RGB+D, NTU RGB+D 120, and PKU-MMD with various GCN backbones, demonstrating that the pipeline effectively improves performance when trained from scratch with limited data. Moreover, it surpasses previous state-of-the-art methods in the one-shot setting, with only 1/10 of the parameters and much fewer FLOPs. The code and data are available at: this https URL
尽管当前的骨架动作识别模型在大规模数据集上展示了出色的表现,但它们适应新应用场景的能力仍然具有挑战性。当面对新的动作类别、多样化的表演者和不同的骨架布局时,这种挑战尤为明显,这会导致性能显著下降。此外,收集骨架数据的成本高昂且困难重重,使得大规模数据采集变得不切实际。本文研究了一次性和有限规模的学习设置,以实现利用最少的数据进行高效适应的目的。现有的方法通常忽视了标记样本之间的丰富互信息,在低数据场景中导致表现不佳。为了提高标记数据的效用,我们确定了表演者之间的变异性以及每个动作内部的一致性是两个关键属性。 本文提出了SkeletonX,这是一种轻量级的训练流程,能够与现有基于GCN(图卷积网络)的骨架动作识别器无缝集成,在受限标签数据下促进有效的训练。首先,我们提出了一种针对两种关键属性定制的样本对构建策略,以形成和聚合样本对。其次,我们开发了一个简洁且有效的特征聚合模块来处理这些对。 在NTU RGB+D、NTU RGB+D 120以及PKU-MMD数据集上使用不同GCN骨干网络进行了广泛的实验,结果表明,在仅使用有限的数据从零开始训练时,该流程能够有效提高性能。此外,它在一次性设置中超越了以前的最佳方法,参数量仅为前者的十分之一,并且浮点运算次数显著减少。 代码和数据可在以下链接获取:this https URL
https://arxiv.org/abs/2504.11749
In this paper, we propose H-MoRe, a novel pipeline for learning precise human-centric motion representation. Our approach dynamically preserves relevant human motion while filtering out background movement. Notably, unlike previous methods relying on fully supervised learning from synthetic data, H-MoRe learns directly from real-world scenarios in a self-supervised manner, incorporating both human pose and body shape information. Inspired by kinematics, H-MoRe represents absolute and relative movements of each body point in a matrix format that captures nuanced motion details, termed world-local flows. H-MoRe offers refined insights into human motion, which can be integrated seamlessly into various action-related applications. Experimental results demonstrate that H-MoRe brings substantial improvements across various downstream tasks, including gait recognition(CL@R1: +16.01%), action recognition(Acc@1: +8.92%), and video generation(FVD: -67.07%). Additionally, H-MoRe exhibits high inference efficiency (34 fps), making it suitable for most real-time scenarios. Models and code will be released upon publication.
在这篇论文中,我们提出了一种新颖的管道系统H-MoRe(Human-centric Motion Representation),用于学习精确的人体运动表示。我们的方法能够动态地保留相关的人类动作同时过滤掉背景移动。值得注意的是,与之前依赖于合成数据进行完全监督学习的方法不同,H-MoRe直接从现实场景中以自监督方式学习,结合了人体姿态和体型信息。 受动力学的启发,H-MoRe 采用矩阵格式来表示每个身体部位的绝对运动和相对运动,并捕捉细微的动作细节,称为世界局部流(world-local flows)。H-MoRe 提供了对人类动作的深入见解,可以无缝地集成到各种与行动相关的应用中。实验结果表明,H-MoRe 在多种下游任务中带来了显著改进,包括步态识别(CL@R1: +16.01%)、行为识别(Acc@1: +8.92%) 和视频生成(FVD: -67.07%)。此外,H-MoRe 还表现出高推断效率(34 fps),适合大多数实时场景应用。 论文发布后将公开模型和代码。
https://arxiv.org/abs/2504.10676