Human action recognition (HAR) has achieved impressive results with deep learning models, but their decision-making process remains opaque due to their black-box nature. Ensuring interpretability is crucial, especially for real-world applications requiring transparency and accountability. Existing video XAI methods primarily rely on feature attribution or static textual concepts, both of which struggle to capture motion dynamics and temporal dependencies essential for action understanding. To address these challenges, we propose Pose Concept Bottleneck for Explainable Action Recognition (PCBEAR), a novel concept bottleneck framework that introduces human pose sequences as motion-aware, structured concepts for video action recognition. Unlike methods based on pixel-level features or static textual descriptions, PCBEAR leverages human skeleton poses, which focus solely on body movements, providing robust and interpretable explanations of motion dynamics. We define two types of pose-based concepts: static pose concepts for spatial configurations at individual frames, and dynamic pose concepts for motion patterns across multiple frames. To construct these concepts, PCBEAR applies clustering to video pose sequences, allowing for automatic discovery of meaningful concepts without manual annotation. We validate PCBEAR on KTH, Penn-Action, and HAA500, showing that it achieves high classification performance while offering interpretable, motion-driven explanations. Our method provides both strong predictive performance and human-understandable insights into the model's reasoning process, enabling test-time interventions for debugging and improving model behavior.
人类动作识别(HAR)在深度学习模型的应用中取得了显著成果,但由于这些模型的黑箱特性,其决策过程仍然难以理解。确保解释性对于需要透明度和问责制的实际应用场景至关重要。现有的视频可解释人工智能(XAI)方法主要依赖于特征归因或静态文本概念,这两种方式都难以捕捉动作识别所必需的动作动态性和时间依赖性。为了解决这些问题,我们提出了姿势概念瓶颈的可解释动作识别框架(PCBEAR),这是一种新型的概念瓶颈框架,引入了人体姿态序列作为具有运动感知和结构化概念的视频动作识别方法。 与基于像素级特征或静态文本描述的方法不同,PCBEAR利用人体骨架姿态,专注于身体的动作,从而提供关于动作动态性的稳健且可解释的说明。我们定义了两种类型的姿势相关概念:静态姿势概念用于描述单帧的空间配置,而动态姿势概念则用于跨越多帧的运动模式。为了构建这些概念,PCBEAR对视频中的姿态序列进行聚类分析,能够自动发现有意义的概念而无需人工注释。 我们在KTH、Penn-Action和HAA500数据集上验证了PCBEAR的有效性,结果表明它不仅具有高水平的分类性能,还提供了与运动相关的可解释说明。我们的方法在提供强大的预测性能的同时,也向人类使用者揭示了模型推理过程中的直观理解,并支持测试时的操作干预以调试和改进模型行为。
https://arxiv.org/abs/2504.13140
While current skeleton action recognition models demonstrate impressive performance on large-scale datasets, their adaptation to new application scenarios remains challenging. These challenges are particularly pronounced when facing new action categories, diverse performers, and varied skeleton layouts, leading to significant performance degeneration. Additionally, the high cost and difficulty of collecting skeleton data make large-scale data collection impractical. This paper studies one-shot and limited-scale learning settings to enable efficient adaptation with minimal data. Existing approaches often overlook the rich mutual information between labeled samples, resulting in sub-optimal performance in low-data scenarios. To boost the utility of labeled data, we identify the variability among performers and the commonality within each action as two key attributes. We present SkeletonX, a lightweight training pipeline that integrates seamlessly with existing GCN-based skeleton action recognizers, promoting effective training under limited labeled data. First, we propose a tailored sample pair construction strategy on two key attributes to form and aggregate sample pairs. Next, we develop a concise and effective feature aggregation module to process these pairs. Extensive experiments are conducted on NTU RGB+D, NTU RGB+D 120, and PKU-MMD with various GCN backbones, demonstrating that the pipeline effectively improves performance when trained from scratch with limited data. Moreover, it surpasses previous state-of-the-art methods in the one-shot setting, with only 1/10 of the parameters and much fewer FLOPs. The code and data are available at: this https URL
尽管当前的骨架动作识别模型在大规模数据集上展示了出色的表现,但它们适应新应用场景的能力仍然具有挑战性。当面对新的动作类别、多样化的表演者和不同的骨架布局时,这种挑战尤为明显,这会导致性能显著下降。此外,收集骨架数据的成本高昂且困难重重,使得大规模数据采集变得不切实际。本文研究了一次性和有限规模的学习设置,以实现利用最少的数据进行高效适应的目的。现有的方法通常忽视了标记样本之间的丰富互信息,在低数据场景中导致表现不佳。为了提高标记数据的效用,我们确定了表演者之间的变异性以及每个动作内部的一致性是两个关键属性。 本文提出了SkeletonX,这是一种轻量级的训练流程,能够与现有基于GCN(图卷积网络)的骨架动作识别器无缝集成,在受限标签数据下促进有效的训练。首先,我们提出了一种针对两种关键属性定制的样本对构建策略,以形成和聚合样本对。其次,我们开发了一个简洁且有效的特征聚合模块来处理这些对。 在NTU RGB+D、NTU RGB+D 120以及PKU-MMD数据集上使用不同GCN骨干网络进行了广泛的实验,结果表明,在仅使用有限的数据从零开始训练时,该流程能够有效提高性能。此外,它在一次性设置中超越了以前的最佳方法,参数量仅为前者的十分之一,并且浮点运算次数显著减少。 代码和数据可在以下链接获取:this https URL
https://arxiv.org/abs/2504.11749
In this paper, we propose H-MoRe, a novel pipeline for learning precise human-centric motion representation. Our approach dynamically preserves relevant human motion while filtering out background movement. Notably, unlike previous methods relying on fully supervised learning from synthetic data, H-MoRe learns directly from real-world scenarios in a self-supervised manner, incorporating both human pose and body shape information. Inspired by kinematics, H-MoRe represents absolute and relative movements of each body point in a matrix format that captures nuanced motion details, termed world-local flows. H-MoRe offers refined insights into human motion, which can be integrated seamlessly into various action-related applications. Experimental results demonstrate that H-MoRe brings substantial improvements across various downstream tasks, including gait recognition(CL@R1: +16.01%), action recognition(Acc@1: +8.92%), and video generation(FVD: -67.07%). Additionally, H-MoRe exhibits high inference efficiency (34 fps), making it suitable for most real-time scenarios. Models and code will be released upon publication.
在这篇论文中,我们提出了一种新颖的管道系统H-MoRe(Human-centric Motion Representation),用于学习精确的人体运动表示。我们的方法能够动态地保留相关的人类动作同时过滤掉背景移动。值得注意的是,与之前依赖于合成数据进行完全监督学习的方法不同,H-MoRe直接从现实场景中以自监督方式学习,结合了人体姿态和体型信息。 受动力学的启发,H-MoRe 采用矩阵格式来表示每个身体部位的绝对运动和相对运动,并捕捉细微的动作细节,称为世界局部流(world-local flows)。H-MoRe 提供了对人类动作的深入见解,可以无缝地集成到各种与行动相关的应用中。实验结果表明,H-MoRe 在多种下游任务中带来了显著改进,包括步态识别(CL@R1: +16.01%)、行为识别(Acc@1: +8.92%) 和视频生成(FVD: -67.07%)。此外,H-MoRe 还表现出高推断效率(34 fps),适合大多数实时场景应用。 论文发布后将公开模型和代码。
https://arxiv.org/abs/2504.10676
Few-shot action recognition (FSAR) aims to recognize novel action categories with few exemplars. Existing methods typically learn frame-level representations independently for each video by designing various inter-frame temporal modeling strategies. However, they neglect explicit relation modeling between videos and tasks, thus failing to capture shared temporal patterns across videos and reuse temporal knowledge from historical tasks. In light of this, we propose HR2G-shot, a Hierarchical Relation-augmented Representation Generalization framework for FSAR, which unifies three types of relation modeling (inter-frame, inter-video, and inter-task) to learn task-specific temporal patterns from a holistic view. In addition to conducting inter-frame temporal interactions, we further devise two components to respectively explore inter-video and inter-task relationships: i) Inter-video Semantic Correlation (ISC) performs cross-video frame-level interactions in a fine-grained manner, thereby capturing task-specific query features and learning intra- and inter-class temporal correlations among support features; ii) Inter-task Knowledge Transfer (IKT) retrieves and aggregates relevant temporal knowledge from the bank, which stores diverse temporal patterns from historical tasks. Extensive experiments on five benchmarks show that HR2G-shot outperforms current top-leading FSAR methods.
几帧动作识别(Few-shot Action Recognition,FSAR)的目标是使用少量示例来识别新的动作类别。现有方法通常通过设计各种跨帧时间建模策略为每个视频独立地学习帧级表示。然而,这些方法忽略了显式的视频与任务之间的关系建模,从而无法捕捉跨视频的时间模式并复用历史任务中的时间知识。为此,我们提出了HR2G-shot框架,这是一种层次化关系增强表征泛化的FSAR框架,它统一了三种类型的关系建模(帧间、视频间和任务间),从整体视角学习特定于任务的时间模式。 除了进行跨帧时间交互之外,我们还设计了两个组件来分别探索视频间与任务间的关系: 1. 视频间语义相关性(Inter-video Semantic Correlation, ISC):在细粒度级别上执行跨视频的帧级交互,从而捕捉特定于查询的动作特征,并学习支撑动作特征集内的和跨类别的时间相关性。 2. 任务间知识转移(Inter-task Knowledge Transfer, IKT):从存储了来自历史任务的各种时间模式的知识库中检索并聚合相关的知识。 在五个基准数据集上的广泛实验表明,HR2G-shot超越了当前最先进的FSAR方法。
https://arxiv.org/abs/2504.10079
Action recognition is an essential task in egocentric vision due to its wide range of applications across many fields. While deep learning methods have been proposed to address this task, most rely on a single modality, typically video. However, including additional modalities may improve the robustness of the approaches to common issues in egocentric videos, such as blurriness and occlusions. Recent efforts in multimodal egocentric action recognition often assume the availability of all modalities, leading to failures or performance drops when any modality is missing. To address this, we introduce an efficient multimodal knowledge distillation approach for egocentric action recognition that is robust to missing modalities (KARMMA) while still benefiting when multiple modalities are available. Our method focuses on resource-efficient development by leveraging pre-trained models as unimodal feature extractors in our teacher model, which distills knowledge into a much smaller and faster student model. Experiments on the Epic-Kitchens and Something-Something datasets demonstrate that our student model effectively handles missing modalities while reducing its accuracy drop in this scenario.
动作识别是第一人称视觉中的一个重要任务,因为它在多个领域的应用范围广泛。尽管已经提出了许多基于深度学习的方法来解决这一问题,但大多数方法仅依赖单一模态(通常是视频)。然而,在第一人称视频中加入额外的模态可能会改善现有方法对诸如模糊和遮挡等常见问题的鲁棒性。最近关于多模态第一人称动作识别的努力常常假设所有模态都是可用的,这导致在任何模态缺失时性能下降或完全失效。为了解决这个问题,我们引入了一种面向第一人称动作识别的高效多模态知识蒸馏方法(KARMMA),该方法能够在缺少模态的情况下保持稳健性,同时当多个模态都存在时仍然受益。我们的方法侧重于资源高效的开发,通过利用预训练模型作为教师模型中的单模态特征提取器,并将知识传递给一个更小更快的学生模型中。在Epic-Kitchens和Something-Something数据集上的实验表明,我们的学生模型能够在处理缺失模态的同时减少其在这种情况下的精度损失。
https://arxiv.org/abs/2504.08578
Micro-Action Recognition (MAR) aims to classify subtle human actions in video. However, annotating MAR datasets is particularly challenging due to the subtlety of actions. To this end, we introduce the setting of Semi-Supervised MAR (SSMAR), where only a part of samples are labeled. We first evaluate traditional Semi-Supervised Learning (SSL) methods to SSMAR and find that these methods tend to overfit on inaccurate pseudo-labels, leading to error accumulation and degraded performance. This issue primarily arises from the common practice of directly using the predictions of classifier as pseudo-labels to train the model. To solve this issue, we propose a novel framework, called Asynchronous Pseudo Labeling and Training (APLT), which explicitly separates the pseudo-labeling process from model training. Specifically, we introduce a semi-supervised clustering method during the offline pseudo-labeling phase to generate more accurate pseudo-labels. Moreover, a self-adaptive thresholding strategy is proposed to dynamically filter noisy labels of different classes. We then build a memory-based prototype classifier based on the filtered pseudo-labels, which is fixed and used to guide the subsequent model training phase. By alternating the two pseudo-labeling and model training phases in an asynchronous manner, the model can not only be learned with more accurate pseudo-labels but also avoid the overfitting issue. Experiments on three MAR datasets show that our APLT largely outperforms state-of-the-art SSL methods. For instance, APLT improves accuracy by 14.5\% over FixMatch on the MA-12 dataset when using only 50\% labeled data. Code will be publicly available.
微动作识别(MAR)旨在对视频中的细微人类行为进行分类。然而,由于这些行动的微妙性,为MAR数据集标注变得非常具有挑战性。为此,我们引入了半监督微动作识别(SSMAR),在这种设定下只有部分样本被打上标签。首先,我们将传统半监督学习方法应用于SSMAR,并发现这些方法倾向于过度拟合不准确的伪标签,导致误差累积和性能下降的问题。这一问题主要来源于直接使用分类器预测作为训练模型的伪标签这一常见做法。 为了解决这个问题,我们提出了一种新的框架——异步伪标记与训练(APLT),该框架明确地将伪标注过程与模型训练过程分离。具体而言,在离线伪标注阶段,我们引入了一种半监督聚类方法来生成更准确的伪标签;同时提出了一个自适应阈值策略以动态过滤不同类别中的噪声标签。基于过滤后的伪标签,我们构建了一个内存基础原型分类器,并将其固定下来用于指导后续模型训练阶段。通过异步交替进行伪标注和模型训练两个阶段,可以使模型不仅能够使用更准确的伪标签学习,还能避免过度拟合的问题。 在三个MAR数据集上的实验表明,我们的APLT框架显著优于当前最先进的半监督学习方法。例如,在MA-12数据集中,当仅使用50%的标记数据时,APLT比FixMatch提高了14.5%的精度。相关代码将公开发布。
https://arxiv.org/abs/2504.07785
Caregiving of older adults is an urgent global challenge, with many older adults preferring to age in place rather than enter residential care. However, providing adequate home-based assistance remains difficult, particularly in geographically vast regions. Teleoperated robots offer a promising solution, but conventional motion-mapping teleoperation imposes unnatural movement constraints on operators, leading to muscle fatigue and reduced usability. This paper presents a novel teleoperation framework that leverages action recognition to enable intuitive remote robot control. Using our simplified Spatio-Temporal Graph Convolutional Network (S-ST-GCN), the system recognizes human actions and executes corresponding preset robot trajectories, eliminating the need for direct motion synchronization. A finite-state machine (FSM) is integrated to enhance reliability by filtering out misclassified actions. Our experiments demonstrate that the proposed framework enables effortless operator movement while ensuring accurate robot execution. This proof-of-concept study highlights the potential of teleoperation with action recognition for enabling caregivers to remotely assist older adults during activities of daily living (ADLs). Future work will focus on improving the S-ST-GCN's recognition accuracy and generalization, integrating advanced motion planning techniques to further enhance robotic autonomy in older adult care, and conducting a user study to evaluate the system's telepresence and ease of control.
老年人照护是一项紧迫的全球挑战,许多老年人更倾向于居家养老而非入住机构。然而,在地理范围广阔的地区提供充分的家庭护理仍然颇具难度。远程操作机器人提供了一种有前景的解决方案,但传统的运动映射远程控制给操作者带来了不自然的动作限制,导致肌肉疲劳和可用性降低。本文提出了一种新的远程控制框架,该框架利用动作识别技术来实现直观的远程机器人控制。通过使用我们简化的时空图卷积网络(S-ST-GCN),系统可以识别人类的动作并执行预设的机器人轨迹,从而避免了直接同步运动的需求。此外,集成的状态机(FSM)通过过滤出误分类的动作提升了系统的可靠性。我们的实验表明,提出的框架使得操作者的动作变得轻松自如,同时确保了机器人的准确执行。这项概念验证研究表明了利用动作识别进行远程控制以帮助照护者在老年人日常生活中提供远程援助的巨大潜力。未来的工作将专注于提高S-ST-GCN的识别精度和泛化能力,整合先进的运动规划技术以进一步增强机器人在家养长者护理中的自主性,并开展用户研究来评估系统的临场感及操作简易程度。
https://arxiv.org/abs/2504.07001
Action recognition models have achieved promising results in understanding instructional videos. However, they often rely on dominant, dataset-specific action sequences rather than true video comprehension, a problem that we define as ordinal bias. To address this issue, we propose two effective video manipulation methods: Action Masking, which masks frames of frequently co-occurring actions, and Sequence Shuffling, which randomizes the order of action segments. Through comprehensive experiments, we demonstrate that current models exhibit significant performance drops when confronted with nonstandard action sequences, underscoring their vulnerability to ordinal bias. Our findings emphasize the importance of rethinking evaluation strategies and developing models capable of generalizing beyond fixed action patterns in diverse instructional videos.
动作识别模型在理解教学视频方面已经取得了令人鼓舞的结果。然而,这些模型往往依赖于特定数据集中的主导动作序列,而不是真正的视频理解能力,我们将其定义为序数偏差(ordinal bias)。为了应对这一问题,我们提出了两种有效的视频处理方法:动作遮蔽(Action Masking),它会掩盖频繁同时出现的动作帧;以及序列打乱(Sequence Shuffling),它将动作片段的顺序随机化。通过全面实验,我们证明当前模型在遇到非标准动作序列时性能显著下降,这凸显了它们对序数偏差的脆弱性。我们的发现强调了重新考虑评估策略和开发能够超越固定动作模式、适用于多样教学视频的模型的重要性。
https://arxiv.org/abs/2504.06580
Few-Shot Action Recognition (FSAR) aims to train a model with only a few labeled video instances. A key challenge in FSAR is handling divergent narrative trajectories for precise video matching. While the frame- and tuple-level alignment approaches have been promising, their methods heavily rely on pre-defined and length-dependent alignment units (e.g., frames or tuples), which limits flexibility for actions of varying lengths and speeds. In this work, we introduce a novel TEmporal Alignment-free Matching (TEAM) approach, which eliminates the need for temporal units in action representation and brute-force alignment during matching. Specifically, TEAM represents each video with a fixed set of pattern tokens that capture globally discriminative clues within the video instance regardless of action length or speed, ensuring its flexibility. Furthermore, TEAM is inherently efficient, using token-wise comparisons to measure similarity between videos, unlike existing methods that rely on pairwise comparisons for temporal alignment. Additionally, we propose an adaptation process that identifies and removes common information across classes, establishing clear boundaries even between novel categories. Extensive experiments demonstrate the effectiveness of TEAM. Codes are available at this http URL.
Few-Shot动作识别(FSAR)的目标是使用少量标记的视频实例来训练模型。在FSAR中,一个关键挑战在于处理叙述轨迹的不同走向以实现精确的视频匹配。尽管基于帧和元组级别的对齐方法显示出前景,但这些方法严重依赖于预定义且长度相关的对齐单元(如帧或元组),这限制了其适应不同动作长度和速度的能力。在这项工作中,我们引入了一种新颖的时间无对齐匹配(TEAM)方法,该方法消除了在动作表示中使用时间单位的必要性,并在匹配过程中避免了蛮力对齐。具体而言,TEAM用一组固定的模式标记来表示每个视频,这些标记捕捉到无论动作长度或速度如何都具有的全局判别信息,确保其灵活性。此外,与现有方法依赖于时间对齐的两两比较不同,TEAM通过基于标记的比较测量视频之间的相似性,从而固有地更高效。我们还提出了一种适应过程,用于识别并移除跨类别中的公共信息,在新类别之间建立明确界限。大量实验验证了TEAM的有效性。代码可在提供的链接中获取。
https://arxiv.org/abs/2504.05956
Continued advances in self-supervised learning have led to significant progress in video representation learning, offering a scalable alternative to supervised approaches by removing the need for manual annotations. Despite strong performance on standard action recognition benchmarks, video self-supervised learning methods are largely evaluated under narrow protocols, typically pretraining on Kinetics-400 and fine-tuning on similar datasets, limiting our understanding of their generalization in real world scenarios. In this work, we present a comprehensive evaluation of modern video self-supervised models, focusing on generalization across four key downstream factors: domain shift, sample efficiency, action granularity, and task diversity. Building on our prior work analyzing benchmark sensitivity in CNN-based contrastive learning, we extend the study to cover state-of-the-art transformer-based video-only and video-text models. Specifically, we benchmark 12 transformer-based methods (7 video-only, 5 video-text) and compare them to 10 CNN-based methods, totaling over 1100 experiments across 8 datasets and 7 downstream tasks. Our analysis shows that, despite architectural advances, transformer-based models remain sensitive to downstream conditions. No method generalizes consistently across all factors, video-only transformers perform better under domain shifts, CNNs outperform for fine-grained tasks, and video-text models often underperform despite large scale pretraining. We also find that recent transformer models do not consistently outperform earlier approaches. Our findings provide a detailed view of the strengths and limitations of current video SSL methods and offer a unified benchmark for evaluating generalization in video representation learning.
自我监督学习的持续进步在视频表示学习中取得了显著成果,通过消除对人工注释的需求,为监督方法提供了一种可扩展的替代方案。尽管在标准的动作识别基准测试上表现出色,但视频自监督学习方法主要是在狭窄的协议下进行评估,通常是在Kinetics-400数据集上预训练并在类似的数据集中微调,这限制了我们对其在实际场景中的泛化能力的理解。 在这项工作中,我们对现代视频自监督模型进行了全面评估,重点关注四大下游因素下的泛化性能:领域偏移(domain shift)、样本效率、动作粒度和任务多样性。基于我们之前关于CNN基对比学习基准敏感性的分析工作,我们将研究扩展到涵盖最先进的基于Transformer的纯视频和视频-文本模型。具体而言,我们在8个数据集和7个下游任务上进行了总计超过1100次实验,包括12种基于Transformer的方法(其中7种是纯视频方法,5种为视频-文本方法),并将它们与10种基于CNN的方法进行比较。 我们的分析表明,尽管架构有所改进,但基于Transformer的模型仍然对下游条件敏感。没有一种方法能够在这四个因素上一致地泛化良好;纯视频Transformer在领域偏移下表现更好,而CNN则在细粒度任务中胜出,尽管进行了大规模预训练,视频-文本模型往往表现不佳。此外,我们发现近期的基于Transformer的方法并不总是优于早期的方法。 我们的研究为当前视频SSL方法的优势和局限性提供了详细的视角,并提供了一个统一的基准来评估其泛化能力。
https://arxiv.org/abs/2504.05706
Energy-efficient image acquisition on the edge is crucial for enabling remote sensing applications where the sensor node has weak compute capabilities and must transmit data to a remote server/cloud for processing. To reduce the edge energy consumption, this paper proposes a sensor-algorithm co-designed system called SnapPix, which compresses raw pixels in the analog domain inside the sensor. We use coded exposure (CE) as the in-sensor compression strategy as it offers the flexibility to sample, i.e., selectively expose pixels, both spatially and temporally. SNAPPIX has three contributions. First, we propose a task-agnostic strategy to learn the sampling/exposure pattern based on the classic theory of efficient coding. Second, we co-design the downstream vision model with the exposure pattern to address the pixel-level non-uniformity unique to CE-compressed images. Finally, we propose lightweight augmentations to the image sensor hardware to support our in-sensor CE compression. Evaluating on action recognition and video reconstruction, SnapPix outperforms state-of-the-art video-based methods at the same speed while reducing the energy by up to 15.4x. We have open-sourced the code at: this https URL.
在边缘设备上进行节能型图像采集对于启用计算能力较弱的传感器节点的远程传感应用至关重要,这些传感器需要将数据传输到远程服务器/云中进行处理。为了降低边缘能耗,本文提出了一种名为SnapPix的传感器-算法协同设计系统,该系统可以在传感器内部通过模拟域对原始像素进行压缩。我们使用编码曝光(CE)作为在传感器中的压缩策略,因为这种策略提供了在空间和时间上灵活选择性采样的灵活性,即可以选择性地暴露像素。 SNAPPIX有三个主要贡献: 1. 我们提出了一种任务无关的策略来学习基于经典高效编码理论的采样/曝光模式。 2. 为了处理CE压缩图像特有的像素级非均匀性,我们协同设计了下游视觉模型与曝光模式。 3. 我们还提出了对图像传感器硬件进行轻量级增强的方法以支持我们的在传感器中CE压缩。 通过在动作识别和视频重建任务上的评估,SnapPix在相同速度下超越了现有的基于视频的方法,并减少了高达15.4倍的能量消耗。我们已开源代码:[请访问原文链接获取具体地址]。
https://arxiv.org/abs/2504.04535
Multi-modal multi-view action recognition is a rapidly growing field in computer vision, offering significant potential for applications in surveillance. However, current datasets often fail to address real-world challenges such as wide-area environmental conditions, asynchronous data streams, and the lack of frame-level annotations. Furthermore, existing methods face difficulties in effectively modeling inter-view relationships and enhancing spatial feature learning. In this study, we propose the Multi-modal Multi-view Transformer-based Sensor Fusion (MultiTSF) method and introduce the MultiSensor-Home dataset, a novel benchmark designed for comprehensive action recognition in home environments. The MultiSensor-Home dataset features untrimmed videos captured by distributed sensors, providing high-resolution RGB and audio data along with detailed multi-view frame-level action labels. The proposed MultiTSF method leverages a Transformer-based fusion mechanism to dynamically model inter-view relationships. Furthermore, the method also integrates a external human detection module to enhance spatial feature learning. Experiments on MultiSensor-Home and MM-Office datasets demonstrate the superiority of MultiTSF over the state-of-the-art methods. The quantitative and qualitative results highlight the effectiveness of the proposed method in advancing real-world multi-modal multi-view action recognition.
多模态多视角动作识别是计算机视觉领域中一个迅速发展的分支,为监控应用提供了巨大的潜力。然而,当前的数据集往往无法解决现实世界中的挑战,如广泛的环境条件、异步数据流以及缺乏帧级注释等问题。此外,现有方法在有效建模跨视角关系和增强空间特征学习方面遇到了困难。 在此研究中,我们提出了一种基于多模态多视角Transformer传感器融合(MultiTSF)的方法,并引入了MultiSensor-Home数据集——这是一个专为家庭环境中的全面动作识别设计的新基准。MultiSensor-Home数据集包含由分布式传感器捕捉的未剪辑视频,提供了高分辨率的RGB和音频数据以及详细的多视角帧级动作标签。 提出的MultiTSF方法利用基于Transformer的融合机制来动态建模跨视角关系,并且还集成了一种外部人体检测模块以增强空间特征学习。在MultiSensor-Home和MM-Office数据集上的实验表明,与现有最先进的方法相比,MultiTSF具有明显的优势。定量和定性的结果强调了所提出的方法在推进现实世界中多模态多视角动作识别方面的有效性。 通过这种方法和新基准的引入,研究者们希望推动该领域的发展,并为实际应用中的复杂挑战提供有效的解决方案。
https://arxiv.org/abs/2504.02287
Action recognition from multi-modal and multi-view observations holds significant potential for applications in surveillance, robotics, and smart environments. However, existing methods often fall short of addressing real-world challenges such as diverse environmental conditions, strict sensor synchronization, and the need for fine-grained annotations. In this study, we propose the Multi-modal Multi-view Transformer-based Sensor Fusion (MultiTSF). The proposed method leverages a Transformer-based to dynamically model inter-view relationships and capture temporal dependencies across multiple views. Additionally, we introduce a Human Detection Module to generate pseudo-ground-truth labels, enabling the model to prioritize frames containing human activity and enhance spatial feature learning. Comprehensive experiments conducted on our in-house MultiSensor-Home dataset and the existing MM-Office dataset demonstrate that MultiTSF outperforms state-of-the-art methods in both video sequence-level and frame-level action recognition settings.
从多模态和多视角观察中的动作识别在监控、机器人技术及智能环境应用中具有重要的潜力。然而,现有的方法通常无法充分应对真实世界中的挑战,例如多样化的环境条件、严格的传感器同步需求以及对细粒度注释的需求。为此,我们提出了基于多模态多视图变换器的传感器融合(MultiTSF)方法。 该方法利用了Transformer架构来动态建模跨视角的关系,并捕捉多个视角间的时序依赖性。此外,我们还引入了一个人体检测模块,用于生成伪真实标签,使模型能够优先处理包含人类活动的画面并增强空间特征学习能力。 在我们内部开发的MultiSensor-Home数据集和现有的MM-Office数据集上进行的一系列全面实验表明,相较于现有最先进的方法,MultiTSF在视频序列级和帧级动作识别任务中均表现出更优性能。
https://arxiv.org/abs/2504.02279
Lifelogging involves continuously capturing personal data through wearable cameras, providing an egocentric view of daily activities. Lifelog retrieval aims to search and retrieve relevant moments from this data, yet existing methods largely overlook activity-level annotations, which capture temporal relationships and enrich semantic understanding. In this work, we introduce LSC-ADL, an ADL-annotated lifelog dataset derived from the LSC dataset, incorporating Activities of Daily Living (ADLs) as a structured semantic layer. Using a semi-automatic approach featuring the HDBSCAN algorithm for intra-class clustering and human-in-the-loop verification, we generate accurate ADL annotations to enhance retrieval explainability. By integrating action recognition into lifelog retrieval, LSC-ADL bridges a critical gap in existing research, offering a more context-aware representation of daily life. We believe this dataset will advance research in lifelog retrieval, activity recognition, and egocentric vision, ultimately improving the accuracy and interpretability of retrieved content. The ADL annotations can be downloaded at this https URL.
生活日志记录(lifelogging)涉及通过可穿戴摄像头持续捕捉个人数据,提供日常活动的以自我为中心的观点。生活日志检索旨在从这些数据中搜索和提取相关时刻,但现有方法大多忽略了活动级别的注释,而这些注释能够捕获时间关系并丰富语义理解。在本工作中,我们介绍了LSC-ADL,这是一个从LSC数据集中衍生出来的日常生活活动(ADLs)标注的生活日志数据集,它将ADL作为结构化的语义层进行整合。通过结合HDBSCAN算法的类内聚类和人工干预验证的半自动方法,我们生成了准确的ADL注释以增强检索的可解释性。通过在生活日志检索中集成动作识别,LSC-ADL填补了现有研究中的一个关键空白,提供了对日常生活更情境感知的表示方式。我们认为这一数据集将推进生活日志检索、活动识别和第一人称视觉领域的研究,最终提高检索内容的准确性和可解释性。ADL注释可以从以下链接下载:[此处插入URL]。
https://arxiv.org/abs/2504.02060
Video understanding has shown remarkable improvements in recent years, largely dependent on the availability of large scaled labeled datasets. Recent advancements in visual-language models, especially based on contrastive pretraining, have shown remarkable generalization in zero-shot tasks, helping to overcome this dependence on labeled datasets. Adaptations of such models for videos, typically involve modifying the architecture of vision-language models to cater to video data. However, this is not trivial, since such adaptations are mostly computationally intensive and struggle with temporal modeling. We present TP-CLIP, an adaptation of CLIP that leverages temporal visual prompting for temporal adaptation without modifying the core CLIP architecture. This preserves its generalization abilities. TP-CLIP efficiently integrates into the CLIP architecture, leveraging its pre-trained capabilities for video data. Extensive experiments across various datasets demonstrate its efficacy in zero-shot and few-shot learning, outperforming existing approaches with fewer parameters and computational efficiency. In particular, we use just 1/3 the GFLOPs and 1/28 the number of tuneable parameters in comparison to recent state-of-the-art and still outperform it by up to 15.8% depending on the task and dataset.
近年来,视频理解取得了显著的进展,很大程度上依赖于大规模标注数据集的存在。基于对比预训练的视觉-语言模型(尤其是CLIP)最近在零样本任务中表现出色,这有助于克服对大量标注数据集的依赖性。将这类模型适应于视频通常涉及修改视觉-语言模型架构以满足视频数据的需求。然而,这种改编并非易事,因为它们大多计算成本高昂,并且难以处理时间建模问题。 我们提出了TP-CLIP(Temporal Prompting for CLIP),这是一种无需修改核心CLIP架构就能实现时间适应性的CLIP变体。这保留了其泛化能力的同时还有效地将时间视觉提示融入到CLIP体系结构中,利用其预训练的能力来处理视频数据。在多个数据集上进行的广泛实验表明,在零样本和少量样例学习方面,TP-CLIP表现出了优越性,并且相比现有方法,它使用更少的参数和计算资源仍能取得更好的效果。 具体而言,我们仅需使用当前最先进的方法1/3的GFLOPs(十亿次浮点运算)以及1/28数量可调参数,就能在不同的任务和数据集上实现最高至15.8%的性能提升。
https://arxiv.org/abs/2504.01890
Knowledge Distillation (KD) compresses neural networks by learning a small network (student) via transferring knowledge from a pre-trained large network (teacher). Many endeavours have been devoted to the image domain, while few works focus on video analysis which desires training much larger model making it be hardly deployed in resource-limited devices. However, traditional methods neglect two important problems, i.e., 1) Since the capacity gap between the teacher and the student exists, some knowledge w.r.t. difficult-to-transfer samples cannot be correctly transferred, or even badly affects the final performance of student, and 2) As training progresses, difficult-to-transfer samples may become easier to learn, and vice versa. To alleviate the two problems, we propose a Sample-level Adaptive Knowledge Distillation (SAKD) framework for action recognition. In particular, it mainly consists of the sample distillation difficulty evaluation module and the sample adaptive distillation module. The former applies the temporal interruption to frames, i.e., randomly dropout or shuffle the frames during training, which increases the learning difficulty of samples during distillation, so as to better discriminate their distillation difficulty. The latter module adaptively adjusts distillation ratio at sample level, such that KD loss dominates the training with easy-to-transfer samples while vanilla loss dominates that with difficult-to-transfer samples. More importantly, we only select those samples with both low distillation difficulty and high diversity to train the student model for reducing computational cost. Experimental results on two video benchmarks and one image benchmark demonstrate the superiority of the proposed method by striking a good balance between performance and efficiency.
知识蒸馏(KD)通过从一个预训练的大网络(教师网络)中转移知识,来学习一个小的神经网络(学生网络),从而实现模型压缩。尽管许多研究集中在图像领域,但针对视频分析的研究却很少,因为视频分析需要训练更大的模型,在资源有限的设备上部署起来非常困难。然而,传统的方法忽略了两个重要的问题:1) 由于教师和学生之间的容量差距存在,一些难以转移的知识可能无法正确传递,甚至会影响最终的学生网络性能;2) 随着训练过程的进行,原本难以转移的样本可能会变得容易学习,反之亦然。 为了解决这些问题,我们提出了一种基于动作识别的样例级自适应知识蒸馏(SAKD)框架。该框架主要由两个模块组成:样例难度评估模块和样例自适应蒸馏模块。前者通过在训练过程中对视频帧进行时间中断处理,如随机丢弃或打乱视频中的帧,从而增加样本的学习难度,并更好地区分出其蒸馏的难度。后者则根据每个样本的具体情况调整知识蒸馏的比例,在容易转移样本的情况下,蒸馏损失占主导;而在难以转移样本的情况下,则是基础损失占主导。 更重要的是,我们仅选择那些既具有较低蒸馏难度又具备高多样性的样本来训练学生模型,以减少计算成本。实验结果在两个视频基准和一个图像基准上展示了所提出方法的优势,证明了该方法能够在性能与效率之间取得良好的平衡。
https://arxiv.org/abs/2504.00606
We propose Cross-Attention in Audio, Space, and Time (CA^2ST), a transformer-based method for holistic video recognition. Recognizing actions in videos requires both spatial and temporal understanding, yet most existing models lack a balanced spatio-temporal understanding of videos. To address this, we propose a novel two-stream architecture, called Cross-Attention in Space and Time (CAST), using only RGB input. In each layer of CAST, Bottleneck Cross-Attention (B-CA) enables spatial and temporal experts to exchange information and make synergistic predictions. For holistic video understanding, we extend CAST by integrating an audio expert, forming Cross-Attention in Visual and Audio (CAVA). We validate the CAST on benchmarks with different characteristics, EPIC-KITCHENS-100, Something-Something-V2, and Kinetics-400, consistently showing balanced performance. We also validate the CAVA on audio-visual action recognition benchmarks, including UCF-101, VGG-Sound, KineticsSound, and EPIC-SOUNDS. With a favorable performance of CAVA across these datasets, we demonstrate the effective information exchange among multiple experts within the B-CA module. In summary, CA^2ST combines CAST and CAVA by employing spatial, temporal, and audio experts through cross-attention, achieving balanced and holistic video understanding.
我们提出了Cross-Attention in Audio, Space, and Time (CA^2ST),这是一种基于Transformer的方法,用于全面的视频识别。在视频中识别动作既需要空间理解也需要时间理解,然而大多数现有的模型缺乏对视频平衡的空间和时间理解。为了解决这个问题,我们提出了一种新颖的两流架构,称为Cross-Attention in Space and Time (CAST),仅使用RGB输入就能实现这一目标。在CAST每一层中,瓶颈交叉注意力(B-CA)使空间专家和时间专家能够交换信息,并做出协同预测。 为了实现对整个视频的理解,我们通过集成一个音频专家将CAST扩展为Cross-Attention in Visual and Audio (CAVA)。我们在具有不同特征的基准上验证了CAST的表现,包括EPIC-KITCHENS-100、Something-Something-V2和Kinetics-400,在这些数据集上显示出了平衡性能。我们还在UCF-101、VGG-Sound、KineticsSound和EPIC-SOUNDS等视听动作识别基准上验证了CAVA的表现。通过在这些数据集中CAVA的出色表现,我们展示了B-CA模块中多个专家之间有效信息交换的能力。 综上所述,CA^2ST结合CAST与CAVA,利用空间、时间以及音频专家之间的交叉注意力,实现了对视频平衡且全面的理解。
https://arxiv.org/abs/2503.23447
Human action recognition in low-light environments is crucial for various real-world applications. However, the existing approaches overlook the full utilization of brightness information throughout the training phase, leading to suboptimal performance. To address this limitation, we propose OwlSight, a biomimetic-inspired framework with whole-stage illumination enhancement to interact with action classification for accurate dark video human action recognition. Specifically, OwlSight incorporates a Time-Consistency Module (TCM) to capture shallow spatiotemporal features meanwhile maintaining temporal coherence, which are then processed by a Luminance Adaptation Module (LAM) to dynamically adjust the brightness based on the input luminance distribution. Furthermore, a Reflect Augmentation Module (RAM) is presented to maximize illumination utilization and simultaneously enhance action recognition via two interactive paths. Additionally, we build Dark-101, a large-scale dataset comprising 18,310 dark videos across 101 action categories, significantly surpassing existing datasets (e.g., ARID1.5 and Dark-48) in scale and diversity. Extensive experiments demonstrate that the proposed OwlSight achieves state-of-the-art performance across four low-light action recognition benchmarks. Notably, it outperforms previous best approaches by 5.36% on ARID1.5 and 1.72% on Dark-101, highlighting its effectiveness in challenging dark environments.
在低光环境中的人体动作识别对于各种实际应用场景至关重要。然而,现有方法未能充分利用亮度信息进行训练,导致性能不佳。为解决这一限制,我们提出了一种仿生启发的框架OwlSight,该框架在整个阶段增强照明并结合动作分类以实现对暗视频中人体动作的准确识别。具体而言,OwlSight包含了一个时间一致性模块(TCM),用于捕捉浅层时空特征同时保持时间连贯性;然后通过亮度适应模块(LAM)根据输入光强分布动态调整亮度。此外,提出了一种反射增强模块(RAM),以最大限度地利用照明并同时通过两种互动路径提升动作识别能力。 为了验证OwlSight的有效性和实用性,我们构建了Dark-101数据集,这是一个包含超过18,310段视频的大规模数据集,涵盖了101个不同的动作类别。这一数据集在规模和多样性上超越现有的数据集(例如ARID1.5和Dark-48)。广泛的实验表明,提出的OwlSight框架在四个低光动作识别基准测试中均达到了最先进的性能,在ARID1.5和Dark-101数据集中分别比之前的最佳方法高出5.36%和1.72%,这突显了其在挑战性的暗环境中的有效性。
https://arxiv.org/abs/2503.23266
The growing ageing population and their preference to maintain independence by living in their own homes require proactive strategies to ensure safety and support. Ambient Assisted Living (AAL) technologies have emerged to facilitate ageing in place by offering continuous monitoring and assistance within the home. Within AAL technologies, action recognition plays a crucial role in interpreting human activities and detecting incidents like falls, mobility decline, or unusual behaviours that may signal worsening health conditions. However, action recognition in practical AAL applications presents challenges, including occlusions, noisy data, and the need for real-time performance. While advancements have been made in accuracy, robustness to noise, and computation efficiency, achieving a balance among them all remains a challenge. To address this challenge, this paper introduces the Robust and Efficient Temporal Convolution network (RE-TCN), which comprises three main elements: Adaptive Temporal Weighting (ATW), Depthwise Separable Convolutions (DSC), and data augmentation techniques. These elements aim to enhance the model's accuracy, robustness against noise and occlusion, and computational efficiency within real-world AAL contexts. RE-TCN outperforms existing models in terms of accuracy, noise and occlusion robustness, and has been validated on four benchmark datasets: NTU RGB+D 60, Northwestern-UCLA, SHREC'17, and DHG-14/28. The code is publicly available at: this https URL
随着人口老龄化的增加以及老年人倾向于通过独立生活来保持自主性,需要采取积极的策略确保他们的安全和得到支持。环境辅助居住(Ambient Assisted Living,AAL)技术应运而生,旨在通过提供家庭内的持续监控和支持来促进人们在家中养老。在AAL技术中,动作识别对于解读人类活动以及检测诸如跌倒、行动能力下降或可能表明健康状况恶化的异常行为等方面起着至关重要的作用。 然而,在实际的AAL应用中进行动作识别面临着诸多挑战,包括遮挡、噪声数据和实时性能需求。尽管准确性、抗噪能力和计算效率方面已取得了进展,但要在这些方面取得平衡仍然是一个难题。为了解决这一问题,本文介绍了一种名为鲁棒且高效的时序卷积网络(Robust and Efficient Temporal Convolution network, RE-TCN)的方法,该方法包括三个主要元素:自适应时间加权(Adaptive Temporal Weighting, ATW)、深度可分离卷积(Depthwise Separable Convolutions, DSC)和数据增强技术。这些组件旨在提升模型在现实世界AAL情境中的准确性、抗噪性和抗遮挡性,并提高计算效率。 RE-TCN在精度、噪声和遮挡鲁棒性方面优于现有模型,已经在四个基准数据集上得到了验证:NTU RGB+D 60、Northwestern-UCLA、SHREC'17 和 DHG-14/28。该代码已公开发布于此链接:[请在此处插入实际的URL]。
https://arxiv.org/abs/2503.23214
Force estimation in human-object interactions is crucial for various fields like ergonomics, physical therapy, and sports science. Traditional methods depend on specialized equipment such as force plates and sensors, which makes accurate assessments both expensive and restricted to laboratory settings. In this paper, we introduce ForcePose, a novel deep learning framework that estimates applied forces by combining human pose estimation with object detection. Our approach leverages MediaPipe for skeletal tracking and SSD MobileNet for object recognition to create a unified representation of human-object interaction. We've developed a specialized neural network that processes both spatial and temporal features to predict force magnitude and direction without needing any physical sensors. After training on our dataset of 850 annotated videos with corresponding force measurements, our model achieves a mean absolute error of 5.83 N in force magnitude and 7.4 degrees in force direction. When compared to existing computer vision approaches, our method performs 27.5% better while still offering real-time performance on standard computing hardware. ForcePose opens up new possibilities for force analysis in diverse real-world scenarios where traditional measurement tools are impractical or intrusive. This paper discusses our methodology, the dataset creation process, evaluation metrics, and potential applications across rehabilitation, ergonomics assessment, and athletic performance analysis.
人体与物体交互中的力估计对于诸如人机工程学、物理治疗和运动科学等众多领域至关重要。传统的方法依赖于专门的设备,如测力平台和传感器,这使得准确评估既昂贵又局限于实验室环境内进行。本文介绍了一种新的深度学习框架——ForcePose,该框架通过结合人体姿态估计与物体检测来估算施加的力。我们的方法利用MediaPipe进行骨骼跟踪,并使用SSD MobileNet进行对象识别,以创建人和物交互作用的一体化表示。我们开发了一个专用神经网络,它可以处理空间和时间特征,从而在无需任何物理传感器的情况下预测力的大小和方向。 经过850段带有相应力测量值的注释视频数据集的训练后,我们的模型在力大小估计上的平均绝对误差为5.83 N,在力方向估计上的平均偏差为7.4度。与现有的计算机视觉方法相比,我们的方法在不牺牲实时性能的前提下,表现提升了27.5%。ForcePose为那些传统测量工具不切实际或侵入性强的多变现实场景中的力分析开辟了新的可能性。 本文详细讨论了我们的方法论、数据集创建过程、评估指标以及康复、人机工程学评估和运动表现分析等领域的潜在应用。
https://arxiv.org/abs/2503.22363