3D pose estimation from sparse multi-views is a critical task for numerous applications, including action recognition, sports analysis, and human-robot interaction. Optimization-based methods typically follow a two-stage pipeline, first detecting 2D keypoints in each view and then associating these detections across views to triangulate the 3D pose. Existing methods rely on mere pairwise associations to model this correspondence problem, treating global consistency between views (i.e., cycle consistency) as a soft constraint. Yet, reconciling these constraints for multiple views becomes brittle when spurious associations propagate errors. We thus propose COMPOSE, a novel framework that formulates multi-view pose correspondence matching as a hypergraph partitioning problem rather than through pairwise association. While the complexity of the resulting integer linear program grows exponentially in theory, we introduce an efficient geometric pruning strategy to substantially reduce the search space. COMPOSE achieves improvements of up to 23% in average precision over previous optimization-based methods and up to 11% over self-supervised end-to-end learned methods, offering a promising solution to a widely studied problem.
从稀疏多视角进行三维姿态估计是一项对于动作识别、体育分析和人机交互等众多应用至关重要的任务。基于优化的方法通常遵循两阶段流程:首先在每个视角中检测2D关键点,然后将这些检测结果跨视角关联起来以三角测量出3D姿态。现有的方法依赖于简单的成对关联来建模这种对应问题,并且将视图之间的全局一致性(即循环一致性)视为软约束处理。然而,在多个视角的情况下,解决这些约束变得脆弱,因为虚假的关联会传播错误。 为此,我们提出了一种名为COMPOSE的新框架,它将多视角姿态对应的匹配问题形式化为超图划分问题,而不是通过成对关联来解决。尽管理论上由此产生的整数线性规划复杂度呈指数增长,但我们引入了一个高效的几何剪枝策略,从而大幅减少了搜索空间。与以前的基于优化的方法相比,COMPOSE在平均精度上提高了多达23%,而相较于自监督端到端学习方法则提升了高达11%的表现,为一个长期研究的问题提供了有前景的解决方案。
https://arxiv.org/abs/2601.09698
In recent years, self-supervised representation learning for skeleton-based action recognition has advanced with the development of contrastive learning methods. However, most of contrastive paradigms are inherently discriminative and often struggle to capture the variability and uncertainty intrinsic to human motion. To address this issue, we propose a variational contrastive learning framework that integrates probabilistic latent modeling with contrastive self-supervised learning. This formulation enables the learning of structured and semantically meaningful representations that generalize across different datasets and supervision levels. Extensive experiments on three widely used skeleton-based action recognition benchmarks show that our proposed method consistently outperforms existing approaches, particularly in low-label regimes. Moreover, qualitative analyses show that the features provided by our method are more relevant given the motion and sample characteristics, with more focus on important skeleton joints, when compared to the other methods.
近年来,基于骨架的动作识别中的自监督表示学习随着对比学习方法的发展而取得了进展。然而,大多数对比范式本质上是判别式的,并且往往难以捕捉人类运动内在的变异性与不确定性。为了解决这一问题,我们提出了一种结合概率潜在建模和对比自监督学习的变分对比学习框架。这种形式化的方法能够学习出结构化的、语义上有意义的表示,这些表示在不同的数据集和监督水平上都能泛化得很好。我们在三个广泛使用的基于骨架的动作识别基准测试中进行了大量的实验,结果表明我们的方法始终优于现有的方法,在低标签环境下尤其如此。此外,定性的分析显示,与其它方法相比,我们提出的方法提供的特征更符合运动特性和样本特性,并且更加注重重要的骨骼关节。
https://arxiv.org/abs/2601.07666
From Vision-Language-Action (VLA) systems to robotics, existing egocentric datasets primarily focus on action recognition tasks, while largely overlooking the inherent role of motion analysis in sports and other fast-movement scenarios. To bridge this gap, we propose a real-time motion focus recognition method that estimates the subject's locomotion intention from any egocentric video. Our approach leverages the foundation model for camera pose estimation and introduces system-level optimizations to enable efficient and scalable inference. Evaluated on a collected egocentric action dataset, our method achieves real-time performance with manageable memory consumption through a sliding batch inference strategy. This work makes motion-centric analysis practical for edge deployment and offers a complementary perspective to existing egocentric studies on sports and fast-movement activities.
从视觉-语言-行动(VLA)系统到机器人技术,现有的第一人称视角数据集主要侧重于动作识别任务,而忽略了运动分析在体育及其他快速移动场景中的内在作用。为了填补这一空白,我们提出了一种实时的运动焦点识别方法,该方法可以从任何第一人称视频中估计主体的移动意图。我们的方法利用了基础模型进行相机姿态估计,并引入系统级优化以实现高效和可扩展的推理。 在收集到的第一人称动作数据集上进行了评估,通过滑动批处理推理策略,我们的方法实现了实时性能并保持了可控的记忆消耗。这项工作使得运动中心分析在边缘部署中变得实用,并为现有的第一人称视角研究提供了关于体育及快速移动活动的新颖补充观点。
https://arxiv.org/abs/2601.07154
Traditional Multi-Object Tracking (MOT) systems have achieved remarkable precision in localization and association, effectively answering \textit{where} and \textit{who}. However, they often function as autistic observers, capable of tracing geometric paths but blind to the semantic \textit{what} and \textit{why} behind object behaviors. To bridge the gap between geometric perception and cognitive reasoning, we propose \textbf{LLMTrack}, a novel end-to-end framework for Semantic Multi-Object Tracking (SMOT). We adopt a bionic design philosophy that decouples strong localization from deep understanding, utilizing Grounding DINO as the eyes and the LLaVA-OneVision multimodal large model as the brain. We introduce a Spatio-Temporal Fusion Module that aggregates instance-level interaction features and video-level contexts, enabling the Large Language Model (LLM) to comprehend complex trajectories. Furthermore, we design a progressive three-stage training strategy, Visual Alignment, Temporal Fine-tuning, and Semantic Injection via LoRA to efficiently adapt the massive model to the tracking domain. Extensive experiments on the BenSMOT benchmark demonstrate that LLMTrack achieves state-of-the-art performance, significantly outperforming existing methods in instance description, interaction recognition, and video summarization while maintaining robust tracking stability.
传统的多目标跟踪(MOT)系统在定位和关联方面达到了很高的精度,有效地回答了“在哪里”和“谁”的问题。然而,这些系统往往像自闭症观察者一样工作,能够追踪几何路径,但却对物体行为背后的语义上的“是什么”和“为什么”视而不见。为了弥合几何感知与认知推理之间的差距,我们提出了**LLMTrack**,这是一种新的端到端框架,用于语义多目标跟踪(SMOT)。我们的设计哲学借鉴了仿生学原理,将强大的定位功能与深度理解分离,使用Grounding DINO作为“眼睛”以及LLaVA-OneVision多模态大型模型作为“大脑”。我们引入了一个时空融合模块,该模块汇聚实例级交互特征和视频级别上下文,使大规模语言模型能够理解和解析复杂的轨迹。此外,为了有效地将庞大的模型适应到跟踪领域,我们设计了一种分阶段的三步训练策略:视觉对齐、时间微调以及通过LoRA进行语义注入。 在BenSMOT基准测试中的广泛实验表明,LLMTrack达到了最先进的性能,在实例描述、交互识别和视频摘要方面显著优于现有方法,同时保持了强大的跟踪稳定性。
https://arxiv.org/abs/2601.06550
Understanding student behavior in the classroom is essential to improve both pedagogical quality and student engagement. Existing methods for predicting student engagement typically require substantial annotated data to model the diversity of student behaviors, yet privacy concerns often restrict researchers to their own proprietary datasets. Moreover, the classroom context, represented in peers' actions, is ignored. To address the aforementioned limitation, we propose a novel three-stage framework for video-based student engagement measurement. First, we explore the few-shot adaptation of the vision-language model for student action recognition, which is fine-tuned to distinguish among action categories with a few training samples. Second, to handle continuous and unpredictable student actions, we utilize the sliding temporal window technique to divide each student's 2-minute-long video into non-overlapping segments. Each segment is assigned an action category via the fine-tuned VLM model, generating a sequence of action predictions. Finally, we leverage the large language model to classify this entire sequence of actions, together with the classroom context, as belonging to an engaged or disengaged student. The experimental results demonstrate the effectiveness of the proposed approach in identifying student engagement.
理解学生在课堂上的行为对于提高教学质量和促进学生的参与度至关重要。现有的预测学生参与度的方法通常需要大量的标注数据来建模多样的学生行为,然而隐私问题往往限制了研究者只能使用自己拥有的专有数据集。此外,代表同伴互动的课堂环境常常被忽略。 为了克服上述局限性,我们提出了一种新颖的三阶段框架,用于基于视频的学生参与度测量。首先,我们探索视觉-语言模型在学生行为识别中的少量样本适应能力,通过少量训练样本来微调该模型以区分不同的动作类别。其次,为处理连续且不可预测的学生行为,我们利用滑动时间窗口技术将每位学生的2分钟长的视频分割成非重叠的时间段,并使用微调后的视觉-语言模型对每个时间段分配一个动作类别,生成一系列的动作预测序列。最后,我们借助大型语言模型来分类整个动作序列,同时考虑课堂环境因素,将其归类为参与或不参与的学生。 实验结果表明,所提出的这种方法在识别学生参与度方面是有效的。
https://arxiv.org/abs/2601.06394
Video-language models (VLMs) achieve strong multimodal understanding but remain prone to hallucinations, especially when reasoning about actions and temporal order. Existing mitigation strategies, such as textual filtering or random video perturbations, often fail to address the root cause: over-reliance on language priors rather than fine-grained visual dynamics. We propose a scalable framework for counterfactual video generation that synthesizes videos differing only in actions or temporal structure while preserving scene context. Our pipeline combines multimodal LLMs for action proposal and editing guidance with diffusion-based image and video models to generate semantic hard negatives at scale. Using this framework, we build CounterVid, a synthetic dataset of ~26k preference pairs targeting action recognition and temporal reasoning. We further introduce MixDPO, a unified Direct Preference Optimization approach that jointly leverages textual and visual preferences. Fine-tuning Qwen2.5-VL with MixDPO yields consistent improvements, notably in temporal ordering, and transfers effectively to standard video hallucination benchmarks. Code and models will be made publicly available.
视频语言模型(VLMs)在多模态理解方面表现出色,但在推理动作和时间顺序时仍容易产生幻觉。现有的缓解策略,如文本过滤或随机视频扰动,通常无法解决根本原因:过度依赖语言先验而不是细粒度的视觉动态。我们提出了一种可扩展的框架,用于生成反事实视频,该框架可以合成在动作或时间结构上有所不同但场景背景保持不变的视频。我们的流水线结合了多模态LLM(大型语言模型)进行动作提案和编辑指导,并利用基于扩散的图像和视频模型以大规模生成语义负样本。使用此框架,我们构建了一个名为CounterVid的合成数据集,该数据集中包含大约26,000对偏好配对,针对动作识别和时间推理问题。此外,我们引入了一种统一的方法MixDPO,这是一种直接偏好评优化方法,可以同时利用文本和视觉偏好。使用MixDPO微调Qwen2.5-VL(一种视频语言模型)在时间排序方面取得了显著的改进,并且能有效地迁移到标准的视频幻觉基准测试中。代码和模型将公开发布。
https://arxiv.org/abs/2601.04778
We present a novel approach for egocentric action recognition that leverages 2D point tracks as an additional motion cue. While most existing methods rely on RGB appearance, human pose estimation, or their combination, our work demonstrates that tracking randomly sampled image points across video frames can substantially improve recognition accuracy. Unlike prior approaches, we do not detect hands, objects, or interaction regions. Instead, we employ CoTracker to follow a set of randomly initialized points through each video and use the resulting trajectories, together with the corresponding image frames, as input to a Transformer-based recognition model. Surprisingly, our method achieves notable gains even when only the initial frame and its associated point tracks are provided, without incorporating the full video sequence. Experimental results confirm that integrating 2D point tracks consistently enhances performance compared to the same model trained without motion information, highlighting their potential as a lightweight yet effective representation for egocentric action understanding.
我们提出了一种新颖的自视角动作识别方法,该方法利用二维点跟踪作为额外的运动线索。尽管大多数现有方法依赖于RGB外观、人体姿态估计或它们的组合,我们的工作表明,在视频帧之间追踪随机采样的图像点可以显著提高识别准确性。与以前的方法不同,我们不检测手部、物体或交互区域。相反,我们使用CoTracker在每个视频中跟踪一组随机初始化的点,并将由此产生的轨迹以及相应的图像帧作为基于Transformer的识别模型的输入。令人惊讶的是,即使只提供了初始帧及其相关点跟踪信息而未包含完整视频序列的情况下,我们的方法也能实现显著提升。实验结果证实,整合二维点跟踪可以持续增强性能,与没有运动信息的相同模型相比表现更佳,这突显了它们作为轻量级且有效表示形式在自视角动作理解中的潜力。
https://arxiv.org/abs/2601.03667
Skeleton-based human action recognition (HAR) has achieved remarkable progress with graph-based architectures. However, most existing methods remain body-centric, focusing on large-scale motions while neglecting subtle hand articulations that are crucial for fine-grained recognition. This work presents a probabilistic dual-stream framework that unifies reliability modeling and multi-modal integration, generalizing expertized learning under uncertainty across both intra-skeleton and cross-modal domains. The framework comprises three key components: (1) a calibration-free preprocessing pipeline that removes canonical-space transformations and learns directly from native coordinates; (2) a probabilistic Noisy-OR fusion that stabilizes reliability-aware dual-stream learning without requiring explicit confidence supervision; and (3) an intra- to cross-modal ensemble that couples four skeleton modalities (Joint, Bone, Joint Motion, and Bone Motion) to RGB representations, bridging structural and visual motion cues in a unified cross-modal formulation. Comprehensive evaluations across multiple benchmarks (NTU RGB+D~60/120, PKU-MMD, N-UCLA) and a newly defined hand-centric benchmark exhibit consistent improvements and robustness under noisy and heterogeneous conditions.
基于骨架的人体动作识别(HAR)在图架构的支持下已经取得了显著的进展。然而,大多数现有方法仍然以身体为中心,专注于大规模运动的同时忽视了细微的手部动作细节,而这些细节对于细粒度的动作识别至关重要。这项工作提出了一种概率双流框架,该框架统一了可靠性建模和多模式融合,并在不确定性条件下实现了跨骨架内部与跨模态领域的专家化学习的泛化。该框架包含三个关键组件: 1. 一种无需校准的预处理管道,该管道消除了标准空间变换并直接从原始坐标中学习。 2. 一种概率Noisy-OR融合机制,在不需显式置信度监督的情况下稳定了基于可靠性的双流学习。 3. 一个内部到跨模态的集成系统,将四种骨架模式(关节、骨骼、关节运动和骨骼运动)与RGB表示相结合,以统一的方式连接结构化和视觉动作线索。 在多个基准测试集(NTU RGB+D~60/120, PKU-MMD, N-UCLA)及一个新定义的手部中心基准上进行的全面评估显示,在噪声和异构条件下具有持续改进和鲁棒性。
https://arxiv.org/abs/2601.00369
4D spatial intelligence involves perceiving and processing how objects move or change over time. Humans naturally possess 4D spatial intelligence, supporting a broad spectrum of spatial reasoning abilities. To what extent can Multimodal Large Language Models (MLLMs) achieve human-level 4D spatial intelligence? In this work, we present Spatial4D-Bench, a versatile 4D spatial intelligence benchmark designed to comprehensively assess the 4D spatial reasoning abilities of MLLMs. Unlike existing spatial intelligence benchmarks that are often small-scale or limited in diversity, Spatial4D-Bench provides a large-scale, multi-task evaluation benchmark consisting of ~40,000 question-answer pairs covering 18 well-defined tasks. We systematically organize these tasks into six cognitive categories: object understanding, scene understanding, spatial relationship understanding, spatiotemporal relationship understanding, spatial reasoning and spatiotemporal reasoning. Spatial4D-Bench thereby offers a structured and comprehensive benchmark for evaluating the spatial cognition abilities of MLLMs, covering a broad spectrum of tasks that parallel the versatility of human spatial intelligence. We benchmark various state-of-the-art open-source and proprietary MLLMs on Spatial4D-Bench and reveal their substantial limitations in a wide variety of 4D spatial reasoning aspects, such as route plan, action recognition, and physical plausibility reasoning. We hope that the findings provided in this work offer valuable insights to the community and that our benchmark can facilitate the development of more capable MLLMs toward human-level 4D spatial intelligence. More resources can be found on our project page.
四维空间智能涉及感知和处理物体随时间的移动或变化。人类天生具备四维空间智能,支持广泛的时空推理能力。多模态大型语言模型(MLLMs)在多大程度上能够达到人类级别的四维空间智能?为此,我们提出了Spatial4D-Bench,这是一个多功能的四维空间智能基准测试,旨在全面评估多模态大型语言模型的四维空间推理能力。与现有的仅限于小规模或多样性不足的空间智能基准不同,Spatial4D-Bench提供了一个大规模、多任务的评估基准,包含约40,000个问题-答案对,覆盖了18项定义明确的任务。我们系统地将这些任务组织成六个认知类别:物体理解、场景理解、空间关系理解、时空关系理解、空间推理和时空推理。因此,Spatial4D-Bench为评估多模态大型语言模型的空间认知能力提供了结构化且全面的基准测试,涵盖了与人类空间智能多样性相匹配的一系列广泛任务。 我们在Spatial4D-Bench上对各种先进的开源和专有MLLM进行了基准测试,并揭示了它们在许多四维空间推理方面(如路径规划、动作识别以及物理合理性推断)存在显著限制。我们希望本工作中提供的发现能为社区提供宝贵的见解,而我们的基准能够促进更强大的多模态大型语言模型的发展,使其向人类级别的四维空间智能迈进。更多资源可在我们的项目页面上找到。
https://arxiv.org/abs/2601.00092
Recognizing fine-grained actions from temporally corrupted skeleton sequences remains a significant challenge, particularly in real-world scenarios where online pose estimation often yields substantial missing data. Existing methods often struggle to accurately recover temporal dynamics and fine-grained spatial structures, resulting in the loss of subtle motion cues crucial for distinguishing similar actions. To address this, we propose FineTec, a unified framework for Fine-grained action recognition under Temporal Corruption. FineTec first restores a base skeleton sequence from corrupted input using context-aware completion with diverse temporal masking. Next, a skeleton-based spatial decomposition module partitions the skeleton into five semantic regions, further divides them into dynamic and static subgroups based on motion variance, and generates two augmented skeleton sequences via targeted perturbation. These, along with the base sequence, are then processed by a physics-driven estimation module, which utilizes Lagrangian dynamics to estimate joint accelerations. Finally, both the fused skeleton position sequence and the fused acceleration sequence are jointly fed into a GCN-based action recognition head. Extensive experiments on both coarse-grained (NTU-60, NTU-120) and fine-grained (Gym99, Gym288) benchmarks show that FineTec significantly outperforms previous methods under various levels of temporal corruption. Specifically, FineTec achieves top-1 accuracies of 89.1% and 78.1% on the challenging Gym99-severe and Gym288-severe settings, respectively, demonstrating its robustness and generalizability. Code and datasets could be found at this https URL.
从时间上受到破坏的骨架序列中识别细微动作仍然是一项重大挑战,特别是在实际场景中,在线姿态估计通常会产生大量缺失数据。现有方法常常难以准确恢复时间动态和细微的空间结构,导致丢失了区分相似动作所必需的微妙运动线索。为解决这一问题,我们提出了FineTec,这是一个用于在时间破坏下进行细粒度动作识别的统一框架。 FineTec首先通过上下文感知完成以及多样化的时域掩码从受损输入中恢复基础骨架序列。接下来,一个基于骨架的空间分解模块将骨架划分为五个语义区域,并根据运动方差进一步将其分成动态和静态子组。然后,该模块通过有针对性的扰动生成两个增强后的骨架序列。这些序列连同基础序列一起被物理驱动估计模块处理,该模块利用拉格朗日动力学来估算关节加速度。最后,融合后的骨架位置序列和加速序列共同输入到基于图卷积网络(GCN)的动作识别头部进行处理。 在粗粒度基准测试(NTU-60、NTU-120)以及细粒度基准测试(Gym99、Gym288)上的广泛实验表明,FineTec在各种时间破坏水平下显著优于先前的方法。具体来说,在具有挑战性的Gym99-severe和Gym288-severe设置中,FineTec分别实现了89.1% 和78.1%的top-1准确率,这显示了其鲁棒性和泛化能力。 有关代码和数据集,请参见此链接:[此处提供URL]。
https://arxiv.org/abs/2512.25067
The development of effective training and evaluation strategies is critical. Conventional methods for assessing surgical proficiency typically rely on expert supervision, either through onsite observation or retrospective analysis of recorded procedures. However, these approaches are inherently subjective, susceptible to inter-rater variability, and require substantial time and effort from expert surgeons. These demands are often impractical in low- and middle-income countries, thereby limiting the scalability and consistency of such methods across training programs. To address these limitations, we propose a novel AI-driven framework for the automated assessment of microanastomosis performance. The system integrates a video transformer architecture based on TimeSformer, improved with hierarchical temporal attention and weighted spatial attention mechanisms, to achieve accurate action recognition within surgical videos. Fine-grained motion features are then extracted using a YOLO-based object detection and tracking method, allowing for detailed analysis of instrument kinematics. Performance is evaluated along five aspects of microanastomosis skill, including overall action execution, motion quality during procedure-critical actions, and general instrument handling. Experimental validation using a dataset of 58 expert-annotated videos demonstrates the effectiveness of the system, achieving 87.7% frame-level accuracy in action segmentation that increased to 93.62% with post-processing, and an average classification accuracy of 76% in replicating expert assessments across all skill aspects. These findings highlight the system's potential to provide objective, consistent, and interpretable feedback, thereby enabling more standardized, data-driven training and evaluation in surgical education.
开发有效的培训和评估策略至关重要。传统的方法来评估外科手术熟练度通常依赖于专家监督,这可以通过现场观察或对手术记录的回顾性分析来进行。然而,这些方法本质上是主观的,容易受到不同评判者之间的变化影响,并且需要资深外科医生投入大量时间和精力。在低收入和中等收入国家,这种需求往往是不切实际的,从而限制了此类方法在培训项目中的可扩展性和一致性。 为了克服这些局限性,我们提出了一种新的基于人工智能(AI)的框架,用于微血管吻合术性能的自动评估。该系统整合了一个基于TimeSformer的视频变换架构,并通过分层时间注意机制和加权空间注意力机制进行了改进,以实现手术视频中准确的动作识别。然后使用一种基于YOLO的对象检测和跟踪方法来提取精细运动特征,从而对器械动力学进行详细分析。 性能评估包括五个方面的微血管吻合术技能:整体动作执行、在程序关键操作期间的运动质量以及一般器械处理技巧。通过一个包含58段专家标注视频的数据集进行实验验证,证明了该系统的有效性,在行动分割方面实现了87.7%的帧级准确率(经后处理提高到93.62%),并且在整个技能方面的平均分类准确性达到了76%,成功复制了专家评估结果。这些发现突显了系统提供客观、一致和可解释反馈的潜力,从而有助于外科培训中的标准化数据驱动训练与评价。
https://arxiv.org/abs/2512.24411
3D Human Pose Estimation (3D HPE) is vital in various applications, from person re-identification and action recognition to virtual reality. However, the reliance on annotated 3D data collected in controlled environments poses challenges for generalization to diverse in-the-wild scenarios. Existing domain adaptation (DA) paradigms like general DA and source-free DA for 3D HPE overlook the issues of non-stationary target pose datasets. To address these challenges, we propose a novel task named lifelong domain adaptive 3D HPE. To our knowledge, we are the first to introduce the lifelong domain adaptation to the 3D HPE task. In this lifelong DA setting, the pose estimator is pretrained on the source domain and subsequently adapted to distinct target domains. Moreover, during adaptation to the current target domain, the pose estimator cannot access the source and all the previous target domains. The lifelong DA for 3D HPE involves overcoming challenges in adapting to current domain poses and preserving knowledge from previous domains, particularly combating catastrophic forgetting. We present an innovative Generative Adversarial Network (GAN) framework, which incorporates 3D pose generators, a 2D pose discriminator, and a 3D pose estimator. This framework effectively mitigates domain shifts and aligns original and augmented poses. Moreover, we construct a novel 3D pose generator paradigm, integrating pose-aware, temporal-aware, and domain-aware knowledge to enhance the current domain's adaptation and alleviate catastrophic forgetting on previous domains. Our method demonstrates superior performance through extensive experiments on diverse domain adaptive 3D HPE datasets.
三维人体姿态估计(3D HPE)在众多应用中至关重要,从人员再识别和动作识别到虚拟现实。然而,在受控环境中收集的标注3D数据依赖性导致了将其推广至多样化的真实场景中的挑战。现有的域适应(DA)范式,例如针对3D HPE的一般域适应和无源域适应,忽略了目标姿态数据集的非平稳性问题。为了解决这些挑战,我们提出了一项名为终身域适配的三维人体姿态估计的新任务。据我们所知,这是我们首次将终身域适应应用于三维人体姿态估计任务中。在这种终身DA设置下,姿态估计器首先在源域上进行预训练,随后逐步调整以适应不同的目标域。此外,在当前目标域的调适过程中,姿态估计器无法访问源域和所有之前的target domains。 针对3D HPE的终身DA需要克服当前领域的姿势适应挑战,并保留来自先前领域知识的问题,尤其是要解决灾难性遗忘问题。为此,我们提出了一种创新性的生成对抗网络(GAN)框架,该框架结合了三维姿态生成器、二维姿态判别器和三维姿态估计器。此框架有效地缓解了域偏移并使原始姿势与增强后的姿势对齐。 此外,我们构建了一个新颖的3D姿态生成器模式,整合了具有姿态感知、时间感知和领域感知的知识,以提升当前领域的适应性,并减轻在先前领域的灾难性遗忘问题。通过广泛的实验验证,在多样化的三维域适配人体姿态估计数据集上,我们的方法展现出了卓越的表现。
https://arxiv.org/abs/2512.23860
While human action recognition has witnessed notable achievements, multimodal methods fusing RGB and skeleton modalities still suffer from their inherent heterogeneity and fail to fully exploit the complementary potential between them. In this paper, we propose PAN, the first human-centric graph representation learning framework for multimodal action recognition, in which token embeddings of RGB patches containing human joints are represented as spatiotemporal graphs. The human-centric graph modeling paradigm suppresses the redundancy in RGB frames and aligns well with skeleton-based methods, thus enabling a more effective and semantically coherent fusion of multimodal features. Since the sampling of token embeddings heavily relies on 2D skeletal data, we further propose attention-based post calibration to reduce the dependency on high-quality skeletal data at a minimal cost interms of model performance. To explore the potential of PAN in integrating with skeleton-based methods, we present two variants: PAN-Ensemble, which employs dual-path graph convolution networks followed by late fusion, and PAN-Unified, which performs unified graph representation learning within a single network. On three widely used multimodal action recognition datasets, both PAN-Ensemble and PAN-Unified achieve state-of-the-art (SOTA) performance in their respective settings of multimodal fusion: separate and unified modeling, respectively.
虽然人类动作识别领域取得了显著成就,但融合RGB和骨架模态的多模态方法仍然受到其固有异质性的困扰,并且未能充分利用这两种模态之间的互补潜力。在本文中,我们提出了PAN(第一种以人为中心的图表示学习框架),用于多模态动作识别,在该框架中,包含人体关节的RGB图像块的标记嵌入被表示为时空图。这种以人类为中心的图建模范式抑制了RGB帧中的冗余,并且与基于骨架的方法很好地对齐,从而能够更有效地融合多模态特征并实现语义一致性。由于标记嵌入的采样高度依赖于2D骨骼数据,我们进一步提出了基于注意力的后校准方法,以在最小影响模型性能的情况下减少对高质量骨骼数据的依赖性。为了探索PAN与基于骨架的方法集成的潜力,我们提出两种变体:采用双路径图卷积网络并在后期融合阶段进行融合的PAN-Ensemble,以及在一个单一网络内执行统一图表示学习的PAN-Unified。在三个广泛使用的多模态动作识别数据集上,无论是在分离模型还是统一建模的情况下,PAN-Ensemble和PAN-Unified都分别达到了最先进的(SOTA)性能。
https://arxiv.org/abs/2512.21916
This manuscript explores multimodal alignment, translation, fusion, and transference to enhance machine understanding of complex inputs. We organize the work into five chapters, each addressing unique challenges in multimodal machine learning. Chapter 3 introduces Spatial-Reasoning Bert for translating text-based spatial relations into 2D arrangements between clip-arts. This enables effective decoding of spatial language into visual representations, paving the way for automated scene generation aligned with human spatial understanding. Chapter 4 presents a method for translating medical texts into specific 3D locations within an anatomical atlas. We introduce a loss function leveraging spatial co-occurrences of medical terms to create interpretable mappings, significantly enhancing medical text navigability. Chapter 5 tackles translating structured text into canonical facts within knowledge graphs. We develop a benchmark for linking natural language to entities and predicates, addressing ambiguities in text extraction to provide clearer, actionable insights. Chapter 6 explores multimodal fusion methods for compositional action recognition. We propose a method fusing video frames and object detection representations, improving recognition robustness and accuracy. Chapter 7 investigates multimodal knowledge transference for egocentric action recognition. We demonstrate how multimodal knowledge distillation enables RGB-only models to mimic multimodal fusion-based capabilities, reducing computational requirements while maintaining performance. These contributions advance methodologies for spatial language understanding, medical text interpretation, knowledge graph enrichment, and action recognition, enhancing computational systems' ability to process complex, multimodal inputs across diverse applications.
这份手稿探讨了多模态对齐、翻译、融合和迁移,以增强机器对复杂输入的理解。我们将这项工作分为五章,每章都针对多模态机器学习中的独特挑战。 第三章介绍了Spatial-Reasoning Bert,它能够将基于文本的空间关系转换为剪贴画之间的二维排列。这使得空间语言的有效解码成为可能,并转化为视觉表示,从而开启了与人类空间理解一致的自动场景生成之路。 第四章提出了一种方法,用于将医学文本翻译成解剖图谱中的特定三维位置。我们引入了一个损失函数,利用医学术语的空间共现来创建可解释映射,显著增强了对医学文本导航的理解能力。 第五章解决了将结构化文本转换为知识图中规范事实的问题。我们建立了一个基准,用以连接自然语言与实体和谓语,解决文本提取中的歧义问题,提供更清晰、可行的见解。 第六章探讨了用于组成性动作识别的多模态融合方法。我们提出了一种结合视频帧和对象检测表示的方法,提高了识别的鲁棒性和准确性。 第七章研究了面向第一人称视角动作识别的多模态知识迁移。我们展示了如何通过多模态知识蒸馏使RGB-only模型模仿基于多模态融合的能力,从而在减少计算需求的同时保持性能。 这些贡献推进了空间语言理解、医学文本解读、知识图谱丰富以及行动识别的方法论,增强了计算系统处理复杂多元输入的能力,涵盖各种应用。
https://arxiv.org/abs/2512.20501
Human motion understanding has advanced rapidly through vision-based progress in recognition, tracking, and captioning. However, most existing methods overlook physical cues such as joint actuation forces that are fundamental in biomechanics. This gap motivates our study: if and when do physically inferred forces enhance motion understanding? By incorporating forces into established motion understanding pipelines, we systematically evaluate their impact across baseline models on 3 major tasks: gait recognition, action recognition, and fine-grained video captioning. Across 8 benchmarks, incorporating forces yields consistent performance gains; for example, on CASIA-B, Rank-1 gait recognition accuracy improved from 89.52% to 90.39% (+0.87), with larger gain observed under challenging conditions: +2.7% when wearing a coat and +3.0% at the side view. On Gait3D, performance also increases from 46.0% to 47.3% (+1.3). In action recognition, CTR-GCN achieved +2.00% on Penn Action, while high-exertion classes like punching/slapping improved by +6.96%. Even in video captioning, Qwen2.5-VL's ROUGE-L score rose from 0.310 to 0.339 (+0.029), indicating that physics-inferred forces enhance temporal grounding and semantic richness. These results demonstrate that force cues can substantially complement visual and kinematic features under dynamic, occluded, or appearance-varying conditions.
人类动作理解通过基于视觉的进步在识别、跟踪和描述方面取得了迅速的发展。然而,大多数现有的方法忽视了诸如关节作用力等生物力学中基本的物理线索。这一差距激发了我们的研究:物理推断出的力量会在何时何地增强对运动的理解?通过将力量融入到已建立的动作理解流程中,我们系统性地评估了它们在三个主要任务上的影响:步态识别、动作识别和细粒度视频描述。在八个基准测试上,引入力量后性能均有持续提升;例如,在CASIA-B数据集上,Rank-1步态识别准确率从89.52%提高到90.39%(+0.87),尤其是在挑战条件下观察到了更大的增益:穿外套时提高了2.7%,侧面视角下提高了3.0%。在Gait3D数据集上,性能也由46.0%提升至47.3%(+1.3)。在动作识别任务中,CTR-GCN模型在Penn Action上的表现提升了2.00%,而高耗力的动作类别如拳击/拍打则获得了高达6.96%的改进。即使是在视频描述领域,Qwen2.5-VL的ROUGE-L评分也从0.310上升至0.339(+0.029),表明物理推断的力量增强了时间定位和语义丰富度。 这些结果证明,在动态、被遮挡或外观变化的情况下,力线索可以显著补充视觉和运动学特征。
https://arxiv.org/abs/2512.20451
Aligning egocentric video with wearable sensors have shown promise for human action recognition, but face practical limitations in user discomfort, privacy concerns, and scalability. We explore exocentric video with ambient sensors as a non-intrusive, scalable alternative. While prior egocentric-wearable works predominantly adopt Global Alignment by encoding entire sequences into unified representations, this approach fails in exocentric-ambient settings due to two problems: (P1) inability to capture local details such as subtle motions, and (P2) over-reliance on modality-invariant temporal patterns, causing misalignment between actions sharing similar temporal patterns with different spatio-semantic contexts. To resolve these problems, we propose DETACH, a decomposed spatio-temporal framework. This explicit decomposition preserves local details, while our novel sensor-spatial features discovered via online clustering provide semantic grounding for context-aware alignment. To align the decomposed features, our two-stage approach establishes spatial correspondence through mutual supervision, then performs temporal alignment via a spatial-temporal weighted contrastive loss that adaptively handles easy negatives, hard negatives, and false negatives. Comprehensive experiments with downstream tasks on Opportunity++ and HWU-USP datasets demonstrate substantial improvements over adapted egocentric-wearable baselines.
将第一人称视频与可穿戴传感器对齐以进行人类动作识别显示出潜力,但面临着用户不适、隐私问题和扩展性方面的实际限制。我们探索了一种使用环境视角的视频和周围传感器作为非侵入性和可扩展性的替代方案。先前的第一人称-可穿戴设备研究主要采用全局对齐方法,通过将整个序列编码为统一表示来实现这一目标,但在环境视角-周围传感器设置中,这种方法由于两个问题而失效:(P1) 无法捕捉局部细节如细微动作变化;(P2) 过度依赖模态不变的时间模式,导致具有相似时间模式但不同空间语义背景的动作之间的对齐错误。为了解决这些问题,我们提出了DETACH,这是一种分解的空间-时间框架。此明确的分解保留了局部细节,而我们通过在线聚类发现的独特传感器-空间特征提供了上下文感知对齐的语义依据。为了对齐这些分解后的特征,我们的两阶段方法首先通过相互监督建立空间对应关系,然后利用一种自适应处理简单负样本、难负样本和假阴性样本的空间-时间加权对比损失来进行时间对齐。在Opportunity++和HWU-USP数据集的下游任务中进行的全面实验显示,与改编的第一人称-可穿戴设备基准相比有显著改进。
https://arxiv.org/abs/2512.20409
Human Activity Recognition (HAR) plays a vital role in healthcare, surveillance, and innovative environments, where reliable action recognition supports timely decision-making and automation. Although deep learning-based HAR systems are widely adopted, the impact of Activation Functions (AFs) and Model Optimizers (MOs) on performance has not been sufficiently analyzed, particularly regarding how their combinations influence model behavior in practical scenarios. Most existing studies focus on architecture design, while the interaction between AF and MO choices remains relatively unexplored. In this work, we investigate the effect of three commonly used activation functions (ReLU, Sigmoid, and Tanh) combined with four optimization algorithms (SGD, Adam, RMSprop, and Adagrad) using two recurrent deep learning architectures, namely BiLSTM and ConvLSTM. Experiments are conducted on six medically relevant activity classes selected from the HMDB51 and UCF101 datasets, considering their suitability for healthcare-oriented HAR applications. Our experimental results show that ConvLSTM consistently outperforms BiLSTM across both datasets. ConvLSTM, combined with Adam or RMSprop, achieves an accuracy of up to 99.00%, demonstrating strong spatio-temporal learning capabilities and stable performance. While BiLSTM performs reasonably well on UCF101, with accuracy approaching 98.00%, its performance drops to approximately 60.00% on HMDB51, indicating limited robustness across datasets and weaker sensitivity to AF and MO variations. This study provides practical insights for optimizing HAR systems, particularly for real-world healthcare environments where fast and precise activity detection is critical.
人体活动识别(HAR)在医疗保健、监控和创新环境中发挥着重要作用,可靠的动作识别支持及时的决策制定与自动化。尽管基于深度学习的HAR系统已被广泛应用,但激活函数(AF)和模型优化器(MO)对性能的影响尚未得到充分分析,特别是在其组合如何影响实际场景中的模型行为方面。大多数现有研究主要集中在架构设计上,而关于AF和MO选择之间的相互作用则相对较少被探索。在这项工作中,我们探讨了三种常用的激活函数(ReLU、Sigmoid和Tanh)与四种优化算法(SGD、Adam、RMSprop和Adagrad)结合的BiLSTM和ConvLSTM两种递归深度学习架构的效果。实验在从HMDB51和UCF101数据集中选出的六个具有医疗相关性的活动类别上进行,考虑了它们对于面向医疗保健的应用而言的适用性。我们的实验结果显示,在两个数据集上,ConvLSTM始终优于BiLSTM。结合Adam或RMSprop的ConvLSTM可以达到高达99.00%的精度,展示了强大的时空学习能力和稳定的性能表现。虽然在UCF101数据集中,BiLSTM能够达到接近98.00%的准确率,并表现出良好的效果,但在HMDB51数据集上的准确率则下降到大约60.00%,这表明其跨数据集的鲁棒性较弱且对激活函数和优化器的变化敏感度较低。本研究为优化HAR系统提供了实用见解,特别是在快速而精确地检测活动对于医疗保健环境至关重要的情况下尤为有用。
https://arxiv.org/abs/2512.20104
Despite increasing interest in computer vision-based distracted driving detection, most existing models rely exclusively on driver-facing views and overlook crucial environmental context that influences driving behavior. This study investigates whether incorporating road-facing views alongside driver-facing footage improves distraction detection accuracy in naturalistic driving conditions. Using synchronized dual-camera recordings from real-world driving, we benchmark three leading spatiotemporal action recognition architectures: SlowFast-R50, X3D-M, and SlowOnly-R50. Each model is evaluated under two input configurations: driver-only and stacked dual-view. Results show that while contextual inputs can improve detection in certain models, performance gains depend strongly on the underlying architecture. The single-pathway SlowOnly model achieved a 9.8 percent improvement with dual-view inputs, while the dual-pathway SlowFast model experienced a 7.2 percent drop in accuracy due to representational conflicts. These findings suggest that simply adding visual context is not sufficient and may lead to interference unless the architecture is specifically designed to support multi-view integration. This study presents one of the first systematic comparisons of single- and dual-view distraction detection models using naturalistic driving data and underscores the importance of fusion-aware design for future multimodal driver monitoring systems.
尽管人们对基于计算机视觉的分心驾驶检测兴趣日益增加,但大多数现有的模型仅依赖于面向驾驶员的视角,并忽略了影响驾驶行为的重要环境背景。本研究探讨了将道路朝向视图与面向驾驶员的镜头结合使用是否能提高自然驾驶条件下的分心检测准确性。我们利用真实世界驾驶中同步双摄像头记录的数据,对三种领先的时空动作识别架构——SlowFast-R50、X3D-M 和 SlowOnly-R50 进行基准测试。每种模型在两种输入配置下进行评估:仅驾驶员视角和堆叠的双视图。结果显示,虽然上下文信息可以在某些模型中提高检测精度,但性能提升强烈依赖于底层架构的设计。单一路径模式的SlowOnly模型,在使用双重视图输入时,准确率提高了9.8%;而多路径模式的SlowFast模型则因表示冲突导致准确率下降了7.2%。这些发现表明,简单地添加视觉背景并不足够,并且可能会产生干扰,除非架构专门设计用于支持多视角融合。 本研究首次系统性地比较了单一视图和双视图分心检测模型在自然驾驶数据中的表现,强调了为未来多模态驾驶员监控系统进行融合感知设计的重要性。
https://arxiv.org/abs/2512.20025
Graph Convolutional Networks (GCNs) demonstrate strong capability in modeling skeletal topology for action recognition, yet their dense floating-point computations incur high energy costs. Spiking Neural Networks (SNNs), characterized by event-driven and sparse activation, offer energy efficiency but remain limited in capturing coupled temporal-frequency and topological dependencies of human motion. To bridge this gap, this article proposes Signal-SGN++, a topology-aware spiking graph framework that integrates structural adaptivity with time-frequency spiking dynamics. The network employs a backbone composed of 1D Spiking Graph Convolution (1D-SGC) and Frequency Spiking Convolution (FSC) for joint spatiotemporal and spectral feature extraction. Within this backbone, a Topology-Shift Self-Attention (TSSA) mechanism is embedded to adaptively route attention across learned skeletal topologies, enhancing graph-level sensitivity without increasing computational complexity. Moreover, an auxiliary Multi-Scale Wavelet Transform Fusion (MWTF) branch decomposes spiking features into multi-resolution temporal-frequency representations, wherein a Topology-Aware Time-Frequency Fusion (TATF) unit incorporates structural priors to preserve topology-consistent spectral fusion. Comprehensive experiments on large-scale benchmarks validate that Signal-SGN++ achieves superior accuracy-efficiency trade-offs, outperforming existing SNN-based methods and achieving competitive results against state-of-the-art GCNs under substantially reduced energy consumption.
图卷积网络(GCN)在建模骨骼拓扑以进行动作识别方面表现出强大的能力,但其密集的浮点计算带来了高昂的能量成本。脉冲神经网络(SNN),以其事件驱动和稀疏激活为特点,在能效上表现出优势,但仍存在限制,即难以捕捉人体运动中的耦合时间-频率和拓扑依赖关系。为了弥补这一差距,本文提出了一种基于拓扑感知的脉冲图框架Signal-SGN++,该框架将结构适应性与时间-频率脉冲动力学相结合。 网络采用由一维脉冲图卷积(1D-SGC)和频域脉冲卷积(FSC)组成的骨干架构来提取关节的空间-时间和谱特征。在此架构中嵌入了拓扑偏移自注意力机制(TSSA),以适应性地在学习到的骨骼拓扑之间路由注意,增强图级敏感度而不增加计算复杂度。此外,一个辅助多尺度小波变换融合分支将脉冲特征分解为多种分辨率的时间-频率表示,在其中,拓扑感知时间-频率融合单元利用结构先验来保持与拓扑一致的谱融合。 在大规模基准测试上的全面实验验证了Signal-SGN++实现了卓越的准确性和效率权衡,在能耗显著降低的情况下超越了现有的SNN方法,并且在动作识别任务中达到了与最新GCN技术相竞争的结果。
https://arxiv.org/abs/2512.22214
The self-supervised pretraining paradigm has achieved great success in learning 3D action representations for skeleton-based action recognition using contrastive learning. However, learning effective representations for skeleton-based temporal action localization remains challenging and underexplored. Unlike video-level {action} recognition, detecting action boundaries requires temporally sensitive features that capture subtle differences between adjacent frames where labels change. To this end, we formulate a snippet discrimination pretext task for self-supervised pretraining, which densely projects skeleton sequences into non-overlapping segments and promotes features that distinguish them across videos via contrastive learning. Additionally, we build on strong backbones of skeleton-based action recognition models by fusing intermediate features with a U-shaped module to enhance feature resolution for frame-level localization. Our approach consistently improves existing skeleton-based contrastive learning methods for action localization on BABEL across diverse subsets and evaluation protocols. We also achieve state-of-the-art transfer learning performance on PKUMMD with pretraining on NTU RGB+D and BABEL.
自监督预训练范式在使用对比学习方法从基于骨架的动作识别中学习3D动作表示方面取得了巨大成功。然而,为基于骨架的时间动作定位学习有效的表示仍然具有挑战性且研究不足。与视频级动作识别不同,检测动作边界需要能够捕捉到标签变化前后相邻帧之间细微差别的时间敏感特征。为此,我们提出了一个片段区分的预训练任务,将骨架序列密集地投影到非重叠段中,并通过对比学习促进跨视频区分这些段落的特性。此外,我们在基于骨架的动作识别模型的强大骨干网络基础上融合了中间特征和U形模块以增强帧级定位的特征分辨率。我们的方法在BABEL数据集的不同子集和评估协议上持续改进现有的基于骨架的对比学习方法用于动作定位的表现。我们还在PKUMMD上实现了最先进的迁移学习性能,使用NTU RGB+D和BABEL进行预训练。
https://arxiv.org/abs/2512.16504