We study Compositional Video Understanding (CVU), where models must recognize verbs and objects and compose them to generalize to unseen combinations. We find that existing Zero-Shot Compositional Action Recognition (ZS-CAR) models fail primarily due to an overlooked failure mode: object-driven verb shortcuts. Through systematic analysis, we show that this behavior arises from two intertwined factors: severe sparsity and skewness of compositional supervision, and the asymmetric learning difficulty between verbs and objects. As training progresses, the existing ZS-CAR model increasingly ignores visual evidence and overfits to co-occurrence statistics. Consequently, the existing model does not gain the benefit of compositional recognition in unseen verb-object compositions. To address this, we propose RCORE, a simple and effective framework that enforces temporally grounded verb learning. RCORE introduces (i) a composition-aware augmentation that diversifies verb-object combinations without corrupting motion cues, and (ii) a temporal order regularization loss that penalizes shortcut behaviors by explicitly modeling temporal structure. Across two benchmarks, Sth-com and our newly constructed EK100-com, RCORE significantly improves unseen composition accuracy, reduces reliance on co-occurrence bias, and achieves consistently positive compositional gaps. Our findings reveal object-driven shortcuts as a critical limiting factor in ZS-CAR and demonstrate that addressing them is essential for robust compositional video understanding.
我们研究了组合视频理解(CVU),在这种情况下,模型必须识别动词和物体,并将它们组合起来以推广到未见过的组合。我们发现现有的零样本组合动作识别(ZS-CAR)模型主要由于一个被忽略的问题模式而失败:基于对象的动词捷径。通过系统的分析,我们展示了这种行为是由两个相互交织的因素引起的:组成监督的高度稀疏性和偏斜性,以及动词和物体之间的不对称学习难度。随着训练的进行,现有的ZS-CAR模型越来越忽视视觉证据,并过度适应共现统计信息。因此,现有模型无法获得在未见过的动词-物体组合中的组合识别益处。 为了解决这个问题,我们提出了RCORE,这是一个简单而有效的框架,强制执行基于时间的基础动词学习。RCORE引入了(i)一种组合感知增强方法,可以在不破坏运动线索的情况下多样化动词-对象组合;(ii)一种时间顺序正则化损失,通过显式建模时间结构来惩罚捷径行为。在两个基准测试Sth-com和我们新构建的EK100-com上,RCORE显著提高了未见过组合的准确性,减少了对共现偏差的依赖,并实现了持续的正面组合差距。 我们的发现揭示了基于对象的捷径作为ZS-CAR中的关键限制因素,并证明解决这些问题对于稳健的组合视频理解至关重要。
https://arxiv.org/abs/2601.16211
Despite significant progress in human action recognition, generalizing to diverse viewpoints remains a challenge. Most existing datasets are captured from ground-level perspectives, and models trained on them often struggle to transfer to drastically different domains such as aerial views. This paper examines how curriculum-based training strategies can improve generalization to unseen real aerial-view data without using any real aerial data during training. We explore curriculum learning for cross-view action recognition using two out-of-domain sources: synthetic aerial-view data and real ground-view data. Our results on the evaluation on order of training (fine-tuning on synthetic aerial data vs. real ground data) shows that fine-tuning on real ground data but differ in how they transition from synthetic to real. The first uses a two-stage curriculum with direct fine-tuning, while the second applies a progressive curriculum that expands the dataset in multiple stages before fine-tuning. We evaluate both methods on the REMAG dataset using SlowFast (CNN-based) and MViTv2 (Transformer-based) architectures. Results show that combining the two out-of-domain datasets clearly outperforms training on a single domain, whether real ground-view or synthetic aerial-view. Both curriculum strategies match the top-1 accuracy of simple dataset combination while offering efficiency gains. With the two-step fine-tuning method, SlowFast achieves up to a 37% reduction in iterations and MViTv2 up to a 30% reduction compared to simple combination. The multi-step progressive approach further reduces iterations, by up to 9% for SlowFast and 30% for MViTv2, relative to the two-step method. These findings demonstrate that curriculum-based training can maintain comparable performance (top-1 accuracy within 3% range) while improving training efficiency in cross-view action recognition.
尽管在人类动作识别方面取得了重大进展,但将模型泛化到不同视角仍然是一项挑战。大多数现有的数据集都是从地面视角捕捉的,并且在这些数据集上训练出来的模型往往难以转移到诸如航拍视图等完全不同领域的任务中。本文探讨了基于课程的学习策略如何能在不使用任何真实航拍数据的情况下,提高对未见过的真实航拍数据的泛化能力。我们探索了使用两种跨域数据源进行基于课程的交叉视角动作识别:合成航拍数据和真实的地面对视数据。 我们在训练顺序(在合成航拍数据上微调与在真实地面数据上微调)上的评估表明,虽然两个方法都采用了从合成到真实的数据过渡策略,但在如何实现这一过程方面有所不同。第一个方法采用两阶段课程法直接进行微调;第二个方法则采取渐进式课程学习,在多阶段扩大训练集后再进行微调。 我们在REMAG数据集上使用了SlowFast(基于CNN)和MViTv2(基于Transformer)两种架构来评估这两种方法的效果。结果显示,将两个跨域的数据集结合在一起的训练效果明显优于单个领域的训练,无论该领域是真实地面对视还是合成航拍视角。同时,两种课程策略都能达到简单的数据组合方式所能实现的最佳准确率(top-1 accuracy),而前者在效率方面有所提升。 采用两步微调法时,SlowFast和MViTv2的迭代次数分别减少了最多37%和30%,相比于简单地合并单一领域训练。多阶段渐进式方法进一步减少了迭代次数,对于SlowFast而言减少幅度为9%,而对于MViTv2则达到了相对两步骤法减幅的最大值即30%。 这些发现表明,在交叉视角动作识别中,基于课程的培训可以在保持相当性能(top-1准确率在3%范围内)的同时提高训练效率。
https://arxiv.org/abs/2601.14101
Motion representation plays an important role in video understanding and has many applications including action recognition, robot and autonomous guidance or others. Lately, transformer networks, through their self-attention mechanism capabilities, have proved their efficiency in many applications. In this study, we introduce a new two-stream transformer video classifier, which extracts spatio-temporal information from content and optical flow representing movement information. The proposed model identifies self-attention features across the joint optical flow and temporal frame domain and represents their relationships within the transformer encoder mechanism. The experimental results show that our proposed methodology provides excellent classification results on three well-known video datasets of human activities.
运动表示在视频理解中扮演着重要角色,并且有许多应用,包括动作识别、机器人和自主导航等。最近,通过自注意力机制的能力,变压器网络在许多应用程序中证明了其有效性。在这项研究中,我们引入了一种新的双流变压器视频分类器,该分类器从内容和表示运动信息的光学流中提取时空信息。所提出的模型在联合光流和时间帧域内识别自注意特征,并通过变压器编码机制表示它们之间的关系。实验结果表明,在三个著名的涉及人类活动的视频数据集上,我们提出的方法提供了出色的分类效果。
https://arxiv.org/abs/2601.14086
Unsupervised video class incremental learning (uVCIL) represents an important learning paradigm for learning video information without forgetting, and without considering any data labels. Prior approaches have focused on supervised class-incremental learning, relying on using the knowledge of labels and task boundaries, which is costly, requires human annotation, or is simply not a realistic option. In this paper, we propose a simple yet effective approach to address the uVCIL. We first consider a deep feature extractor network, providing a set of representative video features during each task without assuming any class or task information. We then progressively build a series of deep clusters from the extracted features. During the successive task learning, the model updated from the previous task is used as an initial state in order to transfer knowledge to the current learning task. We perform in-depth evaluations on three standard video action recognition datasets, including UCF101, HMDB51, and Something-to-Something V2, by ignoring the labels from the supervised setting. Our approach significantly outperforms other baselines on all datasets.
无监督视频类别渐进学习(uVCIL)代表了一种重要的学习范式,旨在在不遗忘先前所学信息且无需考虑任何数据标签的情况下学习视频信息。以往的方法主要集中在有监督的类别增量学习上,这依赖于使用标签知识和任务边界,这种方式成本高昂、需要人工标注或根本不可行。本文中,我们提出了一种简单而有效的方法来解决uVCIL问题。 我们的方法首先采用深度特征提取网络,在每个任务过程中提供一系列代表性的视频特征,并不假设任何类别或任务信息。随后逐步构建从提取的特征形成的多个深层聚类。在后续的任务学习过程中,利用前一任务更新后的模型作为初始状态,以便将知识传递给当前的学习任务。 我们在三个标准视频动作识别数据集(UCF101、HMDB51和Something-to-Something V2)上进行了深入评估,并忽略有监督设置下的标签。我们的方法在所有数据集上的表现都显著优于其他基线模型。
https://arxiv.org/abs/2601.14069
State-space models (SSMs) have become a cornerstone for unraveling brain dynamics, revealing how latent neural states evolve over time and give rise to observed signals. By combining the flexibility of deep learning with the principled dynamical structure of SSMs, recent studies have achieved powerful fits to functional neuroimaging data. However, most existing approaches still view the brain as a set of loosely connected regions or impose oversimplified network priors, falling short of a truly holistic and self-organized dynamical system perspective. Brain functional connectivity (FC) at each time point naturally forms a symmetric positive definite (SPD) matrix, which resides on a curved Riemannian manifold rather than in Euclidean space. Capturing the trajectories of these SPD matrices is key to understanding how coordinated networks support cognition and behavior. To this end, we introduce GeoDynamics, a geometric state-space neural network that tracks latent brain-state trajectories directly on the high-dimensional SPD manifold. GeoDynamics embeds each connectivity matrix into a manifold-aware recurrent framework, learning smooth and geometry-respecting transitions that reveal task-driven state changes and early markers of Alzheimer's disease, Parkinson's disease, and autism. Beyond neuroscience, we validate GeoDynamics on human action recognition benchmarks (UTKinect, Florence, HDM05), demonstrating its scalability and robustness in modeling complex spatiotemporal dynamics across diverse domains.
状态空间模型(SSM)已经成为解开脑动力学之谜的关键工具,揭示了潜在神经状态如何随时间演变并产生观察到的信号。通过结合深度学习的灵活性与SSMs的原则性动态结构,最近的研究已经实现了对功能神经影像数据的强大拟合能力。然而,大多数现有方法仍然将大脑视为一组松散连接的区域或强加过于简化的网络先验条件,未能从真正整体且自我组织的动力系统视角来理解问题。 在每个时间点上,脑部的功能连通性(FC)自然形成了一个对称正定(SPD)矩阵,在黎曼流形而非欧几里得空间中存在。捕捉这些SPD矩阵的轨迹对于了解协调网络如何支持认知和行为至关重要。为此,我们引入了GeoDynamics,这是一种几何状态空间神经网络,它直接在高维SPD流形上追踪潜在的大脑状态轨迹。 GeoDynamics将每个连通性矩阵嵌入到感知流形的递归框架中,学习平滑且尊重几何结构的状态转换过程,从而揭示任务驱动的状态变化以及阿尔茨海默病、帕金森病和自闭症早期标记。在神经科学领域之外,我们在人类动作识别基准(UTKinect, Florence, HDM05)上验证了GeoDynamics的性能,展示了其处理跨不同领域复杂时空动态的可扩展性和鲁棒性。
https://arxiv.org/abs/2601.13570
Skeleton-based action recognition leverages human pose keypoints to categorize human actions, which shows superior generalization and interoperability compared to regular end-to-end action recognition. Existing solutions use RGB cameras to annotate skeletal keypoints, but their performance declines in dark environments and raises privacy concerns, limiting their use in smart homes and hospitals. This paper explores non-invasive wireless sensors, i.e., LiDAR and mmWave, to mitigate these challenges as a feasible alternative. Two problems are addressed: (1) insufficient data on wireless sensor modality to train an accurate skeleton estimation model, and (2) skeletal keypoints derived from wireless sensors are noisier than RGB, causing great difficulties for subsequent action recognition models. Our work, SkeFi, overcomes these gaps through a novel cross-modal knowledge transfer method acquired from the data-rich RGB modality. We propose the enhanced Temporal Correlation Adaptive Graph Convolution (TC-AGC) with frame interactive enhancement to overcome the noise from missing or inconsecutive frames. Additionally, our research underscores the effectiveness of enhancing multiscale temporal modeling through dual temporal convolution. By integrating TC-AGC with temporal modeling for cross-modal transfer, our framework can extract accurate poses and actions from noisy wireless sensors. Experiments demonstrate that SkeFi realizes state-of-the-art performances on mmWave and LiDAR. The code is available at this https URL.
基于骨架的动作识别技术利用人体姿态关键点来分类人类动作,相较于传统的端到端动作识别方法,它展示了更优秀的泛化能力和互操作性。现有解决方案使用RGB相机标注骨骼关键点,但在黑暗环境中性能下降,并且引发隐私问题,限制了其在智能家居和医院的应用。本文探索了一种非侵入性的无线传感器(即LiDAR和毫米波雷达)作为替代方案,以解决这些问题。研究解决了两个主要挑战:(1)缺乏用于训练准确骨架估计模型的无线传感模式数据;(2)与RGB相比,从无线传感器导出的人体关键点噪声更大,这给后续的动作识别模型带来了极大的困难。 我们的工作SkeFi通过一种新颖的跨模态知识迁移方法解决了这些差距,这种方法是从数据丰富的RGB模式中获得的。我们提出了增强的时间相关自适应图卷积(TC-AGC)和帧互动增强来克服由于缺失或不连续帧造成的噪声问题。此外,我们的研究强调了通过双时间卷积提高多尺度时间建模的有效性。通过将TC-AGC与时间建模相结合进行跨模态迁移,我们的框架可以从嘈杂的无线传感器中提取准确的姿态和动作。实验表明,SkeFi在毫米波雷达和LiDAR上实现了最先进的性能。 该项目的代码可以在提供的网址下获取。
https://arxiv.org/abs/2601.12432
In driver activity monitoring, movements are mostly limited to the upper body, which makes many actions look similar. To tell these actions apart, human often rely on the objects the driver is using, such as holding a phone compared with gripping the steering wheel. However, most existing driver-monitoring datasets lack accurate object-location annotations or do not link objects to their associated actions, leaving a critical gap for reliable action recognition. To address this, we introduce the Driver Action with Object Synergy (DAOS) dataset, comprising 9,787 video clips annotated with 36 fine-grained driver actions and 15 object classes, totaling more than 2.5 million corresponding object instances. DAOS offers multi-modal, multi-view data (RGB, IR, and depth) from front, face, left, and right perspectives. Although DAOS captures a wide range of cabin objects, only a few are directly relevant to each action for prediction, so focusing on task-specific human-object relations is essential. To tackle this challenge, we propose the Action-Object-Relation Network (AOR-Net). AOR-Net comprehends complex driver actions through multi-level reasoning and a chain-of-action prompting mechanism that models the logical relationships among actions, objects, and their relations. Additionally, the Mixture of Thoughts module is introduced to dynamically select essential knowledge at each stage, enhancing robustness in object-rich and object-scarce conditions. Extensive experiments demonstrate that our model outperforms other state-of-the-art methods on various datasets.
在驾驶员活动监控中,动作大多局限于上半身,这使得许多动作看起来相似。为了区分这些动作,人类通常依赖于驾驶员所使用的对象,例如手持手机与握住方向盘的动作有所不同。然而,现有的大多数驾驶员监控数据集缺乏准确的对象位置标注或没有将对象与其相关联的动作连接起来,从而导致可靠的动作识别存在关键缺口。为解决这一问题,我们引入了Driver Action with Object Synergy(DAOS)数据集,该数据集包含9,787个视频片段,并附有36种精细划分的驾驶员动作和15类对象的标注,总计超过250万个对应的对象实例。DAOS提供了多模态、多视角的数据(RGB、IR及深度信息),来自前方、面部、左侧和右侧的不同角度。 尽管DAOS捕捉了大量的车内物体,但每个动作直接相关的物品却不多,因此专注于特定任务的人与物的关系至关重要。为应对这一挑战,我们提出了Action-Object-Relation Network (AOR-Net)。AOR-Net通过多层次推理以及一条模型行动逻辑关系的提示机制来理解复杂的驾驶员行为。此外,还引入了Mixture of Thoughts模块,以在每个阶段动态选择关键知识,从而增强在物多或物少条件下的鲁棒性。广泛的实验表明,在各种数据集上,我们的模型优于其他最先进的方法。
https://arxiv.org/abs/2601.11990
Human action recognition has become an important research focus in computer vision due to the wide range of applications where it is used. 3D Resnet-based CNN models, particularly MC3, R3D, and R(2+1)D, have different convolutional filters to extract spatiotemporal features. This paper investigates the impact of reducing the captured knowledge from temporal data, while increasing the resolution of the frames. To establish this experiment, we created similar designs to the three originals, but with a dropout layer added before the final classifier. Secondly, we then developed ten new versions for each one of these three designs. The variants include special attention blocks within their architecture, such as convolutional block attention module (CBAM), temporal convolution networks (TCN), in addition to multi-headed and channel attention mechanisms. The purpose behind that is to observe the extent of the influence each of these blocks has on performance for the restricted-temporal models. The results of testing all the models on UCF101 have shown accuracy of 88.98% for the variant with multiheaded attention added to the modified R(2+1)D. This paper concludes the significance of missing temporal features in the performance of the newly created increased resolution models. The variants had different behavior on class-level accuracy, despite the similarity of their enhancements to the overall performance.
人类行为识别由于其广泛的应用场景,已成为计算机视觉领域的重要研究焦点。基于3D Resnet的CNN模型(如MC3、R3D和R(2+1)D)通过不同的卷积滤波器来提取空间-时间特征。本文探讨了减少捕获的时间数据知识的影响,并增加帧分辨率的效果。为了进行这项实验,我们创建了与原始设计相似的设计,但在最终分类器之前添加了一个dropout层。其次,为每种设计方案开发了十个新版本,这些变体在其架构中加入了特殊的注意模块,如卷积块注意力模块(CBAM)、时间卷积网络(TCN),以及多头和通道注意机制。 引入这些模块的目的是观察它们在限制时间特征的新建高分辨率模型性能中的影响程度。所有模型在UCF101数据集上的测试结果表明,在修改后的R(2+1)D中添加了多头注意力机制的变体获得了88.98%的准确率。本文得出结论,缺失的时间特性对新创建的、增加分辨率的模型性能具有重要意义。尽管这些变体对其整体性能进行了类似的增强,但它们在类别级别的准确性上表现出不同的行为。
https://arxiv.org/abs/2601.10854
As robotics become increasingly integrated into construction workflows, their ability to interpret and respond to human behavior will be essential for enabling safe and effective collaboration. Vision-Language Models (VLMs) have emerged as a promising tool for visual understanding tasks and offer the potential to recognize human behaviors without extensive domain-specific training. This capability makes them particularly appealing in the construction domain, where labeled data is scarce and monitoring worker actions and emotional states is critical for safety and productivity. In this study, we evaluate the performance of three leading VLMs, GPT-4o, Florence 2, and LLaVa-1.5, in detecting construction worker actions and emotions from static site images. Using a curated dataset of 1,000 images annotated across ten action and ten emotion categories, we assess each model's outputs through standardized inference pipelines and multiple evaluation metrics. GPT-4o consistently achieved the highest scores across both tasks, with an average F1-score of 0.756 and accuracy of 0.799 in action recognition, and an F1-score of 0.712 and accuracy of 0.773 in emotion recognition. Florence 2 performed moderately, with F1-scores of 0.497 for action and 0.414 for emotion, while LLaVa-1.5 showed the lowest overall performance, with F1-scores of 0.466 for action and 0.461 for emotion. Confusion matrix analyses revealed that all models struggled to distinguish semantically close categories, such as collaborating in teams versus communicating with supervisors. While the results indicate that general-purpose VLMs can offer a baseline capability for human behavior recognition in construction environments, further improvements, such as domain adaptation, temporal modeling, or multimodal sensing, may be needed for real-world reliability.
随着机器人技术在建筑工作流程中越来越紧密地集成,它们解读和响应人类行为的能力对于实现安全有效的协作至关重要。视觉-语言模型(VLMs)作为视觉理解任务的有力工具已崭露头角,并且具备无需大量领域特定训练即可识别人类行为的能力,这使它们特别适合于建筑行业,在该行业中,标记数据稀缺,而监控工人行动和情感状态对于确保安全性和生产率至关重要。在这项研究中,我们评估了三种领先视觉语言模型——GPT-4o、Florence 2 和 LLaVa-1.5 在从施工现场静态图像中检测建筑工人的行为和情绪方面的性能。通过使用包含1000张图片的数据集,并针对十个行动类别和十个情感类别进行注释,我们采用标准化推理管道并通过多种评估指标来分析每个模型的输出。GPT-4o 在两项任务中的表现均最为出色,在动作识别方面取得了平均F1评分为0.756 和准确率为0.799 的成绩,在情绪识别方面的F1评分则为 0.712,准确率达到了0.773。Florence 2的表现中等,在动作和情感的F1评分分别为0.497 和 0.414。而LLaVa-1.5 则表现最弱,其在动作和情绪识别上的 F1评分为 0.466 和 0.461。 混淆矩阵分析表明,所有模型都难以区分语义相近的类别,例如团队合作与与主管沟通之间的区别。尽管结果显示通用型VLMs可以为建筑环境中的行为识别提供一个基准能力,但在现实世界中实现可靠性可能仍需进一步改进,如领域适应、时间建模或多模式感知等技术的应用。
https://arxiv.org/abs/2601.10835
Inferring physical actions from visual observations is a fundamental capability for advancing machine intelligence in the physical world. Achieving this requires large-scale, open-vocabulary video action datasets that span broad domains. We introduce Action100M, a large-scale dataset constructed from 1.2M Internet instructional videos (14.6 years of duration), yielding O(100 million) temporally localized segments with open-vocabulary action supervision and rich captions. Action100M is generated by a fully automated pipeline that (i) performs hierarchical temporal segmentation using V-JEPA 2 embeddings, (ii) produces multi-level frame and segment captions organized as a Tree-of-Captions, and (iii) aggregates evidence with a reasoning model (GPT-OSS-120B) under a multi-round Self-Refine procedure to output structured annotations (brief/detailed action, actor, brief/detailed caption). Training VL-JEPA on Action100M demonstrates consistent data-scaling improvements and strong zero-shot performance across diverse action recognition benchmarks, establishing Action100M as a new foundation for scalable research in video understanding and world modeling.
从视觉观察中推断物理动作是推进机器智能在现实世界中的能力的一项基本技能。实现这一目标需要大规模、开放式词汇表的动作视频数据集,这些数据集涵盖了广泛的领域。我们引入了Action100M,这是一个由120万互联网教学视频(总时长为14.6年)构建的大规模数据集,产生了数亿个具有开放词汇动作监督和丰富注释的定时间隔片段。Action100M是通过一个完全自动化的管道生成的,该管道包括(i) 使用V-JEPA 2嵌入执行层次化的时间分割;(ii) 产生多层次的画面和段落描述,组织为Caption Tree;以及(iii) 在多轮Self-Refine程序中使用推理模型(GPT-OSS-120B)聚合证据以输出结构化的注释(简要/详细的动作、演员、简要/详细的描述)。在Action100M上对VL-JEPA进行训练展示了持续的数据规模改进,并且跨多种动作识别基准上的零样本性能强大,这使Action100M成为视频理解和世界建模可扩展研究的新基础。
https://arxiv.org/abs/2601.10592
Anticipating the intentions of Vulnerable Road Users (VRUs) is a critical challenge for safe autonomous driving (AD) and mobile robotics. While current research predominantly focuses on pedestrian crossing behaviors from a vehicle's perspective, interactions within dense shared spaces remain underexplored. To bridge this gap, we introduce FUSE-Bike, the first fully open perception platform of its kind. Equipped with two LiDARs, a camera, and GNSS, it facilitates high-fidelity, close-range data capture directly from a cyclist's viewpoint. Leveraging this platform, we present BikeActions, a novel multi-modal dataset comprising 852 annotated samples across 5 distinct action classes, specifically tailored to improve VRU behavior modeling. We establish a rigorous benchmark by evaluating state-of-the-art graph convolution and transformer-based models on our publicly released data splits, establishing the first performance baselines for this challenging task. We release the full dataset together with data curation tools, the open hardware design, and the benchmark code to foster future research in VRU action understanding under this https URL.
预测易受伤害道路使用者(VRUs)的意图是实现安全自动驾驶和移动机器人技术的关键挑战。尽管目前的研究主要从车辆角度研究行人的过马路行为,但密集共享空间中的互动仍然未得到充分探索。为填补这一空白,我们推出了FUSE-Bike,这是第一个完全开放的感知平台。该平台配备了两个激光雷达、一个摄像头和GNSS(全球导航卫星系统),能够直接从骑车人视角捕捉高保真度近距离数据。 借助这个平台,我们发布了BikeActions,这是一个新颖的多模态数据集,包含852个注释样本,涵盖了五种不同的动作类别,特别设计用于改进VRU行为建模。通过在公开发布的数据切分上评估最先进的图卷积和基于变压器的模型,我们建立了一个严格的基准测试,并为此具有挑战性的任务建立了第一个性能基线。 我们发布了整个数据集、数据管理工具、开放硬件设计以及基准代码,以促进未来在这方面的研究(可以通过提供的链接访问)。这一举措旨在推动对VRU行为理解的研究发展。
https://arxiv.org/abs/2601.10521
3D pose estimation from sparse multi-views is a critical task for numerous applications, including action recognition, sports analysis, and human-robot interaction. Optimization-based methods typically follow a two-stage pipeline, first detecting 2D keypoints in each view and then associating these detections across views to triangulate the 3D pose. Existing methods rely on mere pairwise associations to model this correspondence problem, treating global consistency between views (i.e., cycle consistency) as a soft constraint. Yet, reconciling these constraints for multiple views becomes brittle when spurious associations propagate errors. We thus propose COMPOSE, a novel framework that formulates multi-view pose correspondence matching as a hypergraph partitioning problem rather than through pairwise association. While the complexity of the resulting integer linear program grows exponentially in theory, we introduce an efficient geometric pruning strategy to substantially reduce the search space. COMPOSE achieves improvements of up to 23% in average precision over previous optimization-based methods and up to 11% over self-supervised end-to-end learned methods, offering a promising solution to a widely studied problem.
从稀疏多视角进行三维姿态估计是一项对于动作识别、体育分析和人机交互等众多应用至关重要的任务。基于优化的方法通常遵循两阶段流程:首先在每个视角中检测2D关键点,然后将这些检测结果跨视角关联起来以三角测量出3D姿态。现有的方法依赖于简单的成对关联来建模这种对应问题,并且将视图之间的全局一致性(即循环一致性)视为软约束处理。然而,在多个视角的情况下,解决这些约束变得脆弱,因为虚假的关联会传播错误。 为此,我们提出了一种名为COMPOSE的新框架,它将多视角姿态对应的匹配问题形式化为超图划分问题,而不是通过成对关联来解决。尽管理论上由此产生的整数线性规划复杂度呈指数增长,但我们引入了一个高效的几何剪枝策略,从而大幅减少了搜索空间。与以前的基于优化的方法相比,COMPOSE在平均精度上提高了多达23%,而相较于自监督端到端学习方法则提升了高达11%的表现,为一个长期研究的问题提供了有前景的解决方案。
https://arxiv.org/abs/2601.09698
Face video anonymization is aimed at privacy preservation while allowing for the analysis of videos in a number of computer vision downstream tasks such as expression recognition, people tracking, and action recognition. We propose here a novel unified framework referred to as Anon-NET, streamlined to de-identify facial videos, while preserving age, gender, race, pose, and expression of the original video. Specifically, we inpaint faces by a diffusion-based generative model guided by high-level attribute recognition and motion-aware expression transfer. We then animate deidentified faces by video-driven animation, which accepts the de-identified face and the original video as input. Extensive experiments on the datasets VoxCeleb2, CelebV-HQ, and HDTF, which include diverse facial dynamics, demonstrate the effectiveness of AnonNET in obfuscating identity while retaining visual realism and temporal consistency. The code of AnonNet will be publicly released.
面部视频匿名化旨在保护隐私,同时允许在诸如表情识别、人群跟踪和动作识别等多种计算机视觉下游任务中分析视频。在这里,我们提出了一种新颖的统一框架,称为Anon-NET,该框架专门用于去标识面部视频,同时保留原始视频中的年龄、性别、种族、姿态和表情信息。 具体来说,我们的方法通过一种由高级属性识别引导且具备运动感知的表情迁移功能的扩散生成模型来修复人脸。之后,我们利用基于视频驱动的动画技术对去身份化的人脸进行动态处理,该技术以去身份化的脸部图像和原始视频作为输入。 我们在包含多种面部动态变化的数据集VoxCeleb2、CelebV-HQ 和 HDTF 上进行了广泛的实验,结果表明Anon-NET 在混淆身份信息的同时能够保持视觉真实性和时间一致性。 AnonNet 的代码将会公开发布。
https://arxiv.org/abs/2601.11635
In recent years, self-supervised representation learning for skeleton-based action recognition has advanced with the development of contrastive learning methods. However, most of contrastive paradigms are inherently discriminative and often struggle to capture the variability and uncertainty intrinsic to human motion. To address this issue, we propose a variational contrastive learning framework that integrates probabilistic latent modeling with contrastive self-supervised learning. This formulation enables the learning of structured and semantically meaningful representations that generalize across different datasets and supervision levels. Extensive experiments on three widely used skeleton-based action recognition benchmarks show that our proposed method consistently outperforms existing approaches, particularly in low-label regimes. Moreover, qualitative analyses show that the features provided by our method are more relevant given the motion and sample characteristics, with more focus on important skeleton joints, when compared to the other methods.
近年来,基于骨架的动作识别中的自监督表示学习随着对比学习方法的发展而取得了进展。然而,大多数对比范式本质上是判别式的,并且往往难以捕捉人类运动内在的变异性与不确定性。为了解决这一问题,我们提出了一种结合概率潜在建模和对比自监督学习的变分对比学习框架。这种形式化的方法能够学习出结构化的、语义上有意义的表示,这些表示在不同的数据集和监督水平上都能泛化得很好。我们在三个广泛使用的基于骨架的动作识别基准测试中进行了大量的实验,结果表明我们的方法始终优于现有的方法,在低标签环境下尤其如此。此外,定性的分析显示,与其它方法相比,我们提出的方法提供的特征更符合运动特性和样本特性,并且更加注重重要的骨骼关节。
https://arxiv.org/abs/2601.07666
From Vision-Language-Action (VLA) systems to robotics, existing egocentric datasets primarily focus on action recognition tasks, while largely overlooking the inherent role of motion analysis in sports and other fast-movement scenarios. To bridge this gap, we propose a real-time motion focus recognition method that estimates the subject's locomotion intention from any egocentric video. Our approach leverages the foundation model for camera pose estimation and introduces system-level optimizations to enable efficient and scalable inference. Evaluated on a collected egocentric action dataset, our method achieves real-time performance with manageable memory consumption through a sliding batch inference strategy. This work makes motion-centric analysis practical for edge deployment and offers a complementary perspective to existing egocentric studies on sports and fast-movement activities.
从视觉-语言-行动(VLA)系统到机器人技术,现有的第一人称视角数据集主要侧重于动作识别任务,而忽略了运动分析在体育及其他快速移动场景中的内在作用。为了填补这一空白,我们提出了一种实时的运动焦点识别方法,该方法可以从任何第一人称视频中估计主体的移动意图。我们的方法利用了基础模型进行相机姿态估计,并引入系统级优化以实现高效和可扩展的推理。 在收集到的第一人称动作数据集上进行了评估,通过滑动批处理推理策略,我们的方法实现了实时性能并保持了可控的记忆消耗。这项工作使得运动中心分析在边缘部署中变得实用,并为现有的第一人称视角研究提供了关于体育及快速移动活动的新颖补充观点。
https://arxiv.org/abs/2601.07154
Traditional Multi-Object Tracking (MOT) systems have achieved remarkable precision in localization and association, effectively answering \textit{where} and \textit{who}. However, they often function as autistic observers, capable of tracing geometric paths but blind to the semantic \textit{what} and \textit{why} behind object behaviors. To bridge the gap between geometric perception and cognitive reasoning, we propose \textbf{LLMTrack}, a novel end-to-end framework for Semantic Multi-Object Tracking (SMOT). We adopt a bionic design philosophy that decouples strong localization from deep understanding, utilizing Grounding DINO as the eyes and the LLaVA-OneVision multimodal large model as the brain. We introduce a Spatio-Temporal Fusion Module that aggregates instance-level interaction features and video-level contexts, enabling the Large Language Model (LLM) to comprehend complex trajectories. Furthermore, we design a progressive three-stage training strategy, Visual Alignment, Temporal Fine-tuning, and Semantic Injection via LoRA to efficiently adapt the massive model to the tracking domain. Extensive experiments on the BenSMOT benchmark demonstrate that LLMTrack achieves state-of-the-art performance, significantly outperforming existing methods in instance description, interaction recognition, and video summarization while maintaining robust tracking stability.
传统的多目标跟踪(MOT)系统在定位和关联方面达到了很高的精度,有效地回答了“在哪里”和“谁”的问题。然而,这些系统往往像自闭症观察者一样工作,能够追踪几何路径,但却对物体行为背后的语义上的“是什么”和“为什么”视而不见。为了弥合几何感知与认知推理之间的差距,我们提出了**LLMTrack**,这是一种新的端到端框架,用于语义多目标跟踪(SMOT)。我们的设计哲学借鉴了仿生学原理,将强大的定位功能与深度理解分离,使用Grounding DINO作为“眼睛”以及LLaVA-OneVision多模态大型模型作为“大脑”。我们引入了一个时空融合模块,该模块汇聚实例级交互特征和视频级别上下文,使大规模语言模型能够理解和解析复杂的轨迹。此外,为了有效地将庞大的模型适应到跟踪领域,我们设计了一种分阶段的三步训练策略:视觉对齐、时间微调以及通过LoRA进行语义注入。 在BenSMOT基准测试中的广泛实验表明,LLMTrack达到了最先进的性能,在实例描述、交互识别和视频摘要方面显著优于现有方法,同时保持了强大的跟踪稳定性。
https://arxiv.org/abs/2601.06550
Understanding student behavior in the classroom is essential to improve both pedagogical quality and student engagement. Existing methods for predicting student engagement typically require substantial annotated data to model the diversity of student behaviors, yet privacy concerns often restrict researchers to their own proprietary datasets. Moreover, the classroom context, represented in peers' actions, is ignored. To address the aforementioned limitation, we propose a novel three-stage framework for video-based student engagement measurement. First, we explore the few-shot adaptation of the vision-language model for student action recognition, which is fine-tuned to distinguish among action categories with a few training samples. Second, to handle continuous and unpredictable student actions, we utilize the sliding temporal window technique to divide each student's 2-minute-long video into non-overlapping segments. Each segment is assigned an action category via the fine-tuned VLM model, generating a sequence of action predictions. Finally, we leverage the large language model to classify this entire sequence of actions, together with the classroom context, as belonging to an engaged or disengaged student. The experimental results demonstrate the effectiveness of the proposed approach in identifying student engagement.
理解学生在课堂上的行为对于提高教学质量和促进学生的参与度至关重要。现有的预测学生参与度的方法通常需要大量的标注数据来建模多样的学生行为,然而隐私问题往往限制了研究者只能使用自己拥有的专有数据集。此外,代表同伴互动的课堂环境常常被忽略。 为了克服上述局限性,我们提出了一种新颖的三阶段框架,用于基于视频的学生参与度测量。首先,我们探索视觉-语言模型在学生行为识别中的少量样本适应能力,通过少量训练样本来微调该模型以区分不同的动作类别。其次,为处理连续且不可预测的学生行为,我们利用滑动时间窗口技术将每位学生的2分钟长的视频分割成非重叠的时间段,并使用微调后的视觉-语言模型对每个时间段分配一个动作类别,生成一系列的动作预测序列。最后,我们借助大型语言模型来分类整个动作序列,同时考虑课堂环境因素,将其归类为参与或不参与的学生。 实验结果表明,所提出的这种方法在识别学生参与度方面是有效的。
https://arxiv.org/abs/2601.06394
Video-language models (VLMs) achieve strong multimodal understanding but remain prone to hallucinations, especially when reasoning about actions and temporal order. Existing mitigation strategies, such as textual filtering or random video perturbations, often fail to address the root cause: over-reliance on language priors rather than fine-grained visual dynamics. We propose a scalable framework for counterfactual video generation that synthesizes videos differing only in actions or temporal structure while preserving scene context. Our pipeline combines multimodal LLMs for action proposal and editing guidance with diffusion-based image and video models to generate semantic hard negatives at scale. Using this framework, we build CounterVid, a synthetic dataset of ~26k preference pairs targeting action recognition and temporal reasoning. We further introduce MixDPO, a unified Direct Preference Optimization approach that jointly leverages textual and visual preferences. Fine-tuning Qwen2.5-VL with MixDPO yields consistent improvements, notably in temporal ordering, and transfers effectively to standard video hallucination benchmarks. Code and models will be made publicly available.
视频语言模型(VLMs)在多模态理解方面表现出色,但在推理动作和时间顺序时仍容易产生幻觉。现有的缓解策略,如文本过滤或随机视频扰动,通常无法解决根本原因:过度依赖语言先验而不是细粒度的视觉动态。我们提出了一种可扩展的框架,用于生成反事实视频,该框架可以合成在动作或时间结构上有所不同但场景背景保持不变的视频。我们的流水线结合了多模态LLM(大型语言模型)进行动作提案和编辑指导,并利用基于扩散的图像和视频模型以大规模生成语义负样本。使用此框架,我们构建了一个名为CounterVid的合成数据集,该数据集中包含大约26,000对偏好配对,针对动作识别和时间推理问题。此外,我们引入了一种统一的方法MixDPO,这是一种直接偏好评优化方法,可以同时利用文本和视觉偏好。使用MixDPO微调Qwen2.5-VL(一种视频语言模型)在时间排序方面取得了显著的改进,并且能有效地迁移到标准的视频幻觉基准测试中。代码和模型将公开发布。
https://arxiv.org/abs/2601.04778
We present a novel approach for egocentric action recognition that leverages 2D point tracks as an additional motion cue. While most existing methods rely on RGB appearance, human pose estimation, or their combination, our work demonstrates that tracking randomly sampled image points across video frames can substantially improve recognition accuracy. Unlike prior approaches, we do not detect hands, objects, or interaction regions. Instead, we employ CoTracker to follow a set of randomly initialized points through each video and use the resulting trajectories, together with the corresponding image frames, as input to a Transformer-based recognition model. Surprisingly, our method achieves notable gains even when only the initial frame and its associated point tracks are provided, without incorporating the full video sequence. Experimental results confirm that integrating 2D point tracks consistently enhances performance compared to the same model trained without motion information, highlighting their potential as a lightweight yet effective representation for egocentric action understanding.
我们提出了一种新颖的自视角动作识别方法,该方法利用二维点跟踪作为额外的运动线索。尽管大多数现有方法依赖于RGB外观、人体姿态估计或它们的组合,我们的工作表明,在视频帧之间追踪随机采样的图像点可以显著提高识别准确性。与以前的方法不同,我们不检测手部、物体或交互区域。相反,我们使用CoTracker在每个视频中跟踪一组随机初始化的点,并将由此产生的轨迹以及相应的图像帧作为基于Transformer的识别模型的输入。令人惊讶的是,即使只提供了初始帧及其相关点跟踪信息而未包含完整视频序列的情况下,我们的方法也能实现显著提升。实验结果证实,整合二维点跟踪可以持续增强性能,与没有运动信息的相同模型相比表现更佳,这突显了它们作为轻量级且有效表示形式在自视角动作理解中的潜力。
https://arxiv.org/abs/2601.03667
Skeleton-based human action recognition (HAR) has achieved remarkable progress with graph-based architectures. However, most existing methods remain body-centric, focusing on large-scale motions while neglecting subtle hand articulations that are crucial for fine-grained recognition. This work presents a probabilistic dual-stream framework that unifies reliability modeling and multi-modal integration, generalizing expertized learning under uncertainty across both intra-skeleton and cross-modal domains. The framework comprises three key components: (1) a calibration-free preprocessing pipeline that removes canonical-space transformations and learns directly from native coordinates; (2) a probabilistic Noisy-OR fusion that stabilizes reliability-aware dual-stream learning without requiring explicit confidence supervision; and (3) an intra- to cross-modal ensemble that couples four skeleton modalities (Joint, Bone, Joint Motion, and Bone Motion) to RGB representations, bridging structural and visual motion cues in a unified cross-modal formulation. Comprehensive evaluations across multiple benchmarks (NTU RGB+D~60/120, PKU-MMD, N-UCLA) and a newly defined hand-centric benchmark exhibit consistent improvements and robustness under noisy and heterogeneous conditions.
基于骨架的人体动作识别(HAR)在图架构的支持下已经取得了显著的进展。然而,大多数现有方法仍然以身体为中心,专注于大规模运动的同时忽视了细微的手部动作细节,而这些细节对于细粒度的动作识别至关重要。这项工作提出了一种概率双流框架,该框架统一了可靠性建模和多模式融合,并在不确定性条件下实现了跨骨架内部与跨模态领域的专家化学习的泛化。该框架包含三个关键组件: 1. 一种无需校准的预处理管道,该管道消除了标准空间变换并直接从原始坐标中学习。 2. 一种概率Noisy-OR融合机制,在不需显式置信度监督的情况下稳定了基于可靠性的双流学习。 3. 一个内部到跨模态的集成系统,将四种骨架模式(关节、骨骼、关节运动和骨骼运动)与RGB表示相结合,以统一的方式连接结构化和视觉动作线索。 在多个基准测试集(NTU RGB+D~60/120, PKU-MMD, N-UCLA)及一个新定义的手部中心基准上进行的全面评估显示,在噪声和异构条件下具有持续改进和鲁棒性。
https://arxiv.org/abs/2601.00369