Abstract
Human motion understanding has advanced rapidly through vision-based progress in recognition, tracking, and captioning. However, most existing methods overlook physical cues such as joint actuation forces that are fundamental in biomechanics. This gap motivates our study: if and when do physically inferred forces enhance motion understanding? By incorporating forces into established motion understanding pipelines, we systematically evaluate their impact across baseline models on 3 major tasks: gait recognition, action recognition, and fine-grained video captioning. Across 8 benchmarks, incorporating forces yields consistent performance gains; for example, on CASIA-B, Rank-1 gait recognition accuracy improved from 89.52% to 90.39% (+0.87), with larger gain observed under challenging conditions: +2.7% when wearing a coat and +3.0% at the side view. On Gait3D, performance also increases from 46.0% to 47.3% (+1.3). In action recognition, CTR-GCN achieved +2.00% on Penn Action, while high-exertion classes like punching/slapping improved by +6.96%. Even in video captioning, Qwen2.5-VL's ROUGE-L score rose from 0.310 to 0.339 (+0.029), indicating that physics-inferred forces enhance temporal grounding and semantic richness. These results demonstrate that force cues can substantially complement visual and kinematic features under dynamic, occluded, or appearance-varying conditions.
Abstract (translated)
人类动作理解通过基于视觉的进步在识别、跟踪和描述方面取得了迅速的发展。然而,大多数现有的方法忽视了诸如关节作用力等生物力学中基本的物理线索。这一差距激发了我们的研究:物理推断出的力量会在何时何地增强对运动的理解?通过将力量融入到已建立的动作理解流程中,我们系统性地评估了它们在三个主要任务上的影响:步态识别、动作识别和细粒度视频描述。在八个基准测试上,引入力量后性能均有持续提升;例如,在CASIA-B数据集上,Rank-1步态识别准确率从89.52%提高到90.39%(+0.87),尤其是在挑战条件下观察到了更大的增益:穿外套时提高了2.7%,侧面视角下提高了3.0%。在Gait3D数据集上,性能也由46.0%提升至47.3%(+1.3)。在动作识别任务中,CTR-GCN模型在Penn Action上的表现提升了2.00%,而高耗力的动作类别如拳击/拍打则获得了高达6.96%的改进。即使是在视频描述领域,Qwen2.5-VL的ROUGE-L评分也从0.310上升至0.339(+0.029),表明物理推断的力量增强了时间定位和语义丰富度。 这些结果证明,在动态、被遮挡或外观变化的情况下,力线索可以显著补充视觉和运动学特征。
URL
https://arxiv.org/abs/2512.20451