In this paper, we present a texture-independent approach to estimate and track 3D joint positions of multiple pigeons. For this purpose, we build upon the existing 3D-MuPPET framework, which estimates and tracks the 3D poses of up to 10 pigeons using a multi-view camera setup. We extend this framework by using a segmentation method that generates silhouettes of the individuals, which are then used to estimate 2D keypoints. Following 3D-MuPPET, these 2D keypoints are triangulated to infer 3D poses, and identities are matched in the first frame and tracked in 2D across subsequent frames. Our proposed texture-independent approach achieves comparable accuracy to the original texture-dependent 3D-MuPPET framework. Additionally, we explore our approach's applicability to other bird species. To do that, we infer the 2D joint positions of four bird species without additional fine-tuning the model trained on pigeons and obtain preliminary promising results. Thus, we think that our approach serves as a solid foundation and inspires the development of more robust and accurate texture-independent pose estimation frameworks.
在这篇论文中,我们提出了一种基于纹理无关的方法来估计和追踪多只鸽子的三维关节位置。为此,我们在现有的3D-MuPPET框架基础上进行扩展,该框架使用一个多视角摄像头设置,可以估算并跟踪多达10只鸽子的三维姿态。我们将此框架扩展为一种分割方法,该方法生成个体轮廓,并将这些轮廓用于估计二维关键点。接着,遵循3D-MuPPET的方法,我们对这些二维关键点进行三角测量以推断出三维姿势,并在第一帧中匹配身份并在后续帧中对其进行二维跟踪。 我们的纹理无关方法达到了与原始的基于纹理的3D-MuPPET框架相当的准确度。此外,我们还探讨了该方法在其他鸟类物种上的应用性。为此,我们在不额外微调训练鸽子模型的情况下推断出四种不同鸟类的二维关节位置,并获得了初步有前景的结果。 因此,我们认为我们的方法为开发更稳健且精确的纹理无关姿势估计框架奠定了坚实的基础,并具有启发意义。
https://arxiv.org/abs/2505.16633
We present GMatch, a learning-free feature matcher designed for robust 6DoF object pose estimation, addressing common local ambiguities in sparse feature matching. Unlike traditional methods that rely solely on descriptor similarity, GMatch performs a guided, incremental search, enforcing SE(3)-invariant geometric consistency throughout the matching process. It leverages a provably complete set of geometric features that uniquely determine 3D keypoint configurations, ensuring globally consistent correspondences without the need for training or GPU support. When combined with classical descriptors such as SIFT, GMatch-SIFT forms a general-purpose pose estimation pipeline that offers strong interpretability and generalization across diverse objects and scenes. Experiments on the HOPE dataset show that GMatch outperforms both traditional and learning-based matchers, with GMatch-SIFT achieving or surpassing the performance of instance-level pose networks. On the YCB-Video dataset, GMatch-SIFT demonstrates high accuracy and low variance on texture-rich objects. These results not only validate the effectiveness of GMatch-SIFT for object pose estimation but also highlight the broader applicability of GMatch as a general-purpose feature matcher. Code will be released upon acceptance.
我们介绍了GMatch,这是一种无需学习的特征匹配器,专门用于稳健的6自由度(6DoF)物体姿态估计,并且能够解决稀疏特征匹配中常见的局部歧义问题。与依赖于描述符相似性的传统方法不同,GMatch执行引导式的增量搜索,在整个匹配过程中强制执行SE(3)不变几何一致性。它利用了一组理论上完备的几何特征,这些特征能唯一确定三维关键点配置,从而在无需训练或GPU支持的情况下确保全局一致的对应关系。当与经典描述符(如SIFT)结合使用时,GMatch-SIFT形成了一种通用的姿态估计管道,提供了强大的可解释性和跨不同物体和场景的一般化能力。 在HOPE数据集上的实验表明,GMatch的表现优于传统方法及基于学习的方法,而GMatch-SIFT的性能与实例级别的姿态网络相当或更优。在YCB-Video数据集上,GMatch-SIFT对纹理丰富的对象表现出高精度和低方差的特点。这些结果不仅验证了GMatch-SIFT在物体姿态估计中的有效性,还突显了GMatch作为通用特征匹配器的广泛适用性。 接受后将发布代码。
https://arxiv.org/abs/2505.16144
Robot manipulation learning from human demonstrations offers a rapid means to acquire skills but often lacks generalization across diverse scenes and object placements. This limitation hinders real-world applications, particularly in complex tasks requiring dexterous manipulation. Vision-Language-Action (VLA) paradigm leverages large-scale data to enhance generalization. However, due to data scarcity, VLA's performance remains limited. In this work, we introduce Object-Focus Actor (OFA), a novel, data-efficient approach for generalized dexterous manipulation. OFA exploits the consistent end trajectories observed in dexterous manipulation tasks, allowing for efficient policy training. Our method employs a hierarchical pipeline: object perception and pose estimation, pre-manipulation pose arrival and OFA policy execution. This process ensures that the manipulation is focused and efficient, even in varied backgrounds and positional layout. Comprehensive real-world experiments across seven tasks demonstrate that OFA significantly outperforms baseline methods in both positional and background generalization tests. Notably, OFA achieves robust performance with only 10 demonstrations, highlighting its data efficiency.
从人类演示中学习的机器人操作技能获取方法可以快速地使机器人掌握技能,但通常缺乏在多样场景和物体放置位置上的泛化能力。这一局限性阻碍了其在复杂任务中的实际应用,尤其是在需要灵巧操作的任务中。视觉-语言-行动(VLA)范式利用大规模数据来提高泛化能力,但由于数据稀疏性,其性能仍然有限。为此,在这项工作中我们引入了一种新的、数据高效的通用灵巧操作方法:对象焦点执行器(OFA)。OFA 利用在灵巧任务中观察到的终点轨迹的一致性特性来进行有效的策略训练。 我们的方法采用分层流水线架构,包括目标物体感知与姿态估计、预操作姿态到达和OFA策略执行。这一过程确保了操作专注于目标对象,并且即使在背景环境多样化的条件下也能保持高效。通过跨七项任务的全面真实世界实验,我们证明了OFA在位置和背景泛化测试中显著优于基线方法。值得注意的是,仅需10次演示,OFA即可实现稳健的表现,突显其数据效率优势。
https://arxiv.org/abs/2505.15098
We introduce a unified approach to forecast the dynamics of human keypoints along with the motion trajectory based on a short sequence of input poses. While many studies address either full-body pose prediction or motion trajectory prediction, only a few attempt to merge them. We propose a motion transformation technique to simultaneously predict full-body pose and trajectory key-points in a global coordinate frame. We utilize an off-the-shelf 3D human pose estimation module, a graph attention network to encode the skeleton structure, and a compact, non-autoregressive transformer suitable for real-time motion prediction for human-robot interaction and human-aware navigation. We introduce a human navigation dataset ``DARKO'' with specific focus on navigational activities that are relevant for human-aware mobile robot navigation. We perform extensive evaluation on Human3.6M, CMU-Mocap, and our DARKO dataset. In comparison to prior work, we show that our approach is compact, real-time, and accurate in predicting human navigation motion across all datasets. Result animations, our dataset, and code will be available at this https URL
我们介绍了一种统一的方法,用于基于输入姿势的短序列预测人体关键点的动力学以及运动轨迹。尽管许多研究要么专注于全身姿态预测,要么关注运动轨迹预测,但只有少数尝试将两者结合在一起。我们提出了一项运动转换技术,旨在同时在全球坐标系中预测全身姿态和轨迹关键点。为此,我们使用了一个现成的3D人体姿态估计模块、一个图注意力网络来编码骨骼结构,并且采用了一个紧凑型非自回归变压器,适用于实时的人机交互和人类感知导航中的运动预测。 为了验证我们的方法的有效性,我们建立了一个专门针对移动机器人的人类导航活动的数据集“DARKO”。我们在Human3.6M、CMU-MotionCapture(CMU-Mocap)以及我们自己的DARKO数据集上进行了广泛的评估。与之前的工作相比,我们展示了我们的方法在所有数据集中预测人类导航运动时的紧凑性、实时性和准确性。 研究结果动画、我们的数据集和代码可以在以下网址获取:[此URL]
https://arxiv.org/abs/2505.14866
Currently almost all state-of-the-art novel view synthesis and reconstruction models rely on calibrated cameras or additional geometric priors for training. These prerequisites significantly limit their applicability to massive uncalibrated data. To alleviate this requirement and unlock the potential for self-supervised training on large-scale uncalibrated videos, we propose a novel two-stage strategy to train a view synthesis model from only raw video frames or multi-view images, without providing camera parameters or other priors. In the first stage, we learn to reconstruct the scene implicitly in a latent space without relying on any explicit 3D representation. Specifically, we predict per-frame latent camera and scene context features, and employ a view synthesis model as a proxy for explicit rendering. This pretraining stage substantially reduces the optimization complexity and encourages the network to learn the underlying 3D consistency in a self-supervised manner. The learned latent camera and implicit scene representation have a large gap compared with the real 3D world. To reduce this gap, we introduce the second stage training by explicitly predicting 3D Gaussian primitives. We additionally apply explicit Gaussian Splatting rendering loss and depth projection loss to align the learned latent representations with physically grounded 3D geometry. In this way, Stage 1 provides a strong initialization and Stage 2 enforces 3D consistency - the two stages are complementary and mutually beneficial. Extensive experiments demonstrate the effectiveness of our approach, achieving high-quality novel view synthesis and accurate camera pose estimation, compared to methods that employ supervision with calibration, pose, or depth information. The code is available at this https URL.
目前,几乎所有最先进的视图合成和重建模型都依赖于校准的摄像机或额外的几何先验进行训练。这些前提条件极大地限制了它们在大规模未校准数据上的应用范围。为了缓解这一需求并解锁在大规模未校准视频上进行自监督训练的潜力,我们提出了一种新颖的两阶段策略,仅通过原始视频帧或多视图图像来训练一个视图合成模型,并且无需提供摄像机参数或其他先验条件。 第一阶段中,我们在隐式空间内学习场景重建,而不依赖任何显式的3D表示。具体而言,我们预测每个帧的潜在相机和场景上下文特征,并使用视图合成模型作为显式渲染的代理。这一预训练阶段极大地降低了优化复杂度,并鼓励网络以自监督的方式学习底层的3D一致性。学到的潜在相机和隐含场景表示与真实世界的3D环境存在较大差距。为缩小这一差距,我们在第二阶段引入了显式的三维高斯原语预测。此外,我们应用明确的高斯点画渲染损失和深度投影损失来使所学的潜在表示与物理基础的3D几何结构对齐。通过这种方式,第一阶段提供了一个强有力的初始化,而第二阶段则强制执行3D一致性——这两个阶段是互补且相互受益的。 广泛的实验表明了我们方法的有效性,在没有使用校准、姿态或深度信息监督的情况下,我们的方法在高质量的新视图合成和准确的摄像机姿态估计方面优于其他方法。代码可在[这里](https://example.com)获取。(注意:实际链接应该指向可用的具体GitHub或其他代码托管平台上的项目地址)
https://arxiv.org/abs/2505.13440
Broader access to high-quality movement analysis could greatly benefit movement science and rehabilitation, such as allowing more detailed characterization of movement impairments and responses to interventions, or even enabling early detection of new neurological conditions or fall risk. While emerging technologies are making it easier to capture kinematics with biomechanical models, or how joint angles change over time, inferring the underlying physics that give rise to these movements, including ground reaction forces, joint torques, or even muscle activations, is still challenging. Here we explore whether imitation learning applied to a biomechanical model from a large dataset of movements from able-bodied and impaired individuals can learn to compute these inverse dynamics. Although imitation learning in human pose estimation has seen great interest in recent years, our work differences in several ways: we focus on using an accurate biomechanical model instead of models adopted for computer vision, we test it on a dataset that contains participants with impaired movements, we reported detailed tracking metrics relevant for the clinical measurement of movement including joint angles and ground contact events, and finally we apply imitation learning to a muscle-driven neuromusculoskeletal model. We show that our imitation learning policy, KinTwin, can accurately replicate the kinematics of a wide range of movements, including those with assistive devices or therapist assistance, and that it can infer clinically meaningful differences in joint torques and muscle activations. Our work demonstrates the potential for using imitation learning to enable high-quality movement analysis in clinical practice.
更广泛地获取高质量的运动分析可以极大地造福于运动科学和康复领域,例如允许对运动障碍及其对干预措施的反应进行更为详细的描述,甚至能够早期发现新的神经系统状况或跌倒风险。虽然新兴技术使捕捉生物力学模型中的动力学(如关节角度随时间的变化)变得越来越容易,但推断产生这些动作的基本物理现象——包括地面反作用力、关节扭矩和肌肉激活等——仍然具有挑战性。本文探讨了基于来自身体健全及运动受损个体的大规模运动数据集的仿生学习方法是否可以应用于生物力学模型,并学会计算逆动力学。 尽管近年来在人体姿态估计中的模仿学习引起了广泛关注,但我们的工作与现有研究存在一些关键差异:我们专注于使用准确的生物力学模型而非为计算机视觉设计的模型;我们在包含具有受损动作参与者的数据集上进行了测试;我们报告了包括关节角度和地面接触事件在内的详细跟踪指标,这些指标对临床运动测量非常相关;最后,我们将模仿学习应用于肌肉驱动的神经肌肉骨骼模型。我们的研究展示了 KinTwin 政策能够准确复制广泛范围的动作动力学,包括使用辅助设备或治疗师协助的情况,并且可以推断出关节扭矩和肌肉激活方面的临床上有意义的区别。 本工作证明了使用模仿学习以在临床实践中实现高质量运动分析的巨大潜力。
https://arxiv.org/abs/2505.13436
Detecting an athlete's position on a route and identifying hold usage are crucial in various climbing-related applications. However, no climbing dataset with detailed hold usage annotations exists to our knowledge. To address this issue, we introduce a dataset of 22 annotated climbing videos, providing ground-truth labels for hold locations, usage order, and time of use. Furthermore, we explore the application of keypoint-based 2D pose-estimation models for detecting hold usage in sport climbing. We determine usage by analyzing the key points of certain joints and the corresponding overlap with climbing holds. We evaluate multiple state-of-the-art models and analyze their accuracy on our dataset, identifying and highlighting climbing-specific challenges. Our dataset and results highlight key challenges in climbing-specific pose estimation and establish a foundation for future research toward AI-assisted systems for sports climbing.
检测运动员在攀岩路线上的位置并识别其抓握使用情况,在各种与攀岩相关的应用中至关重要。然而,据我们所知,目前尚不存在包含详细抓握手柄使用的标注数据集。为了解决这一问题,我们引入了一个由22个注释攀岩视频组成的数据库,提供了抓握手柄的位置、使用顺序和使用时间的真实标签信息。此外,我们还探索了基于关键点的二维姿态估计模型在体育攀岩中的应用,通过分析特定关节的关键点及其与攀岩手柄的重叠来确定抓握使用情况。我们在我们的数据集上评估了多个最先进的模型,并对其准确性进行了分析,识别并强调了攀岩特有的挑战。我们的数据集和结果突显了攀岩特有姿态估计中的关键挑战,并为未来的研究奠定了基础,以支持体育攀岩中的人工智能辅助系统的发展。
https://arxiv.org/abs/2505.12854
The dynamic movement of the human body presents a fundamental challenge for human pose estimation and body segmentation. State-of-the-art approaches primarily rely on combining keypoint heatmaps with segmentation masks but often struggle in scenarios involving overlapping joints or rapidly changing poses during instance-level segmentation. To address these limitations, we propose Keypoints as Dynamic Centroid (KDC), a new centroid-based representation for unified human pose estimation and instance-level segmentation. KDC adopts a bottom-up paradigm to generate keypoint heatmaps for both easily distinguishable and complex keypoints and improves keypoint detection and confidence scores by introducing KeyCentroids using a keypoint disk. It leverages high-confidence keypoints as dynamic centroids in the embedding space to generate MaskCentroids, allowing for swift clustering of pixels to specific human instances during rapid body movements in live environments. Our experimental evaluations on the CrowdPose, OCHuman, and COCO benchmarks demonstrate KDC's effectiveness and generalizability in challenging scenarios in terms of both accuracy and runtime performance. The implementation is available at: this https URL.
人体动态运动给人体姿态估计和身体分割带来了基本挑战。目前最先进的方法主要依赖于结合关键点热图与分割掩码,但在处理关节重叠或快速变化姿势的实例级分割时往往效果不佳。为了解决这些问题,我们提出了一种新的基于中心点的人体姿态估计和实例级别分割统一表示法——动态中心点(Keypoints as Dynamic Centroid, KDC)。KDC采用自下而上的方法来生成易于区分及复杂关键点的关键点热图,并通过引入使用关键点盘的KeyCentroids改进了关键点检测与置信度评分。它利用高置信度关键点作为嵌入空间中的动态中心点,以产生MaskCentroids,在真实环境中快速的身体运动下可以迅速将像素聚类到特定的人体实例中。 我们在CrowdPose、OCHuman和COCO基准测试上的实验评估显示了KDC在准确性和运行时间性能方面对具有挑战性场景的有效性和泛化能力。实现代码可在此链接获取:this https URL.
https://arxiv.org/abs/2505.12130
Expressive human pose and shape (EHPS) estimation is vital for digital human generation, particularly in live-streaming applications. However, most existing EHPS models focus primarily on minimizing estimation errors, with limited attention on potential security vulnerabilities. Current adversarial attacks on EHPS models often require white-box access (e.g., model details or gradients) or generate visually conspicuous perturbations, limiting their practicality and ability to expose real-world security threats. To address these limitations, we propose a novel Unnoticeable Black-Box Attack (UBA) against EHPS models. UBA leverages the latent-space representations of natural images to generate an optimal adversarial noise pattern and iteratively refine its attack potency along an optimized direction in digital space. Crucially, this process relies solely on querying the model's output, requiring no internal knowledge of the EHPS architecture, while guiding the noise optimization toward greater stealth and effectiveness. Extensive experiments and visual analyses demonstrate the superiority of UBA. Notably, UBA increases the pose estimation errors of EHPS models by 17.27%-58.21% on average, revealing critical vulnerabilities. These findings underscore the urgent need to address and mitigate security risks associated with digital human generation systems.
表达式人体姿态和形状(EHPS)估计对于数字人生成,特别是直播应用来说至关重要。然而,大多数现有的EHPS模型主要关注减少估计误差,对潜在的安全漏洞则考虑较少。目前针对EHPS模型的对抗攻击通常需要白盒访问权限(例如模型细节或梯度信息),或者会产生视觉上显眼的扰动,这限制了它们的实际应用价值和揭示现实世界安全威胁的能力。 为了解决这些局限性,我们提出了一种新颖的不可察觉黑盒攻击(UBA)来对抗EHPS模型。UBA利用自然图像在潜在空间中的表示生成最佳的对抗噪声模式,并沿着优化方向迭代地增强其攻击效果。重要的是,这一过程仅依赖于查询模型输出,无需了解EHPS架构的内部知识,同时引导噪声优化以实现更高的隐蔽性和有效性。 通过广泛的实验和视觉分析表明了UBA的优势。值得注意的是,UBA使EHPS模型的姿态估计误差平均增加了17.27%到58.21%,揭示出关键的安全漏洞。这些发现强调了应对与数字人生成系统相关的安全风险的紧迫性。
https://arxiv.org/abs/2505.12009
For the elderly population, falls pose a serious and increasing risk of serious injury and loss of independence. In order to overcome this difficulty, we present ElderFallGuard: A Computer Vision Based IoT Solution for Elderly Fall Detection and Notification, a cutting-edge, non-invasive system intended for quick caregiver alerts and real-time fall detection. Our approach leverages the power of computer vision, utilizing MediaPipe for accurate human pose estimation from standard video streams. We developed a custom dataset comprising 7200 samples across 12 distinct human poses to train and evaluate various machine learning classifiers, with Random Forest ultimately selected for its superior performance. ElderFallGuard employs a specific detection logic, identifying a fall when a designated prone pose ("Pose6") is held for over 3 seconds coupled with a significant drop in motion detected for more than 2 seconds. Upon confirmation, the system instantly dispatches an alert, including a snapshot of the event, to a designated Telegram group via a custom bot, incorporating cooldown logic to prevent notification overload. Rigorous testing on our dataset demonstrated exceptional results, achieving 100% accuracy, precision, recall, and F1-score. ElderFallGuard offers a promising, vision-based IoT solution to enhance elderly safety and provide peace of mind for caregivers through intelligent, timely alerts.
对于老年人群体而言,跌倒是一个严重的、日益增长的造成严重伤害和丧失独立生活能力的风险。为了解决这一问题,我们提出了ElderFallGuard:一种基于计算机视觉的物联网解决方案,旨在实现老年人跌倒检测及即时通知。该系统是非侵入性的,并且设计用于迅速向护理人员发出警报并实时检测跌倒事件。 我们的方法利用了计算机视觉的强大功能,使用MediaPipe从标准视频流中进行准确的人体姿态估计。我们开发了一个包含7200个样本的自定义数据集,这些样本涵盖了12种不同的人体姿势,用于训练和评估各种机器学习分类器,并最终选择了随机森林算法,因为它具有最佳性能。 ElderFallGuard采用了特定的检测逻辑:当被指定为“Pose6”的俯卧姿势持续超过3秒并且随后运动量显著下降超过2秒时,则认为发生了跌倒。一旦确认发生跌倒事件,系统会立即通过一个自定义机器人向预定的Telegram群组发送包括事件快照在内的警报,并且内置了冷却逻辑以防止通知过载。 在我们的数据集上进行的严格测试显示出了卓越的结果,实现了100%的准确率、精确度、召回率和F1分数。ElderFallGuard为提高老年人的安全性提供了有前景的视觉物联网解决方案,并通过智能及时警报给护理人员带来了安心感。
https://arxiv.org/abs/2505.11845
Accurate pose estimation of surgical tools in Robot-assisted Minimally Invasive Surgery (RMIS) is essential for surgical navigation and robot control. While traditional marker-based methods offer accuracy, they face challenges with occlusions, reflections, and tool-specific designs. Similarly, supervised learning methods require extensive training on annotated datasets, limiting their adaptability to new tools. Despite their success in other domains, zero-shot pose estimation models remain unexplored in RMIS for pose estimation of surgical instruments, creating a gap in generalising to unseen surgical tools. This paper presents a novel 6 Degrees of Freedom (DoF) pose estimation pipeline for surgical instruments, leveraging state-of-the-art zero-shot RGB-D models like the FoundationPose and SAM-6D. We advanced these models by incorporating vision-based depth estimation using the RAFT-Stereo method, for robust depth estimation in reflective and textureless environments. Additionally, we enhanced SAM-6D by replacing its instance segmentation module, Segment Anything Model (SAM), with a fine-tuned Mask R-CNN, significantly boosting segmentation accuracy in occluded and complex conditions. Extensive validation reveals that our enhanced SAM-6D surpasses FoundationPose in zero-shot pose estimation of unseen surgical instruments, setting a new benchmark for zero-shot RGB-D pose estimation in RMIS. This work enhances the generalisability of pose estimation for unseen objects and pioneers the application of RGB-D zero-shot methods in RMIS.
在机器人辅助微创手术(RMIS)中,对手术工具的精确姿态估计对于手术导航和机器人控制至关重要。传统的基于标记的方法虽然能够提供高精度的姿态估计,但面临遮挡、反射以及特定设计工具带来的挑战。同样,监督学习方法需要大量标注数据进行训练,限制了其对新工具的适应性。尽管零样本姿态估计模型在其他领域取得了一定的成功,但在RMIS中用于手术器械姿态估计的应用尚未被探索,从而导致难以推广到未见过的手术工具上。 本文提出了一种新颖的6自由度(DoF)手术工具姿态估计流水线,利用了最先进的零样本RGB-D模型如FoundationPose和SAM-6D。我们通过结合基于视觉的深度估计方法RAFT-Stereo增强了这些模型,以实现在反射和无纹理环境中的稳健深度估计。此外,我们改进了SAM-6D,将其实例分割模块Segment Anything Model(SAM)替换为微调后的Mask R-CNN,从而显著提高了在遮挡及复杂条件下姿态估计的准确性。 通过广泛的验证发现,我们的增强型SAM-6D在零样本未见过手术工具的姿态估计上超越了FoundationPose,在RMIS中设立了新的零样本RGB-D姿态估计基准。这项工作增强了对未知物体的姿态估计的一般性,并开创了将RGB-D零样本方法应用于RMIS的先河。
https://arxiv.org/abs/2505.11439
Mobile robots are reaching unprecedented speeds, with platforms like Unitree B2, and Fraunhofer O3dyn achieving maximum speeds between 5 and 10 m/s. However, effectively utilizing such speeds remains a challenge due to the limitations of RGB cameras, which suffer from motion blur and fail to provide real-time responsiveness. Event cameras, with their asynchronous operation, and low-latency sensing, offer a promising alternative for high-speed robotic perception. In this work, we introduce MTevent, a dataset designed for 6D pose estimation and moving object detection in highly dynamic environments with large detection distances. Our setup consists of a stereo-event camera and an RGB camera, capturing 75 scenes, each on average 16 seconds, and featuring 16 unique objects under challenging conditions such as extreme viewing angles, varying lighting, and occlusions. MTevent is the first dataset to combine high-speed motion, long-range perception, and real-world object interactions, making it a valuable resource for advancing event-based vision in robotics. To establish a baseline, we evaluate the task of 6D pose estimation using NVIDIA's FoundationPose on RGB images, achieving an Average Recall of 0.22 with ground-truth masks, highlighting the limitations of RGB-based approaches in such dynamic settings. With MTevent, we provide a novel resource to improve perception models and foster further research in high-speed robotic vision. The dataset is available for download this https URL
移动机器人正在达到前所未有的速度,例如Unitree B2和Fraunhofer O3dyn平台的最高速度已达5至10米/秒。然而,有效利用这种速度仍然面临挑战,因为传统的RGB相机在动态环境中会遭受运动模糊的影响,并且无法提供实时响应。事件相机通过其异步操作和低延迟感知提供了高动态环境下机器人感知的一个有希望的替代方案。在这项工作中,我们介绍了MTevent数据集,该数据集旨在支持高速度、长距离感知以及真实世界物体交互条件下进行6D姿态估计及移动物体检测。 我们的实验设置包括立体事件相机和RGB相机,共记录了75个场景,每个场景平均持续16秒,并且包含在极端视角、光照变化以及遮挡等挑战性条件下的16种独特对象。MTevent是首个结合高速运动、长距离感知及现实世界物体互动的数据集,这使其成为推动基于事件的视觉技术发展的宝贵资源。 为了建立一个基准线,我们使用NVIDIA的FoundationPose在RGB图像上进行了6D姿态估计任务评估,在使用地面实况掩码的情况下实现了平均召回率为0.22,揭示了传统RGB方法在这种动态设置下的局限性。通过MTevent,我们提供了一个改进感知模型和促进高速机器人视觉研究的新资源。该数据集可在此网址下载:[此URL](请将"[此URL]"替换为实际的数据集下载链接)。
https://arxiv.org/abs/2505.11282
Reliable three-dimensional human pose estimation is becoming increasingly important for real-world applications, yet much of prior work has focused solely on the performance within a single dataset. In practice, however, systems must adapt to diverse viewpoints, environments, and camera setups -- conditions that differ significantly from those encountered during training, which is often the case in real-world scenarios. To address these challenges, we present a standardized testing environment in which each method is evaluated on a variety of datasets, ensuring consistent and fair cross-dataset comparisons -- allowing for the analysis of methods on previously unseen data. Therefore, we propose PoseBench3D, a unified framework designed to systematically re-evaluate prior and future models across four of the most widely used datasets for human pose estimation -- with the framework able to support novel and future datasets as the field progresses. Through a unified interface, our framework provides datasets in a pre-configured yet easily modifiable format, ensuring compatibility with diverse model architectures. We re-evaluated the work of 18 methods, either trained or gathered from existing literature, and reported results using both Mean Per Joint Position Error (MPJPE) and Procrustes Aligned Mean Per Joint Position Error (PA-MPJPE) metrics, yielding more than 100 novel cross-dataset evaluation results. Additionally, we analyze performance differences resulting from various pre-processing techniques and dataset preparation parameters -- offering further insight into model generalization capabilities.
可靠的三维人体姿态估计在实际应用中变得越来越重要,然而先前的大部分研究工作主要集中在单一数据集内的性能评估上。但实际上,系统必须适应多种视角、环境和相机设置——这些条件往往与训练时所遇到的情况有显著差异。为了解决这一挑战,我们提出了一种标准化测试环境,在该环境中对每一种方法在多个数据集上的表现进行评价,以确保一致且公平的跨数据集比较,并允许分析模型在以前未见过的数据上的性能。 因此,我们提出了PoseBench3D,这是一个统一框架,旨在系统地重新评估此前和未来的模型在四个最广泛使用的人体姿态估计数据集上——该框架还能够支持新型和未来出现的数据集。通过一个统一的接口,我们的框架提供了预配置但易于修改的数据集格式,确保与各种模型架构兼容。我们重新评估了来自现有文献中18种方法的结果,并使用平均关节位置误差(MPJPE)及普罗克鲁斯特斯对齐后的平均关节位置误差(PA-MPJPE)两种度量报告结果,得出超过100个跨数据集的新的评价结果。此外,我们还分析了由不同预处理技术和数据集准备参数导致的表现差异——这些分析进一步揭示了模型泛化能力。 通过这项工作,研究者们可以获得对现有方法更为全面的理解,并为未来的研究提供一个坚实的基础和指导方向。
https://arxiv.org/abs/2505.10888
Estimating the 6D pose of unseen objects from monocular RGB images remains a challenging problem, especially due to the lack of prior object-specific knowledge. To tackle this issue, we propose RefPose, an innovative approach to object pose estimation that leverages a reference image and geometric correspondence as guidance. RefPose first predicts an initial pose by using object templates to render the reference image and establish the geometric correspondence needed for the refinement stage. During the refinement stage, RefPose estimates the geometric correspondence of the query based on the generated references and iteratively refines the pose through a render-and-compare approach. To enhance this estimation, we introduce a correlation volume-guided attention mechanism that effectively captures correlations between the query and reference images. Unlike traditional methods that depend on pre-defined object models, RefPose dynamically adapts to new object shapes by leveraging a reference image and geometric correspondence. This results in robust performance across previously unseen objects. Extensive evaluation on the BOP benchmark datasets shows that RefPose achieves state-of-the-art results while maintaining a competitive runtime.
从单目RGB图像估计未见过物体的6D姿态仍然是一个具有挑战性的问题,尤其是在缺乏特定对象先验知识的情况下。为了解决这个问题,我们提出了RefPose,这是一种创新的方法,用于通过参考图像和几何对应关系指导来估计对象的姿态。RefPose首先使用对象模板渲染参考图像并建立细化阶段所需的几何对应关系,以预测初始姿态。在细化阶段,RefPose基于生成的参考图估计查询的几何对应,并通过渲染和比较方法迭代地优化姿势。为了增强这种估计,我们引入了一种由相关体积引导的注意力机制,该机制能够有效捕捉查询图像与参考图像之间的关联性。 不同于传统的依赖于预定义对象模型的方法,RefPose 通过利用参考图像和几何对应关系来动态适应新的物体形状。这使得它在处理之前未见过的对象时也能保持稳健的表现。在BOP基准数据集上的广泛评估表明,RefPose 达到了最先进的性能,并且维持了竞争性的运行时间。
https://arxiv.org/abs/2505.10841
Sparse wearable inertial measurement units (IMUs) have gained popularity for estimating 3D human motion. However, challenges such as pose ambiguity, data drift, and limited adaptability to diverse bodies persist. To address these issues, we propose UMotion, an uncertainty-driven, online fusing-all state estimation framework for 3D human shape and pose estimation, supported by six integrated, body-worn ultra-wideband (UWB) distance sensors with IMUs. UWB sensors measure inter-node distances to infer spatial relationships, aiding in resolving pose ambiguities and body shape variations when combined with anthropometric data. Unfortunately, IMUs are prone to drift, and UWB sensors are affected by body occlusions. Consequently, we develop a tightly coupled Unscented Kalman Filter (UKF) framework that fuses uncertainties from sensor data and estimated human motion based on individual body shape. The UKF iteratively refines IMU and UWB measurements by aligning them with uncertain human motion constraints in real-time, producing optimal estimates for each. Experiments on both synthetic and real-world datasets demonstrate the effectiveness of UMotion in stabilizing sensor data and the improvement over state of the art in pose accuracy.
稀疏可穿戴惯性测量单元(IMU)在估计三维人体运动方面越来越受欢迎,但诸如姿态模糊、数据漂移和对不同体型适应能力有限等问题依然存在。为了解决这些问题,我们提出了UMotion框架,这是一个不确定性驱动的在线融合全状态估计框架,用于3D人体形状和姿势估计,该框架结合了六个集成的身体穿戴超宽带(UWB)距离传感器与IMUs。UWB传感器测量节点间的距离来推断空间关系,并且当结合人体尺寸数据时,有助于解决姿态模糊及身体形状变化的问题。 然而,IMU容易产生漂移,而UWB传感器会受到身体遮挡的影响。因此,我们开发了一个紧密耦合的无迹卡尔曼滤波器(UKF)框架,该框架融合了来自传感器数据和基于个人体型的人体运动估计的不确定性。通过实时将IMU和UWB测量值与不确定的人体运动约束对齐,UKF迭代地优化这些测量值,并为每个输出提供最优估算。 在合成数据集和真实世界数据集上的实验表明,UMotion能够稳定传感器数据,并且在姿态准确性方面超越了现有的技术。
https://arxiv.org/abs/2505.09393
Precise initialization plays a critical role in the performance of localization algorithms, especially in the context of robotics, autonomous driving, and computer vision. Poor localization accuracy is often a consequence of inaccurate initial poses, particularly noticeable in GNSS-denied environments where GPS signals are primarily relied upon for initialization. Recent advances in leveraging deep neural networks for pose regression have led to significant improvements in both accuracy and robustness, especially in estimating complex spatial relationships and orientations. In this paper, we introduce APR-Transformer, a model architecture inspired by state-of-the-art methods, which predicts absolute pose (3D position and 3D orientation) using either image or LiDAR data. We demonstrate that our proposed method achieves state-of-the-art performance on established benchmark datasets such as the Radar Oxford Robot-Car and DeepLoc datasets. Furthermore, we extend our experiments to include our custom complex APR-BeIntelli dataset. Additionally, we validate the reliability of our approach in GNSS-denied environments by deploying the model in real-time on an autonomous test vehicle. This showcases the practical feasibility and effectiveness of our approach. The source code is available at:this https URL.
精确的初始化在定位算法中的性能发挥中扮演着至关重要的角色,尤其是在机器人技术、自动驾驶和计算机视觉领域。较差的定位精度往往是因为初始姿态不准确所导致,这种情况尤其明显于没有GNSS信号覆盖(即GPS信号不可用)的环境中。最近利用深度神经网络进行姿态回归的研究成果显著提高了精度与鲁棒性,特别是在估计复杂的空间关系和方向时表现优异。在本文中,我们介绍了APR-Transformer模型架构,该架构受当前最佳方法启发,并使用图像或LiDAR数据来预测绝对姿态(三维位置和三维定向)。我们的研究表明,所提出的方法在诸如牛津机器人汽车雷达和DeepLoc等已建立的基准数据集上实现了最先进的性能。此外,我们扩展了实验范围,引入了自己的复杂APR-BeIntelli数据集进行测试。为了验证本方法在无GNSS信号环境中的可靠性,我们在自动驾驶测试车辆上实时部署模型,证明了我们的方法不仅具有理论上的优势,在实际应用中也表现出色和有效。源代码可在以下网址获得:[此URL](请将[此URL]替换为实际提供的链接地址)。
https://arxiv.org/abs/2505.09356
Autonomy in Minimally Invasive Robotic Surgery (MIRS) has the potential to reduce surgeon cognitive and task load, thereby increasing procedural efficiency. However, implementing accurate autonomous control can be difficult due to poor end-effector proprioception, a limitation of their cable-driven mechanisms. Although the robot may have joint encoders for the end-effector pose calculation, various non-idealities make the entire kinematics chain inaccurate. Modern vision-based pose estimation methods lack real-time capability or can be hard to train and generalize. In this work, we demonstrate a real-time capable, vision transformer-based pose estimation approach that is trained using end-to-end differentiable kinematics and rendering in simulation. We demonstrate the potential of this method to correct for noisy pose estimates in simulation, with the longer term goal of verifying the sim-to-real transferability of our approach.
在微创机器人手术(MIRS)中,自主性有潜力减少外科医生的认知和任务负担,从而提高程序效率。然而,由于末端执行器本体感受较差——这是其缆绳驱动机制的一个局限性——实现准确的自主控制可能很困难。虽然机器人可能配备了用于计算末端执行器姿态的关节编码器,但各种非理想因素使得整个运动学链不准确。现代基于视觉的姿态估计方法缺乏实时处理能力或难以训练和泛化。 在本工作中,我们展示了一种基于视觉变换器、具备实时处理能力的姿态估计方法,并通过模拟中的端到端可微分运动学与渲染进行训练。我们在模拟中演示了该方法纠正噪声姿态估计的潜力,并以验证从模拟到现实的应用转移性为长期目标。
https://arxiv.org/abs/2505.08875
Bed-based pressure-sensitive mats (PSMs) offer a non-intrusive way of monitoring patients during sleep. We focus on four-way sleep position classification using data collected from a PSM placed under a mattress in a sleep clinic. Sleep positions can affect sleep quality and the prevalence of sleep disorders, such as apnea. Measurements were performed on patients with suspected sleep disorders referred for assessments at a sleep clinic. Training deep learning models can be challenging in clinical settings due to the need for large amounts of labeled data. To overcome the shortage of labeled training data, we utilize transfer learning to adapt pre-trained deep learning models to accurately estimate sleep positions from a low-resolution PSM dataset collected in a polysomnography sleep lab. Our approach leverages Vision Transformer models pre-trained on ImageNet using masked autoencoding (ViTMAE) and a pre-trained model for human pose estimation (ViTPose). These approaches outperform previous work from PSM-based sleep pose classification using deep learning (TCN) as well as traditional machine learning models (SVM, XGBoost, Random Forest) that use engineered features. We evaluate the performance of sleep position classification from 112 nights of patient recordings and validate it on a higher resolution 13-patient dataset. Despite the challenges of differentiating between sleep positions from low-resolution PSM data, our approach shows promise for real-world deployment in clinical settings
基于床垫的压力敏感垫(PSM)提供了一种非侵入性的方式来监测患者在睡眠过程中的状况。我们专注于使用放置于睡眠诊所床垫下的PSM收集的数据来进行四向睡眠姿势分类。患者的睡眠姿势会影响其睡眠质量和某些睡眠障碍的发病率,例如睡眠呼吸暂停。测量是在疑似患有睡眠障碍并被转介至睡眠诊所进行评估的患者中进行的。 在临床环境中训练深度学习模型具有挑战性,因为需要大量的标注数据。为了解决标签数据短缺的问题,我们利用迁移学习将预训练的深度学习模型调整为能够从低分辨率PSM数据集中准确估计睡眠姿势。我们的方法使用了Vision Transformer(ViT)模型,在ImageNet上通过掩码自动编码(ViTMAE)进行预训练,并用于人体姿态估计的预训练模型(ViTPose)。这些方法在基于PSM的深度学习(TCN)睡眠姿势分类以及使用工程特征的传统机器学习模型(SVM、XGBoost和随机森林)方面表现出了优于以往工作的性能。 我们在112个夜晚患者的记录上评估了睡眠姿势分类的性能,并在一个更高分辨率的13名患者数据集上进行了验证。尽管从低分辨率PSM数据中区分不同的睡眠姿势具有挑战性,但我们的方法在临床环境中实际部署方面显示出前景。
https://arxiv.org/abs/2505.08111
Musculoskeletal disorders (MSDs) are a leading cause of injury and productivity loss in the manufacturing industry, incurring substantial economic costs. Ergonomic assessments can mitigate these risks by identifying workplace adjustments that improve posture and reduce strain. Camera-based systems offer a non-intrusive, cost-effective method for continuous ergonomic tracking, but they also raise significant privacy concerns. To address this, we propose a privacy-aware ergonomic assessment framework utilizing machine learning techniques. Our approach employs adversarial training to develop a lightweight neural network that obfuscates video data, preserving only the essential information needed for human pose estimation. This obfuscation ensures compatibility with standard pose estimation algorithms, maintaining high accuracy while protecting privacy. The obfuscated video data is transmitted to a central server, where state-of-the-art keypoint detection algorithms extract body landmarks. Using multi-view integration, 3D keypoints are reconstructed and evaluated with the Rapid Entire Body Assessment (REBA) method. Our system provides a secure, effective solution for ergonomic monitoring in industrial environments, addressing both privacy and workplace safety concerns.
肌肉骨骼疾病(MSDs)是制造业中工伤和生产力损失的主要原因,造成了巨大的经济成本。通过进行人体工学评估来识别改善姿势并减少劳损的工作场所调整可以减轻这些风险。基于摄像头的系统提供了一种非侵入性且成本效益高的方法来进行持续的人体工学跟踪,但同时也引发了重大的隐私问题。为了解决这个问题,我们提出了一种利用机器学习技术的隐私保护型人体工学评估框架。 我们的方法采用对抗训练来开发一个轻量级神经网络,该网络可以混淆视频数据,仅保留用于人体姿态估计所需的必要信息。这种混淆确保了与标准的姿态估计算法兼容性,在保持高精度的同时保护个人隐私。被混淆后的视频数据会被传输到中央服务器,然后使用最先进的关键点检测算法来提取身体地标。通过多视角融合技术,重建出3D关键点,并利用快速全身评估(REBA)方法进行评价。 我们的系统为工业环境中的人体工学监测提供了一个安全有效的解决方案,同时解决了隐私和工作场所安全性的问题。
https://arxiv.org/abs/2505.07306
The accuracy and efficiency of human body pose estimation depend on the quality of the data to be processed and of the particularities of these data. To demonstrate how dance videos can challenge pose estimation techniques, we proposed a new 3D human body pose estimation pipeline which combined up-to-date techniques and methods that had not been yet used in dance analysis. Second, we performed tests and extensive experimentations from dance video archives, and used visual analytic tools to evaluate the impact of several data parameters on human body pose. Our results are publicly available for research at this https URL
人体姿态估计的准确性和效率取决于待处理数据的质量及其特定特性。为了展示舞蹈视频如何对姿态估计算法构成挑战,我们提出了一种新的3D人体姿态估算流水线,该流程结合了目前最新的技术和之前未在舞蹈分析中使用的方法。其次,我们在舞蹈视频档案上进行了测试和广泛的实验,并利用可视化工具评估了几项数据参数对人体姿态的影响。我们的研究成果可在此网址公开获取(请将“https URL”替换为实际的链接地址)。
https://arxiv.org/abs/2505.07249