Agents in cyber-physical systems are increasingly entrusted with safety-critical tasks. Ensuring safety of these agents often requires localizing the pose for subsequent actions. Pose estimates can, e.g., be obtained from various combinations of lidar sensors, cameras, and external services such as GPS. Crucially, in safety-critical domains, a rough estimate is insufficient to formally determine safety, i.e., guaranteeing safety even in the worst-case scenario, and external services might additionally not be trustworthy. We address this problem by presenting a certified pose estimation in 3D solely from a camera image and a well-known target geometry. This is realized by formally bounding the pose, which is computed by leveraging recent results from reachability analysis and formal neural network verification. Our experiments demonstrate that our approach efficiently and accurately localizes agents in both synthetic and real-world experiments.
在物理与数字世界融合的系统中,代理(agents)越来越多地承担安全关键性的任务。确保这些代理的安全性通常需要精确定位其姿态以便后续操作。姿态估计可以基于激光雷达传感器、摄像头以及如GPS等外部服务的不同组合来获取。然而,在涉及人身安全的关键领域中,粗略的姿态估计是不够的;在最坏的情况下也无法保证安全性,并且外部服务可能不可信。 为了应对这一挑战,我们提出了一种仅通过相机图像和已知目标几何结构来进行认证的姿态估计方法。这种方法通过对姿态进行形式化界定实现:利用近期可达到性分析(reachability analysis)及形式神经网络验证的成果来计算姿态。我们的实验表明,在合成环境与真实世界环境中,该方法能够有效地且准确地定位代理。 通过这种方式,即便是在缺乏可靠外部服务的情况下,也能确保安全关键任务中的精确性和可靠性。这种方法为实现更加自主、安全和有效的自动化系统开辟了新的途径。
https://arxiv.org/abs/2602.10032
This work presents a finite-time stable pose estimator (FTS-PE) for rigid bodies undergoing rotational and translational motion in three dimensions, using measurements from onboard sensors that provide position vectors to inertially-fixed points and body velocities. The FTS-PE is a full-state observer for the pose (position and orientation) and velocities and is obtained through a Lyapunov analysis that shows its stability in finite time and its robustness to bounded measurement noise. Further, this observer is designed directly on the state space, the tangent bundle of the Lie group of rigid body motions, SE(3), without using local coordinates or (dual) quaternion representations. Therefore, it can estimate arbitrary rigid body motions without encountering singularities or the unwinding phenomenon and be readily applied to autonomous vehicles. A version of this observer that does not need translational velocity measurements and uses only point clouds and angular velocity measurements from rate gyros, is also obtained. It is discretized using the framework of geometric mechanics for numerical and experimental implementations. The numerical simulations compare the FTS-PE with a dual-quaternion extended Kalman filter and our previously developed variational pose estimator (VPE). The experimental results are obtained using point cloud images and rate gyro measurements obtained from a Zed 2i stereo depth camera sensor. These results validate the stability and robustness of the FTS-PE.
这项工作提出了一种针对三维空间中刚体旋转和平移运动的姿态估计器(有限时间稳定姿态估计算法,FTS-PE),该算法使用机载传感器提供的惯性固定点的位置向量和刚体速度测量值。FTS-PE是一个全状态观测器,用于估计姿态(位置和方向)以及速度,并通过李雅普诺夫分析得出,证明了其在有限时间内达到稳定性和对有界测量噪声的鲁棒性。 进一步地,该观察器直接设计于状态空间,即刚体运动的李群SE(3)的切丛上,而无需使用局部坐标或四元数表示。因此,它能够估计任意刚体运动而不遇到奇点或解开现象,并且可以轻松应用于自主车辆。此外还得到了一个不需要平移速度测量值、仅利用点云和角速度计测得的角速度的观察器版本。 该算法通过几何力学框架进行离散化处理,以实现数值和实验实施。数值仿真将FTS-PE与双四元数扩展卡尔曼滤波器以及我们先前开发的变分姿态估计算法(VPE)进行了比较。实验结果使用Zed 2i立体深度相机传感器获取到的点云图像和角速度计测量值获得,这些结果验证了FTS-PE的稳定性和鲁棒性。
https://arxiv.org/abs/2602.09414
Human pose estimation is fundamental to intelligent perception in the Internet of Things (IoT), enabling applications ranging from smart healthcare to human-computer interaction. While WiFi-based methods have gained traction, they often struggle with continuous motion and high computational overhead. This work presents WiFlow, a novel framework for continuous human pose estimation using WiFi signals. Unlike vision-based approaches such as two-dimensional deep residual networks that treat Channel State Information (CSI) as images, WiFlow employs an encoder-decoder architecture. The encoder captures spatio-temporal features of CSI using temporal and asymmetric convolutions, preserving the original sequential structure of signals. It then refines keypoint features of human bodies to be tracked and capture their structural dependencies via axial attention. The decoder subsequently maps the encoded high-dimensional features into keypoint coordinates. Trained on a self-collected dataset of 360,000 synchronized CSI-pose samples from 5 subjects performing continuous sequences of 8 daily activities, WiFlow achieves a Percentage of Correct Keypoints (PCK) of 97.00% at a threshold of 20% (PCK@20) and 99.48% at PCK@50, with a mean per-joint position error of 0.008m. With only 4.82M parameters, WiFlow significantly reduces model complexity and computational cost, establishing a new performance baseline for practical WiFi-based human pose estimation. Our code and datasets are available at this https URL.
人体姿态估计在物联网(IoT)中的智能感知中起着基础性作用,支持从智慧医疗到人机交互等各种应用。虽然基于WiFi的方法越来越受到关注,但它们通常难以处理连续运动,并且具有较高的计算开销。本研究提出了一种新的框架WiFlow,用于使用WiFi信号进行持续人体姿态估计。与将通道状态信息(CSI)视为图像的视觉方法(如二维深度残差网络)不同,WiFlow采用编码器-解码器架构。 在编码阶段,该框架利用时间卷积和非对称卷积捕捉CSI的空间时间特征,并保持原始信号序列结构不变;然后通过轴向注意力机制细化人体关键点特性并捕获它们的结构性依赖关系。随后,在解码阶段,WiFlow将高维特征映射为关键点坐标。 在由5个受试者执行8种日常活动连续序列组成的自收集数据集中(包含360,000组同步CSI-姿态样本),经过训练后,WiFlow实现了PCK@20和PCK@50分别为97.00%和99.48%,且每个关节的平均位置误差为0.008m。仅具有4.82M参数量的WiFlow显著降低了模型复杂性和计算成本,在实际WiFi人体姿态估计中建立了一个新的性能基准。 我们的代码及数据集可在[此处](https://this_https_URL.com/)获取。
https://arxiv.org/abs/2602.08661
Gestures are a key component of non-verbal communication in traffic, often helping pedestrian-to-driver interactions when formal traffic rules may be insufficient. This problem becomes more apparent when autonomous vehicles (AVs) struggle to interpret such gestures. In this study, we present a gesture classification framework using 2D pose estimation applied to real-world video sequences from the WIVW dataset. We categorise gestures into four primary classes (Stop, Go, Thank & Greet, and No Gesture) and extract 76 static and dynamic features from normalised keypoints. Our analysis demonstrates that hand position and movement velocity are especially discriminative in distinguishing between gesture classes, achieving a classification accuracy score of 87%. These findings not only improve the perceptual capabilities of AV systems but also contribute to the broader understanding of pedestrian behaviour in traffic contexts.
手势是非言语交通交流中的关键组成部分,常常在正式交通规则不足时帮助行人与司机之间的互动。当自动驾驶汽车(AV)难以解读这些手势时,这一问题变得更加明显。在这项研究中,我们提出了一种基于二维姿态估计的框架,用于从WIVW数据集中提取的真实世界视频序列的手势分类。我们将手势分为四大类(停止、前进、感谢与问候、无手势),并从归一化的关键点中抽取76个静态和动态特征。我们的分析表明,手的位置和移动速度在区分不同手势类别时尤其具有鉴别力,达到了87%的分类准确率。这些发现不仅提升了自动驾驶系统的感知能力,还为理解交通环境中行人的行为提供了更广泛的认识。
https://arxiv.org/abs/2602.08479
Vision-guided robotic systems are increasingly deployed in precision alignment tasks that require reliable execution under near-field and off-axis configurations. While recent advances in pose estimation have significantly improved numerical accuracy, practical robotic systems still suffer from frequent execution failures even when pose estimates appear accurate. This gap suggests that pose accuracy alone is insufficient to guarantee execution-level reliability. In this paper, we reveal that such failures arise from a deterministic geometric error amplification mechanism, in which small pose estimation errors are magnified through system structure and motion execution, leading to unstable or failed alignment. Rather than modifying pose estimation algorithms, we propose a Reliability-aware Execution Gating mechanism that operates at the execution level. The proposed approach evaluates geometric consistency and configuration risk before execution, and selectively rejects or scales high-risk pose updates. We validate the proposed method on a real UR5 robotic platform performing single-step visual alignment tasks under varying camera-target distances and off-axis configurations. Experimental results demonstrate that the proposed execution gating significantly improves task success rates, reduces execution variance, and suppresses tail-risk behavior, while leaving average pose accuracy largely unchanged. Importantly, the proposed mechanism is estimator-agnostic and can be readily integrated with both classical geometry-based and learning-based pose estimation pipelines. These results highlight the importance of execution-level reliability modeling and provide a practical solution for improving robustness in near-field vision-guided robotic systems.
基于视觉的机器人系统越来越多地被部署在需要近距离和偏离轴线配置下可靠执行的精密对准任务中。尽管最近的姿态估计技术取得了显著进展,提高了数值准确性,但实际中的机器人系统仍然会频繁出现执行失败的情况,即便姿态估计看起来是准确的。这种差距表明,仅靠姿态精度无法保证执行级别的可靠性。在本文中,我们揭示了此类故障是由一种确定性的几何误差放大机制引起的,在该机制下,小的姿态估计错误通过系统的结构和运动执行被放大,导致不稳定或对准失败的情况发生。我们没有修改姿态估计算法,而是提出了一种基于可靠性的执行门控机制,这种机制在执行层面运作。所提出的方案会在执行前评估几何一致性及配置风险,并选择性地拒绝或调整高风险的姿态更新。 我们在一个真实的UR5机器人平台上验证了该方法,该平台在变化的摄像机-目标距离和偏离轴线配置下进行单步视觉对准任务。实验结果表明,所提出的执行门控机制显著提高了任务的成功率,减少了执行变异性,并抑制了尾部风险行为,同时基本保持了平均姿态精度不变。 重要的是,这种机制是估计器无关的,可以轻松地与基于经典几何和学习的姿态估计算法管道集成使用。这些结果突显了在近距离视觉引导机器人系统中对执行级别可靠性建模的重要性,并提供了一种提高此类系统鲁棒性的实用解决方案。
https://arxiv.org/abs/2602.08466
Deep learning has the potential to improve colonoscopy by enabling 3D reconstruction of the colon, providing a comprehensive view of mucosal surfaces and lesions, and facilitating the identification of unexplored areas. However, the development of robust methods is limited by the scarcity of large-scale ground truth data. We propose RealSynCol, a highly realistic synthetic dataset designed to replicate the endoscopic environment. Colon geometries extracted from 10 CT scans were imported into a virtual environment that closely mimics intraoperative conditions and rendered with realistic vascular textures. The resulting dataset comprises 28\,130 frames, paired with ground truth depth maps, optical flow, 3D meshes, and camera trajectories. A benchmark study was conducted to evaluate the available synthetic colon datasets for the tasks of depth and pose estimation. Results demonstrate that the high realism and variability of RealSynCol significantly enhance generalization performance on clinical images, proving it to be a powerful tool for developing deep learning algorithms to support endoscopic diagnosis.
深度学习有可能通过启用结肠的三维重建来改进结肠镜检查,提供粘膜表面和病变的整体视图,并帮助识别未探索的区域。然而,由于缺乏大规模的真实数据集,开发稳健方法的努力受到了限制。我们提出了RealSynCol,这是一个高度逼真的合成数据集,旨在复制内窥镜环境。从10次CT扫描中提取的结肠几何形状被导入到一个虚拟环境中,该环境紧密模拟了术中的条件,并用逼真的血管纹理进行了渲染。生成的数据集中包含28,130帧图像,每张图像都配有一份地面实况深度图、光流数据、三维网格和相机轨迹。 为了评估现有的合成结肠数据集在深度估计和姿态估计任务上的性能,我们进行了一项基准研究。结果显示,RealSynCol的高逼真度和多样性显著提升了在临床图像上的一般化性能,证明它是一种开发支持内窥镜诊断的深度学习算法的强大工具。
https://arxiv.org/abs/2602.08397
Camera pose estimation from sparse correspondences is a fundamental problem in geometric computer vision and remains particularly challenging in near-field scenarios, where strong perspective effects and heterogeneous measurement noise can significantly degrade the stability of analytic PnP solutions. In this paper, we present a geometric error propagation framework for camera pose estimation based on a parallel perspective approximation. By explicitly modeling how image measurement errors propagate through perspective geometry, we derive an error transfer model that characterizes the relationship between feature point distribution, camera depth, and pose estimation uncertainty. Building on this analysis, we develop a pose estimation method that leverages parallel perspective initialization and error-aware weighting within a Gauss-Newton optimization scheme, leading to improved robustness in proximity operations. Extensive experiments on both synthetic data and real-world images, covering diverse conditions such as strong illumination, surgical lighting, and underwater low-light environments, demonstrate that the proposed approach achieves accuracy and robustness comparable to state-of-the-art analytic and iterative PnP methods, while maintaining high computational efficiency. These results highlight the importance of explicit geometric error modeling for reliable camera pose estimation in challenging near-field settings.
从稀疏对应关系中估计相机姿态是几何计算机视觉中的一个基本问题,尤其在近场场景下尤为具有挑战性。在这种情况下,强烈的透视效应和异质的测量噪声可能会显著降低解析PnP(Perspective-n-Point)解决方案的稳定性。本文提出了一种基于平行视角近似的相机姿态估计的几何误差传播框架。通过显式建模图像测量误差在透视几何中的传播方式,我们推导出一个误差传递模型,该模型描述了特征点分布、相机深度和姿态估计不确定性之间的关系。在此基础上,我们开发了一种姿态估计方法,该方法利用平行视角初始化并在高斯-牛顿优化方案中采用误差感知加权,从而在近距离操作中提高了鲁棒性。 我们在合成数据集以及涵盖强光照、手术照明和水下低光环境等多样化条件的真实图像上进行了广泛的实验。实验结果表明,所提出的方法实现了与最先进的解析和迭代PnP方法相当的精度和稳健性,并且保持了较高的计算效率。这些结果强调了在具有挑战性的近场环境中进行可靠的相机姿态估计时显式几何误差建模的重要性。
https://arxiv.org/abs/2602.07888
Accurate 3D pose estimation of drones is essential for security and surveillance systems. However, existing methods often rely on prior drone information such as physical sizes or 3D meshes. At the same time, current datasets are small-scale, limited to single models, and collected under constrained environments, which makes reliable validation of generalization difficult. We present DroneKey++, a prior-free framework that jointly performs keypoint detection, drone classification, and 3D pose estimation. The framework employs a keypoint encoder for simultaneous keypoint detection and classification, and a pose decoder that estimates 3D pose using ray-based geometric reasoning and class embeddings. To address dataset limitations, we construct 6DroneSyn, a large-scale synthetic benchmark with over 50K images covering 7 drone models and 88 outdoor backgrounds, generated using 360-degree panoramic synthesis. Experiments show that DroneKey++ achieves MAE 17.34 deg and MedAE 17.1 deg for rotation, MAE 0.135 m and MedAE 0.242 m for translation, with inference speeds of 19.25 FPS (CPU) and 414.07 FPS (GPU), demonstrating both strong generalization across drone models and suitability for real-time applications. The dataset is publicly available.
准确的3D姿态估计对于无人机的安全和监控系统至关重要。然而,现有的方法通常依赖于预先存在的无人机信息,如物理尺寸或三维网格模型。同时,当前的数据集规模较小,仅限于单个型号,并且是在受控环境中收集的,这使得验证泛化能力变得困难。我们提出了一种名为DroneKey++的无先验框架,该框架能够同时进行关键点检测、无人机分类和3D姿态估计。此框架采用了一个关键点编码器来实现同步的关键点检测与分类,并使用基于射线的几何推理和类嵌入的姿势解码器来估算3D姿态。 为了解决数据集限制问题,我们构建了6DroneSyn,这是一个大规模的人工合成基准测试集,包含超过5万张图片,涵盖了7种无人机型号以及88种户外背景环境,并使用全景合成技术生成。实验表明,DroneKey++在旋转估计中达到了17.34度的平均绝对误差(MAE)和17.1度的中位数绝对误差(MedAE),在平移估计中的误差分别为0.135米(MAE)和0.242米(MedAE)。此外,该框架还表现出卓越的速度性能,在CPU上的推断速度为每秒19.25帧,在GPU上的推断速度为每秒414.07帧。这些结果表明DroneKey++在不同无人机型号之间的泛化能力以及适用于实时应用的潜力。数据集现已公开可用。
https://arxiv.org/abs/2602.06211
Adapting large pretrained models to new tasks efficiently and continually is crucial for real-world deployment but remains challenging due to catastrophic forgetting and the high cost of retraining. While parameter-efficient tuning methods like low rank adaptation (LoRA) reduce computational demands, they lack mechanisms for strict continual learning and knowledge integration, without relying on data replay, or multiple adapters. We propose Share, a novel approach to parameter efficient continual finetuning that learns and dynamically updates a single, shared low-rank subspace, enabling seamless adaptation across multiple tasks and modalities. Share constructs a foundational subspace that extracts core knowledge from past tasks and incrementally integrates new information by identifying essential subspace directions. Knowledge from each new task is incorporated into this evolving subspace, facilitating forward knowledge transfer, while minimizing catastrophic interference. This approach achieves up to 100x parameter reduction and 281x memory savings over traditional LoRA methods, maintaining performance comparable to jointly trained models. A single Share model can replace hundreds of task-specific LoRA adapters, supporting scalable, asynchronous continual learning. Experiments across image classification, natural language understanding, 3D pose estimation, and text-to-image generation validate its effectiveness, making Share a practical and scalable solution for lifelong learning in large-scale AI systems.
将大型预训练模型高效且持续地适应新任务对于现实世界的部署至关重要,但由于灾难性遗忘和重新训练成本高昂的问题,这一过程仍然具有挑战性。虽然参数效率调优方法(如低秩自适应调整 LoRA)可以减少计算需求,但它们缺乏严格的连续学习机制以及知识整合能力,并且不依赖于数据重播或多个适配器。为此,我们提出了 Share 方法——一种新颖的参数高效持续微调方法,该方法通过学习和动态更新单一、共享的低秩子空间来实现跨多个任务和模式的无缝适应。Share 构建了一个基础子空间以从过往的任务中提取核心知识,并逐步整合新的信息,通过识别关键子空间方向进行这一过程。每个新任务的知识被融入到这个不断演变的基础子空间中,从而促进正向知识迁移并最小化灾难性干扰。 这种方法相对于传统的 LoRA 方法可实现高达 100 倍的参数减少和 281 倍的记忆节省,并且保持与联合训练模型相当的性能。一个 Share 模型可以替代数百个任务特定的 LoRA 适配器,支持大规模、异步连续学习。跨图像分类、自然语言理解、3D 姿态估计和文本到图像生成的实验验证了其有效性,使 Share 成为大型 AI 系统中终身学习的实际且可扩展解决方案。
https://arxiv.org/abs/2602.06043
Monocular 3D pose estimation is fundamentally ill-posed due to depth ambiguity and occlusions, thereby motivating probabilistic methods that generate multiple plausible 3D pose hypotheses. In particular, diffusion-based models have recently demonstrated strong performance, but their iterative denoising process typically requires many timesteps for each prediction, making inference computationally expensive. In contrast, we leverage Flow Matching (FM) to learn a velocity field defined by an Ordinary Differential Equation (ODE), enabling efficient generation of 3D pose samples with only a few integration steps. We propose a novel generative pose estimation framework, FMPose3D, that formulates 3D pose estimation as a conditional distribution transport problem. It continuously transports samples from a standard Gaussian prior to the distribution of plausible 3D poses conditioned only on 2D inputs. Although ODE trajectories are deterministic, FMPose3D naturally generates various pose hypotheses by sampling different noise seeds. To obtain a single accurate prediction from those hypotheses, we further introduce a Reprojection-based Posterior Expectation Aggregation (RPEA) module, which approximates the Bayesian posterior expectation over 3D hypotheses. FMPose3D surpasses existing methods on the widely used human pose estimation benchmarks Human3.6M and MPI-INF-3DHP, and further achieves state-of-the-art performance on the 3D animal pose datasets Animal3D and CtrlAni3D, demonstrating strong performance across both 3D pose domains. The code is available at this https URL.
单目三维姿态估计由于深度模糊和遮挡问题本质上是一个难以解决的问题,这促使了生成多种可能的三维姿态假设的概率方法的发展。特别是基于扩散模型的方法最近表现出了很强的能力,但它们迭代去噪的过程通常需要大量的时间步骤来进行每次预测,使得推理过程计算成本高昂。相比之下,我们利用流匹配(Flow Matching, FM)来学习由普通微分方程(ODE)定义的速度场,从而可以在仅需少量积分步的情况下高效生成三维姿态样本。我们提出了一种新的生成式姿态估计框架FMPose3D,它将三维姿态估计表述为一个条件分布传输问题。该框架连续地将标准高斯先验中的样本传输到仅基于二维输入的可能三维姿势的分布中。尽管ODE轨迹是确定性的,但通过采样不同的噪声种子,FMPose3D可以自然生成各种姿态假设。为了从这些假设中获得一个准确的预测,我们进一步引入了一个基于重投影的后验期望聚合(RPEA)模块,该模块可以在三维假设上近似贝叶斯后验期望。 在广泛使用的两个基准测试集——Human3.6M和MPI-INF-3DHP上,FMPose3D超过了现有的方法,并且在Animal3D和CtrlAni3D这两个三维动物姿态数据集中达到了最新的性能水平,显示了其在跨域三维姿态估计中的强大能力。代码可在提供的链接处获取。 这段文本主要介绍了基于流匹配技术的新型生成式姿态估计框架FMPose3D,该框架通过解决深度模糊和遮挡问题,在单目三维人体和动物姿态估计中取得了显著成效。
https://arxiv.org/abs/2602.05755
Immersive virtual reality (VR) applications demand accurate, temporally coherent full-body pose tracking. Recent head-mounted camera-based approaches show promise in egocentric pose estimation, but encounter challenges when applied to VR head-mounted displays (HMDs), including temporal instability, inaccurate lower-body estimation, and the lack of real-time performance. To address these limitations, we present EgoPoseVR, an end-to-end framework for accurate egocentric full-body pose estimation in VR that integrates headset motion cues with egocentric RGB-D observations through a dual-modality fusion pipeline. A spatiotemporal encoder extracts frame- and joint-level representations, which are fused via cross-attention to fully exploit complementary motion cues across modalities. A kinematic optimization module then imposes constraints from HMD signals, enhancing the accuracy and stability of pose estimation. To facilitate training and evaluation, we introduce a large-scale synthetic dataset of over 1.8 million temporally aligned HMD and RGB-D frames across diverse VR scenarios. Experimental results show that EgoPoseVR outperforms state-of-the-art egocentric pose estimation models. A user study in real-world scenes further shows that EgoPoseVR achieved significantly higher subjective ratings in accuracy, stability, embodiment, and intention for future use compared to baseline methods. These results show that EgoPoseVR enables robust full-body pose tracking, offering a practical solution for accurate VR embodiment without requiring additional body-worn sensors or room-scale tracking systems.
沉浸式虚拟现实(VR)应用需要准确且时间连贯的全身姿态跟踪。最近基于头戴式摄像头的方法在第一人称视角的姿态估计方面显示出前景,但在应用于VR头显设备时遇到了挑战,包括时间上的不稳定性、下肢估计算法不准以及缺乏实时性能。为了解决这些限制,我们提出了EgoPoseVR,这是一种针对VR中准确的第一人称全身姿态估计的端到端框架,它通过双模态融合管道将头戴式移动线索与第一人称RGB-D观察结果相结合。 该系统包含一个时空编码器,用于提取帧级别和关节级别的表示。通过交叉注意力机制将这些表示进行融合,充分利用不同模式之间的互补运动线索。随后的一个刚体动力学优化模块会施加来自VR头显的信号约束,从而提高姿态估计的准确性和稳定性。 为了促进训练和评估,我们引入了一个大规模合成数据集,该数据集中包含超过180万张时间对齐的VR头显与RGB-D帧,涵盖了各种虚拟现实场景。实验结果表明,EgoPoseVR在第一人称视角的姿态估计算法中超越了最先进的模型。 一项针对真实世界场景中的用户研究进一步显示,在准确性、稳定性、沉浸感和未来使用意图方面,EgoPoseVR相比基线方法得到了显著更高的主观评分。这些结果显示,EgoPoseVR能够实现稳健的全身姿态跟踪,并为准确的虚拟现实体验提供了一个实用解决方案,无需额外的身体穿戴式传感器或房间规模跟踪系统。 综上所述,EgoPoseVR通过利用头显设备和RGB-D摄像头提供的信息,在无须附加身体追踪器的情况下实现了高效且精确的第一人称视角姿态估计。这标志着在解决沉浸式VR应用中的关键挑战方面取得了重大进展。
https://arxiv.org/abs/2602.05590
We present a unified operator-theoretic framework for analyzing per-feature sensitivity in camera pose estimation on the Lie group SE(3). Classical sensitivity tools - conditioning analyses, Euclidean perturbation arguments, and Fisher information bounds - do not explain how individual image features influence the pose estimate, nor why dynamic or inconsistent observations can disproportionately distort modern SLAM and structure-from-motion systems. To address this gap, we extend influence function theory to matrix Lie groups and derive an intrinsic perturbation operator for left-trivialized M-estimators on SE(3). The resulting Geometric Observability Index (GOI) quantifies the contribution of a single measurement through the curvature operator and the Lie algebraic structure of the observable subspace. GOI admits a spectral decomposition along the principal directions of the observable curvature, revealing a direct correspondence between weak observability and amplified sensitivity. In the population regime, GOI coincides with the Fisher information geometry on SE(3), yielding a single-measurement analogue of the Cramer-Rao bound. The same spectral mechanism explains classical degeneracies such as pure rotation and vanishing parallax, as well as dynamic feature amplification along weak curvature directions. Overall, GOI provides a geometrically consistent description of measurement influence that unifies conditioning analysis, Fisher information geometry, influence function theory, and dynamic scene detectability through the spectral geometry of the curvature operator. Because these quantities arise directly within Gauss-Newton pipelines, the curvature spectrum and GOI also yield lightweight, training-free diagnostic signals for identifying dynamic features and detecting weak observability configurations without modifying existing SLAM architectures.
我们提出了一种统一的算子理论框架,用于分析在李群SE(3)上的相机姿态估计中每项特征敏感性。经典的敏感性工具——条件分析、欧几里得扰动论证以及费舍尔信息界,并不能解释个别图像特征如何影响姿态估计,也无法说明为什么动态或不一致的观测值会不成比例地扭曲现代视觉同时定位与地图构建(SLAM)和从运动恢复结构(SfM)系统。为了解决这一空白,我们将影响力函数理论扩展到矩阵李群,并在SE(3)上导出了左规范化的M-估计器的固有扰动算子。由此产生的几何可观察性指数(GOI)通过曲率算子和可观测子空间的李代数结构量化单个测量值的贡献。GOI沿着可观测曲率的主要方向允许谱分解,揭示了弱可观察性和放大敏感性的直接对应关系。在群体模式下,GOI与SE(3)上的费舍尔信息几何重合,提供了一个类似于克拉默-拉奥界的一次观测量度版本。同样的光谱机制解释了经典的退化情况,如纯旋转和消失的视差现象,以及动态特征在弱曲率方向上的放大效应。 总体而言,GOI提供了测量影响的几何一致描述,统一了条件分析、费舍尔信息几何、影响力函数理论以及通过曲率算子的谱几何来检测动态场景。由于这些量直接出现在高斯-牛顿管道中,因此曲率光谱和GOI还为在不修改现有SLAM架构的情况下识别动态特征和检测弱可观察性配置提供了轻便且无需训练的诊断信号。
https://arxiv.org/abs/2602.05582
We introduce ShapeGaussian, a high-fidelity, template-free method for 4D human reconstruction from casual monocular videos. Generic reconstruction methods lacking robust vision priors, such as 4DGS, struggle to capture high-deformation human motion without multi-view cues. While template-based approaches, primarily relying on SMPL, such as HUGS, can produce photorealistic results, they are highly susceptible to errors in human pose estimation, often leading to unrealistic artifacts. In contrast, ShapeGaussian effectively integrates template-free vision priors to achieve both high-fidelity and robust scene reconstructions. Our method follows a two-step pipeline: first, we learn a coarse, deformable geometry using pretrained models that estimate data-driven priors, providing a foundation for reconstruction. Then, we refine this geometry using a neural deformation model to capture fine-grained dynamic details. By leveraging 2D vision priors, we mitigate artifacts from erroneous pose estimation in template-based methods and employ multiple reference frames to resolve the invisibility issue of 2D keypoints in a template-free manner. Extensive experiments demonstrate that ShapeGaussian surpasses template-based methods in reconstruction accuracy, achieving superior visual quality and robustness across diverse human motions in casual monocular videos.
我们介绍了ShapeGaussian,这是一种无需模板的高保真4D人体重建方法,适用于从单目随意视频中进行重建。通用重建方法(如4DGS)缺乏稳健的视觉先验知识,在没有多视角线索的情况下难以捕捉高度变形的人体运动。基于模板的方法主要依赖于SMPL模型(例如HUGS),虽然能够生成逼真的结果,但它们对姿态估计中的错误非常敏感,常常导致不真实的效果。相比之下,ShapeGaussian通过整合无模板的视觉先验知识,在实现高保真度的同时也具备了较强的鲁棒性。 我们的方法采用两步管道流程:首先,我们使用预训练模型来学习粗略、可变形的几何结构,这些模型能够估算出数据驱动的先验知识,并为此类重建提供基础。随后,利用神经变形模型进一步细化这一几何结构以捕捉细微动态细节。通过运用2D视觉先验,我们减轻了模板方法中由于姿态估计错误而产生的伪影问题,并采用多个参考帧来解决无模板方式下2D关键点的不可见性问题。 大量实验表明,ShapeGaussian在重建准确性方面优于基于模板的方法,在随意单目视频中的各种人体动作上实现了卓越的视觉质量和鲁棒性。
https://arxiv.org/abs/2602.05572
We introduce IndustryShapes, a new RGB-D benchmark dataset of industrial tools and components, designed for both instance-level and novel object 6D pose estimation approaches. The dataset provides a realistic and application-relevant testbed for benchmarking these methods in the context of industrial robotics bridging the gap between lab-based research and deployment in real-world manufacturing scenarios. Unlike many previous datasets that focus on household or consumer products or use synthetic, clean tabletop datasets, or objects captured solely in controlled lab environments, IndustryShapes introduces five new object types with challenging properties, also captured in realistic industrial assembly settings. The dataset has diverse complexity, from simple to more challenging scenes, with single and multiple objects, including scenes with multiple instances of the same object and it is organized in two parts: the classic set and the extended set. The classic set includes a total of 4,6k images and 6k annotated poses. The extended set introduces additional data modalities to support the evaluation of model-free and sequence-based approaches. To the best of our knowledge, IndustryShapes is the first dataset to offer RGB-D static onboarding sequences. We further evaluate the dataset on a representative set of state-of-the art methods for instance-based and novel object 6D pose estimation, including also object detection, segmentation, showing that there is room for improvement in this domain. The dataset page can be found in this https URL.
我们介绍了一个新的RGB-D基准数据集IndustryShapes,该数据集专注于工业工具和组件,并为实例级和新颖对象的6D姿态估计方法设计。这个数据集提供了一个现实且与应用相关的测试平台,在工业机器人领域内对这些方法进行评估,从而弥合实验室研究与实际制造场景部署之间的差距。不同于许多以前集中在家庭或消费品上的数据集、使用合成干净桌面设置的数据集或者仅在控制实验室环境中捕捉对象的数据集,IndustryShapes引入了五种具有挑战性特性的新对象类型,并且是在现实工业装配环境下捕捉的。 该数据集涵盖了从简单到复杂场景的不同难度级别,包括单个和多个物体的情况以及包含同一物体多次实例化的场景。它分为经典部分和扩展部分:经典部分总计有4,600张图像和6,000个标注姿态;而扩展部分则引入了额外的数据模式以支持无模型方法和序列化方法的评估。 据我们所知,IndustryShapes是首个提供RGB-D静态上载序列的数据集。我们进一步在代表性的当前最优方法集上对该数据集进行了评估,这些方法涵盖基于实例和新颖对象6D姿态估计,包括物体检测和分割,表明该领域还有改进的空间。该数据集页面可在此[https URL]找到。
https://arxiv.org/abs/2602.05555
Object pose estimation is a fundamental problem in computer vision and plays a critical role in virtual reality and embodied intelligence, where agents must understand and interact with objects in 3D space. Recently, score based generative models have to some extent solved the rotational symmetry ambiguity problem in category level pose estimation, but their efficiency remains limited by the high sampling cost of score-based diffusion. In this work, we propose a new framework, RFM-Pose, that accelerates category-level 6D object pose generation while actively evaluating sampled hypotheses. To improve sampling efficiency, we adopt a flow-matching generative model and generate pose candidates along an optimal transport path from a simple prior to the pose distribution. To further refine these candidates, we cast the flow-matching sampling process as a Markov decision process and apply proximal policy optimization to fine-tune the sampling policy. In particular, we interpret the flow field as a learnable policy and map an estimator to a value network, enabling joint optimization of pose generation and hypothesis scoring within a reinforcement learning framework. Experiments on the REAL275 benchmark demonstrate that RFM-Pose achieves favorable performance while significantly reducing computational cost. Moreover, similar to prior work, our approach can be readily adapted to object pose tracking and attains competitive results in this setting.
物体姿态估计是计算机视觉中的一个基本问题,在虚拟现实和具身智能中起着关键作用,因为代理必须理解和与三维空间中的对象进行交互。最近,基于评分的生成模型在一定程度上解决了类别级别姿态估计中的旋转对称性模糊问题,但其效率仍受到基于分数扩散的高采样成本限制。在这项工作中,我们提出了一种新的框架 RFM-Pose,该框架加速了类别级别的 6D 物体姿态生成,并积极评估所抽样的假设。为了提高抽样效率,我们采用了一个流匹配生成模型,并沿最优传输路径从一个简单的先验到姿势分布生成姿势候选物。为进一步细化这些候选者,我们将流匹配采样过程视为马尔可夫决策过程,并应用近端策略优化来微调采样政策。特别是,我们将流场解释为可学习的政策,并将估计器映射到价值网络中,使姿态生成和假设评分在强化学习框架内能够进行联合优化。在 REAL275 数据集上的实验表明,RFM-Pose 在显著降低计算成本的同时实现了良好的性能。此外,类似于先前的工作,我们的方法可以轻松地适应物体姿态跟踪,并在此设置中取得了具有竞争力的结果。
https://arxiv.org/abs/2602.05257
Operating effectively in novel real-world environments requires robotic systems to estimate and interact with previously unseen objects. Current state-of-the-art models address this challenge by using large amounts of training data and test-time samples to build black-box scene representations. In this work, we introduce a differentiable neuro-graphics model that combines neural foundation models with physics-based differentiable rendering to perform zero-shot scene reconstruction and robot grasping without relying on any additional 3D data or test-time samples. Our model solves a series of constrained optimization problems to estimate physically consistent scene parameters, such as meshes, lighting conditions, material properties, and 6D poses of previously unseen objects from a single RGBD image and bounding boxes. We evaluated our approach on standard model-free few-shot benchmarks and demonstrated that it outperforms existing algorithms for model-free few-shot pose estimation. Furthermore, we validated the accuracy of our scene reconstructions by applying our algorithm to a zero-shot grasping task. By enabling zero-shot, physically-consistent scene reconstruction and grasping without reliance on extensive datasets or test-time sampling, our approach offers a pathway towards more data efficient, interpretable and generalizable robot autonomy in novel environments.
在新的现实环境中高效运行要求机器人系统能够估计并处理之前从未见过的对象。目前最先进的模型通过使用大量的训练数据和测试时的样本建立黑箱场景表示来解决这一挑战。在这项工作中,我们引入了一个可微神经图形模型,该模型结合了基于神经网络的基础模型与物理基础的可微分渲染技术,从而能够从单个RGBD图像和边界框中进行零样本场景重建,并实现无需额外3D数据或测试时采样的机器人抓取。我们的模型通过解决一系列约束优化问题来估计物理一致性的场景参数,如网格、光照条件、材质属性以及之前未见物体的6维姿态。 我们在标准无模型(model-free)少样本基准上评估了这种方法,并证明其在无模型少样本位姿估计方面超越现有算法。此外,我们通过将我们的算法应用于零样本抓取任务来验证场景重建的准确性。通过实现无需大量数据集或测试时采样的物理一致性的零样本场景重建和抓取,我们的方法为机器人在新环境中的高效、可解释且泛化的自主性提供了一条路径。
https://arxiv.org/abs/2602.05029
Feed-forward multi-frame 3D reconstruction models often degrade on videos with object motion. Global-reference becomes ambiguous under multiple motions, while the local pointmap relies heavily on estimated relative poses and can drift, causing cross-frame misalignment and duplicated structures. We propose TrajVG, a reconstruction framework that makes cross-frame 3D correspondence an explicit prediction by estimating camera-coordinate 3D trajectories. We couple sparse trajectories, per-frame local point maps, and relative camera poses with geometric consistency objectives: (i) bidirectional trajectory-pointmap consistency with controlled gradient flow, and (ii) a pose consistency objective driven by static track anchors that suppresses gradients from dynamic regions. To scale training to in-the-wild videos where 3D trajectory labels are scarce, we reformulate the same coupling constraints into self-supervised objectives using only pseudo 2D tracks, enabling unified training with mixed supervision. Extensive experiments across 3D tracking, pose estimation, pointmap reconstruction, and video depth show that TrajVG surpasses the current feedforward performance baseline.
前向传播的多帧三维重建模型在包含物体运动的视频上表现较差。全局参考在面对多种运动时变得模糊不清,而局部点图则严重依赖于估计的相对姿态,并可能产生漂移,导致跨帧错位和重复结构。为此,我们提出了TrajVG这一重建框架,该框架通过估算相机坐标系下的3D轨迹来将跨帧三维对应关系作为显式预测。 TrajVG结合了稀疏轨迹、每帧局部点图以及相对相机姿态,并使用了几何一致性目标进行优化:(i) 双向的轨迹-点地图一致性,并控制梯度流动;(ii) 由静态追踪锚点驱动的姿态一致性目标,该目标能抑制动态区域产生的梯度。为了在缺乏三维轨迹标签的真实场景视频上进行大规模训练,我们重新定义了相同的耦合约束为自监督目标,仅使用伪二维跟踪就可实现统一的混合监督训练。 通过广泛的实验,包括3D跟踪、姿态估计、点图重建和视频深度预测,在这些任务中TrajVG均超越了现有的前馈性能基线。
https://arxiv.org/abs/2602.04439
3D human pose estimation from 2D images is a challenging problem due to depth ambiguity and occlusion. Because of these challenges the task is underdetermined, where there exists multiple -- possibly infinite -- poses that are plausible given the image. Despite this, many prior works assume the existence of a deterministic mapping and estimate a single pose given an image. Furthermore, methods based on machine learning require a large amount of paired 2D-3D data to train and suffer from generalization issues to unseen scenarios. To address both of these issues, we propose a framework for pose estimation using diffusion models, which enables sampling from a probability distribution over plausible poses which are consistent with a 2D image. Our approach falls under the guidance framework for conditional generation, and guides samples from an unconditional diffusion model, trained only on 3D data, using the gradients of the heatmaps from a 2D keypoint detector. We evaluate our method on the Human 3.6M dataset under best-of-$m$ multiple hypothesis evaluation, showing state-of-the-art performance among methods which do not require paired 2D-3D data for training. We additionally evaluate the generalization ability using the MPI-INF-3DHP and 3DPW datasets and demonstrate competitive performance. Finally, we demonstrate the flexibility of our framework by using it for novel tasks including pose generation and pose completion, without the need to train bespoke conditional models. We make code available at this https URL .
从2D图像中估计3D人体姿态是一个具有挑战性的问题,原因在于深度模糊和遮挡。由于这些难题的存在,这个问题是欠定的,即给定一张图片的情况下可能存在多个——甚至是无限多的——可能的姿态。尽管如此,许多先前的研究假设存在确定性的映射关系,并且仅基于一幅图像就能估算出单一姿态。此外,基于机器学习的方法需要大量的配对2D-3D数据进行训练,并且在面对未见过的情况时会遇到泛化问题。 为了同时解决这两个问题,我们提出了一种使用扩散模型的框架来进行姿态估计,该框架可以从与给定2D图像一致的所有可能姿态的概率分布中抽样。我们的方法属于条件生成引导框架的一部分,通过利用从2D关键点检测器获得的热图梯度来指导无条件训练(仅基于3D数据)的扩散模型样本。 我们在Human 3.6M数据集上进行了最佳假设评估(best-of-$m$ multiple hypothesis evaluation),展示了在无需配对2D-3D数据进行训练的方法中,我们的方法具有最先进的性能。我们还使用MPI-INF-3DHP和3DPW数据集评估了泛化能力,并表现出竞争性的结果。最后,通过利用该框架执行包括姿态生成和姿态补全在内的新任务,证明了其灵活性,而无需为这些条件模型训练专门的模型。 我们的代码可在以下网址获取:[此处插入实际链接]。
https://arxiv.org/abs/2602.03126
Real-world scenes are inherently crowded. Hence, estimating 3D poses of all nearby humans, tracking their movements over time, and understanding their activities within social and environmental contexts are essential for many applications, such as autonomous driving, robot perception, robot navigation, and human-robot interaction. However, most existing 3D human pose estimation datasets primarily focus on single-person scenes or are collected in controlled laboratory environments, which restricts their relevance to real-world applications. To bridge this gap, we introduce JRDB-Pose3D, which captures multi-human indoor and outdoor environments from a mobile robotic platform. JRDB-Pose3D provides rich 3D human pose annotations for such complex and dynamic scenes, including SMPL-based pose annotations with consistent body-shape parameters and track IDs for each individual over time. JRDB-Pose3D contains, on average, 5-10 human poses per frame, with some scenes featuring up to 35 individuals simultaneously. The proposed dataset presents unique challenges, including frequent occlusions, truncated bodies, and out-of-frame body parts, which closely reflect real-world environments. Moreover, JRDB-Pose3D inherits all available annotations from the JRDB dataset, such as 2D pose, information about social grouping, activities, and interactions, full-scene semantic masks with consistent human- and object-level tracking, and detailed annotations for each individual, such as age, gender, and race, making it a holistic dataset for a wide range of downstream perception and human-centric understanding tasks.
现实世界中的场景往往非常拥挤。因此,估计附近所有人的3D姿态、跟踪他们的长时间移动以及在社会和环境背景下的活动理解对于许多应用(如自动驾驶、机器人感知、机器人导航及人机交互)来说至关重要。然而,现有的大多数3D人体姿态估算数据集主要集中在单个人物的场景上或是在控制良好的实验室环境中收集的,这限制了它们与实际应用场景的相关性。为了弥合这一差距,我们引入了JRDB-Pose3D,该数据集从移动机器人平台捕捉多个人类的室内和室外环境。JRDB-Pose3D为这些复杂且动态的场景提供了丰富的3D人体姿态标注,包括基于SMPL的姿态注释(具有一致的身体形状参数),以及每个个体随时间变化的一致ID跟踪。 JRDB-Pose3D平均每一帧包含5到10个不同的人体姿势,并且某些场景中同时出现了多达35个人。该数据集提出的挑战包括频繁的遮挡、截断的身体和超出屏幕范围的身体部位,这与现实世界的环境紧密相关。此外,JRDB-Pose3D继承了来自JRDB数据集的所有可用注释信息,例如2D姿态、社会群体的信息、活动以及互动情况,带有完整场景语义掩膜的一致人类和对象级别的跟踪,及针对每个人的详细标注(如年龄、性别和种族),使其成为一个广泛的下游感知和以人为中心的理解任务的全面数据集。
https://arxiv.org/abs/2602.03064
Privacy preservation is a prerequisite for using video data in Operating Room (OR) research. Effective anonymization relies on the exhaustive localization of every individual; even a single missed detection necessitates extensive manual correction. However, existing approaches face two critical scalability bottlenecks: (1) they usually require manual annotations of each new clinical site for high accuracy; (2) while multi-camera setups have been widely adopted to address single-view ambiguity, camera calibration is typically required whenever cameras are repositioned. To address these problems, we propose a novel self-supervised multi-view video anonymization framework consisting of whole-body person detection and whole-body pose estimation, without annotation or camera calibration. Our core strategy is to enhance the single-view detector by "retrieving" false negatives using temporal and multi-view context, and conducting self-supervised domain adaptation. We first run an off-the-shelf whole-body person detector in each view with a low-score threshold to gather candidate detections. Then, we retrieve the low-score false negatives that exhibit consistency with the high-score detections via tracking and self-supervised uncalibrated multi-view association. These recovered detections serve as pseudo labels to iteratively fine-tune the whole-body detector. Finally, we apply whole-body pose estimation on each detected person, and fine-tune the pose model using its own high-score predictions. Experiments on the 4D-OR dataset of simulated surgeries and our dataset of real surgeries show the effectiveness of our approach achieving over 97% recall. Moreover, we train a real-time whole-body detector using our pseudo labels, achieving comparable performance and highlighting our method's practical applicability. Code is available at this https URL.
隐私保护是使用手术室(OR)视频数据进行研究的先决条件。有效的匿名化依赖于对每个个体的全面定位;即使遗漏一个检测也需要大量的手动修正。然而,现有的方法面临着两个关键的可扩展性瓶颈:(1) 通常需要为每家新的医疗机构的手动注释以确保高精度;(2) 虽然多摄像头设置已被广泛采用来解决单视图模糊问题,但每当摄像头重新定位时,通常都需要进行相机校准。为了应对这些问题,我们提出了一种新颖的自我监督下的多视角视频匿名化框架,该框架包含全身人物检测和全身姿态估计,并且无需注释或相机校准。 我们的核心策略是通过利用时间和多视图上下文“检索”漏检(false negatives),并通过自我监督的未校准多视图关联来增强单视图探测器进行领域适应。首先,我们在每个视角中运行一个现成的全身人物检测器,并设置较低的得分阈值以收集候选检测结果。然后,我们通过追踪和自我监督下的未校准多视图关联检索与高分检测一致性的低分漏检(false negatives)。这些恢复的检测结果被用作伪标签,以迭代地微调全身检测器。最后,我们在每个检测到的人物上进行全身姿态估计,并使用其自身的高分预测来微调姿势模型。 在模拟手术和真实手术数据集上的4D-OR实验表明,我们的方法实现了超过97%的召回率。此外,我们还利用伪标签训练了一个实时全身检测器,达到了相当的性能,突显了我们方法的实际应用性。代码可在[链接]处获取。
https://arxiv.org/abs/2602.02850