The emergence of visual foundation models has revolutionized visual odometry~(VO) and SLAM, enabling pose estimation and dense reconstruction within a single feed-forward network. However, unlike traditional pipelines that leverage keyframe methods to enhance efficiency and accuracy, current foundation model based methods, such as VGGT-Long, typically process raw image sequences indiscriminately. This leads to computational redundancy and degraded performance caused by low inter-frame parallax, which provides limited contextual stereo information. Integrating traditional geometric heuristics into these methods is non-trivial, as their performance depends on high-dimensional latent representations rather than explicit geometric metrics. To bridge this gap, we propose a novel keyframe-based feed-forward VO. Instead of relying on hand-crafted rules, our approach employs reinforcement learning to derive an adaptive keyframe policy in a data-driven manner, aligning selection with the intrinsic characteristics of the underlying foundation model. We train our agent on TartanAir dataset and conduct extensive evaluations across several real-world datasets. Experimental results demonstrate that the proposed method achieves consistent and substantial improvements over state-of-the-art feed-forward VO methods.
视觉基础模型的出现已经彻底革新了视觉里程计(VO)和同时定位与地图构建(SLAM),使得姿态估计和密集重建可以在单一前馈网络中完成。然而,不同于传统的管道利用关键帧方法来提高效率和精度,目前基于基础模型的方法,例如VGGT-Long,通常会不加区分地处理原始图像序列。这导致了由于低帧间视差引起的计算冗余以及性能下降,因为低帧间视差提供的立体背景信息有限。将传统的几何启发式融入这些方法中颇具挑战性,因为它们的性能依赖于高维潜在表示而非明确的几何度量。 为了弥合这一差距,我们提出了一种新颖的关键帧前馈VO方法。不同于依赖手工设计规则的方法,我们的方法利用强化学习以数据驱动的方式推导出适应性的关键帧策略,并与基础模型的本质特性相匹配。我们在TartanAir数据集上训练了代理,并在几个真实世界的数据库中进行了广泛的评估。实验结果表明,所提出的方法在最先进的前馈VO方法中实现了持续且显著的改进。
https://arxiv.org/abs/2601.16020
Purpose: Accurate 3D hand pose estimation supports surgical applications such as skill assessment, robot-assisted interventions, and geometry-aware workflow analysis. However, surgical environments pose severe challenges, including intense and localized lighting, frequent occlusions by instruments or staff, and uniform hand appearance due to gloves, combined with a scarcity of annotated datasets for reliable model training. Method: We propose a robust multi-view pipeline for 3D hand pose estimation in surgical contexts that requires no domain-specific fine-tuning and relies solely on off-the-shelf pretrained models. The pipeline integrates reliable person detection, whole-body pose estimation, and state-of-the-art 2D hand keypoint prediction on tracked hand crops, followed by a constrained 3D optimization. In addition, we introduce a novel surgical benchmark dataset comprising over 68,000 frames and 3,000 manually annotated 2D hand poses with triangulated 3D ground truth, recorded in a replica operating room under varying levels of scene complexity. Results: Quantitative experiments demonstrate that our method consistently outperforms baselines, achieving a 31% reduction in 2D mean joint error and a 76% reduction in 3D mean per-joint position error. Conclusion: Our work establishes a strong baseline for 3D hand pose estimation in surgery, providing both a training-free pipeline and a comprehensive annotated dataset to facilitate future research in surgical computer vision.
目的:准确的三维手部姿态估计支持手术应用,如技能评估、机器人辅助干预和几何感知工作流程分析。然而,手术环境带来了严重的挑战,包括强烈的局部照明、频繁的手被仪器或工作人员遮挡以及由于手套导致的手部外观一致化,并且缺乏可靠的模型训练所需的数据集。 方法:我们提出了一种稳健的多视角流水线,用于在手术环境中进行三维手部姿态估计,该流水线无需特定领域的微调,仅依赖现成的预训练模型。这个流程包括可靠的人体检测、全身姿势估计和基于跟踪的手部裁剪区域内的最先进的二维关键点预测,并通过受约束的三维优化来完成整个过程。此外,我们引入了一个新颖的手术基准数据集,该数据集包含超过68,000帧及3,000个手动注释的二维手部姿态,这些数据是在一个模拟手术室中记录下来的,在不同的场景复杂度下都有三角测量的三维真实值。 结果:定量实验表明,我们的方法在性能上始终优于基准模型,实现了2D平均关节误差降低31%,以及3D平均每关节位置误差减少76%的成绩。 结论:我们提出的工作为手术环境中的三维手部姿态估计建立了坚实的基础,提供了一个无需训练的流水线和一个全面注释的数据集,以促进未来在手术计算机视觉领域的研究。
https://arxiv.org/abs/2601.15918
With the growing demand for device-free and privacy-preserving sensing solutions, Wi-Fi sensing has emerged as a promising approach for human pose estimation (HPE). However, existing methods often process vast amounts of channel state information (CSI) data directly, ultimately straining networking resources. This paper introduces TinySense, an efficient compression framework that enhances the scalability of Wi-Fi-based human sensing. Our approach is based on a new vector quantization-based generative adversarial network (VQGAN). Specifically, by leveraging a VQGAN-learned codebook, TinySense significantly reduces CSI data while maintaining the accuracy required for reliable HPE. To optimize compression, we employ the K-means algorithm to dynamically adjust compression bitrates to cluster a large-scale pre-trained codebook into smaller subsets. Furthermore, a Transformer model is incorporated to mitigate bitrate loss, enhancing robustness in unreliable networking conditions. We prototype TinySense on an experimental testbed using Jetson Nano and Raspberry Pi to measure latency and network resource use. Extensive results demonstrate that TinySense significantly outperforms state-of-the-art compression schemes, achieving up to 1.5x higher HPE accuracy score (PCK20) under the same compression rate. It also reduces latency and networking overhead, respectively, by up to 5x and 2.5x. The code repository is available online at here.
随着对无设备和隐私保护感应解决方案的需求日益增长,Wi-Fi 感应已成为人体姿态估计(HPE)的一种有前景的方法。然而,现有的方法通常直接处理大量的信道状态信息(CSI)数据,最终导致网络资源紧张。本文介绍了 TinySense,这是一种高效的压缩框架,旨在增强基于 Wi-Fi 的人体感应的可扩展性。我们的方法建立在一种新的向量量化生成对抗网络 (VQGAN) 之上。具体而言,通过利用 VQGAN 学习到的代码本,TinySense 显著减少了 CSI 数据的同时保持了进行可靠 HPE 所需的精度。为了优化压缩,我们采用了 K-means 算法,动态调整压缩比特率以将大规模预训练的代码本聚类成较小的子集。此外,还融入了一个 Transformer 模型来缓解比特率损失,在不可靠的网络条件下提高了鲁棒性。我们在实验测试平台上使用 Jetson Nano 和 Raspberry Pi 作为原型机,测量延迟和网络资源使用情况。广泛的结果表明,TinySense 在相同的压缩比率下显著优于最先进的压缩方案,HPE 准确度得分(PCK20)最高可提高1.5倍。它还分别将延迟降低了最多 5 倍,并且减少了多达 2.5 倍的网络开销。代码库在线提供:[此处插入实际链接]。
https://arxiv.org/abs/2601.15838
The proliferation of XR devices has made egocentric hand pose estimation a vital task, yet this perspective is inherently challenged by frequent finger occlusions. To address this, we propose a novel approach that leverages the rich information in dorsal hand skin deformation, unlocked by recent advances in dense visual featurizers. We introduce a dual-stream delta encoder that learns pose by contrasting features from a dynamic hand with a baseline relaxed position. Our evaluation demonstrates that, using only cropped dorsal images, our method reduces the Mean Per Joint Angle Error (MPJAE) by 18% in self-occluded scenarios (fingers >=50% occluded) compared to state-of-the-art techniques that depend on the whole hand's geometry and large model backbones. Consequently, our method not only enhances the reliability of downstream tasks like index finger pinch and tap estimation in occluded scenarios but also unlocks new interaction paradigms, such as detecting isometric force for a surface "click" without visible movement while minimizing model size.
随着XR设备的普及,自中心视角下的手部姿态估计变得至关重要。然而,由于手指频繁被遮挡,这一任务面临着固有的挑战。为解决这些问题,我们提出了一种新颖的方法,该方法利用了背部手部皮肤变形中丰富的信息,这些信息得益于最近在密集视觉特征提取技术上的进展。我们引入了一个双流delta编码器,通过对比动态手势与基线放松位置的特征来学习姿态。 我们的评估表明,在自我遮挡场景(手指遮挡≥50%)下,仅使用裁剪后的背部手部图像,相较于依赖整个手部几何结构和大型模型骨干网的状态-of-the-art技术,我们的方法将平均关节角度误差(MPJAE)降低了18%。因此,我们的方法不仅增强了在被遮挡场景下的下游任务(如食指捏合和点击估计)的可靠性,还解锁了新的交互范式,例如,在没有明显运动的情况下检测表面“点击”的等效力,同时最小化模型大小。
https://arxiv.org/abs/2601.15516
Most 2D human pose estimation benchmarks are nearly saturated, with the exception of crowded scenes. We introduce PMPose, a top-down 2D pose estimator that incorporates the probabilistic formulation and the mask-conditioning. PMPose improves crowded pose estimation without sacrificing performance on standard scenes. Building on this, we present BBoxMaskPose v2 (BMPv2) integrating PMPose and an enhanced SAM-based mask refinement module. BMPv2 surpasses state-of-the-art by 1.5 average precision (AP) points on COCO and 6 AP points on OCHuman, becoming the first method to exceed 50 AP on OCHuman. We demonstrate that BMP's 2D prompting of 3D model improves 3D pose estimation in crowded scenes and that advances in 2D pose quality directly benefit 3D estimation. Results on the new OCHuman-Pose dataset show that multi-person performance is more affected by pose prediction accuracy than by detection. The code, models, and data are available on this https URL.
大多数二维人体姿态估计基准测试已经接近饱和,唯一的例外是人群场景。我们引入了PMPose,这是一种从上到下的二维姿态估计器,采用了概率形式和掩码条件化技术。PMPose在不牺牲标准场景性能的情况下提高了拥挤场景中的姿态估计效果。在此基础上,我们推出了BBoxMaskPose v2(BMPv2),该版本集成了PMPose和增强的基于SAM的掩码细化模块。BMPv2在COCO上超越了最先进的方法,平均精度(AP)提升了1.5分,在OCHuman数据集中则提高了6个AP点,并且成为首个在OCHuman数据集上超过50 AP的方法。我们展示了BMP对三维模型的二维提示如何改善拥挤场景中的三维姿态估计效果,并证明了二维姿态质量的进步直接有利于三维姿态估计。新的OCHuman-Pose数据集上的结果显示,多人性能更受姿态预测准确性的直接影响,而不是检测准确性的影响。代码、模型和数据可在[此链接](https://this https URL)获取。
https://arxiv.org/abs/2601.15200
In Robot-Assisted Minimally Invasive Surgery (RMIS), accurate tool localization is crucial to ensure patient safety and successful task execution. However, this remains challenging for cable-driven robots, such as the da Vinci robot, because erroneous encoder readings lead to pose estimation errors. In this study, we propose a calibration framework to produce accurate tool localization results through computing the hand-eye transformation matrix on-the-fly. The framework consists of two interrelated algorithms: the feature association block and the hand-eye calibration block, which provide robust correspondences for key points detected on monocular images without pre-training, and offer the versatility to accommodate various surgical scenarios by adopting an array of filter approaches, respectively. To validate its efficacy, we test the framework extensively on publicly available video datasets that feature multiple surgical instruments conducting tasks in both in vitro and ex vivo scenarios, under varying illumination conditions and with different levels of key point measurement accuracy. The results show a significant reduction in tool localization errors under the proposed calibration framework, with accuracies comparable to other state-of-the-art methods while being more time-efficient.
在机器人辅助微创手术(RMIS)中,准确的工具定位对于确保患者安全和成功完成任务至关重要。然而,这种准确性对诸如达芬奇机器人等缆控机器人来说仍具挑战性,因为错误的编码器读数会导致姿态估计误差。在这项研究中,我们提出了一种校准框架,通过实时计算手眼变换矩阵来生成准确的工具定位结果。该框架包含两个相互关联的算法:特征关联模块和手眼标定模块。前者可以在无需预训练的情况下为单目图像上检测到的关键点提供稳健对应关系;后者则采用一系列滤波方法,使其能够适应各种手术场景。 为了验证其有效性,我们在公共可用视频数据集上广泛测试了该框架,这些数据集涵盖了多种手术器械在体外和体内情景下执行任务的情况,并且在不同的光照条件下具有不同程度的关键点测量精度。结果表明,在所提出的校准框架下,工具定位误差显著减少,其准确性可与当前最先进的方法相媲美,同时更加节省时间。
https://arxiv.org/abs/2601.14871
Estimating 3D from 2D is one of the central tasks in computer vision. In this work, we consider the monocular setting, i.e. single-view input, for 3D human pose estimation (HPE). Here, the task is to predict a 3D point set of human skeletal joints from a single 2D input image. While by definition this is an ill-posed problem, recent work has presented methods that solve it with up to several-centimetre error. Typically, these methods employ a two-step approach, where the first step is to detect the 2D skeletal joints in the input image, followed by the step of 2D-to-3D lifting. We find that common lifting models fail when encountering a rotated input. We argue that learning a single human pose along with its in-plane rotations is considerably easier and more geometrically grounded than directly learning a point-to-point mapping. Furthermore, our intuition is that endowing the model with the notion of rotation equivariance without explicitly constraining its parameter space should lead to a more straightforward learning process than one with equivariance by design. Utilising the common HPE benchmarks, we confirm that the 2D rotation equivariance per se improves the model performance on human poses akin to rotations in the image plane, and can be efficiently and straightforwardly learned by augmentation, outperforming state-of-the-art equivariant-by-design methods.
从二维图像估计三维信息是计算机视觉中的一个核心任务。在这项工作中,我们考虑的是单目视角下的3D人体姿态估计(HPE),即基于单一的2D输入图像来预测人类骨骼关节的3D点集。尽管从定义上讲这是一个不适定问题,但最近的研究已经提出了能够将误差控制在几厘米内的方法。通常,这些方法采用两步法:第一步是在输入图像中检测出二维的人体关键点,第二步是进行2D到3D的提升(lifting)。 我们发现,现有的提升模型遇到旋转后的输入时会失效。我们认为学习一个姿态及其平面内旋转的概念要远远比直接学习点对点映射更容易且更具有几何意义。此外,我们的直觉认为让模型具备旋转等变性的概念而不通过限制其参数空间来显式约束它应该比专门设计的等变方法带来更加简单的学习过程。 利用常见的HPE基准测试,我们确认了2D旋转等变性本身可以提升针对图像平面内旋转的人体姿态的表现。这一特性可以通过数据增强的方法高效且直接地被模型学会,并能超越当前最优的设计为等变性的方法。
https://arxiv.org/abs/2601.13913
As aerial platforms evolve from passive observers to active manipulators, the challenge shifts toward designing intuitive interfaces that allow non-expert users to command these systems naturally. This work introduces a novel concept of autonomous aerial manipulation system capable of interpreting high-level natural language commands to retrieve objects and deliver them to a human user. The system is intended to integrate a MediaPipe based on Grounding DINO and a Vision-Language-Action (VLA) model with a custom-built drone equipped with a 1-DOF gripper and an Intel RealSense RGB-D camera. VLA performs semantic reasoning to interpret the intent of a user prompt and generates a prioritized task queue for grasping of relevant objects in the scene. Grounding DINO and dynamic A* planning algorithm are used to navigate and safely relocate the object. To ensure safe and natural interaction during the handover phase, the system employs a human-centric controller driven by MediaPipe. This module provides real-time human pose estimation, allowing the drone to employ visual servoing to maintain a stable, distinct position directly in front of the user, facilitating a comfortable handover. We demonstrate the system's efficacy through real-world experiments for localization and navigation, which resulted in a 0.164m, 0.070m, and 0.084m of max, mean euclidean, and root-mean squared errors, respectively, highlighting the feasibility of VLA for aerial manipulation operations.
随着空中平台从被动观察者发展为主动操作者,设计直观的用户界面成为挑战,以使非专家用户能够自然地控制这些系统。这项工作提出了一种新颖的概念——自主空中操控系统,该系统能理解高级别的自然语言指令来检索物体并将其交付给人类用户。这个系统整合了基于Grounding DINO和MediaPipe的Vision-Language-Action (VLA)模型以及一个配备了1-DOF夹具和Intel RealSense RGB-D相机的定制无人机。 VLA模型执行语义推理,以理解用户的意图,并生成优先级任务队列来抓取场景中的相关物体。使用Grounding DINO和动态A*规划算法进行导航并安全地重新定位对象。为了确保在交接阶段的安全且自然互动,该系统采用了由MediaPipe驱动的人体中心控制器模块。这个模块提供实时人体姿态估计功能,使无人机能够采用视觉伺服技术来保持一个稳定、明显的姿态,直接位于用户前方,从而实现舒适的物体交付。 通过实际实验验证了系统的有效性,在定位和导航方面取得了最大误差为0.164米,平均欧氏距离误差为0.070米,均方根误差为0.084米的成绩。这些结果突显了VLA模型在空中操控操作中的可行性。
https://arxiv.org/abs/2601.13809
Open-vocabulary 6D object pose estimation empowers robots to manipulate arbitrary unseen objects guided solely by natural language. However, a critical limitation of existing approaches is their reliance on unconstrained global matching strategies. In open-world scenarios, trying to match anchor features against the entire query image space introduces excessive ambiguity, as target features are easily confused with background distractors. To resolve this, we propose Fine-grained Correspondence Pose Estimation (FiCoP), a framework that transitions from noise-prone global matching to spatially-constrained patch-level correspondence. Our core innovation lies in leveraging a patch-to-patch correlation matrix as a structural prior to narrowing the matching scope, effectively filtering out irrelevant clutter to prevent it from degrading pose estimation. Firstly, we introduce an object-centric disentanglement preprocessing to isolate the semantic target from environmental noise. Secondly, a Cross-Perspective Global Perception (CPGP) module is proposed to fuse dual-view features, establishing structural consensus through explicit context reasoning. Finally, we design a Patch Correlation Predictor (PCP) that generates a precise block-wise association map, acting as a spatial filter to enforce fine-grained, noise-resilient matching. Experiments on the REAL275 and Toyota-Light datasets demonstrate that FiCoP improves Average Recall by 8.0% and 6.1%, respectively, compared to the state-of-the-art method, highlighting its capability to deliver robust and generalized perception for robotic agents operating in complex, unconstrained open-world environments. The source code will be made publicly available at this https URL.
开放词汇表的六维物体姿态估计使机器人能够仅通过自然语言来操控任意未见过的对象。然而,现有方法的一个关键限制在于它们依赖于无约束的整体匹配策略。在开放世界场景中,尝试在整个查询图像空间内将锚定特征与目标特征进行匹配会引入过多的模糊性,因为目标特征很容易被背景干扰物混淆。为了解决这个问题,我们提出了细粒度对应姿态估计(FiCoP),这是一个从噪声易感的整体匹配转换到具有空间限制的补丁级别对应的框架。我们的核心创新在于利用一个补丁到补丁的相关矩阵作为结构先验来缩小匹配范围,有效地过滤掉与姿态估计降级无关的干扰物。 首先,我们引入了一个以对象为中心的去纠缠预处理步骤,将语义目标从环境噪声中分离出来。其次,我们提出了一种跨视角全局感知(CPGP)模块,用于融合双视图特征,并通过明确的上下文推理建立结构共识。最后,我们设计了补丁相关预测器(PCP),该预测器生成一个精确的块级关联映射,作为空间滤波器来强制执行细粒度且抗噪的匹配。 在REAL275和Toyota-Light数据集上的实验表明,与最先进的方法相比,FiCoP分别提高了8.0%和6.1%的平均召回率,这突显了其能够在复杂、无约束开放世界环境中为机器人提供稳健且通用感知的能力。源代码将公开发布在这个URL上:[此URL](请替换为您实际提供的链接)。
https://arxiv.org/abs/2601.13565
Autonomous navigation for nano-scale unmanned aerial vehicles (nano-UAVs) is governed by extreme Size, Weight, and Power (SWaP) constraints (with the weight < 50 g and sub-100 mW onboard processor), distinguishing it fundamentally from standard robotic paradigms. This review synthesizes the state-of-the-art in sensing, computing, and control architectures designed specifically for these sub- 100mW computational envelopes. We critically analyse the transition from classical geometry-based methods to emerging "Edge AI" paradigms, including quantized deep neural networks deployed on ultra-low-power System-on-Chips (SoCs) and neuromorphic event-based control. Beyond algorithms, we evaluate the hardware-software co-design requisite for autonomy, covering advancements in dense optical flow, optimized Simultaneous Localization and Mapping (SLAM), and learning-based flight control. While significant progress has been observed in visual navigation and relative pose estimation, our analysis reveals persistent gaps in long-term endurance, robust obstacle avoidance in dynamic environments, and the "Sim-to-Real" transfer of reinforcement learning policies. This survey provides a roadmap for bridging these gaps, advocating for hybrid architectures that fuse lightweight classical control with data-driven perception to enable fully autonomous, agile nano-UAVs in GPS-denied environments.
纳米级无人飞行器(nano-UAV)的自主导航受极端尺寸、重量和功耗(SWaP)限制的影响,其重量小于50克且机载处理器功率低于100毫瓦,这与标准机器人范式有根本区别。这篇综述总结了为这些低至100毫瓦计算能力设计的传感、计算及控制架构的最新进展。我们批判性地分析了从传统几何方法向新兴“边缘AI”(Edge AI)范式的转变,包括在超低功耗片上系统(SoCs)上部署量化深度神经网络以及基于事件的神经形态控制。除了算法之外,还评估了实现自主性的硬件-软件协同设计需求,涵盖了密集光流、优化的同时定位与地图构建(SLAM)和学习型飞行控制的进步。 尽管在视觉导航和相对姿态估计方面已经取得了显著进展,但我们的分析揭示了长期续航能力不足、动态环境中的鲁棒性避障以及强化学习策略的“仿真到实际”迁移等方面的持续差距。本调查提供了弥合这些差距的道路图,倡导融合轻量级经典控制与数据驱动感知的混合架构,以实现在没有全球定位系统(GPS)支持环境中完全自主且敏捷的纳米无人机飞行。
https://arxiv.org/abs/2601.13252
Tactile sensing provides a promising sensing modality for object pose estimation in manipulation settings where visual information is limited due to occlusion or environmental effects. However, efficiently leveraging tactile data for estimation remains a challenge due to partial observability, with single observations corresponding to multiple possible contact configurations. This limits conventional estimation approaches largely tailored to vision. We propose to address these challenges by learning an inverse tactile sensor model using denoising diffusion. The model is conditioned on tactile observations from a distributed tactile sensor and trained in simulation using a geometric sensor model based on signed distance fields. Contact constraints are enforced during inference through single-step projection using distance and gradient information from the signed distance field. For online pose estimation, we integrate the inverse model with a particle filter through a proposal scheme that combines generated hypotheses with particles from the prior belief. Our approach is validated in simulated and real-world planar pose estimation settings, without access to visual data or tight initial pose priors. We further evaluate robustness to unmodeled contact and sensor dynamics for pose tracking in a box-pushing scenario. Compared to local sampling baselines, the inverse sensor model improves sampling efficiency and estimation accuracy while preserving multimodal beliefs across objects with varying tactile discriminability.
触觉感知为在视觉信息受限(由于遮挡或环境影响)的操作环境中进行物体姿态估计提供了有前景的感知方式。然而,由于部分可观测性和单一观测对应多种可能接触配置的问题,有效地利用触觉数据进行估算仍是一项挑战。这限制了传统主要针对视觉设计的估算方法的应用范围。我们提出通过使用去噪扩散来学习逆向触觉传感器模型的方法来解决这些问题。该模型基于分布式的触觉传感器的触觉观测,在模拟环境中通过基于符号距离场(Signed Distance Field, SDF)的几何传感器模型进行训练。在推理过程中,利用SDF的距离和梯度信息执行单步投影以强制实施接触约束。对于在线姿态估计,我们通过提案方案将逆向模型与粒子滤波器集成,该方案结合了生成假设和先前信念中的粒子。 我们在模拟和平面姿态估算的真实世界设置中验证了我们的方法,在没有视觉数据或严格的初始姿势先验的情况下进行实验。我们进一步评估了在盒子推动场景的位姿跟踪中对未建模接触和传感器动力学鲁棒性的性能。与局部采样基准相比,逆向传感器模型提高了采样效率和估计精度,并且在整个具有不同触觉分辨能力的对象上保持多模态信念。
https://arxiv.org/abs/2601.13250
WiFi-based 3D human pose estimation offers a low-cost and privacy-preserving alternative to vision-based systems for smart interaction. However, existing approaches rely on visual 3D poses as supervision and directly regress CSI to a camera-based coordinate system. We find that this practice leads to coordinate overfitting: models memorize deployment-specific WiFi transceiver layouts rather than only learning activity-relevant representations, resulting in severe generalization failures. To address this challenge, we present PerceptAlign, the first geometry-conditioned framework for WiFi-based cross-layout pose estimation. PerceptAlign introduces a lightweight coordinate unification procedure that aligns WiFi and vision measurements in a shared 3D space using only two checkerboards and a few photos. Within this unified space, it encodes calibrated transceiver positions into high-dimensional embeddings and fuses them with CSI features, making the model explicitly aware of device geometry as a conditional variable. This design forces the network to disentangle human motion from deployment layouts, enabling robust and, for the first time, layout-invariant WiFi pose estimation. To support systematic evaluation, we construct the largest cross-domain 3D WiFi pose estimation dataset to date, comprising 21 subjects, 5 scenes, 18 actions, and 7 device layouts. Experiments show that PerceptAlign reduces in-domain error by 12.3% and cross-domain error by more than 60% compared to state-of-the-art baselines. These results establish geometry-conditioned learning as a viable path toward scalable and practical WiFi sensing.
基于WiFi的3D人体姿态估计为智能交互提供了一种低成本且保护隐私的替代方案,相较于视觉系统而言。然而,现有的方法依赖于视觉三维姿势作为监督,并直接将信道状态信息(CSI)回归到相机坐标系中。我们发现这种做法会导致坐标过拟合:模型会记住特定部署环境中的WiFi收发器布局,而不是仅学习与活动相关的表示,从而导致严重的泛化失败。为了解决这一挑战,我们提出了PerceptAlign,这是第一个用于基于WiFi的跨布局姿态估计的几何条件框架。 PerceptAlign引入了一种轻量级的坐标统一过程,在共享的三维空间中使用仅有两张棋盘图和少量照片的数据来对齐WiFi与视觉测量结果。在该统一的空间内,它将校准后的收发器位置编码为高维嵌入,并将其与CSI特征融合在一起,使得模型明确地将设备几何形状作为条件变量。这一设计迫使网络分离人体运动与部署布局的关系,从而实现了鲁棒且首次支持布局不变性的基于WiFi的姿态估计。 为了支持系统的评估,我们构建了迄今为止最大的跨域3D WiFi姿态估计算法数据集,包含21个对象、5个场景、18种动作和7种设备布局。实验结果表明,PerceptAlign相比于最先进的基线算法,在领域内错误减少12.3%,在跨领域的错误减少了超过60%。这些成果证明了基于几何条件的学习是一种通向可扩展且实用的WiFi感知的有效路径。
https://arxiv.org/abs/2601.12252
The task of 6DoF object pose estimation is one of the fundamental problems of 3D vision with many practical applications such as industrial automation. Traditional deep learning approaches for this task often require extensive training data or CAD models, limiting their application in real-world industrial settings where data is scarce and object instances vary. We propose a novel method for 6DoF pose estimation focused specifically on bins used in industrial settings. We exploit the cuboid geometry of bins by first detecting intermediate 3D line segments corresponding to their top edges. Our approach extends the 2D line segment detection network LeTR to operate on structured point cloud data. The detected 3D line segments are then processed using a simple geometric procedure to robustly determine the bin's 6DoF pose. To evaluate our method, we extend an existing dataset with a newly collected and annotated dataset, which we make publicly available. We show that incorporating synthetic training data significantly improves pose estimation accuracy on real scans. Moreover, we show that our method significantly outperforms current state-of-the-art 6DoF pose estimation methods in terms of the pose accuracy (3 cm translation error, 8.2$^\circ$ rotation error) while not requiring instance-specific CAD models during inference.
六自由度(6DoF)物体姿态估计是三维视觉领域的一个基本问题,具有许多实际应用价值,如工业自动化。传统深度学习方法通常需要大量的训练数据或CAD模型,在现实世界中的工业环境中,由于数据稀缺且对象实例多样,这限制了这些方法的应用。 我们提出了一种针对工业环境中使用的容器进行6DoF姿态估计的新方法。通过首先检测出与容器顶部边缘相对应的中间3D线段来利用容器的长方体几何特征。我们的方法将二维线段检测网络LeTR扩展到结构化点云数据上操作。然后,我们采用简单的几何过程处理这些检测到的3D线段,以稳健地确定容器的6DoF姿态。 为了评估我们的方法,我们在现有的数据集基础上增加了新的收集并标注的数据集,并将其公开发布。研究表明,在真实扫描中引入合成训练数据可以显著提高姿态估计的准确性。此外,我们证明了与目前最先进的6DoF姿态估计方法相比,我们的方法在姿态精度(3厘米平移误差和8.2度旋转误差)方面表现优异,而且在推理过程中无需特定实例的CAD模型。
https://arxiv.org/abs/2601.12090
We present WildRayZer, a self-supervised framework for novel view synthesis (NVS) in dynamic environments where both the camera and objects move. Dynamic content breaks the multi-view consistency that static NVS models rely on, leading to ghosting, hallucinated geometry, and unstable pose estimation. WildRayZer addresses this by performing an analysis-by-synthesis test: a camera-only static renderer explains rigid structure, and its residuals reveal transient regions. From these residuals, we construct pseudo motion masks, distill a motion estimator, and use it to mask input tokens and gate loss gradients so supervision focuses on cross-view background completion. To enable large-scale training and evaluation, we curate Dynamic RealEstate10K (D-RE10K), a real-world dataset of 15K casually captured dynamic sequences, and D-RE10K-iPhone, a paired transient and clean benchmark for sparse-view transient-aware NVS. Experiments show that WildRayZer consistently outperforms optimization-based and feed-forward baselines in both transient-region removal and full-frame NVS quality with a single feed-forward pass.
我们介绍了WildRayZer,这是一个用于动态环境中新颖视角合成(NVS)的自监督框架,在这种环境下,相机和物体都在移动。动态内容破坏了静态NVS模型所依赖的多视图一致性,导致出现幽灵现象、虚假几何结构以及姿态估计不稳定的问题。WildRayZer通过执行“综合分析测试”来解决这一问题:一个仅限于相机的静态渲染器解释刚性结构,而其残差揭示出瞬态区域。基于这些残差,我们构建了伪运动掩码,提炼出了一个运动估计算法,并使用该算法对输入令牌进行屏蔽并控制损失梯度以使监督聚焦于跨视图背景完成。 为了支持大规模训练和评估,我们策划了Dynamic RealEstate10K(D-RE10K),这是一个由15,000个日常捕捉到的动态序列组成的现实世界数据集,以及一个配对瞬态和干净基准的D-RE10K-iPhone,适用于稀疏视图瞬态感知NVS。实验表明,WildRayZer在移除瞬变区域及全帧NVS质量方面,在单次前馈通过中始终优于基于优化和前向传播的方法基线。
https://arxiv.org/abs/2601.10716
3D pose estimation from sparse multi-views is a critical task for numerous applications, including action recognition, sports analysis, and human-robot interaction. Optimization-based methods typically follow a two-stage pipeline, first detecting 2D keypoints in each view and then associating these detections across views to triangulate the 3D pose. Existing methods rely on mere pairwise associations to model this correspondence problem, treating global consistency between views (i.e., cycle consistency) as a soft constraint. Yet, reconciling these constraints for multiple views becomes brittle when spurious associations propagate errors. We thus propose COMPOSE, a novel framework that formulates multi-view pose correspondence matching as a hypergraph partitioning problem rather than through pairwise association. While the complexity of the resulting integer linear program grows exponentially in theory, we introduce an efficient geometric pruning strategy to substantially reduce the search space. COMPOSE achieves improvements of up to 23% in average precision over previous optimization-based methods and up to 11% over self-supervised end-to-end learned methods, offering a promising solution to a widely studied problem.
从稀疏多视角进行三维姿态估计是一项对于动作识别、体育分析和人机交互等众多应用至关重要的任务。基于优化的方法通常遵循两阶段流程:首先在每个视角中检测2D关键点,然后将这些检测结果跨视角关联起来以三角测量出3D姿态。现有的方法依赖于简单的成对关联来建模这种对应问题,并且将视图之间的全局一致性(即循环一致性)视为软约束处理。然而,在多个视角的情况下,解决这些约束变得脆弱,因为虚假的关联会传播错误。 为此,我们提出了一种名为COMPOSE的新框架,它将多视角姿态对应的匹配问题形式化为超图划分问题,而不是通过成对关联来解决。尽管理论上由此产生的整数线性规划复杂度呈指数增长,但我们引入了一个高效的几何剪枝策略,从而大幅减少了搜索空间。与以前的基于优化的方法相比,COMPOSE在平均精度上提高了多达23%,而相较于自监督端到端学习方法则提升了高达11%的表现,为一个长期研究的问题提供了有前景的解决方案。
https://arxiv.org/abs/2601.09698
Low-cost inertial measurement units (IMUs) are widely utilized in mobile robot localization due to their affordability and ease of integration. However, their complex, nonlinear, and time-varying noise characteristics often lead to significant degradation in localization accuracy when applied directly for dead reckoning. To overcome this limitation, we propose a novel brain-inspired state estimation framework that combines a spiking neural network (SNN) with an invariant extended Kalman filter (InEKF). The SNN is designed to extract motion-related features from long sequences of IMU data affected by substantial random noise and is trained via a surrogate gradient descent algorithm to enable dynamic adaptation of the covariance noise parameter within the InEKF. By fusing the SNN output with raw IMU measurements, the proposed method enhances the robustness and accuracy of pose estimation. Extensive experiments conducted on the KITTI dataset and real-world data collected using a mobile robot equipped with a low-cost IMU demonstrate that the proposed approach outperforms state-of-the-art methods in localization accuracy and exhibits strong robustness to sensor noise, highlighting its potential for real-world mobile robot applications.
低成本惯性测量单元(IMUs)由于其经济性和易于集成的特性,在移动机器人定位中被广泛应用。然而,当直接用于航位推算时,这些设备复杂的、非线性的和随时间变化的噪声特征常常会导致定位精度显著下降。为克服这一限制,我们提出了一种新型仿脑状态估计框架,该框架结合了脉冲神经网络(SNN)与不变扩展卡尔曼滤波器(InEKF)。SNN被设计用于从受大量随机噪声影响的长时间IMU数据序列中提取运动相关特征,并通过代理梯度下降算法进行训练,以使InEKF中的协方差噪声参数能够动态调整。通过融合SNN输出与原始IMU测量值,所提出的方法增强了姿态估计的鲁棒性和准确性。 在KITTI数据集和使用低成本IMU装备的移动机器人采集的真实世界数据上进行了广泛的实验表明,该方法在定位精度方面优于现有的先进方法,并且对传感器噪声表现出强大的鲁棒性。这一结果凸显了其在现实世界的移动机器人应用中的潜力。
https://arxiv.org/abs/2601.08248
We introduce Fiducial Exoskeletons, an image-based reformulation of 3D robot state estimation that replaces cumbersome procedures and motor-centric pipelines with single-image inference. Traditional approaches - especially robot-camera extrinsic estimation - often rely on high-precision actuators and require time-consuming routines such as hand-eye calibration. In contrast, modern learning-based robot control is increasingly trained and deployed from RGB observations on lower-cost hardware. Our key insight is twofold. First, we cast robot state estimation as 6D pose estimation of each link from a single RGB image: the robot-camera base transform is obtained directly as the estimated base-link pose, and the joint state is recovered via a lightweight global optimization that enforces kinematic consistency with the observed link poses (optionally warm-started with encoder readings). Second, we make per-link 6D pose estimation robust and simple - even without learning - by introducing the fiducial exoskeleton: a lightweight 3D-printed mount with a fiducial marker on each link and known marker-link geometry. This design yields robust camera-robot extrinsics, per-link SE(3) poses, and joint-angle state from a single image, enabling robust state estimation even on unplugged robots. Demonstrated on a low-cost robot arm, fiducial exoskeletons substantially simplify setup while improving calibration, state accuracy, and downstream 3D control performance. We release code and printable hardware designs to enable further algorithm-hardware co-design.
我们介绍了一种名为“基准外骨骼”的图像基三维机器人状态估计方法,这种方法用单张图片的推断取代了复杂的操作和以电机为中心的工作流程。传统的方法——尤其是机器人的相机外部参数估算——常常依赖于高精度执行器,并且需要诸如手动眼睛校准之类的耗时过程。相比之下,现代基于学习的机器人控制越来越多地使用低成本硬件上的RGB观察结果进行训练和部署。 我们的关键见解有两个方面。首先,我们将机器人状态估计重新定义为从单张RGB图像中估算每个链节的6D姿态:机器人相机的基础变换直接通过基础-链节的姿态估计获得,并且关节状态可以通过轻量级全局优化恢复,该过程强制执行与观察到的链节姿态一致的运动学一致性(在使用编码器读数进行热启动时可选)。 其次,我们通过引入基准外骨骼来使每个链节的6D姿态估算更加稳健和简单——即使不采用学习方法。这种设计包括一个轻量级的3D打印支架,在每个链节上安装了一个基准标记,并且已知标记-链节几何关系。这一设计理念能够生成稳健的相机机器人外部参数,单个图像中的每个链接SE(3)姿态以及关节角度状态,从而即使在断电的情况下也能实现精确的状态估算。 在一个低成本的机械臂上进行演示后,我们发现使用基准外骨骼大大简化了设置过程,并且提高了校准、状态精度和下游三维控制性能。为了进一步促进算法与硬件的设计协同工作,我们将发布代码及可打印的硬件设计。
https://arxiv.org/abs/2601.08034
This paper presents a method for carrying fair comparisons of the accuracy of pose estimation using fiducial markers. These comparisons rely on large sets of high-fidelity synthetic images enabling deep exploration of the 6 degrees of freedom. A low-discrepancy sampling of the space allows to check the correlations between each degree of freedom and the pose errors by plotting the 36 pairs of combinations. The images are rendered using a physically based ray tracing code that has been specifically developed to use the standard calibration coefficients of any camera directly. The software reproduces image distortions, defocus and diffraction blur. Furthermore, sub-pixel sampling is applied to sharp edges to enhance the fidelity of the rendered image. After introducing the rendering algorithm and its experimental validation, the paper proposes a method for evaluating the pose accuracy. This method is applied to well-known markers, revealing their strengths and weaknesses for pose estimation. The code is open source and available on GitHub.
本文提出了一种使用标识标记进行姿态估计准确性公平比较的方法。这些比较依赖于大量高保真的合成图像集,使六自由度(6DOF)的深度探索成为可能。通过低差异性采样空间,可以绘制出36对组合图来检查每个自由度与姿态误差之间的相关性。使用基于物理的光线追踪代码渲染图像,该代码专门开发用于直接使用任何摄像头的标准校准系数。软件能够再现图像失真、焦外模糊和衍射模糊。此外,在锐边处应用亚像素采样以增强渲染图像的保真度。 在介绍了渲染算法及其实验验证之后,本文提出了一种评估姿态准确性的方法,并将该方法应用于几种知名标识标记中,揭示了它们在姿态估计中的优缺点。源代码是开源的,并可在GitHub上获取。
https://arxiv.org/abs/2601.07723
6D object pose estimation plays a crucial role in scene understanding for applications such as robotics and augmented reality. To support the needs of ever-changing object sets in such context, modern zero-shot object pose estimators were developed to not require object-specific training but only rely on CAD models. Such models are hard to obtain once deployed, and a continuously changing and growing set of objects makes it harder to reliably identify the instance model of interest. To address this challenge, we introduce an Open-Set CAD Retrieval from a Language Prompt and a Single Image (OSCAR), a novel training-free method that retrieves a matching object model from an unlabeled 3D object database. During onboarding, OSCAR generates multi-view renderings of database models and annotates them with descriptive captions using an image captioning model. At inference, GroundedSAM detects the queried object in the input image, and multi-modal embeddings are computed for both the Region-of-Interest and the database captions. OSCAR employs a two-stage retrieval: text-based filtering using CLIP identifies candidate models, followed by image-based refinement using DINOv2 to select the most visually similar object. In our experiments we demonstrate that OSCAR outperforms all state-of-the-art methods on the cross-domain 3D model retrieval benchmark MI3DOR. Furthermore, we demonstrate OSCAR's direct applicability in automating object model sourcing for 6D object pose estimation. We propose using the most similar object model for pose estimation if the exact instance is not available and show that OSCAR achieves an average precision of 90.48\% during object retrieval on the YCB-V object dataset. Moreover, we demonstrate that the most similar object model can be utilized for pose estimation using Megapose achieving better results than a reconstruction-based approach.
六维物体姿态估计在机器人和增强现实等应用的场景理解中扮演着至关重要的角色。为了支持这些应用场景中不断变化的对象集合的需求,现代零样本(zero-shot)物体姿态估算器被开发出来,不需要特定于某个对象的训练,而只需要依靠CAD(计算机辅助设计)模型即可工作。然而,在部署后获取这些模型变得非常困难,并且由于对象集持续变化和增长,准确识别感兴趣的实例模型变得更加具有挑战性。 为了解决这一难题,我们引入了一种名为OSCAR的方法,即从语言提示和单张图像进行开放集合CAD检索的新颖训练自由方法(Open-Set CAD Retrieval from a Language Prompt and a Single Image)。在部署时,OSCAR会生成数据库中模型的多视角渲染,并使用图像描述性文字注释工具来标注这些渲染。推理阶段时,GroundedSAM会在输入图像中检测查询对象,同时为感兴趣区域和数据库中的描述性文字计算多模态嵌入。 OSCAR采用两阶段检索方法:第一阶段是利用CLIP(一种文本到图像的匹配模型)基于文本过滤候选模型;第二阶段则使用DINOv2进行基于图像的细化,选择视觉上最相似的对象。在我们的实验中显示,与现有最佳方法相比,OSCAR在跨域3D模型检索基准MI3DOR上的性能更优。 此外,我们展示了OSCAR在自动化六维物体姿态估计所需对象模型获取中的直接应用价值。当无法获得确切实例时,我们可以使用最相似的对象模型进行姿态估计,并证明在YCB-V对象数据集上,OSCAR在物体检索期间达到了90.48%的平均精度。 最后,我们还展示了即使采用Megapose方法利用最接近的对象模型来进行姿态估计也能取得比基于重建的方法更好的结果。
https://arxiv.org/abs/2601.07333
From Vision-Language-Action (VLA) systems to robotics, existing egocentric datasets primarily focus on action recognition tasks, while largely overlooking the inherent role of motion analysis in sports and other fast-movement scenarios. To bridge this gap, we propose a real-time motion focus recognition method that estimates the subject's locomotion intention from any egocentric video. Our approach leverages the foundation model for camera pose estimation and introduces system-level optimizations to enable efficient and scalable inference. Evaluated on a collected egocentric action dataset, our method achieves real-time performance with manageable memory consumption through a sliding batch inference strategy. This work makes motion-centric analysis practical for edge deployment and offers a complementary perspective to existing egocentric studies on sports and fast-movement activities.
从视觉-语言-行动(VLA)系统到机器人技术,现有的第一人称视角数据集主要侧重于动作识别任务,而忽略了运动分析在体育及其他快速移动场景中的内在作用。为了填补这一空白,我们提出了一种实时的运动焦点识别方法,该方法可以从任何第一人称视频中估计主体的移动意图。我们的方法利用了基础模型进行相机姿态估计,并引入系统级优化以实现高效和可扩展的推理。 在收集到的第一人称动作数据集上进行了评估,通过滑动批处理推理策略,我们的方法实现了实时性能并保持了可控的记忆消耗。这项工作使得运动中心分析在边缘部署中变得实用,并为现有的第一人称视角研究提供了关于体育及快速移动活动的新颖补充观点。
https://arxiv.org/abs/2601.07154