The computer vision community has developed numerous techniques for digitally restoring true scene information from single-view degraded photographs, an important yet extremely ill-posed task. In this work, we tackle image restoration from a different perspective by jointly denoising multiple photographs of the same scene. Our core hypothesis is that degraded images capturing a shared scene contain complementary information that, when combined, better constrains the restoration problem. To this end, we implement a powerful multi-view diffusion model that jointly generates uncorrupted views by extracting rich information from multi-view relationships. Our experiments show that our multi-view approach outperforms existing single-view image and even video-based methods on image deblurring and super-resolution tasks. Critically, our model is trained to output 3D consistent images, making it a promising tool for applications requiring robust multi-view integration, such as 3D reconstruction or pose estimation.
计算机视觉领域已经开发了多种技术,用于从单视角的退化照片中数字化地恢复真实场景信息,这是一个重要但极其难以处理的任务。在这项工作中,我们通过同时去噪同一场景的多张照片,以不同的视角解决了图像修复的问题。我们的核心假设是,捕捉相同场景的退化图片包含互补的信息,这些信息结合后可以更好地约束图像复原问题。为此,我们实现了一个强大的多视图扩散模型,该模型能够从多视图关系中提取丰富的信息,并同时生成未受污染的视角。 实验表明,与现有的单视图和甚至基于视频的方法相比,我们的多视图方法在图像去模糊和超分辨率任务上表现出色。尤为重要的是,我们的模型经过训练可以输出三维一致的图像,使其成为需要稳健多视图集成的应用(如3D重建或姿态估计)的理想工具。
https://arxiv.org/abs/2503.14463
Existing 3D Human Pose Estimation (HPE) methods achieve high accuracy but suffer from computational overhead and slow inference, while knowledge distillation methods fail to address spatial relationships between joints and temporal correlations in multi-frame inputs. In this paper, we propose Sparse Correlation and Joint Distillation (SCJD), a novel framework that balances efficiency and accuracy for 3D HPE. SCJD introduces Sparse Correlation Input Sequence Downsampling to reduce redundancy in student network inputs while preserving inter-frame correlations. For effective knowledge transfer, we propose Dynamic Joint Spatial Attention Distillation, which includes Dynamic Joint Embedding Distillation to enhance the student's feature representation using the teacher's multi-frame context feature, and Adjacent Joint Attention Distillation to improve the student network's focus on adjacent joint relationships for better spatial understanding. Additionally, Temporal Consistency Distillation aligns the temporal correlations between teacher and student networks through upsampling and global supervision. Extensive experiments demonstrate that SCJD achieves state-of-the-art performance. Code is available at this https URL.
现有的3D人体姿态估计(HPE)方法虽然能够达到很高的精度,但存在计算开销大和推理速度慢的问题。而知识蒸馏方法则无法解决关节之间的空间关系以及多帧输入中的时间相关性问题。本文提出了一种新的框架——稀疏关联与关节蒸馏(SCJD),该框架能够在3D HPE中实现效率与精度的平衡。 SCJD引入了稀疏相关输入序列降采样技术,以减少学生网络输入中的冗余信息同时保留帧间的相关性。为了有效地进行知识传递,我们提出了动态关节空间注意力蒸馏方法,其中包括动态关节嵌入蒸馏和相邻关节注意蒸馏两个方面:前者利用教师的多帧上下文特征来增强学生的特征表示;后者则改进了学生网络对相邻关节关系的关注度以提高其在空间上的理解能力。此外,通过上采样与全局监督相结合的方式进行时间一致性蒸馏,使教师模型和学生模型之间的时间相关性得到对齐。 经过广泛的实验验证,SCJD实现了最先进的性能表现。该框架的代码可在提供的链接中获取(此URL指向的是原文中的一个链接位置)。
https://arxiv.org/abs/2503.14097
Accurate transformation estimation between camera space and robot space is essential. Traditional methods using markers for hand-eye calibration require offline image collection, limiting their suitability for online self-calibration. Recent learning-based robot pose estimation methods, while advancing online calibration, struggle with cross-robot generalization and require the robot to be fully visible. This work proposes a Foundation feature-driven online End-Effector Pose Estimation (FEEPE) algorithm, characterized by its training-free and cross end-effector generalization capabilities. Inspired by the zero-shot generalization capabilities of foundation models, FEEPE leverages pre-trained visual features to estimate 2D-3D correspondences derived from the CAD model and target image, enabling 6D pose estimation via the PnP algorithm. To resolve ambiguities from partial observations and symmetry, a multi-historical key frame enhanced pose optimization algorithm is introduced, utilizing temporal information for improved accuracy. Compared to traditional hand-eye calibration, FEEPE enables marker-free online calibration. Unlike robot pose estimation, it generalizes across robots and end-effectors in a training-free manner. Extensive experiments demonstrate its superior flexibility, generalization, and performance.
相机空间和机器人空间之间的精确变换估计至关重要。传统方法使用标记进行手眼标定需要离线图像采集,这限制了它们在线自校准的适用性。虽然基于学习的方法在推进在线校准方面取得了进展,但这些方法难以实现跨机器人的泛化,并且要求机器人完全可见。这项工作提出了一种基于基础特征的在线末端执行器姿态估计(FEEPE)算法,该算法具有免训练和跨末端执行器泛化的特性。 受到零样本泛化能力强大的基础模型的启发,FEEPE利用预先训练的视觉特征来估计从CAD模型和目标图像中提取的2D-3D对应关系,并通过PnP算法实现6D姿态估计。为了解决来自部分观测和对称性的歧义问题,引入了一种多历史关键帧增强的姿态优化算法,该算法利用时间信息提高准确性。 与传统的手眼标定方法相比,FEEPE可以实现无标记的在线校准。与其他机器人姿态估计方法不同,它可以在不进行训练的情况下跨多个机器人和末端执行器进行泛化。广泛的实验表明其在灵活性、泛化能力和性能方面具有明显优势。
https://arxiv.org/abs/2503.14051
Category-level object pose estimation aims to determine the pose and size of novel objects in specific categories. Existing correspondence-based approaches typically adopt point-based representations to establish the correspondences between primitive observed points and normalized object coordinates. However, due to the inherent shape-dependence of canonical coordinates, these methods suffer from semantic incoherence across diverse object shapes. To resolve this issue, we innovatively leverage the sphere as a shared proxy shape of objects to learn shape-independent transformation via spherical representations. Based on this insight, we introduce a novel architecture called SpherePose, which yields precise correspondence prediction through three core designs. Firstly, We endow the point-wise feature extraction with SO(3)-invariance, which facilitates robust mapping between camera coordinate space and object coordinate space regardless of rotation transformation. Secondly, the spherical attention mechanism is designed to propagate and integrate features among spherical anchors from a comprehensive perspective, thus mitigating the interference of noise and incomplete point cloud. Lastly, a hyperbolic correspondence loss function is designed to distinguish subtle distinctions, which can promote the precision of correspondence prediction. Experimental results on CAMERA25, REAL275 and HouseCat6D benchmarks demonstrate the superior performance of our method, verifying the effectiveness of spherical representations and architectural innovations.
类别级别的对象姿态估计旨在确定特定类别中新型对象的姿态和大小。现有的基于对应的方法通常采用点基表示方法来建立原始观察到的点与归一化物体坐标的对应关系。然而,由于规范坐标固有的形状依赖性,在不同形状的对象之间这些方法会遭受语义不一致的问题。为了解决这个问题,我们创新地利用球体作为对象的共享代理形状,通过球形表示学习出形状无关的变换。基于这一见解,我们引入了一种名为SpherePose的新架构,该架构通过三个核心设计实现了精确对应预测。 首先,我们将SO(3)不变性赋予逐点特征提取过程,这使得无论旋转转换如何,都能在相机坐标空间和对象坐标空间之间进行稳健映射。其次,球形注意力机制被设计用来从全面的角度传播和集成来自球形锚之间的特征,从而减轻噪声和不完整点云的干扰。最后,我们设计了一种双曲对应损失函数来区分细微差别,这可以提高对应的预测精度。 在CAMERA25、REAL275和HouseCat6D基准测试上的实验结果表明了我们的方法优越的性能,并验证了球形表示和架构创新的有效性。
https://arxiv.org/abs/2503.13926
We introduce STEP, a novel framework utilizing Transformer-based discriminative model prediction for simultaneous tracking and estimation of pose across diverse animal species and humans. We are inspired by the fact that the human brain exploits spatiotemporal continuity and performs concurrent localization and pose estimation despite the specialization of brain areas for form and motion processing. Traditional discriminative models typically require predefined target states for determining model weights, a challenge we address through Gaussian Map Soft Prediction (GMSP) and Offset Map Regression Adapter (OMRA) Modules. These modules remove the necessity of keypoint target states as input, streamlining the process. Our method starts with a known target state initialized through a pre-trained detector or manual initialization in the initial frame of a given video sequence. It then seamlessly tracks the target and estimates keypoints of anatomical importance as output for subsequent frames. Unlike prevalent top-down pose estimation methods, our approach doesn't rely on per-frame target detections due to its tracking capability. This facilitates a significant advancement in inference efficiency and potential applications. We train and validate our approach on datasets encompassing diverse species. Our experiments demonstrate superior results compared to existing methods, opening doors to various applications, including but not limited to action recognition and behavioral analysis.
我们介绍了一个名为STEP的新型框架,该框架利用基于Transformer的判别模型预测,在不同动物物种和人类中同时进行姿态追踪和估计。我们的灵感来源于人脑如何利用时空连续性,并在处理形态和运动的专业化脑区之间实现并发定位与姿态估计。传统的判别模型通常需要预定义的目标状态来确定模型权重,这是我们通过高斯地图软预测(GMSP)和偏移图回归适配器(OMRA)模块解决的问题。这些模块消除了对关键点目标状态作为输入的需求,简化了整个流程。 我们的方法从已知的初始目标状态开始,该状态可以通过预训练检测器或手动在给定视频序列的初始帧中进行初始化。然后它无缝地追踪目标,并为后续帧输出具有解剖学意义的关键点估计值。与流行的自上而下的姿态估计方法不同,由于其追踪能力,我们的方法不需要依赖每帧的目标检测结果。这极大地提升了推断效率和潜在应用。 我们在涵盖多种物种的数据集上训练并验证了该方法。实验结果显示,相比于现有方法,我们的框架取得了显著的优越性能,为包括但不限于动作识别和行为分析在内的各种应用场景开辟了新的可能性。
https://arxiv.org/abs/2503.13344
Estimating the 3D pose of hand and potential hand-held object from monocular images is a longstanding challenge. Yet, existing methods are specialized, focusing on either bare-hand or hand interacting with object. No method can flexibly handle both scenarios and their performance degrades when applied to the other scenario. In this paper, we propose UniHOPE, a unified approach for general 3D hand-object pose estimation, flexibly adapting both scenarios. Technically, we design a grasp-aware feature fusion module to integrate hand-object features with an object switcher to dynamically control the hand-object pose estimation according to grasping status. Further, to uplift the robustness of hand pose estimation regardless of object presence, we generate realistic de-occluded image pairs to train the model to learn object-induced hand occlusions, and formulate multi-level feature enhancement techniques for learning occlusion-invariant features. Extensive experiments on three commonly-used benchmarks demonstrate UniHOPE's SOTA performance in addressing hand-only and hand-object scenarios. Code will be released on this https URL.
从单目图像估计手部和潜在手持物体的三维姿态是一个长期存在的挑战。现有的方法是专门化的,要么专注于裸手,要么关注手与物体交互的情况。没有一种方法可以灵活地处理这两种情况,并且当应用于另一种场景时其性能会下降。在本文中,我们提出了UniHOPE,这是一种用于通用3D手-对象姿态估计的统一方法,能够灵活适应两种场景。 从技术上讲,我们设计了一个抓握感知特征融合模块,将手部和物体特征进行整合,并使用一个对象切换器来根据抓握状态动态控制手部和物体姿态的估算。此外,为了增强无论是否有物体存在时的手势估计的鲁棒性,我们生成了现实的去遮挡图像对,以训练模型学习由物体引起的双手遮挡情况,并制定了多级特征增强技术,用于学习不受遮挡影响的特征。 在三个常用基准上的广泛实验表明,UniHOPE在处理仅手部和手-对象场景时具有最先进的性能。代码将在以下链接发布:[此URL](请将此处的"this https URL"替换为实际发布的具体网址)。
https://arxiv.org/abs/2503.13303
Compact and efficient 6DoF object pose estimation is crucial in applications such as robotics, augmented reality, and space autonomous navigation systems, where lightweight models are critical for real-time accurate performance. This paper introduces a novel uncertainty-aware end-to-end Knowledge Distillation (KD) framework focused on keypoint-based 6DoF pose estimation. Keypoints predicted by a large teacher model exhibit varying levels of uncertainty that can be exploited within the distillation process to enhance the accuracy of the student model while ensuring its compactness. To this end, we propose a distillation strategy that aligns the student and teacher predictions by adjusting the knowledge transfer based on the uncertainty associated with each teacher keypoint prediction. Additionally, the proposed KD leverages this uncertainty-aware alignment of keypoints to transfer the knowledge at key locations of their respective feature maps. Experiments on the widely-used LINEMOD benchmark demonstrate the effectiveness of our method, achieving superior 6DoF object pose estimation with lightweight models compared to state-of-the-art approaches. Further validation on the SPEED+ dataset for spacecraft pose estimation highlights the robustness of our approach under diverse 6DoF pose estimation scenarios.
紧凑且高效的六自由度(6DoF)物体姿态估计在机器人技术、增强现实和自主导航系统中至关重要,特别是在这些应用中,轻量级模型对于实现实时准确性能是关键。本文介绍了一种新颖的不确定性感知端到端知识蒸馏(KD)框架,该框架专注于基于关键点的六自由度姿态估计。大型教师模型预测的关键点具有不同的不确定水平,可以在蒸馏过程中利用这些差异来提高学生模型的准确性并确保其紧凑性。为此,我们提出了一种蒸馏策略,通过根据每个教师关键点预测的相关不确定性调整知识传输,使学生和教师的预测对齐。此外,所提出的KD框架利用这种基于不确定性的关键点对齐,在特征图的关键位置传递知识。 在广泛使用的LINEMOD基准测试中进行的实验表明,我们的方法与最先进的方法相比,在使用轻量级模型的情况下实现了更优越的6DoF物体姿态估计性能。进一步在SPEED+数据集上验证了太空飞行器姿态估计中的鲁棒性,展示了该方法在各种六自由度姿态估计场景下的有效性。
https://arxiv.org/abs/2503.13053
The increasing frequency of firearm-related incidents has necessitated advancements in security and surveillance systems, particularly in firearm detection within public spaces. Traditional gun detection methods rely on manual inspections and continuous human monitoring of CCTV footage, which are labor-intensive and prone to high false positive and negative rates. To address these limitations, we propose a novel approach that integrates human pose estimation with weapon appearance recognition using deep learning techniques. Unlike prior studies that focus on either body pose estimation or firearm detection in isolation, our method jointly analyzes posture and weapon presence to enhance detection accuracy in real-world, dynamic environments. To train our model, we curated a diverse dataset comprising images from open-source repositories such as IMFDB and Monash Guns, supplemented with AI-generated and manually collected images from web sources. This dataset ensures robust generalization and realistic performance evaluation under various surveillance conditions. Our research aims to improve the precision and reliability of firearm detection systems, contributing to enhanced public safety and threat mitigation in high-risk areas.
近年来,与枪支相关的事件频发,促使了安全和监控系统的改进,尤其是在公共场所的枪支检测方面。传统的枪支检测方法依赖于手动检查以及持续的人工监控闭路电视(CCTV)录像,这些方法既费时又容易出现误报或漏报的情况。为了克服这些限制,我们提出了一种创新的方法,该方法结合了人体姿态估计和武器外观识别技术,并利用深度学习技术实现。 与先前的研究不同的是,它们主要关注单一的人体姿势估计或枪支检测任务,我们的方法则同时分析姿势和武器的存在情况,以提高在真实复杂环境中的检测准确率。为了训练模型,我们整理了一个多样化数据集,包括从IMFDB、Monash Guns等开源库获取的图像以及通过AI生成及人工收集自网络资源的图片。这一数据集确保了模型在各种监控条件下的稳健泛化能力和实际性能评估。 我们的研究旨在提高枪支检测系统的精确性和可靠性,从而增强公共安全和高风险区域的安全威胁缓解能力。
https://arxiv.org/abs/2503.12215
Humans can infer complete shapes and appearances of objects from limited visual cues, relying on extensive prior knowledge of the physical world. However, completing partially observable objects while ensuring consistency across video frames remains challenging for existing models, especially for unstructured, in-the-wild videos. This paper tackles the task of Video Amodal Completion (VAC), which aims to generate the complete object consistently throughout the video given a visual prompt specifying the object of interest. Leveraging the rich, consistent manifolds learned by pre-trained video diffusion models, we propose a conditional diffusion model, TACO, that repurposes these manifolds for VAC. To enable its effective and robust generalization to challenging in-the-wild scenarios, we curate a large-scale synthetic dataset with multiple difficulty levels by systematically imposing occlusions onto un-occluded videos. Building on this, we devise a progressive fine-tuning paradigm that starts with simpler recovery tasks and gradually advances to more complex ones. We demonstrate TACO's versatility on a wide range of in-the-wild videos from Internet, as well as on diverse, unseen datasets commonly used in autonomous driving, robotic manipulation, and scene understanding. Moreover, we show that TACO can be effectively applied to various downstream tasks like object reconstruction and pose estimation, highlighting its potential to facilitate physical world understanding and reasoning. Our project page is available at this https URL.
人类可以从有限的视觉线索中推断出物体完整形状和外观,这依赖于对物理世界的广泛先验知识。然而,在视频帧间保持一致性的前提下完成部分可见的对象对于现有的模型来说仍然具有挑战性,尤其是在处理未结构化的、自然环境中的视频时更是如此。本文研究的任务是视频模态完成(Video Amodal Completion, VAC),旨在根据指定感兴趣对象的视觉提示生成视频中该对象在全程内的完整形态。通过利用预训练视频扩散模型所学习到的丰富且一致的流形,我们提出了一种条件扩散模型TACO,用以重新利用这些流形来执行VAC任务。 为了使TACO能够有效地泛化到具有挑战性的自然场景中,我们系统性地对未遮挡的视频施加遮挡,构建了一个大规模合成数据集,并包含了多个难度等级。基于此,我们设计了一种逐步细化训练范式,从简单的恢复任务开始逐渐推进至更复杂的任务。 我们在一系列互联网来源的、具有挑战性的自然环境中的真实视频上展示了TACO的多功能性;同时,在自动驾驶、机器人操作和场景理解中常用的多种未见过的数据集上也对其进行了展示。此外,我们还表明TACO可以有效地应用于物体重建和姿态估计等下游任务,突显了它在物理世界理解和推理方面的潜力。 我们的项目页面可在此链接访问:[此URL](原文中的具体链接无法直接复制,您可以在原始论文或相关发布中查找实际链接)。
https://arxiv.org/abs/2503.12049
Egocentric 3D human pose estimation has been actively studied using cameras installed in front of a head-mounted device (HMD). While frontal placement is the optimal and the only option for some tasks, such as hand tracking, it remains unclear if the same holds for full-body tracking due to self-occlusion and limited field-of-view coverage. Notably, even the state-of-the-art methods often fail to estimate accurate 3D poses in many scenarios, such as when HMD users tilt their heads upward (a common motion in human activities). A key limitation of existing HMD designs is their neglect of the back of the body, despite its potential to provide crucial 3D reconstruction cues. Hence, this paper investigates the usefulness of rear cameras in the HMD design for full-body tracking. We also show that simply adding rear views to the frontal inputs is not optimal for existing methods due to their dependence on individual 2D joint detectors without effective multi-view integration. To address this issue, we propose a new transformer-based method that refines 2D joint heatmap estimation with multi-view information and heatmap uncertainty, thereby improving 3D pose tracking. Moreover, we introduce two new large-scale datasets, Ego4View-Syn and Ego4View-RW, for a rear-view evaluation. Our experiments show that the new camera configurations with back views provide superior support for 3D pose tracking compared to only frontal placements. The proposed method achieves significant improvement over the current state of the art (>10% on MPJPE). We will release the source code, trained models, and new datasets on our project page this https URL.
基于头部佩戴设备(HMD)前方安装的相机的人体3D姿态估计研究已经十分活跃。尽管对于某些任务,如手部追踪而言,前置视角是最优且唯一的选择,但对于全身追踪来说,由于自我遮挡和有限的视野覆盖范围问题,尚不清楚同样的前置视角是否适用。值得注意的是,即使是最先进的方法,在许多场景中(例如当HMD用户向上仰头时——这是人类活动中常见的动作)也无法准确估计3D姿态。现有HMD设计的一个关键局限是忽视了人体后方部分的重要性,尽管后者有可能提供重要的3D重建线索。因此,本文探讨了在HMD设计中加入后置相机对于全身追踪的实用性。同时我们还展示了简单地将后视图添加到前置输入并不能优化现有的方法,因为这些方法依赖于单独的2D关节检测器,并且缺乏有效的多视角整合。为解决这个问题,我们提出了一种新的基于Transformer的方法,该方法通过利用多视角信息和热图不确定性来改进2D关节热图估计,从而提升3D姿态追踪效果。此外,为了后视评估,我们还引入了两个新的大规模数据集Ego4View-Syn和Ego4View-RW。实验结果表明,新配置的相机(包括后置视角)在支持3D姿势追踪方面优于仅前置视角的配置。所提出的方法相对于当前最佳方法实现了显著改进(MPJPE指标提升超过10%)。我们将在这个项目页面上发布源代码、训练模型和新的数据集:此链接。 请注意,原文中提到的URL未提供具体网址,在实际应用时需要根据实际情况替换为正确的链接地址。
https://arxiv.org/abs/2503.11652
Online test-time adaptation for 3D human pose estimation is used for video streams that differ from training data. Ground truth 2D poses are used for adaptation, but only estimated 2D poses are available in practice. This paper addresses adapting models to streaming videos with estimated 2D poses. Comparing adaptations reveals the challenge of limiting estimation errors while preserving accurate pose information. To this end, we propose adaptive aggregation, a two-stage optimization, and local augmentation for handling varying levels of estimated pose error. First, we perform adaptive aggregation across videos to initialize the model state with labeled representative samples. Within each video, we use a two-stage optimization to benefit from 2D fitting while minimizing the impact of erroneous updates. Second, we employ local augmentation, using adjacent confident samples to update the model before adapting to the current non-confident sample. Our method surpasses state-of-the-art by a large margin, advancing adaptation towards more practical settings of using estimated 2D poses.
在线测试时间适应方法用于处理与训练数据不同的视频流,以进行3D人体姿态估计。该方法利用真实的2D姿态来进行适应调整,但在实际应用中只有估算的2D姿态可用。本文提出了一种解决方案,旨在使用估算的2D姿态来对模型进行适应调整,使其能够应用于实时视频流。 比较各种适应方法后发现,在减少估算误差的同时保持准确的姿态信息是一大挑战。为此,我们提出了自适应聚合、两阶段优化以及局部增强的方法来应对不同级别的估算姿态错误。 首先,通过在多个视频之间执行自适应聚合操作,利用标注的代表性样本对模型状态进行初始化。然后,对于每个单独的视频,采用两阶段优化策略,在最大化利用2D拟合准确性的同时最小化由错误更新所导致的影响。 其次,我们采用了局部增强技术,即使用相邻的可信样本数据来调整模型参数,并在适应当前不可信样本之前完成这一过程。 我们的方法显著超过了现有的最先进技术,向着更实用的基于估算的2D姿态进行适应的方向迈进了一大步。
https://arxiv.org/abs/2503.11194
Accurate robot localization is essential for effective operation. Monte Carlo Localization (MCL) is commonly used with known maps but is computationally expensive due to landmark matching for each particle. Humanoid robots face additional challenges, including sensor noise from locomotion vibrations and a limited field of view (FOV) due to camera placement. This paper proposes a fast and robust localization method via iterative landmark matching (ILM) for humanoid robots. The iterative matching process improves the accuracy of the landmark association so that it does not need MCL to match landmarks to particles. Pose estimation with the outlier removal process enhances its robustness to measurement noise and faulty detections. Furthermore, an additional filter can be utilized to fuse inertial data from the inertial measurement unit (IMU) and pose data from localization. We compared ILM with Iterative Closest Point (ICP), which shows that ILM method is more robust towards the error in the initial guess and easier to get a correct matching. We also compared ILM with the Augmented Monte Carlo Localization (aMCL), which shows that ILM method is much faster than aMCL and even more accurate. The proposed method's effectiveness is thoroughly evaluated through experiments and validated on the humanoid robot ARTEMIS during RoboCup 2024 adult-sized soccer competition.
机器人精确定位对于有效运行至关重要。蒙特卡洛定位(Monte Carlo Localization,MCL)通常与已知地图一起使用,但由于每个粒子的地标匹配计算成本高而受到限制。人形机器人面临着更多挑战,包括由于行走振动造成的传感器噪声和因相机位置导致的有限视野范围(FOV)。本文提出了一种通过迭代地标匹配(Iterative Landmark Matching,ILM)实现快速且稳健定位的方法,特别适用于人形机器人。该方法中的迭代匹配过程提高了地标关联的准确性,使地标不必与粒子进行匹配就能使用MCL。姿态估计结合剔除异常值的过程增强了其对测量噪声和错误检测的鲁棒性。此外,还可以利用一个额外的滤波器来融合惯性测量单元(IMU)提供的惯性数据和定位产生的姿态数据。 我们通过将ILM与迭代最近点算法(Iterative Closest Point, ICP)进行比较,发现ILM方法在初始猜测误差方面更鲁棒,并且更容易获得正确的匹配。此外,我们将ILM与增强型蒙特卡洛定位(Augmented Monte Carlo Localization,aMCL)进行了对比,结果显示ILM不仅比aMCL更快,而且精度更高。 本文提出的该方法的有效性通过实验进行了彻底评估,并在RoboCup 2024成人组足球竞赛中使用人形机器人ARTEMIS上得到了验证。
https://arxiv.org/abs/2503.11020
Clothes-Changing Person Re-Identification (ReID) aims to recognize the same individual across different videos captured at various times and locations. This task is particularly challenging due to changes in appearance, such as clothing, hairstyle, and accessories. We propose a Clothes-Changing ReID method that uses only skeleton data and does not use appearance features. Traditional ReID methods often depend on appearance features, leading to decreased accuracy when clothing changes. Our approach utilizes a spatio-temporal Graph Convolution Network (GCN) encoder to generate a skeleton-based descriptor for each individual. During testing, we improve accuracy by aggregating predictions from multiple segments of a video clip. Evaluated on the CCVID dataset with several different pose estimation models, our method achieves state-of-the-art performance, offering a robust and efficient solution for Clothes-Changing ReID.
服装变化下的重识别(ReID)旨在跨不同时间和地点捕捉的视频中识别同一人。这一任务特别具有挑战性,因为人的外貌会随时间发生变化,例如穿着、发型和配饰等。我们提出了一种仅使用骨架数据而不依赖于外观特征的服装变化下的人再识别方法。传统的人再识别方法通常依靠外观特征,在遇到着装变化时准确率会降低。 我们的方法利用了时空图卷积网络(GCN)编码器,为每个人生成基于骨架描述符。在测试阶段,我们通过汇集视频片段多个时间段上的预测来提高准确性。我们在使用多种姿态估计模型的CCVID数据集上进行了评估,取得了最先进的性能,提供了一种鲁棒且高效的人再识别解决方案,在服装变化情况下也能保持高精度。
https://arxiv.org/abs/2503.10759
Over the past decade, studying animal behaviour with the help of computer vision has become more popular. Replacing human observers by computer vision lowers the cost of data collection and therefore allows to collect more extensive datasets. However, the majority of available computer vision algorithms to study animal behaviour is highly tailored towards a single research objective, limiting possibilities for data reuse. In this perspective, pose-estimation in combination with animal tracking offers opportunities to yield a higher level representation capturing both the spatial and temporal component of animal behaviour. Such a higher level representation allows to answer a wide variety of research questions simultaneously, without the need to develop repeatedly tailored computer vision algorithms. In this paper, we therefore first cope with several weaknesses of current pose-estimation algorithms and thereafter introduce KeySORT (Keypoint Simple and Online Realtime Tracking). KeySORT deploys an adaptive Kalman filter to construct tracklets in a bounding-box free manner, significantly improving the temporal consistency of detected keypoints. In this paper, we focus on pose estimation in cattle, but our methodology can easily be generalised to any other animal species. Our test results indicate our algorithm is able to detect up to 80% of the ground truth keypoints with high accuracy, with only a limited drop in performance when daylight recordings are compared to nightvision recordings. Moreover, by using KeySORT to construct skeletons, the temporal consistency of generated keypoint coordinates was largely improved, offering opportunities with regard to automated behaviour monitoring of animals.
过去十年,借助计算机视觉研究动物行为越来越受欢迎。用计算机视觉替代人工观察者降低了数据收集的成本,并因此允许收集更广泛的数据库。然而,目前可用的大多数用于研究动物行为的计算机视觉算法都高度针对单一的研究目标,限制了数据再利用的可能性。从这个角度来看,结合姿态估计和动物追踪可以提供机会,以更高层次地表示动物的行为,这种高层次表示能够同时回答各种各样的研究问题,而无需反复开发定制的计算机视觉算法。 在这篇论文中,我们首先解决了当前姿态估计算法的一些弱点,并引入了KeySORT(Keypoint Simple and Online Realtime Tracking)。KeySORT 使用自适应卡尔曼滤波器以不依赖于边界框的方式构造轨迹片段,显著提高了检测到的关键点的时间一致性。本文主要关注牛的姿态估计,但我们的方法可以轻松推广至任何其他动物物种。 测试结果显示,我们的算法能够以高精度检测出高达80%的真实关键点,在白天记录与夜间记录相比性能仅有轻微下降。此外,通过使用KeySORT 构建骨架结构,生成的关键点坐标的时序一致性得到了显著改善,为自动化动物行为监测提供了新的可能性。
https://arxiv.org/abs/2503.10450
We seek to extract a temporally consistent 6D pose trajectory of a manipulated object from an Internet instructional video. This is a challenging set-up for current 6D pose estimation methods due to uncontrolled capturing conditions, subtle but dynamic object motions, and the fact that the exact mesh of the manipulated object is not known. To address these challenges, we present the following contributions. First, we develop a new method that estimates the 6D pose of any object in the input image without prior knowledge of the object itself. The method proceeds by (i) retrieving a CAD model similar to the depicted object from a large-scale model database, (ii) 6D aligning the retrieved CAD model with the input image, and (iii) grounding the absolute scale of the object with respect to the scene. Second, we extract smooth 6D object trajectories from Internet videos by carefully tracking the detected objects across video frames. The extracted object trajectories are then retargeted via trajectory optimization into the configuration space of a robotic manipulator. Third, we thoroughly evaluate and ablate our 6D pose estimation method on YCB-V and HOPE-Video datasets as well as a new dataset of instructional videos manually annotated with approximate 6D object trajectories. We demonstrate significant improvements over existing state-of-the-art RGB 6D pose estimation methods. Finally, we show that the 6D object motion estimated from Internet videos can be transferred to a 7-axis robotic manipulator both in a virtual simulator as well as in a real world set-up. We also successfully apply our method to egocentric videos taken from the EPIC-KITCHENS dataset, demonstrating potential for Embodied AI applications.
我们旨在从互联网教学视频中提取操作对象的一致性6D姿态轨迹。由于不受控制的拍摄条件、细微但动态的对象运动以及被操作物体的确切网格模型未知的事实,这对当前的6D姿态估计方法来说是一个具有挑战性的设置。为了解决这些挑战,我们提出了以下贡献: 首先,我们开发了一种新方法,该方法可以在没有对象先验知识的情况下估算输入图像中任何对象的6D姿态。此方法通过以下步骤进行:(i)从大规模模型数据库中检索与描绘的对象相似的CAD模型;(ii)将检索到的CAD模型与输入图像对齐以获得6D姿态估计;以及(iii)根据场景确定物体相对于绝对尺度的比例。 其次,我们通过仔细跟踪视频帧中的检测对象来提取互联网视频中的平滑6D对象轨迹。然后,我们将提取的对象轨迹通过轨迹优化重定向至机器人操作器的配置空间中。 第三,我们在YCB-V和HOPE-Video数据集以及一个新的由手动标注近似6D物体轨迹的数据集上彻底评估并消融了我们的6D姿态估计方法,并证明相对于现有的最先进的RGB 6D姿态估计算法有显著改进。 最后,我们展示了从互联网视频中估算出的6D对象运动可以转移到具有7轴的机器人操作器中,在虚拟模拟器和真实世界设置中皆可实现。我们还将该方法成功应用于EPIC-KITCHENS数据集中的第一人称视角视频,展示了在具身人工智能应用方面的潜力。
https://arxiv.org/abs/2503.10307
We present VicaSplat, a novel framework for joint 3D Gaussians reconstruction and camera pose estimation from a sequence of unposed video frames, which is a critical yet underexplored task in real-world 3D applications. The core of our method lies in a novel transformer-based network architecture. In particular, our model starts with an image encoder that maps each image to a list of visual tokens. All visual tokens are concatenated with additional inserted learnable camera tokens. The obtained tokens then fully communicate with each other within a tailored transformer decoder. The camera tokens causally aggregate features from visual tokens of different views, and further modulate them frame-wisely to inject view-dependent features. 3D Gaussian splats and camera pose parameters can then be estimated via different prediction heads. Experiments show that VicaSplat surpasses baseline methods for multi-view inputs, and achieves comparable performance to prior two-view approaches. Remarkably, VicaSplat also demonstrates exceptional cross-dataset generalization capability on the ScanNet benchmark, achieving superior performance without any fine-tuning. Project page: this https URL.
我们提出了VicaSplat,这是一个新颖的框架,用于从一系列无姿态视频帧中同时重建3D高斯分布并估计相机姿态。这一任务在现实世界的3D应用中至关重要但鲜有探索。 我们的方法核心在于一种基于Transformer的新颖网络架构。具体而言,模型首先使用图像编码器将每个图像映射到一组视觉标记(visual tokens)。所有的视觉标记随后与额外的可学习相机标记(learnable camera tokens)进行连接。获得的这些标记在特制的Transformer解码器中完全相互沟通。相机标记通过因果聚合不同视图中的视觉标记特征,进一步以逐帧的方式调节它们来注入视角依赖性特征。 3D高斯点阵和相机姿态参数随后可以通过不同的预测头进行估计。实验表明,在多视角输入方面,VicaSplat超越了基线方法,并且在两视图方法上实现了相当的性能。值得注意的是,在ScanNet基准测试中,VicaSplat还展示了卓越的数据集跨域泛化能力,无需任何微调即可获得优越表现。 项目页面:[此链接](https://this/)(请将"this https URL"替换为实际项目页面链接)。
https://arxiv.org/abs/2503.10286
Human pose estimation (HPE) detects the positions of human body joints for various applications. Compared to using cameras, HPE using radio frequency (RF) signals is non-intrusive and more robust to adverse conditions, exploiting the signal variations caused by human interference. However, existing studies focus on single-domain HPE confined by domain-specific confounders, which cannot generalize to new domains and result in diminished HPE performance. Specifically, the signal variations caused by different human body parts are entangled, containing subject-specific confounders. RF signals are also intertwined with environmental noise, involving environment-specific confounders. In this paper, we propose GenHPE, a 3D HPE approach that generates counterfactual RF signals to eliminate domain-specific confounders. GenHPE trains generative models conditioned on human skeleton labels, learning how human body parts and confounders interfere with RF signals. We manipulate skeleton labels (i.e., removing body parts) as counterfactual conditions for generative models to synthesize counterfactual RF signals. The differences between counterfactual signals approximately eliminate domain-specific confounders and regularize an encoder-decoder model to learn domain-independent representations. Such representations help GenHPE generalize to new subjects/environments for cross-domain 3D HPE. We evaluate GenHPE on three public datasets from WiFi, ultra-wideband, and millimeter wave. Experimental results show that GenHPE outperforms state-of-the-art methods and reduces estimation errors by up to 52.2mm for cross-subject HPE and 10.6mm for cross-environment HPE.
人体姿态估计(Human Pose Estimation, HPE)旨在检测人体关节的位置,服务于各种应用。与使用摄像头的方法相比,利用无线频率(RF)信号进行HPE具有非侵入性且在恶劣条件下更为鲁棒的特点,因为这种技术可以捕捉到由人体活动引起的信号变化。然而,现有的研究主要集中在单领域内的HPE上,这些方法受限于特定领域的混淆因素,导致其难以推广至新的应用场景,并降低了HPE的性能表现。具体来说,不同身体部位引起的不同信号变化相互交织在一起,包含了针对个体的具体混淆因素;同时,RF信号也与环境噪声纠缠不清,涉及到了特定环境下的混淆因素。 在本文中,我们提出了GenHPE这一方法,这是一种通过生成反事实(counterfactual)的无线频率信号来消除领域特有混淆因素的三维人体姿态估计技术。GenHPE训练了以人体骨架标签为条件的生成模型,学习不同身体部位和混淆因素如何干扰RF信号。通过对这些骨架标签进行操作(即移除某些身体部位),作为生成模型合成反事实RF信号的逆向假设条件,我们能够通过计算信号之间的差异来近似消除领域特有混淆因素,并对编码器-解码器模型进行正则化,使其学习到域无关表示。这种表示有助于GenHPE在新的个体/环境中推广运用,实现跨领域的三维人体姿态估计。 我们在来自WiFi、超宽带和毫米波的三个公开数据集上评估了GenHPE的表现。实验结果显示,相较于现有的最先进方法,GenHPE显著提高了性能,在跨主体的人体姿态估计中减少了高达52.2mm的估算误差,并在跨环境场景下将估算误差降低了10.6mm。
https://arxiv.org/abs/2503.09537
This paper presents a robust monocular visual SLAM system that simultaneously utilizes point, line, and vanishing point features for accurate camera pose estimation and mapping. To address the critical challenge of achieving reliable localization in low-texture environments, where traditional point-based systems often fail due to insufficient visual features, we introduce a novel approach leveraging Global Primitives structural information to improve the system's robustness and accuracy performance. Our key innovation lies in constructing vanishing points from line features and proposing a weighted fusion strategy to build Global Primitives in the world coordinate system. This strategy associates multiple frames with non-overlapping regions and formulates a multi-frame reprojection error optimization, significantly improving tracking accuracy in texture-scarce scenarios. Evaluations on various datasets show that our system outperforms state-of-the-art methods in trajectory precision, particularly in challenging environments.
本文提出了一种鲁棒的单目视觉SLAM系统,该系统同时利用点、线和消失点特征来进行准确的相机姿态估计和地图构建。为了应对低纹理环境中实现可靠定位这一关键挑战——传统基于点的方法常常由于缺乏足够的视觉特征而失效,我们引入了一种新颖的方法,利用全局原始信息来增强系统的鲁棒性和精度性能。 我们的主要创新在于从线特征中构造消失点,并提出了一种加权融合策略,在世界坐标系下构建全局原始结构。这种策略将多个不重叠区域的帧关联起来,并制定了一个多帧重投影误差优化方案,显著提高了在纹理稀缺场景中的跟踪准确性。各种数据集上的评估表明,我们的系统在轨迹精度方面超越了最先进的方法,尤其是在具有挑战性的环境中表现尤为突出。
https://arxiv.org/abs/2503.09296
We present Better Together, a method that simultaneously solves the human pose estimation problem while reconstructing a photorealistic 3D human avatar from multi-view videos. While prior art usually solves these problems separately, we argue that joint optimization of skeletal motion with a 3D renderable body model brings synergistic effects, i.e. yields more precise motion capture and improved visual quality of real-time rendering of avatars. To achieve this, we introduce a novel animatable avatar with 3D Gaussians rigged on a personalized mesh and propose to optimize the motion sequence with time-dependent MLPs that provide accurate and temporally consistent pose estimates. We first evaluate our method on highly challenging yoga poses and demonstrate state-of-the-art accuracy on multi-view human pose estimation, reducing error by 35% on body joints and 45% on hand joints compared to keypoint-based methods. At the same time, our method significantly boosts the visual quality of animatable avatars (+2dB PSNR on novel view synthesis) on diverse challenging subjects.
我们提出了一种名为“Better Together”的方法,该方法在处理多视角视频时能够同时解决人体姿态估计问题,并重建出逼真的3D人类角色模型。以往的方法通常会分别解决这些问题,但我们认为将骨骼运动与可渲染的三维身体模型进行联合优化会产生协同效应,即可以实现更精确的动作捕捉和实时生成的虚拟角色图像质量提升。 为了达到这一目标,我们引入了一种新的具有动画功能的角色模型,该模型在个性化的网格上装有3D高斯分布,并提出使用时间依赖型MLPs(多层感知机)优化动作序列以提供准确且时间一致的姿态估计。首先,在高度挑战性的瑜伽姿势评估中验证了我们的方法的有效性,并展示了在多视角人体姿态估计方面达到了业界领先精度,与基于关键点的方法相比,身体关节的误差减少了35%,手部关节的误差减少达45%。 同时,我们提出的方法显著提高了可动画角色模型的视觉质量(在新颖视图合成中提升了+2dB PSNR),特别是在面对各种复杂挑战性主题时。
https://arxiv.org/abs/2503.09293
Rendering realistic human-object interactions (HOIs) from sparse-view inputs is challenging due to occlusions and incomplete observations, yet crucial for various real-world applications. Existing methods always struggle with either low rendering qualities (\eg, visual fidelity and physically plausible HOIs) or high computational costs. To address these limitations, we propose HOGS (Human-Object Rendering via 3D Gaussian Splatting), a novel framework for efficient and physically plausible HOI rendering from sparse views. Specifically, HOGS combines 3D Gaussian Splatting with a physics-aware optimization process. It incorporates a Human Pose Refinement module for accurate pose estimation and a Sparse-View Human-Object Contact Prediction module for efficient contact region identification. This combination enables coherent joint rendering of human and object Gaussians while enforcing physically plausible interactions. Extensive experiments on the HODome dataset demonstrate that HOGS achieves superior rendering quality, efficiency, and physical plausibility compared to existing methods. We further show its extensibility to hand-object grasp rendering tasks, presenting its broader applicability to articulated object interactions.
从稀疏视角输入渲染真实的物体-人交互(HOI)由于遮挡和不完整的观察而具有挑战性,但对各种现实世界应用至关重要。现有方法在低渲染质量和高计算成本之间始终面临权衡问题(例如视觉保真度和物理上合理的HOIs)。为了解决这些限制,我们提出了HOGS (通过3D高斯点阵实现的人-物体渲染),这是一种新型框架,可以从稀疏视角高效地进行物理上合理的HOI渲染。具体而言,HOGS结合了3D高斯点阵与物理感知优化过程。它包含一个人体姿态精炼模块用于准确的姿态估计以及一个稀疏视图人体-物体重叠预测模块以有效识别接触区域。这种组合使连贯的人和物体的高斯联合渲染成为可能,并强制执行合理的物理交互。在HODome数据集上的广泛实验表明,与现有方法相比,HOGS实现了更出色的渲染质量、效率和物理合理性。我们进一步展示了它向手-物体抓握渲染任务的可扩展性,展示其对复杂对象交互应用的广泛应用潜力。
https://arxiv.org/abs/2503.09640