Conventional 2D human pose estimation methods typically require extensive labeled annotations, which are both labor-intensive and expensive. In contrast, semi-supervised 2D human pose estimation can alleviate the above problems by leveraging a large amount of unlabeled data along with a small portion of labeled data. Existing semi-supervised 2D human pose estimation methods update the network through backpropagation, ignoring crucial historical information from the previous training process. Therefore, we propose a novel semi-supervised 2D human pose estimation method by utilizing a newly designed Teacher-Reviewer-Student framework. Specifically, we first mimic the phenomenon that human beings constantly review previous knowledge for consolidation to design our framework, in which the teacher predicts results to guide the student's learning and the reviewer stores important historical parameters to provide additional supervision signals. Secondly, we introduce a Multi-level Feature Learning strategy, which utilizes the outputs from different stages of the backbone to estimate the heatmap to guide network training, enriching the supervisory information while effectively capturing keypoint relationships. Finally, we design a data augmentation strategy, i.e., Keypoint-Mix, to perturb pose information by mixing different keypoints, thus enhancing the network's ability to discern keypoints. Extensive experiments on publicly available datasets, demonstrate our method achieves significant improvements compared to the existing methods.
传统的2D人体姿态估计方法通常需要大量的标注数据,这既耗时又昂贵。相比之下,半监督的2D人体姿态估计可以通过利用大量未标记的数据和少量已标注的数据来缓解上述问题。现有的半监督2D人体姿态估计方法通过反向传播更新网络,而忽视了之前训练过程中重要的历史信息。因此,我们提出了一种新的半监督2D人体姿态估计方法,采用了一个新颖设计的教师-评审员-学生框架(Teacher-Reviewer-Student framework)。具体来说,首先模仿人类不断回顾以前的知识以巩固记忆的现象来设计我们的框架,在这个框架中,教师预测结果来指导学生的学习,而评审员存储重要的历史参数以提供额外的监督信号。其次,我们引入了一种多级特征学习策略,利用骨干网络不同阶段的输出估计热图(heatmap)来指导网络训练,这不仅丰富了监督信息,还能有效捕捉关键点之间的关系。最后,我们设计了一种数据增强策略,即关键点混合(Keypoint-Mix),通过混合不同的关键点来扰动姿态信息,从而增强了网络区分关键点的能力。在公开可用的数据集上进行的广泛实验表明,我们的方法相较于现有方法取得了显著的改进。
https://arxiv.org/abs/2501.09565
Robust WiFi-based human pose estimation is a challenging task that bridges discrete and subtle WiFi signals to human skeletons. This paper revisits this problem and reveals two critical yet overlooked issues: 1) cross-domain gap, i.e., due to significant variations between source-target domain pose distributions; and 2) structural fidelity gap, i.e., predicted skeletal poses manifest distorted topology, usually with misplaced joints and disproportionate bone lengths. This paper fills these gaps by reformulating the task into a novel two-phase framework dubbed DT-Pose: Domain-consistent representation learning and Topology-constrained Pose decoding. Concretely, we first propose a temporal-consistent contrastive learning strategy with uniformity regularization, coupled with self-supervised masking-reconstruction operations, to enable robust learning of domain-consistent and motion-discriminative WiFi-specific representations. Beyond this, we introduce a simple yet effective pose decoder with task prompts, which integrates Graph Convolution Network (GCN) and Transformer layers to constrain the topology structure of the generated skeleton by exploring the adjacent-overarching relationships among human joints. Extensive experiments conducted on various benchmark datasets highlight the superior performance of our method in tackling these fundamental challenges in both 2D/3D human pose estimation tasks.
基于WiFi的人体姿态估计是一项具有挑战性的任务,它需要将离散且微妙的WiFi信号与人体骨骼联系起来。本文重新审视了这一问题,并揭示了两个关键但被忽视的问题:1)跨域差距,即由于源领域和目标领域的姿态分布差异显著;2)结构保真度差距,即预测的人体骨骼姿势表现出扭曲的拓扑结构,通常表现为关节位置不正确及骨头长度不成比例。本文通过重构任务为一个新颖的两阶段框架来填补这些缺口,该框架被称为DT-Pose(领域一致表示学习和拓扑约束姿态解码)。具体而言,我们首先提出了一种带有均匀性正则化的时序一致性对比学习策略,并结合自监督掩蔽-重建操作,以实现领域一致性和运动判别性的WiFi特有表征的稳健学习。此外,我们引入了一个简单但有效的姿势解码器,该解码器通过集成图卷积网络(GCN)和Transformer层来约束生成骨骼的拓扑结构,并探索人体关节之间的相邻-整体关系。在多个基准数据集上进行的广泛实验表明,在二维/三维人体姿态估计任务中解决这些基本挑战时,我们的方法表现出卓越性能。
https://arxiv.org/abs/2501.09411
As robotic technology rapidly develops, robots are being employed in an increasing number of fields. However, due to the complexity of deployment environments or the prevalence of ambiguous-condition objects, the practical application of robotics still faces many challenges, leading to frequent errors. Traditional methods and some LLM-based approaches, although improved, still require substantial human intervention and struggle with autonomous error correction in complex this http URL this work, we propose RoboReflect, a novel framework leveraging large vision-language models (LVLMs) to enable self-reflection and autonomous error correction in robotic grasping tasks. RoboReflect allows robots to automatically adjust their strategies based on unsuccessful attempts until successful execution is this http URL corrected strategies are saved in a memory for future task this http URL evaluate RoboReflect through extensive testing on eight common objects prone to ambiguous conditions of three this http URL results demonstrate that RoboReflect not only outperforms existing grasp pose estimation methods like AnyGrasp and high-level action planning techniques using GPT-4V but also significantly enhances the robot's ability to adapt and correct errors independently. These findings underscore the critical importance of autonomous selfreflection in robotic systems while effectively addressing the challenges posed by ambiguous environments.
随着机器人技术的迅速发展,机器人被越来越多地应用到各个领域。然而,由于部署环境的复杂性或模糊条件物体的普遍存在,机器人的实际应用仍面临许多挑战,导致频繁出现错误。传统的解决方案和一些基于大型语言模型的方法虽然有所改进,但仍需大量的人工干预,并且在复杂的环境中难以实现自主纠错。 在此工作中,我们提出了一种名为RoboReflect的新框架,该框架利用大规模视觉-语言模型(LVLMs)使机器人能够在抓取任务中进行自我反思并实现自主错误纠正。RoboReflect允许机器人根据未成功的尝试自动调整策略,直到成功执行为止,并将这些修正后的策略保存在内存中以备将来任务使用。 我们通过广泛的测试,在三种具有模糊条件倾向的常见物体上评估了RoboReflect的表现。结果表明,与现有的抓取姿态估计方法(如AnyGrasp)和使用GPT-4V的高度行动规划技术相比,RoboReflect不仅表现更佳,还显著增强了机器人在复杂环境中独立适应和纠正错误的能力。 这些发现强调了在机器人系统中实现自主自我反思的重要性,并有效解决了模糊环境带来的挑战。
https://arxiv.org/abs/2501.09307
Visual odometry (VO) plays a crucial role in autonomous driving, robotic navigation, and other related tasks by estimating the position and orientation of a camera based on visual input. Significant progress has been made in data-driven VO methods, particularly those leveraging deep learning techniques to extract image features and estimate camera poses. However, these methods often struggle in low-light conditions because of the reduced visibility of features and the increased difficulty of matching keypoints. To address this limitation, we introduce BrightVO, a novel VO model based on Transformer architecture, which not only performs front-end visual feature extraction, but also incorporates a multi-modality refinement module in the back-end that integrates Inertial Measurement Unit (IMU) data. Using pose graph optimization, this module iteratively refines pose estimates to reduce errors and improve both accuracy and robustness. Furthermore, we create a synthetic low-light dataset, KiC4R, which includes a variety of lighting conditions to facilitate the training and evaluation of VO frameworks in challenging environments. Experimental results demonstrate that BrightVO achieves state-of-the-art performance on both the KiC4R dataset and the KITTI benchmarks. Specifically, it provides an average improvement of 20% in pose estimation accuracy in normal outdoor environments and 259% in low-light conditions, outperforming existing methods. For widespread use and further development, the research work is fully open-source at this https URL.
视觉里程计(VO)在自动驾驶、机器人导航及其他相关任务中发挥着关键作用,它通过基于视觉输入来估计相机的位置和方向。近年来,在数据驱动的VO方法方面取得了显著进展,特别是那些利用深度学习技术提取图像特征并估算摄像机姿态的方法。然而,这些方法往往难以应对低光环境下的挑战,因为在这种条件下可见特征减少且匹配关键点变得更为困难。 为了解决这一限制,我们引入了BrightVO,这是一种基于Transformer架构的新型VO模型,它不仅执行前端视觉特性提取,还在后端整合了一个多模态精炼模块,该模块结合了惯性测量单元(IMU)的数据。利用姿态图优化方法,此模块可以迭代地细化姿态估计值以减少误差并提高精度和鲁棒性。 此外,我们创建了一个合成低光数据集KiC4R,该数据集中包含各种照明条件,这有助于训练和评估在挑战环境下工作的VO框架。实验结果表明,BrightVO在KiC4R数据集以及KITTI基准测试中均表现出领先水平的性能。具体而言,在常规室外环境中,它将姿态估计精度平均提高了20%,而在低光条件下则提高了259%(相对于现有方法)。为了促进广泛使用和进一步开发,我们的研究工作完全开源,并可在[此链接](https://github.com/your-research-group/brightvo)访问。 注意:文中提及的GitHub链接需要替换为实际公开的研究组BrightVO代码库链接。
https://arxiv.org/abs/2501.08659
Human pose estimation, a vital task in computer vision, involves detecting and localising human joints in images and videos. While single-frame pose estimation has seen significant progress, it often fails to capture the temporal dynamics for understanding complex, continuous movements. We propose Poseidon, a novel multi-frame pose estimation architecture that extends the ViTPose model by integrating temporal information for enhanced accuracy and robustness to address these limitations. Poseidon introduces key innovations: (1) an Adaptive Frame Weighting (AFW) mechanism that dynamically prioritises frames based on their relevance, ensuring that the model focuses on the most informative data; (2) a Multi-Scale Feature Fusion (MSFF) module that aggregates features from different backbone layers to capture both fine-grained details and high-level semantics; and (3) a Cross-Attention module for effective information exchange between central and contextual frames, enhancing the model's temporal coherence. The proposed architecture improves performance in complex video scenarios and offers scalability and computational efficiency suitable for real-world applications. Our approach achieves state-of-the-art performance on the PoseTrack21 and PoseTrack18 datasets, achieving mAP scores of 88.3 and 87.8, respectively, outperforming existing methods.
人体姿态估计是计算机视觉中的一个重要任务,涉及在图像和视频中检测和定位人体关节。尽管单帧姿态估计取得了显著进展,但它常常无法捕捉理解复杂、连续动作所需的时序动态特性。为此,我们提出Poseidon,这是一种新型的多帧姿态估计算法架构,它通过整合ViTPose模型中的时间信息来提高准确性和鲁棒性,从而克服这些限制。 Poseidon引入了关键创新: 1. 一种自适应帧加权(AFW)机制,该机制可以根据每帧的相关性动态地调整权重,确保模型能够专注于最具有信息量的数据。 2. 多尺度特征融合(MSFF)模块,通过从不同主干层中聚合特性来捕捉细粒度细节和高层次语义。 3. 交叉注意模块,用于中央帧与上下文帧之间有效信息交换,增强模型的时间连贯性。 所提出的架构在复杂视频场景中的性能得到提升,并提供了适用于实际应用的可扩展性和计算效率。我们的方法在PoseTrack21和PoseTrack18数据集上实现了最先进的表现,分别获得了88.3和87.8的平均精度(mAP)得分,超越了现有的方法。
https://arxiv.org/abs/2501.08446
RGB-based 3D pose estimation methods have been successful with the development of deep learning and the emergence of high-quality 3D pose datasets. However, most existing methods do not operate well for testing images whose distribution is far from that of training data. However, most existing methods do not operate well for testing images whose distribution is far from that of training data. This problem might be alleviated by involving diverse data during training, however it is non-trivial to collect such diverse data with corresponding labels (i.e. 3D pose). In this paper, we introduced an unsupervised domain adaptation framework for 3D pose estimation that utilizes the unlabeled data in addition to labeled data via masked image modeling (MIM) framework. Foreground-centric reconstruction and attention regularization are further proposed to increase the effectiveness of unlabeled data usage. Experiments are conducted on the various datasets in human and hand pose estimation tasks, especially using the cross-domain scenario. We demonstrated the effectiveness of ours by achieving the state-of-the-art accuracy on all datasets.
基于RGB的3D姿态估计方法随着深度学习的发展和高质量3D姿态数据集的出现而取得了成功。然而,大多数现有的方法在处理与训练数据分布差异较大的测试图像时表现不佳。尽管可以通过在训练过程中引入多样化数据来缓解这一问题,但收集带有对应标签(即3D姿势)的多样数据并非易事。在这篇论文中,我们介绍了一个用于3D姿态估计的无监督领域适应框架,该框架利用了未标注的数据和已标注的数据,通过遮罩图像建模(MIM)框架实现这一点。为了进一步提高未标注数据使用的有效性,我们提出了以前景为中心的重建和注意力正则化方法。我们在人类和手部姿态估计任务的各种数据集上进行了实验,特别是在跨域场景中使用了这些技术。我们的研究表明,在所有数据集中均达到了最先进的准确性。 这段翻译阐述了一个用于3D姿态估计的新框架,并强调了该研究在处理不同领域测试图像时的有效性。通过引入无监督领域适应和未标注数据的利用,这种方法能够在现有方法的基础上取得显著改进。
https://arxiv.org/abs/2501.08408
We present HaPTIC, an approach that infers coherent 4D hand trajectories from monocular videos. Current video-based hand pose reconstruction methods primarily focus on improving frame-wise 3D pose using adjacent frames rather than studying consistent 4D hand trajectories in space. Despite the additional temporal cues, they generally underperform compared to image-based methods due to the scarcity of annotated video data. To address these issues, we repurpose a state-of-the-art image-based transformer to take in multiple frames and directly predict a coherent trajectory. We introduce two types of lightweight attention layers: cross-view self-attention to fuse temporal information, and global cross-attention to bring in larger spatial context. Our method infers 4D hand trajectories similar to the ground truth while maintaining strong 2D reprojection alignment. We apply the method to both egocentric and allocentric videos. It significantly outperforms existing methods in global trajectory accuracy while being comparable to the state-of-the-art in single-image pose estimation. Project website: this https URL
我们介绍了HaPTIC,这是一种从单目视频中推断连贯4D手部轨迹的方法。目前基于视频的手部姿态重建方法主要集中在利用相邻帧来改善每一帧的3D姿态上,而不是研究空间中一致的4D手部轨迹。尽管这些方法使用了额外的时间线索,但由于标注视频数据稀缺的原因,它们通常的表现不如基于图像的方法。为了解决这些问题,我们将最先进的基于图像的Transformer模型重新设计,使其能够处理多帧输入,并直接预测连贯的轨迹。我们引入了两种轻量级注意层:跨视图自我注意用于融合时间信息,全局交叉注意力则用来引入更大的空间上下文。 我们的方法可以推断出类似于真实情况的4D手部轨迹,同时保持良好的2D重投影对齐。我们将该方法应用于第一人称视角和第三人称视角视频,并且在全球轨迹准确性方面显著超越现有方法,在单张图像姿态估计方面的表现与最新技术相当。项目网站:[请参阅原文链接]
https://arxiv.org/abs/2501.08329
While recent foundation models have enabled significant breakthroughs in monocular depth estimation, a clear path towards safe and reliable deployment in the real-world remains elusive. Metric depth estimation, which involves predicting absolute distances, poses particular challenges, as even the most advanced foundation models remain prone to critical errors. Since quantifying the uncertainty has emerged as a promising endeavor to address these limitations and enable trustworthy deployment, we fuse five different uncertainty quantification methods with the current state-of-the-art DepthAnythingV2 foundation model. To cover a wide range of metric depth domains, we evaluate their performance on four diverse datasets. Our findings identify fine-tuning with the Gaussian Negative Log-Likelihood Loss (GNLL) as a particularly promising approach, offering reliable uncertainty estimates while maintaining predictive performance and computational efficiency on par with the baseline, encompassing both training and inference time. By fusing uncertainty quantification and foundation models within the context of monocular depth estimation, this paper lays a critical foundation for future research aimed at improving not only model performance but also its explainability. Extending this critical synthesis of uncertainty quantification and foundation models into other crucial tasks, such as semantic segmentation and pose estimation, presents exciting opportunities for safer and more reliable machine vision systems.
尽管最近的基础模型在单目深度估计方面取得了显著突破,但要实现安全可靠的现实世界部署仍然面临挑战。涉及预测绝对距离的度量深度估计尤其具有挑战性,即使最先进的基础模型仍容易出现关键错误。鉴于量化不确定性已成为解决这些限制并实现可信部署的有希望的方法,我们融合了五种不同的不确定性量化方法与当前最先进的DepthAnythingV2基础模型。为了涵盖广泛的度量深度领域,我们在四个多样化的数据集上评估它们的表现。我们的研究发现,使用高斯负对数似然损失(GNLL)进行微调特别具有前景,这种方法在提供可靠的不确定性估计的同时,保持了预测性能和计算效率与基线相当,在训练和推理时间方面均包括。 通过将不确定性量化和基础模型融合到单目深度估计的背景下,本文为未来旨在不仅提高模型性能而且增强其可解释性的研究奠定了关键基础。将这种不确定性量化的关键综合应用扩展到其他重要任务(如语义分割和姿态估计),为更安全可靠的机器视觉系统带来了令人兴奋的机会。
https://arxiv.org/abs/2501.08188
Pose distillation is widely adopted to reduce model size in human pose estimation. However, existing methods primarily emphasize the transfer of teacher knowledge while often neglecting the performance degradation resulted from the curse of capacity gap between teacher and student. To address this issue, we propose AgentPose, a novel pose distillation method that integrates a feature agent to model the distribution of teacher features and progressively aligns the distribution of student features with that of the teacher feature, effectively overcoming the capacity gap and enhancing the ability of knowledge transfer. Our comprehensive experiments conducted on the COCO dataset substantiate the effectiveness of our method in knowledge transfer, particularly in scenarios with a high capacity gap.
姿势提炼(Pose distillation)在人体姿态估计中被广泛采用,以减小模型的大小。然而,现有的方法主要强调知识迁移,而往往忽视了由于教师模型和学生模型之间容量差距所导致的表现下降问题。为了解决这个问题,我们提出了一种新的姿态提炼方法——AgentPose,该方法整合了一个特征代理(feature agent)来建模教师特征分布,并逐步将学生特征的分布与其对齐。这样可以有效克服容量差距,增强知识迁移的能力。 我们在COCO数据集上进行了一系列全面的实验,证实了我们的方法在知识迁移中的有效性,尤其是在存在较大容量差距的情况下表现尤为突出。
https://arxiv.org/abs/2501.08088
As critical visual details become obscured, the low visibility and high ISO noise in extremely low-light images pose a significant challenge to human pose estimation. Current methods fail to provide high-quality representations due to reliance on pixel-level enhancements that compromise semantics and the inability to effectively handle extreme low-light conditions for robust feature learning. In this work, we propose a frequency-based framework for low-light human pose estimation, rooted in the "divide-and-conquer" principle. Instead of uniformly enhancing the entire image, our method focuses on task-relevant information. By applying dynamic illumination correction to the low-frequency components and low-rank denoising to the high-frequency components, we effectively enhance both the semantic and texture information essential for accurate pose estimation. As a result, this targeted enhancement method results in robust, high-quality representations, significantly improving pose estimation performance. Extensive experiments demonstrating its superiority over state-of-the-art methods in various challenging low-light scenarios.
当关键视觉细节在极低光环境下变得模糊时,低可见度和高ISO噪声对人类姿态估计构成了重大挑战。当前的方法由于依赖于像素级增强技术而无法提供高质量的表示,这些技术会损害语义信息,并且无法有效地处理极端低光照条件下的稳健特征学习问题。在这项工作中,我们提出了一种基于频率的极低光环境下的人体姿态估计框架,该框架采用了“分而治之”的原则。我们的方法不均匀地增强整个图像,而是专注于任务相关的数据。通过将动态照明校正应用于低频成分,并对高频成分进行低秩去噪处理,我们可以有效提升对于准确的姿态估计至关重要的语义和纹理信息。这种定向增强的方法产生了稳健且高质量的表示形式,显著提高了姿态估计性能。在各种具有挑战性的低光场景中的广泛实验表明了该方法优于最先进的技术。
https://arxiv.org/abs/2501.08038
Recent advancements in 3D human pose estimation from single-camera images and videos have relied on parametric models, like SMPL. However, these models oversimplify anatomical structures, limiting their accuracy in capturing true joint locations and movements, which reduces their applicability in biomechanics, healthcare, and robotics. Biomechanically accurate pose estimation, on the other hand, typically requires costly marker-based motion capture systems and optimization techniques in specialized labs. To bridge this gap, we propose BioPose, a novel learning-based framework for predicting biomechanically accurate 3D human pose directly from monocular videos. BioPose includes three key components: a Multi-Query Human Mesh Recovery model (MQ-HMR), a Neural Inverse Kinematics (NeurIK) model, and a 2D-informed pose refinement technique. MQ-HMR leverages a multi-query deformable transformer to extract multi-scale fine-grained image features, enabling precise human mesh recovery. NeurIK treats the mesh vertices as virtual markers, applying a spatial-temporal network to regress biomechanically accurate 3D poses under anatomical constraints. To further improve 3D pose estimations, a 2D-informed refinement step optimizes the query tokens during inference by aligning the 3D structure with 2D pose observations. Experiments on benchmark datasets demonstrate that BioPose significantly outperforms state-of-the-art methods. Project website: \url{this https URL}.
最近,从单摄像头图像和视频中进行三维人体姿态估计的技术进步主要依赖于参数化模型,如SMPL。然而,这些模型过于简化了解剖结构,限制了其在捕捉真实关节位置和运动方面的准确性,从而降低了它们在生物力学、医疗保健和机器人技术等领域的适用性。相比之下,精确的生物力学姿态估算通常需要昂贵的基于标记的动作捕捉系统以及专门实验室中的优化技术。 为了弥合这一差距,我们提出了BioPose,这是一种新颖的学习框架,可以直接从单目视频中预测出高精度的三维人体姿势。BioPose包含三个关键组件:多查询人类网格恢复模型(MQ-HMR)、神经逆向动力学(NeurIK)模型和基于2D信息的姿态优化技术。 - MQ-HMR利用一个多查询可变形变换器来提取跨尺度的细粒度图像特征,从而实现精确的人体网格恢复。 - NeurIK将网格顶点视为虚拟标记,并应用时空网络在解剖约束下回归出生物力学准确的三维姿态。 - 为了进一步提高3D姿势估计的质量,在推理过程中通过与2D姿态观察对齐来优化查询令牌,以调整3D结构。 实验证明,BioPose在基准数据集上显著优于现有的最先进的方法。项目网站:[这个链接](this https URL)。
https://arxiv.org/abs/2501.07800
Recent advances in monocular depth prediction have led to significantly improved depth prediction accuracy. In turn, this enables various applications to use such depth predictions. In this paper, we propose a novel framework for estimating the relative pose between two cameras from point correspondences with associated monocular depths. Since depth predictions are typically defined up to an unknown scale and shift parameter, our solvers jointly estimate both scale and shift parameters together with the camera pose. We derive efficient solvers for three cases: (1) two calibrated cameras, (2) two uncalibrated cameras with an unknown but shared focal length, and (3) two uncalibrated cameras with unknown and different focal lengths. Experiments on synthetic and real data, including experiments with depth maps estimated by 11 different depth predictors, show the practical viability of our solvers. Compared to prior work, our solvers achieve state-of-the-art results on two large-scale, real-world datasets. The source code is available at this https URL
最近在单目深度预测领域的进展显著提高了深度预测的准确性,这使得各种应用能够利用这些深度预测结果。本文提出了一种新颖的框架,用于根据点对应关系及其关联的单目深度估计两台相机之间的相对姿态。由于深度预测通常定义为未知缩放和偏移参数,我们的求解器会同时估计这些参数与相机的姿态。 我们为三种情况推导了高效的求解方法:(1) 两个校准过的相机;(2) 两个未校准的相机但具有共享且未知的焦距;以及 (3) 两个未校准的相机,它们有不同且未知的焦距。我们在合成数据和真实数据(包括通过11种不同的深度预测器估计出的深度图)上的实验表明了我们求解器的实际可行性。 相较于先前的工作,在两个大规模的真实世界数据集上,我们的求解器取得了最先进的结果。源代码可以在[该链接](this https URL)获得。
https://arxiv.org/abs/2501.07742
With the availability of egocentric 3D hand-object interaction datasets, there is increasing interest in developing unified models for hand-object pose estimation and action recognition. However, existing methods still struggle to recognise seen actions on unseen objects due to the limitations in representing object shape and movement using 3D bounding boxes. Additionally, the reliance on object templates at test time limits their generalisability to unseen objects. To address these challenges, we propose to leverage superquadrics as an alternative 3D object representation to bounding boxes and demonstrate their effectiveness on both template-free object reconstruction and action recognition tasks. Moreover, as we find that pure appearance-based methods can outperform the unified methods, the potential benefits from 3D geometric information remain unclear. Therefore, we study the compositionality of actions by considering a more challenging task where the training combinations of verbs and nouns do not overlap with the testing split. We extend H2O and FPHA datasets with compositional splits and design a novel collaborative learning framework that can explicitly reason about the geometric relations between hands and the manipulated object. Through extensive quantitative and qualitative evaluations, we demonstrate significant improvements over the state-of-the-arts in (compositional) action recognition.
随着自中心3D手部与物体交互数据集的可用性提高,开发用于手部和物体姿态估计及动作识别的统一模型的兴趣日益增长。然而,现有方法仍然难以在未见过的物体上准确识别已见的动作,这是由于使用三维边界框来表示物体形状和运动存在局限性所致。此外,在测试时依赖于对象模板会限制其对未见过的物体的泛化能力。为了解决这些挑战,我们提议采用超二次体作为边界框的替代3D对象表示,并展示了它们在无模板的对象重建和动作识别任务中的有效性。 另外,我们发现基于外观的方法可以优于统一方法,但从三维几何信息中可能带来的潜在收益仍不清楚。因此,通过考虑一个更具挑战性的任务——训练动词和名词组合与测试分组不重叠——我们研究了行动的构成性,并将H2O和FPHA数据集扩展为具有组成性分割的数据集,设计了一种新的协作学习框架,该框架可以明确地推理出手部和被操作物体之间的几何关系。通过广泛的定量和定性评估,我们在(组成型)动作识别方面显著优于现有的最佳方法。
https://arxiv.org/abs/2501.07100
The bio-inspired event camera has garnered extensive research attention in recent years, owing to its significant potential derived from its high dynamic range and low latency characteristics. Similar to the standard camera, the event camera requires precise intrinsic calibration to facilitate further high-level visual applications, such as pose estimation and mapping. While several calibration methods for event cameras have been proposed, most of them are either (i) engineering-driven, heavily relying on conventional image-based calibration pipelines, or (ii) inconvenient, requiring complex instrumentation. To this end, we propose an accurate and convenient intrinsic calibration method for event cameras, named eKalibr, which builds upon a carefully designed event-based circle grid pattern recognition algorithm. To extract target patterns from events, we perform event-based normal flow estimation to identify potential events generated by circle edges, and cluster them spatially. Subsequently, event clusters associated with the same grid circles are matched and grouped using normal flows, for subsequent time-varying ellipse estimation. Fitted ellipse centers are time-synchronized, for final grid pattern recognition. We conducted extensive experiments to evaluate the performance of eKalibr in terms of pattern extraction and intrinsic calibration. The implementation of eKalibr is open-sourced at (this https URL) to benefit the research community.
近年来,受其高动态范围和低延迟特性的影响,仿生事件相机吸引了广泛的研究关注。与标准相机类似,事件相机需要精确的内部校准以支持进一步的高级视觉应用,如姿态估计和地图绘制。尽管已经提出了几种针对事件相机的校准方法,但大多数要么是工程驱动型的,严重依赖于传统的基于图像的校准管道;要么使用起来不方便,需要复杂的仪器设备。为此,我们提出了一种准确且方便的事件相机内部校准方法,名为eKalibr,它建立在精心设计的基于事件的圆形网格图案识别算法之上。 为了从事件中提取目标模式,我们进行了基于事件的法线流估计来识别由圆边缘生成的潜在事件,并对其进行空间聚类。随后,使用法线流将与相同网格圆关联的事件簇进行匹配和分组,以便后续的时间变化椭圆估计。拟合的椭圆中心经过时间同步后,用于最终的网格模式识别。 我们进行了广泛的实验来评估eKalibr在图案提取和内部校准方面的性能。eKalibr的实现已开源(在此网址),以造福研究社区。
https://arxiv.org/abs/2501.05688
Monocular depth estimation (MDE) models have undergone significant advancements over recent years. Many MDE models aim to predict affine-invariant relative depth from monocular images, while recent developments in large-scale training and vision foundation models enable reasonable estimation of metric (absolute) depth. However, effectively leveraging these predictions for geometric vision tasks, in particular relative pose estimation, remains relatively under explored. While depths provide rich constraints for cross-view image alignment, the intrinsic noise and ambiguity from the monocular depth priors present practical challenges to improving upon classic keypoint-based solutions. In this paper, we develop three solvers for relative pose estimation that explicitly account for independent affine (scale and shift) ambiguities, covering both calibrated and uncalibrated conditions. We further propose a hybrid estimation pipeline that combines our proposed solvers with classic point-based solvers and epipolar constraints. We find that the affine correction modeling is beneficial to not only the relative depth priors but also, surprisingly, the ``metric" ones. Results across multiple datasets demonstrate large improvements of our approach over classic keypoint-based baselines and PnP-based solutions, under both calibrated and uncalibrated setups. We also show that our method improves consistently with different feature matchers and MDE models, and can further benefit from very recent advances on both modules. Code is available at this https URL.
近年来,单目深度估计(MDE)模型取得了显著的进步。许多MDE模型旨在从单目图像中预测仿射不变的相对深度,而大规模训练和视觉基础模型的发展使得合理估算度量(绝对)深度成为可能。然而,在几何视觉任务中有效地利用这些预测——特别是相对姿态估计方面——仍然相对探索不足。虽然深度提供了跨视角图像对齐的丰富约束条件,但单目深度先验中的固有噪声和模糊性为改进传统的基于关键点的解决方案带来了实际挑战。 在本文中,我们开发了三种用于相对姿态估计的求解器,这些求解器明确考虑独立仿射(尺度和平移)歧义,并涵盖了校准和未校准的情况。我们进一步提出了一种混合估计流水线,将我们的拟议求解器与传统的基于点的求解器以及极线约束相结合。我们发现,仿射修正建模不仅对相对深度先验有益,而且令人惊讶地也对“度量”先验有利。 在多个数据集上的结果表明,在校准和未校准设置下,我们的方法相对于经典的关键点基线和PnP解决方案有了显著改进。此外,我们还展示了无论使用哪种特征匹配器或MDE模型,我们的方法都能保持一致的改进,并可以从这两个模块最近的进步中进一步受益。 代码可在提供的URL上获取。
https://arxiv.org/abs/2501.05446
3D human pose estimation (3D HPE) has emerged as a prominent research topic, particularly in the realm of RGB-based methods. However, RGB images are susceptible to limitations such as sensitivity to lighting conditions and potential user discomfort. Consequently, multi-modal sensing, which leverages non-intrusive sensors, is gaining increasing attention. Nevertheless, multi-modal 3D HPE still faces challenges, including modality imbalance and the imperative for continual learning. In this work, we introduce a novel balanced continual multi-modal learning method for 3D HPE, which harnesses the power of RGB, LiDAR, mmWave, and WiFi. Specifically, we propose a Shapley value-based contribution algorithm to quantify the contribution of each modality and identify modality imbalance. To address this imbalance, we employ a re-learning strategy. Furthermore, recognizing that raw data is prone to noise contamination, we develop a novel denoising continual learning approach. This approach incorporates a noise identification and separation module to mitigate the adverse effects of noise and collaborates with the balanced learning strategy to enhance optimization. Additionally, an adaptive EWC mechanism is employed to alleviate catastrophic forgetting. We conduct extensive experiments on the widely-adopted multi-modal dataset, MM-Fi, which demonstrate the superiority of our approach in boosting 3D pose estimation and mitigating catastrophic forgetting in complex scenarios. We will release our codes.
三维人体姿态估计(3D HPE)已经成为基于RGB方法的重要研究课题。然而,RGB图像容易受到光照条件敏感性和潜在用户不适的限制。因此,利用非侵入式传感器进行多模态感知的方法正逐渐获得越来越多的关注。尽管如此,多模态3D HPE仍然面临着诸如模式不平衡和持续学习必要性等挑战。在本项工作中,我们提出了一种新颖的平衡连续多模态学习方法用于3D HPE,该方法利用了RGB、LiDAR、毫米波(mmWave)以及WiFi等多种传感器数据。 具体来说,我们提出了基于Shapley值的贡献算法来量化每种模式的贡献并识别模式不平衡。为解决这种不平衡问题,我们采用了重新学习策略。此外,考虑到原始数据容易受到噪声污染,我们开发了一种新颖的去噪连续学习方法。这种方法包含了一个用于减少噪声负面影响的噪声识别和分离模块,并与平衡学习策略协同工作以增强优化过程。另外,还采用了一种自适应EWC机制来缓解灾难性遗忘问题。 我们在广泛使用的多模态数据集MM-Fi上进行了广泛的实验,结果表明我们的方法在提升3D姿态估计性能以及减少复杂场景中的灾难性遗忘方面具有显著优势。我们将发布代码。
https://arxiv.org/abs/2501.05264
Recent advancements in LiDAR-Inertial Odometry (LIO) have boosted a large amount of applications. However, traditional LIO systems tend to focus more on localization rather than mapping, with maps consisting mostly of sparse geometric elements, which is not ideal for downstream tasks. Recent emerging neural field technology has great potential in dense mapping, but pure LiDAR mapping is difficult to work on high-dynamic vehicles. To mitigate this challenge, we present a new solution that tightly couples geometric kinematics with neural fields to enhance simultaneous state estimation and dense mapping capabilities. We propose both semi-coupled and tightly coupled Kinematic-Neural LIO (KN-LIO) systems that leverage online SDF decoding and iterated error-state Kalman filtering to fuse laser and inertial data. Our KN-LIO minimizes information loss and improves accuracy in state estimation, while also accommodating asynchronous multi-LiDAR inputs. Evaluations on diverse high-dynamic datasets demonstrate that our KN-LIO achieves performance on par with or superior to existing state-of-the-art solutions in pose estimation and offers improved dense mapping accuracy over pure LiDAR-based methods. The relevant code and datasets will be made available at https://**.
最近在LiDAR惯性里程计(LIO)领域的进展极大地促进了各种应用的发展。然而,传统的LIO系统通常更侧重于定位而非地图构建,生成的地图主要由稀疏的几何元素组成,这对下游任务来说不够理想。新兴的神经场技术在密集地图构建方面具有巨大潜力,但纯LiDAR地图构建对于高动态车辆而言较为困难。为了解决这一挑战,我们提出了一种新的解决方案,即紧密集成几何动力学与神经场以增强同时状态估计和密集地图构建的能力。我们提出了半耦合及紧耦合的动力学-神经LIO(KN-LIO)系统,该系统利用在线SDF解码以及迭代误差状态卡尔曼滤波器来融合激光数据和惯性数据。 我们的KN-LIO技术减少了信息损失,并在状态估计的准确性上有所提升。同时,它也能适应异步多LiDAR输入的情况。在多种高动态数据集上的评估结果表明,与现有的最先进的解决方案相比,我们的KN-LIO系统在姿态估计方面达到了同等或更高的性能水平,并且相对于纯LiDAR方法,在密集地图构建的精确度上有所提高。 相关代码和数据集将在https://**(实际网址需要替换)提供。
https://arxiv.org/abs/2501.04263
Virtual try-on methods based on diffusion models achieve realistic try-on effects. They use an extra reference network or an additional image encoder to process multiple conditional image inputs, which results in high training costs. Besides, they require more than 25 inference steps, bringing a long inference time. In this work, with the development of diffusion transformer (DiT), we rethink the necessity of reference network or image encoder, then propose MC-VTON, enabling DiT to integrate minimal conditional try-on inputs by utilizing its intrinsic backbone. Compared to existing methods, the superiority of MC-VTON is demonstrated in four aspects: (1)Superior detail fidelity. Our DiT-based MC-VTON exhibits superior fidelity in preserving fine-grained details. (2)Simplified network and inputs. We remove any extra reference network or image encoder. We also remove unnecessary conditions like the long prompt, pose estimation, human parsing, and depth map. We require only the masked person image and the garment image. (3)Parameter-efficient training. To process the try-on task, we fine-tune the FLUX.1-dev with only 39.7M additional parameters 0.33% of the backbone parameters). (4)Less inference steps. We apply distillation diffusion on MC-VTON and only need 8 steps to generate a realistic try-on image, with only 86.8M additional parameters (0.72% of the backbone parameters). Experiments show that MC-VTON achieves superior qualitative and quantitative results with fewer condition inputs, fewer inference steps, and fewer trainable parameters than baseline methods.
基于扩散模型的虚拟试穿方法能够实现逼真的试穿效果,但这些方法通常需要额外的参考网络或图像编码器来处理多条件输入图像,导致训练成本高。此外,它们还需要超过25次推理步骤才能生成一张试穿图片,使得推理时间过长。 随着扩散变换器(Diffusion Transformer, DiT)的发展,我们重新思考了参考网络和图像编码器的必要性,并提出了MC-VTON方法。该方法利用DiT固有的骨干网来最小化条件输入处理,从而实现更高效的虚拟试穿效果。相比现有技术,MC-VTON在以下几个方面表现出显著优势: 1. **细节保真度高**:我们的基于DiT的MC-VTON在保留细微细节方面的性能优于其他模型。 2. **简化网络和输入**:我们移除了额外的参考网络或图像编码器,并且去掉了不必要的条件,如长提示、姿态估计、人体解析和深度图。只需用到遮挡的人体图像和服装图像即可进行试穿模拟。 3. **训练参数更少**:为处理试穿任务,我们在FLUX.1-dev基础上仅增加了3970万个额外参数(占骨干网参数的0.33%)来微调网络模型。 4. **推理步骤更少**:我们对MC-VTON应用了蒸馏扩散技术,在生成逼真的试穿图片时只需要8个步骤,而新增加的参数仅有8680万(占骨干网参数的0.72%)。 实验表明,MC-VTON在条件输入较少、推理步骤更少和训练参数更少的情况下,仍能获得优于基准方法的定性和定量结果。
https://arxiv.org/abs/2501.03630
Reconstructing 3D models of dynamic, real-world objects with high-fidelity textures from monocular frame sequences has been a challenging problem in recent years. This difficulty stems from factors such as shadows, indirect illumination, and inaccurate object-pose estimations due to occluding hand-object interactions. To address these challenges, we propose a novel approach that predicts the hand's impact on environmental visibility and indirect illumination on the object's surface albedo. Our method first learns the geometry and low-fidelity texture of the object, hand, and background through composite rendering of radiance fields. Simultaneously, we optimize the hand and object poses to achieve accurate object-pose estimations. We then refine physics-based rendering parameters - including roughness, specularity, albedo, hand visibility, skin color reflections, and environmental illumination - to produce precise albedo, and accurate hand illumination and shadow regions. Our approach surpasses state-of-the-art methods in texture reconstruction and, to the best of our knowledge, is the first to account for hand-object interactions in object texture reconstruction.
近年来,从单目帧序列中重建具有高保真纹理的动态真实世界物体的三维模型一直是一个挑战性问题。这一难题源于阴影、间接照明以及由于手与物体之间的遮挡交互导致的对象姿态估计不准确等因素。为了解决这些问题,我们提出了一种新颖的方法,该方法预测手对环境可见性和物体表面反照率间接照明的影响。我们的方法首先通过辐射场的合成渲染学习对象、手部和背景的几何结构及低保真度纹理。同时,我们优化手部和物体的姿态以实现准确的对象姿态估计。随后,我们会细化基于物理渲染参数(包括粗糙度、镜面反射度、反照率、手部可见性、皮肤颜色反射以及环境照明),以此来生成精确的反照率,并确保手部光照和阴影区域的真实准确性。 我们的方法在纹理重建方面超越了现有的最先进方法,并且据我们所知,这是首次考虑手部与物体交互作用的对象纹理重建技术。
https://arxiv.org/abs/2501.03525
As a novel way of presenting information, augmented reality (AR) enables people to interact with the physical world in a direct and intuitive way. While there are some mobile AR products implemented with specific hardware at a high cost, the software approaches of AR implementation on mobile platforms(such as smartphones, tablet PC, etc.) are still far from practical use. GPS-based mobile AR systems usually perform poorly due to the inaccurate positioning in the indoor environment. Previous vision-based pose estimation methods need to continuously track predefined markers within a short distance, which greatly degrade user experience. This paper first conducts a comprehensive study of the state-of-the-art AR and localization systems on mobile platforms. Then, we propose an effective indoor mobile AR framework. In the framework, a fusional localization method and a new pose estimation implementation are developed to increase the overall matching rate and thus improving AR display accuracy. Experiments show that our framework has higher performance than approaches purely based on images or Wi-Fi signals. We achieve low average error distances (0.61-0.81m) and accurate matching rates (77%-82%) when the average sampling grid length is set to 0.5m.
作为信息呈现的一种新颖方式,增强现实(AR)使人们能够以直接且直观的方式与物理世界互动。尽管存在一些采用特定硬件并成本高昂的移动AR产品,但在移动平台(如智能手机、平板电脑等)上实现AR软件的方法仍然离实际应用有很大距离。基于GPS的移动AR系统在室内环境中由于定位不准确而表现不佳。以往基于视觉的姿态估计方法需要持续追踪短距离内的预定义标记,这极大地降低了用户体验。本文首先对移动平台上的最新AR和定位系统进行了全面的研究,然后提出了一种有效的室内移动AR框架。在此框架中,开发了一种融合的定位方法以及一种新的姿态估计算法,以提高整体匹配率,并因此提升AR显示精度。实验表明,我们的框架在图像或Wi-Fi信号纯基于的方法之上表现更佳。当平均采样网格长度设置为0.5米时,我们达到了较低的平均误差距离(0.61-0.81米)和准确的匹配率(77%-82%)。
https://arxiv.org/abs/2501.03336