In 3D human shape and pose estimation from a monocular video, models trained with limited labeled data cannot generalize well to videos with occlusion, which is common in the wild videos. The recent human neural rendering approaches focusing on novel view synthesis initialized by the off-the-shelf human shape and pose methods have the potential to correct the initial human shape. However, the existing methods have some drawbacks such as, erroneous in handling occlusion, sensitive to inaccurate human segmentation, and ineffective loss computation due to the non-regularized opacity field. To address these problems, we introduce ORTexME, an occlusion-robust temporal method that utilizes temporal information from the input video to better regularize the occluded body parts. While our ORTexME is based on NeRF, to determine the reliable regions for the NeRF ray sampling, we utilize our novel average texture learning approach to learn the average appearance of a person, and to infer a mask based on the average texture. In addition, to guide the opacity-field updates in NeRF to suppress blur and noise, we propose the use of human body mesh. The quantitative evaluation demonstrates that our method achieves significant improvement on the challenging multi-person 3DPW dataset, where our method achieves 1.8 P-MPJPE error reduction. The SOTA rendering-based methods fail and enlarge the error up to 5.6 on the same dataset.
在从单眼视频中进行三维人类形状和姿态估计时,训练使用有限标记数据models并不能很好地适用于遮挡的视频,这在野生视频中很常见。最近,人类神经网络渲染方法专注于初始化由标准人类形状和姿态方法开发的新视角合成,有潜力纠正最初的人类形状。然而,现有的方法有一些缺点,例如,在处理遮挡时有误,对不准确的人类分割敏感,由于非规范化的opacity field,无效的损失计算。为了解决这些问题,我们介绍了ORTexME,它是一种遮挡 robust 的时间方法,利用输入视频的时间信息更好地规范化被遮挡的身体部分。虽然ORTexME基于NeRF,为了确定NeRF射线采样可靠的区域,我们采用了我们的新型平均纹理学习方法来学习人的平均外貌,并基于平均纹理推断一个掩膜。此外,为了指导NeRF的opacity field更新,抑制模糊和噪声,我们建议使用人体网格。定量评估表明,我们的方法在挑战性的多人3DPW数据集上取得了显著改进,我们的方法实现了1.8 P-MPJPE误差减少。SOTA渲染方法失败,并将误差增加到5.6。
https://arxiv.org/abs/2309.12183
As robotic systems increasingly encounter complex and unconstrained real-world scenarios, there is a demand to recognize diverse objects. The state-of-the-art 6D object pose estimation methods rely on object-specific training and therefore do not generalize to unseen objects. Recent novel object pose estimation methods are solving this issue using task-specific fine-tuned CNNs for deep template matching. This adaptation for pose estimation still requires expensive data rendering and training procedures. MegaPose for example is trained on a dataset consisting of two million images showing 20,000 different objects to reach such generalization capabilities. To overcome this shortcoming we introduce ZS6D, for zero-shot novel object 6D pose estimation. Visual descriptors, extracted using pre-trained Vision Transformers (ViT), are used for matching rendered templates against query images of objects and for establishing local correspondences. These local correspondences enable deriving geometric correspondences and are used for estimating the object's 6D pose with RANSAC-based PnP. This approach showcases that the image descriptors extracted by pre-trained ViTs are well-suited to achieve a notable improvement over two state-of-the-art novel object 6D pose estimation methods, without the need for task-specific fine-tuning. Experiments are performed on LMO, YCBV, and TLESS. In comparison to one of the two methods we improve the Average Recall on all three datasets and compared to the second method we improve on two datasets.
机器人系统越来越面临复杂和非限制性的现实世界场景,有需求识别不同的物体。目前最先进的6D物体姿态估计方法依赖于特定的训练,因此不能适用于未观察到的物体。最近提出了新的物体姿态估计方法,通过任务特定的优化卷积神经网络来进行深度模板匹配。这种方法还需要昂贵的数据渲染和训练程序。例如,MegaPose方法通过训练一个包含两百万图像的dataset,显示可以产生这种泛化能力。为了克服这一缺陷,我们引入了ZS6D,用于零样本的新物体6D姿态估计。从预先训练的视觉Transformer提取的视觉特征用于匹配渲染模板与物体的查询图像,并建立局部对应关系。这些局部对应关系可以实现从几何对应关系的提取,并使用RANSAC-based PnP方法估计物体的6D姿态。这种方法展示了提取从预先训练的ViTs提取的视觉特征的方法非常适合实现比两个最先进的新物体6D姿态估计方法显著的改进,而不需要任务特定的优化。实验在LMO、YCBV和TLESS上进行。与其中一个方法相比,我们提高了所有三个dataset的平均召回率,与第二个方法相比,我们提高了两个dataset的数据。
https://arxiv.org/abs/2309.11986
We present Ego3DPose, a highly accurate binocular egocentric 3D pose reconstruction system. The binocular egocentric setup offers practicality and usefulness in various applications, however, it remains largely under-explored. It has been suffering from low pose estimation accuracy due to viewing distortion, severe self-occlusion, and limited field-of-view of the joints in egocentric 2D images. Here, we notice that two important 3D cues, stereo correspondences, and perspective, contained in the egocentric binocular input are neglected. Current methods heavily rely on 2D image features, implicitly learning 3D information, which introduces biases towards commonly observed motions and leads to low overall accuracy. We observe that they not only fail in challenging occlusion cases but also in estimating visible joint positions. To address these challenges, we propose two novel approaches. First, we design a two-path network architecture with a path that estimates pose per limb independently with its binocular heatmaps. Without full-body information provided, it alleviates bias toward trained full-body distribution. Second, we leverage the egocentric view of body limbs, which exhibits strong perspective variance (e.g., a significantly large-size hand when it is close to the camera). We propose a new perspective-aware representation using trigonometry, enabling the network to estimate the 3D orientation of limbs. Finally, we develop an end-to-end pose reconstruction network that synergizes both techniques. Our comprehensive evaluations demonstrate that Ego3DPose outperforms state-of-the-art models by a pose estimation error (i.e., MPJPE) reduction of 23.1% in the UnrealEgo dataset. Our qualitative results highlight the superiority of our approach across a range of scenarios and challenges.
我们提出了 Ego3DPose,一个高度准确的双视角自我中心3D姿态重建系统。自我中心双视角 setup 在多种应用中具有实际和有用的实用性,然而,它仍然 largely 未被 explore。由于观看失真、严重 self-occlusion 和二维图像中 joint 的视角范围有限,姿态估计精度一直较低。在这里,我们注意到,包含自我中心双视角输入的两个重要3D cues,即立体对应率和视角,被忽视了。当前方法 heavily 依赖2D图像特征,隐含地学习3D信息,这导致总体精度较低。我们观察到,它们不仅 在挑战性的 self-occlusion 情况下失败,而且在估计可见 joint 位置时也失败。为了解决这些问题,我们提出了两个新的方法。首先,我们设计了一个两路径网络架构,其中一条路径以每个肢体的独立视角估计姿态。在没有全身信息的情况下,可以减轻倾向于训练全身分布的偏见。其次,我们利用身体肢体的自我中心视图,它表现出强烈的视角变异(例如,当靠近相机时,变得非常大的手)。我们提出了一种利用三角学的新视角意识表示方法,使网络可以估计肢体的3D方向。最后,我们开发了一个端到端的姿态重建网络,协同这两种方法。我们的全面评估表明,Ego3DPose 在 UnrealEgo 数据集上比当前最先进的模型高出23.1%的姿态估计误差,我们的定性结果突出了我们方法在所有场景和挑战下的优势。
https://arxiv.org/abs/2309.11962
Extreme head postures pose a common challenge across a spectrum of facial analysis tasks, including face detection, facial landmark detection (FLD), and head pose estimation (HPE). These tasks are interdependent, where accurate FLD relies on robust face detection, and HPE is intricately associated with these key points. This paper focuses on the integration of these tasks, particularly when addressing the complexities posed by large-angle face poses. The primary contribution of this study is the proposal of a real-time multi-task detection system capable of simultaneously performing joint detection of faces, facial landmarks, and head poses. This system builds upon the widely adopted YOLOv8 detection framework. It extends the original object detection head by incorporating additional landmark regression head, enabling efficient localization of crucial facial landmarks. Furthermore, we conduct optimizations and enhancements on various modules within the original YOLOv8 framework. To validate the effectiveness and real-time performance of our proposed model, we conduct extensive experiments on 300W-LP and AFLW2000-3D datasets. The results obtained verify the capability of our model to tackle large-angle face pose challenges while delivering real-time performance across these interconnected tasks.
极端头姿势在面部分析任务中提出了一起常见的挑战,包括面部检测、面部地标检测(FLD)和头部姿势估计(HPE)。这些任务是相互依存的,准确的FLD依赖于强大的面部检测,而HPE与这些关键点密切相关。本文重点探讨了这些任务的整合,特别是在处理大角度面部姿势的复杂性时。本研究的主要贡献是提出了一种实时多任务检测系统,能够同时完成面部、面部地标和头部姿势的联合检测。该系统基于广泛使用的YOLOv8检测框架,通过添加地标回归头扩展了原对象检测头,从而能够高效地定位关键的面部地标。此外,我们在原YOLOv8框架中优化和改进了各种模块。为了验证我们提出的模型的有效性和实时性能,我们在300W-LP和AFLW2000-3D数据集上进行了广泛的实验。所获得的结果验证我们的模型能够应对大角度面部姿势挑战,同时在这些相互关联的任务中提供实时性能。
https://arxiv.org/abs/2309.11773
As 3D human pose estimation can now be achieved with very high accuracy in the supervised learning scenario, tackling the case where 3D pose annotations are not available has received increasing attention. In particular, several methods have proposed to learn image representations in a self-supervised fashion so as to disentangle the appearance information from the pose one. The methods then only need a small amount of supervised data to train a pose regressor using the pose-related latent vector as input, as it should be free of appearance information. In this paper, we carry out in-depth analysis to understand to what degree the state-of-the-art disentangled representation learning methods truly separate the appearance information from the pose one. First, we study disentanglement from the perspective of the self-supervised network, via diverse image synthesis experiments. Second, we investigate disentanglement with respect to the 3D pose regressor following an adversarial attack perspective. Specifically, we design an adversarial strategy focusing on generating natural appearance changes of the subject, and against which we could expect a disentangled network to be robust. Altogether, our analyses show that disentanglement in the three state-of-the-art disentangled representation learning frameworks if far from complete, and that their pose codes contain significant appearance information. We believe that our approach provides a valuable testbed to evaluate the degree of disentanglement of pose from appearance in self-supervised 3D human pose estimation.
现如今,在监督学习场景中,3D人类姿态估计可以实现极高的精度。因此,解决3D姿态标注不足的问题已经引起了越来越多的关注。特别是,有一些方法提出了通过自监督学习来学习图像表示,以便将外观信息从姿态信息中分离。这些方法只需要少量的监督数据来训练姿态回归器,使用姿态相关的隐向量作为输入,因为外观信息应该被排除在外。在本文中,我们进行了深入分析,以了解art-of-the-art的分离表示学习方法在何种程度上真正将外观信息从姿态信息中分离。首先,我们从自监督网络的角度出发,通过多种图像合成实验研究了分离的实现方式。其次,我们研究了从攻击视角下的3D姿态回归器的分离实现。具体来说,我们设计了一种新的攻击策略,重点是生成物体的自然外观变化,我们可以期望分离网络具有鲁棒性。总之,我们的分析表明,三个art-of-the-art的分离表示学习框架的分离程度虽然还没有完全完成,但它们的姿态代码中包含重要的外观信息。我们相信,我们的方法提供了评估自监督3D人类姿态估计中姿态与外观分离程度的宝贵测试平台。
https://arxiv.org/abs/2309.11667
This work presents an Online Supervised Training (OST) method to enable robust vision-based navigation about a non-cooperative spacecraft. Spaceborne Neural Networks (NN) are susceptible to domain gap as they are primarily trained with synthetic images due to the inaccessibility of space. OST aims to close this gap by training a pose estimation NN online using incoming flight images during Rendezvous and Proximity Operations (RPO). The pseudo-labels are provided by adaptive unscented Kalman filter where the NN is used in the loop as a measurement module. Specifically, the filter tracks the target's relative orbital and attitude motion, and its accuracy is ensured by robust on-ground training of the NN using only synthetic data. The experiments on real hardware-in-the-loop trajectory images show that OST can improve the NN performance on the target image domain given that OST is performed on images of the target viewed from a diverse set of directions during RPO.
这项工作提出了一种在线监督训练(OST)方法,以便实现对非合作航天器进行可靠的视觉导航。由于空间难以到达,航天器神经网络(NN)主要使用合成图像进行训练。OST的目标是在复位和接近操作(RPO)期间使用 incoming flight images 训练一个位置估计NN,并将NN用作测量模块。具体来说,滤波器跟踪目标相对轨道和姿态运动,并通过仅使用合成数据进行可靠的地面NN训练来确保其精度。在真实硬件参与循环的轨迹图像实验中,OST可以提高NN在目标图像领域的性能,因为OST是在RPO期间从多个方向观察目标图像时对目标图像进行训练的。
https://arxiv.org/abs/2309.11645
Visual Odometry (VO) plays a pivotal role in autonomous systems, with a principal challenge being the lack of depth information in camera images. This paper introduces OCC-VO, a novel framework that capitalizes on recent advances in deep learning to transform 2D camera images into 3D semantic occupancy, thereby circumventing the traditional need for concurrent estimation of ego poses and landmark locations. Within this framework, we utilize the TPV-Former to convert surround view cameras' images into 3D semantic occupancy. Addressing the challenges presented by this transformation, we have specifically tailored a pose estimation and mapping algorithm that incorporates Semantic Label Filter, Dynamic Object Filter, and finally, utilizes Voxel PFilter for maintaining a consistent global semantic map. Evaluations on the Occ3D-nuScenes not only showcase a 20.6% improvement in Success Ratio and a 29.6% enhancement in trajectory accuracy against ORB-SLAM3, but also emphasize our ability to construct a comprehensive map. Our implementation is open-sourced and available at: this https URL.
视觉估计(VO)在自主系统中扮演了关键的角色,其中主要挑战是相机图像缺乏深度信息。本文介绍了Occ3D-nuScenes,这是一种新框架,利用深度学习的最新进展将2D相机图像转换为3D语义占用,从而绕过了传统的同时估计自我姿态和地标位置的需求。在这个框架中,我们使用TPV- former将周围的视图相机图像转换为3D语义占用。为了解决这种转换所带来的挑战,我们特别定制了姿态估计和映射算法,包括语义标签过滤器、动态物体过滤器,并最终使用 Voxel PFilter维持一个稳定的全球语义地图。在Occ3D-nuScenes的评估中,不仅表现出与ORB-SLAM3相比20.6%的成功 Ratio 和29.6%的轨迹精度提高,也强调了我们构建全面了解图的能力。我们的实现是开源的,可以在以下httpsURL上获取。
https://arxiv.org/abs/2309.11011
Affordance detection and pose estimation are of great importance in many robotic applications. Their combination helps the robot gain an enhanced manipulation capability, in which the generated pose can facilitate the corresponding affordance task. Previous methods for affodance-pose joint learning are limited to a predefined set of affordances, thus limiting the adaptability of robots in real-world environments. In this paper, we propose a new method for language-conditioned affordance-pose joint learning in 3D point clouds. Given a 3D point cloud object, our method detects the affordance region and generates appropriate 6-DoF poses for any unconstrained affordance label. Our method consists of an open-vocabulary affordance detection branch and a language-guided diffusion model that generates 6-DoF poses based on the affordance text. We also introduce a new high-quality dataset for the task of language-driven affordance-pose joint learning. Intensive experimental results demonstrate that our proposed method works effectively on a wide range of open-vocabulary affordances and outperforms other baselines by a large margin. In addition, we illustrate the usefulness of our method in real-world robotic applications. Our code and dataset are publicly available at this https URL
很多机器人应用中, affordance detection 和 pose estimation 非常重要。它们的结合可以帮助机器人获得更好的操作能力,其中生成的 pose 可以方便相应的 affordance 任务。以往的 affordance-pose 联合学习方法局限于预先定义的 affordance 集合,因此限制了机器人在真实环境中的适应性。在本文中,我们提出了一种新的方法,基于语言条件的 affordance-pose 联合学习在 3D 点云上。给定一个 3D 点云对象,我们的方法和语言引导扩散模型可以检测到 affordance 区域并生成任意无约束 affordance 标签的 6DOF pose。我们的方法包括一个开放式 affordance 检测分支和一个基于语言的引导扩散模型,该模型基于 affordance 文本生成 6DOF pose。我们还介绍了一个高质量的语言驱动 affordance-pose 联合学习任务数据集。密集的实验结果显示,我们提出的方法和其他基准方法在开放式 affordance 集合上非常有效,并且以显著优势超越其他基准方法。此外,我们还展示了我们方法在真实机器人应用中的有用性。我们的代码和数据集在此 https URL 上公开可用。
https://arxiv.org/abs/2309.10911
Bodily behavioral language is an important social cue, and its automated analysis helps in enhancing the understanding of artificial intelligence systems. Furthermore, behavioral language cues are essential for active engagement in social agent-based user interactions. Despite the progress made in computer vision for tasks like head and body pose estimation, there is still a need to explore the detection of finer behaviors such as gesturing, grooming, or fumbling. This paper proposes a multiview attention fusion method named MAGIC-TBR that combines features extracted from videos and their corresponding Discrete Cosine Transform coefficients via a transformer-based approach. The experiments are conducted on the BBSI dataset and the results demonstrate the effectiveness of the proposed feature fusion with multiview attention. The code is available at: this https URL
身体行为语言是重要的社会提示,其自动化分析有助于增强人工智能系统的理解了。此外,身体行为语言提示是社交代理用户交互中的积极参与所必需的。尽管在任务如头部和身体姿态估计方面计算机视觉取得了进展,但仍需要探索更细的行为,如手势、修剪或摸索。本文提出了一种名为Magic-TBR的多视角注意力融合方法,通过基于Transformer的方法从视频中提取特征并计算对应的离散余弦变换系数。在BBSI数据集上进行了实验,结果证明了所提出的多视角注意力特征融合的有效性。代码已放在以下链接: https://github.com/tbr-magic/Magic-TBR 。
https://arxiv.org/abs/2309.10765
An accurate and uncertainty-aware 3D human body pose estimation is key to enabling truly safe but efficient human-robot interactions. Current uncertainty-aware methods in 3D human pose estimation are limited to predicting the uncertainty of the body posture, while effectively neglecting the body shape and root pose. In this work, we present GloPro, which to the best of our knowledge the first framework to predict an uncertainty distribution of a 3D body mesh including its shape, pose, and root pose, by efficiently fusing visual clues with a learned motion model. We demonstrate that it vastly outperforms state-of-the-art methods in terms of human trajectory accuracy in a world coordinate system (even in the presence of severe occlusions), yields consistent uncertainty distributions, and can run in real-time. Our code will be released upon acceptance at this https URL.
准确的、意识到不确定性的三维人体姿态估计是实现真正安全但高效的人类机器人交互的关键。当前在三维人体姿态估计中意识到不确定性的方法局限于预测身体姿态的不确定性,而有效地忽略了身体形状和基态。在这项工作中,我们介绍了 GloPro,它是我们所知的第一位框架,通过高效地结合学习的运动模型视觉线索,预测3D身体网格包括其形状、姿态和基态的不确定性分布。我们证明,它在世界坐标系中的人向量精度方面比最先进的方法(即使在严重遮挡的情况下)表现出色,产生一致的不确定分布,并且可以在实时中运行。我们的代码将在接受此httpsURL后发布。
https://arxiv.org/abs/2309.10369
While showing promising results, recent RGB-D camera-based category-level object pose estimation methods have restricted applications due to the heavy reliance on depth sensors. RGB-only methods provide an alternative to this problem yet suffer from inherent scale ambiguity stemming from monocular observations. In this paper, we propose a novel pipeline that decouples the 6D pose and size estimation to mitigate the influence of imperfect scales on rigid transformations. Specifically, we leverage a pre-trained monocular estimator to extract local geometric information, mainly facilitating the search for inlier 2D-3D correspondence. Meanwhile, a separate branch is designed to directly recover the metric scale of the object based on category-level statistics. Finally, we advocate using the RANSAC-P$n$P algorithm to robustly solve for 6D object pose. Extensive experiments have been conducted on both synthetic and real datasets, demonstrating the superior performance of our method over previous state-of-the-art RGB-based approaches, especially in terms of rotation accuracy.
虽然表现出良好的结果,但最近基于RGB-D相机的类别级别的物体姿态估计方法由于过度依赖深度传感器而具有有限的应用范围。仅使用RGB的方法提供了解决这个问题的另一种选择,但仍然受到单目观察带来的 scale 歧义的影响。在本文中,我们提出了一种新的流程,将6D姿态和尺寸估计分离开来,以减轻不完美尺度对Rigid Transformations的影响。具体来说,我们利用预先训练的单目估计器提取局部几何信息,主要促进了2D-3D连接的搜索。同时,我们设计了一个独立的分支,基于类别级别的统计信息直接恢复物体的度量尺度。最后,我们倡导使用RANSAC-P$n$P算法 robustly 解决6D物体姿态。在模拟数据和真实数据集上进行了广泛的实验,证明了我们方法比先前最先进的RGB-based方法在旋转精度方面表现更好。
https://arxiv.org/abs/2309.10255
The most commonly used method for addressing 3D geometric registration is the iterative closet-point algorithm, this approach is incremental and prone to drift over multiple consecutive frames. The Common strategy to address the drift is the pose graph optimization subsequent to frame-to-frame registration, incorporating a loop closure process that identifies previously visited places. In this paper, we explore a framework that replaces traditional geometric registration and pose graph optimization with a learned model utilizing hierarchical attention mechanisms and graph neural networks. We propose a strategy to condense the data flow, preserving essential information required for the precise estimation of rigid poses. Our results, derived from tests on the KITTI Odometry dataset, demonstrate a significant improvement in pose estimation accuracy. This improvement is especially notable in determining rotational components when compared with results obtained through conventional multi-way registration via pose graph optimization. The code will be made available upon completion of the review process.
解决3D几何注册最广泛应用的方法是迭代接近点算法,这种方法是逐渐增加的,并且容易在多个连续帧上漂移。解决漂移的常见方法是帧到帧注册后的姿态图优化,包括一个循环结束过程,以确定曾经访问过的地方。在本文中,我们探讨一种框架,将传统的几何注册和姿态图优化用学习模型替换,利用分层注意力机制和图形神经网络。我们提出了一种策略,以简化数据流,保留用于精确估计硬质姿态的重要信息。我们从KITTI姿态测距数据集测试中得出的结果表明,姿态估计精度有了显著改善。这种改善尤其在确定旋转组件方面特别显著,与通过传统多方式注册通过姿态图优化得到的结果进行比较时也是如此。代码将在审查完成后提供。
https://arxiv.org/abs/2309.09934
Current deep learning-based solutions for image analysis tasks are commonly incapable of handling problems to which multiple different plausible solutions exist. In response, posterior-based methods such as conditional Diffusion Models and Invertible Neural Networks have emerged; however, their translation is hampered by a lack of research on adequate validation. In other words, the way progress is measured often does not reflect the needs of the driving practical application. Closing this gap in the literature, we present the first systematic framework for the application-driven validation of posterior-based methods in inverse problems. As a methodological novelty, it adopts key principles from the field of object detection validation, which has a long history of addressing the question of how to locate and match multiple object instances in an image. Treating modes as instances enables us to perform mode-centric validation, using well-interpretable metrics from the application perspective. We demonstrate the value of our framework through instantiations for a synthetic toy example and two medical vision use cases: pose estimation in surgery and imaging-based quantification of functional tissue parameters for diagnostics. Our framework offers key advantages over common approaches to posterior validation in all three examples and could thus revolutionize performance assessment in inverse problems.
当前用于图像分析任务的深度学习方法通常无法处理存在多个合理解决方案的问题。为此,出现了基于后验的方法,如条件扩散模型和不可逆神经网络的方法;然而,这些方法的翻译受到缺乏足够的验证研究的限制。换句话说,通常衡量进展的方法往往不反映实际业务应用的需求。为了填补文献中的这一差距,我们提出了一种系统框架,用于驱动反向问题中后验方法的验证。作为一种方法论新颖性,它采用了 object detection 验证领域的关键原则,该领域长期以来一直致力于解决如何在图像中定位和匹配多个物体实例的问题。将模式视为实例可以帮助我们以应用视角进行模式centric 的验证。通过实例化一个合成玩具例子和两个医疗视觉用例,我们展示了我们框架的价值:在手术中的姿态估计和基于成像的量化功能组织参数的诊断中。我们框架在所有三个例子中都比常见的后验验证方法提供了关键优势,因此可以彻底改变反向问题性能评估的方式。
https://arxiv.org/abs/2309.09764
Unlike in natural images, in endoscopy there is no clear notion of an up-right camera orientation. Endoscopic videos therefore often contain large rotational motions, which require keypoint detection and description algorithms to be robust to these conditions. While most classical methods achieve rotation-equivariant detection and invariant description by design, many learning-based approaches learn to be robust only up to a certain degree. At the same time learning-based methods under moderate rotations often outperform classical approaches. In order to address this shortcoming, in this paper we propose RIDE, a learning-based method for rotation-equivariant detection and invariant description. Following recent advancements in group-equivariant learning, RIDE models rotation-equivariance implicitly within its architecture. Trained in a self-supervised manner on a large curation of endoscopic images, RIDE requires no manual labeling of training data. We test RIDE in the context of surgical tissue tracking on the SuPeR dataset as well as in the context of relative pose estimation on a repurposed version of the SCARED dataset. In addition we perform explicit studies showing its robustness to large rotations. Our comparison against recent learning-based and classical approaches shows that RIDE sets a new state-of-the-art performance on matching and relative pose estimation tasks and scores competitively on surgical tissue tracking.
与自然图像不同,在endoscope中没有一个明确的右对齐相机方向的概念。因此,endoscope视频通常包含大量的旋转运动,需要关键点检测和描述算法在这些条件下保持鲁棒性。尽管大多数经典方法的设计目标是实现旋转协变性检测和不变性描述,但许多基于学习的方法只能学到一定程度的鲁棒性。同时,在中等旋转情况下,基于学习的方法往往比经典方法表现更好。为了解决这一缺点,在本文中,我们提出了ride,一种基于学习的方法,用于旋转协变性检测和不变性描述。随着群体协变性学习的最新进展,ride在其架构中隐含了旋转协变性模型。通过自我监督地对大量endoscope图像的大规模 curation训练,ride不需要手动标签训练数据。我们在SupPeR数据集和改造后的SCRED数据集上的手术组织追踪上下文中测试了ride,并在相对姿态估计任务中进行了竞争测试。此外,我们还研究了ride对大规模旋转的鲁棒性。我们与最近基于学习和经典的方法进行比较,结果表明,ride在匹配和相对姿态估计任务上的新的前沿性能,并在手术组织追踪上下文中表现出竞争力的结果。
https://arxiv.org/abs/2309.09563
We propose a sparse and privacy-enhanced representation for Human Pose Estimation (HPE). Given a perspective camera, we use a proprietary motion vector sensor(MVS) to extract an edge image and a two-directional motion vector image at each time frame. Both edge and motion vector images are sparse and contain much less information (i.e., enhancing human privacy). We advocate that edge information is essential for HPE, and motion vectors complement edge information during fast movements. We propose a fusion network leveraging recent advances in sparse convolution used typically for 3D voxels to efficiently process our proposed sparse representation, which achieves about 13x speed-up and 96% reduction in FLOPs. We collect an in-house edge and motion vector dataset with 16 types of actions by 40 users using the proprietary MVS. Our method outperforms individual modalities using only edge or motion vector images. Finally, we validate the privacy-enhanced quality of our sparse representation through face recognition on CelebA (a large face dataset) and a user study on our in-house dataset.
我们提出了一种稀疏且增强隐私的人类姿态估计(HPE)表示方案。给定一个的视角相机,我们使用一种专有的运动向量传感器(MVS)在每个时间帧中提取边缘图像和两个方向的运动向量图像。 both edge and motion vector images are sparse and contain much less information (i.e., enhancing human privacy). 我们倡导,边缘信息对于HPE非常重要,而运动向量在快速运动时可以补充边缘信息。我们提出了一种利用最近用于3D立方体晶格的稀疏卷积先进技术的 fusion 网络,以高效处理我们提出的稀疏表示,该方法实现了大约13倍的速度提升和96%的FLOPs减少。我们利用专有MVS由40个用户采取的16种行动收集了公司内部的边缘和运动向量数据集。我们的方法仅使用边缘或运动向量图像就超越了单独的特征模式。最后,我们通过CelebA(一个巨大的人脸数据集)和我们的公司内部数据集的用户研究验证了我们稀疏表示增强隐私质量的有效性。
https://arxiv.org/abs/2309.09515
The current interacting hand (IH) datasets are relatively simplistic in terms of background and texture, with hand joints being annotated by a machine annotator, which may result in inaccuracies, and the diversity of pose distribution is limited. However, the variability of background, pose distribution, and texture can greatly influence the generalization ability. Therefore, we present a large-scale synthetic dataset RenderIH for interacting hands with accurate and diverse pose annotations. The dataset contains 1M photo-realistic images with varied backgrounds, perspectives, and hand textures. To generate natural and diverse interacting poses, we propose a new pose optimization algorithm. Additionally, for better pose estimation accuracy, we introduce a transformer-based pose estimation network, TransHand, to leverage the correlation between interacting hands and verify the effectiveness of RenderIH in improving results. Our dataset is model-agnostic and can improve more accuracy of any hand pose estimation method in comparison to other real or synthetic datasets. Experiments have shown that pretraining on our synthetic data can significantly decrease the error from 6.76mm to 5.79mm, and our Transhand surpasses contemporary methods. Our dataset and code are available at this https URL.
当前参与手(IH)数据的集相对简单,背景和纹理描绘较为简单,手关节由机器标注,可能会导致精度不足,姿态分布多样性有限。然而,背景、姿态分布和纹理的变化可以 greatly影响泛化能力。因此,我们提出了一个大型合成数据集RenderIH,用于具有准确和多样化的姿态标注的参与手。该数据集包含1000万张逼真的照片,背景、视角和手纹理各不相同。为了生成自然和多样化的参与姿态,我们提出了一种新的姿态优化算法。此外,为了提高姿态估计的准确性,我们引入了基于Transformer的姿态估计网络TransHand,利用参与手之间的相关性,并验证RenderIH在改善结果方面的有效性。我们的数据集是模型无关的,与其他真实或合成数据集相比,可以显著提高任何手姿态估计方法的准确性。实验表明,在我们合成数据上进行预训练可以显著减少误差从6.76毫米到5.79毫米,而我们的Transhand超过了 contemporary methods。我们的数据和代码可在以下httpsURL上提供。
https://arxiv.org/abs/2309.09301
Range-only (RO) pose estimation involves determining a robot's pose over time by measuring the distance between multiple devices on the robot, known as tags, and devices installed in the environment, known as anchors. The nonconvex nature of the range measurement model results in a cost function with possible local minima. In the absence of a good initialization, commonly used iterative solvers can get stuck in these local minima resulting in poor trajectory estimation accuracy. In this work, we propose convex relaxations to the original nonconvex problem based on semidefinite programs (SDPs). Specifically, we formulate computationally tractable SDP relaxations to obtain accurate initial pose and trajectory estimates for RO trajectory estimation under static and dynamic (i.e., constant-velocity motion) conditions. Through simulation and real experiments, we demonstrate that our proposed initialization strategies estimate the initial state accurately compared to iterative local solvers. Additionally, the proposed relaxations recover global minima under moderate range measurement noise levels.
范围唯一(RO)姿态估计涉及通过测量机器人上多个设备之间的距离来确定其姿态,这些设备被称为标签,和环境中的设备被称为锚。范围测量模型的非凸性质导致可能存在局部最小值的成本函数。在没有良好的初始化的情况下,常见的迭代求解器可能会在这些局部最小值中卡住,导致轨迹估计精度不佳。在本文中,我们基于 semidefinite 程序(SDPs)提出了凸 relaxation 对原始非凸问题进行 relaxation。具体而言,我们制定了计算计算快速的 SDP relaxation,以在静态和动态条件下(即恒定速度运动)为 RO 轨迹估计提供准确的初始姿态和轨迹估计估计,通过模拟和实际实验,我们证明了我们提出的初始化策略相对于迭代本地求解器能够准确估计初始状态。此外,我们提出的 relaxation 在中等范围测量噪声水平下恢复全局最小值。
https://arxiv.org/abs/2309.09011
Dynamic reconstruction with neural radiance fields (NeRF) requires accurate camera poses. These are often hard to retrieve with existing structure-from-motion (SfM) pipelines as both camera and scene content can change. We propose DynaMoN that leverages simultaneous localization and mapping (SLAM) jointly with motion masking to handle dynamic scene content. Our robust SLAM-based tracking module significantly accelerates the training process of the dynamic NeRF while improving the quality of synthesized views at the same time. Extensive experimental validation on TUM RGB-D, BONN RGB-D Dynamic and the DyCheck's iPhone dataset, three real-world datasets, shows the advantages of DynaMoN both for camera pose estimation and novel view synthesis.
动态重构(NeRF)需要准确的相机姿态,而现有的运动结构(SfM) pipeline 很难处理相机和场景内容都可以随时改变的这种情况。我们提出了DynaMoN,它利用同时定位和映射(SLAM)结合运动掩膜来处理动态场景内容。我们的可靠的 SLAM 跟踪模块显著加速了动态 NeRF 的训练过程,同时提高了合成视图的质量。在 TUM RGB-D、BONN RGB-D 动态和 DyCheck 的 iPhone 数据集等三个真实数据集上进行了大量的实验验证,证明了DynaMoN 在相机姿态估计和新视图合成方面的优势。
https://arxiv.org/abs/2309.08927
One-shot LiDAR localization refers to the ability to estimate the robot pose from one single point cloud, which yields significant advantages in initialization and relocalization processes. In the point cloud domain, the topic has been extensively studied as a global descriptor retrieval (i.e., loop closure detection) and pose refinement (i.e., point cloud registration) problem both in isolation or combined. However, few have explicitly considered the relationship between candidate retrieval and correspondence generation in pose estimation, leaving them brittle to substructure ambiguities. To this end, we propose a hierarchical one-shot localization algorithm called Outram that leverages substructures of 3D scene graphs for locally consistent correspondence searching and global substructure-wise outlier pruning. Such a hierarchical process couples the feature retrieval and the correspondence extraction to resolve the substructure ambiguities by conducting a local-to-global consistency refinement. We demonstrate the capability of Outram in a variety of scenarios in multiple large-scale outdoor datasets. Our implementation is open-sourced: this https URL.
一次性激光雷达定位指的是从单个点云中估计机器人姿态的能力,这在初始化和重新定位过程中具有显著优势。在点云领域,这个话题已经被广泛研究作为全局特征提取(即环检测)和姿态优化(即点云注册)问题,既可以独立研究,也可以结合研究。然而, few explicitly 考虑了姿态估计中候选提取和对应生成之间的关系,这使得它们很脆弱,面临结构亚结构歧义。为此,我们提出了一种名为Outram的Hierarchical 一次性定位算法,利用3D场景图的子结构来本地一致性对应搜索和全球子结构偏移删除。这种Hierarchical 过程将特征提取和对应提取耦合起来,通过进行局部到全局一致性优化,解决亚结构歧义。我们在不同的场景下展示了Outram的能力,包括多个大规模室外数据集。我们的实现是开源的:这个https URL。
https://arxiv.org/abs/2309.08914
Our work introduces the YCB-Ev dataset, which contains synchronized RGB-D frames and event data that enables evaluating 6DoF object pose estimation algorithms using these modalities. This dataset provides ground truth 6DoF object poses for the same 21 YCB objects \cite{calli2017yale} that were used in the YCB-Video (YCB-V) dataset, enabling the evaluation of algorithm performance when transferred across datasets. The dataset consists of 21 synchronized event and RGB-D sequences, amounting to a total of 7:43 minutes of video. Notably, 12 of these sequences feature the same object arrangement as the YCB-V subset used in the BOP challenge. Our dataset is the first to provide ground truth 6DoF pose data for event streams. Furthermore, we evaluate the generalization capabilities of two state-of-the-art algorithms, which were pre-trained for the BOP challenge, using our novel YCB-V sequences. The proposed dataset is available at this https URL.
我们的工作介绍了YCB-Ev dataset,该数据集包含同步的RGB-D帧和事件数据,可用于使用这些特性评估6DoF物体姿态估计算法。该数据集为YCB物体(YCB-V)数据集使用的21个YCB物体提供了 ground truth 6DoF物体姿态数据,因此可以在多个数据集之间进行评估。数据集包括21个同步事件和RGB-D序列,总共时长7:43分钟。值得注意的是,12个序列中的12个物体与BOP挑战中使用的YCB-V子集相同。我们的数据集是首个提供事件流 ground truth 6DoF姿态数据的数据集。此外,我们使用我们的新YCB-V序列评估了用于BOP挑战的两个最先进的算法的泛化能力。该提议数据集可以在这个https://www.google.com/url?q=https%3A%2F%2Fwww.deeplearning.net%2Fdatasets%2Fycb-Ev%2F&sa=U&ved=2ahUKEwi_uMy5q7pmAhUEIVQKHeAMB8kQ_AUIegQIDhAB&usg=AFQjCNFjsxqZfVuYFjW_x6S9c9O_SfJ6Czg&cad=rja
https://arxiv.org/abs/2309.08482