Category-level 6D pose estimation aims to predict the poses and sizes of unseen objects from a specific category. Thanks to prior deformation, which explicitly adapts a category-specific 3D prior (i.e., a 3D template) to a given object instance, prior-based methods attained great success and have become a major research stream. However, obtaining category-specific priors requires collecting a large amount of 3D models, which is labor-consuming and often not accessible in practice. This motivates us to investigate whether priors are necessary to make prior-based methods effective. Our empirical study shows that the 3D prior itself is not the credit to the high performance. The keypoint actually is the explicit deformation process, which aligns camera and world coordinates supervised by world-space 3D models (also called canonical space). Inspired by these observation, we introduce a simple prior-free implicit space transformation network, namely IST-Net, to transform camera-space features to world-space counterparts and build correspondence between them in an implicit manner without relying on 3D priors. Besides, we design camera- and world-space enhancers to enrich the features with pose-sensitive information and geometrical constraints, respectively. Albeit simple, IST-Net becomes the first prior-free method that achieves state-of-the-art performance, with top inference speed on the REAL275 dataset. Our code and models will be publicly available.
类别级别的6D姿态估计旨在预测特定类别中未观测到物体的姿態和大小。由于预先变形方法,它 explicitly 适应特定类别的3D先验(即3D模板)以给定物体实例,这种方法取得了巨大的成功并成为主要研究流。然而,获得特定类别先验需要收集大量的3D模型,这在实际应用中往往难以实现。这激励我们研究是否有必要使用先验方法来使先验方法有效。我们的实证研究表明,3D先验本身并不是高性能的归功于。关键是 explicit 变形过程,它由世界空间3D模型监督 align 相机和世界坐标,(也叫做标准空间)。受到这些观察的启发,我们引入了一个简单的没有先验的隐含空间变换网络,即 IST-Net,它将相机空间特征转换为世界空间对应物,并在它们之间建立联系,而无需依赖3D先验。此外,我们设计相机和世界空间增强器,以丰富特征,具有姿態敏感性信息和几何约束。尽管简单,IST-Net成为第一个没有先验方法实现先进的性能,在真实275数据集上具有最快推理速度的方法。我们的代码和模型将公开可用。
https://arxiv.org/abs/2303.13479
We propose a novel method for joint estimation of shape and pose of rigid objects from their sequentially observed RGB-D images. In sharp contrast to past approaches that rely on complex non-linear optimization, we propose to formulate it as a neural optimization that learns to efficiently estimate the shape and pose. We introduce Deep Directional Distance Function (DeepDDF), a neural network that directly outputs the depth image of an object given the camera viewpoint and viewing direction, for efficient error computation in 2D image space. We formulate the joint estimation itself as a Transformer which we refer to as TransPoser. We fully leverage the tokenization and multi-head attention to sequentially process the growing set of observations and to efficiently update the shape and pose with a learned momentum, respectively. Experimental results on synthetic and real data show that DeepDDF achieves high accuracy as a category-level object shape representation and TransPoser achieves state-of-the-art accuracy efficiently for joint shape and pose estimation.
我们提出了一种新方法,用于从Sequentially observed RGB-D图像中 joint estimation rigid 对象的形状和姿态。与过去依赖于复杂的非线性优化的方法相反,我们提议将其写成一种神经网络优化,学会高效地估计形状和姿态。我们引入了 Deep Directional Distance Function (DeepDDF),它是一个神经网络,根据相机视角和观测方向,直接输出对象的深度图像,用于在 2D 图像空间中高效计算错误。我们将其整形成一种Transformer,我们称之为TransPoser。我们 fully 利用分块和多目注意力,Sequentially 处理不断增加的观测值,并使用学习的动力更新形状和姿态。模拟数据和实际数据的实验结果显示,DeepDDF 作为类级别的对象形状表示,具有极高的准确性,而TransPoser 高效地用于 joint 形状和姿态估计。
https://arxiv.org/abs/2303.13477
We present a novel technique to estimate the 6D pose of objects from single images where the 3D geometry of the object is only given approximately and not as a precise 3D model. To achieve this, we employ a dense 2D-to-3D correspondence predictor that regresses 3D model coordinates for every pixel. In addition to the 3D coordinates, our model also estimates the pixel-wise coordinate error to discard correspondences that are likely wrong. This allows us to generate multiple 6D pose hypotheses of the object, which we then refine iteratively using a highly efficient region-based approach. We also introduce a novel pixel-wise posterior formulation by which we can estimate the probability for each hypothesis and select the most likely one. As we show in experiments, our approach is capable of dealing with extreme visual conditions including overexposure, high contrast, or low signal-to-noise ratio. This makes it a powerful technique for the particularly challenging task of estimating the pose of tumbling satellites for in-orbit robotic applications. Our method achieves state-of-the-art performance on the SPEED+ dataset and has won the SPEC2021 post-mortem competition.
我们提出了一种 novel 技术,用于从单个图像中估计物体的 6D 姿态,其中物体的 3D 几何只给出近似值,而不是精确的 3D 模型。为了实现这一目标,我们使用了一种Dense 2D-to-3D 对应预测器,该预测器对每个像素的 3D 模型坐标进行回归。除了 3D 坐标,我们的模型还估计了像素坐标错误,以排除可能不正确的对应关系。这允许我们生成多个物体的 6D 姿态假设,然后使用高效的区域方法迭代地优化。我们还引入了一种 novel 像素后处理方法,可以估计每个假设的概率,并选择最可能的那个。正如在实验中所示,我们的方法可以处理极端的视觉条件,包括过曝、高对比度或低信号-to-噪声比。这使得它成为估计在轨道机器人应用中翻滚卫星姿态的特别挑战性任务的强大技术。我们的方法在 SPEED+ 数据集上取得了最先进的性能,并赢得了 SPEC2021 post-mortem competition。
https://arxiv.org/abs/2303.13241
Locating 3D objects from a single RGB image via Perspective-n-Point (PnP) is a long-standing problem in computer vision. Driven by end-to-end deep learning, recent studies suggest interpreting PnP as a differentiable layer, allowing for partial learning of 2D-3D point correspondences by backpropagating the gradients of pose loss. Yet, learning the entire correspondences from scratch is highly challenging, particularly for ambiguous pose solutions, where the globally optimal pose is theoretically non-differentiable w.r.t. the points. In this paper, we propose the EPro-PnP, a probabilistic PnP layer for general end-to-end pose estimation, which outputs a distribution of pose with differentiable probability density on the SE(3) manifold. The 2D-3D coordinates and corresponding weights are treated as intermediate variables learned by minimizing the KL divergence between the predicted and target pose distribution. The underlying principle generalizes previous approaches, and resembles the attention mechanism. EPro-PnP can enhance existing correspondence networks, closing the gap between PnP-based method and the task-specific leaders on the LineMOD 6DoF pose estimation benchmark. Furthermore, EPro-PnP helps to explore new possibilities of network design, as we demonstrate a novel deformable correspondence network with the state-of-the-art pose accuracy on the nuScenes 3D object detection benchmark. Our code is available at this https URL.
通过Perspective-n-Point(PnP)从单个RGB图像识别3D物体是一个长期存在的计算机视觉问题。受端到端深度学习的驱动,最近的研究表明可以将PnP解释为一种可区分的层,通过反向传播姿态损失梯度 partial学习2D-3D点对应关系。然而,从头学习整个对应关系非常具有挑战性,特别是对于那些具有歧义的姿态解决方案,其中全局最优姿态理论上是不可区分的对于点的。在本文中,我们提出了EPro-PnP,一种probabilistic的PnP层,用于一般端到端姿态估计,其输出SE(3)支路上的姿态分布,该分布具有可区分的概率密度。2D-3D坐标和相应的权重被看作是通过学习KL差异最小化预测和目标姿态分布之间的差异的中间变量学习。基本Principle generalizes previous approaches,并类似于注意机制。EPro-PnP可以增强现有的对应网络,并在LineMOD 6DoF姿态估计基准测试中填补PnP方法与任务特定领袖之间的差距。此外,EPro-PnP帮助探索网络设计的新可能性,我们展示了一个具有最先进的姿态精度的新可变形对应网络,并在nuScenes 3D物体检测基准测试中证明了。我们的代码可在this https URL上获取。
https://arxiv.org/abs/2303.12787
Object detection is one of the most important and fundamental aspects of computer vision tasks, which has been broadly utilized in pose estimation, object tracking and instance segmentation models. To obtain training data for object detection model efficiently, many datasets opt to obtain their unannotated data in video format and the annotator needs to draw a bounding box around each object in the images. Annotating every frame from a video is costly and inefficient since many frames contain very similar information for the model to learn from. How to select the most informative frames from a video to annotate has become a highly practical task to solve but attracted little attention in research. In this paper, we proposed a novel active learning algorithm for object detection models to tackle this problem. In the proposed active learning algorithm, both classification and localization informativeness of unlabelled data are measured and aggregated. Utilizing the temporal information from video frames, two novel localization informativeness measurements are proposed. Furthermore, a weight curve is proposed to avoid querying adjacent frames. Proposed active learning algorithm with multiple configurations was evaluated on the MuPoTS dataset and FootballPD dataset.
对象检测是计算机视觉任务中最为重要的和基本方面之一,在姿态估计、物体跟踪和实例分割模型中被广泛应用。为了高效地获取对象检测模型的训练数据,许多数据集选择以视频格式获取未标注数据,标注者需要在每个图像中画一个边界框来包围每个对象。对每个视频帧进行标注非常昂贵且效率低,因为许多帧包含模型可学习的信息。如何选择视频中最有用的帧来标注已成为一个非常实际的问题,但研究人员对此问题的关注较少。在本文中,我们提出了一种针对对象检测模型的新主动学习算法来解决此问题。在所提出的主动学习算法中,未标注数据的分类和定位 informativeness 都被测量和聚合。利用视频帧的时序信息,我们提出了两个新的定位 informativeness 测量方法。此外,我们提出了一种权重曲线,以避免询问相邻帧。所提出的多种配置的主动学习算法在MuPoTS数据和足球PD数据集上进行了评估。
https://arxiv.org/abs/2303.12760
Most recent 6D object pose estimation methods first use object detection to obtain 2D bounding boxes before actually regressing the pose. However, the general object detection methods they use are ill-suited to handle cluttered scenes, thus producing poor initialization to the subsequent pose network. To address this, we propose a rigidity-aware detection method exploiting the fact that, in 6D pose estimation, the target objects are rigid. This lets us introduce an approach to sampling positive object regions from the entire visible object area during training, instead of naively drawing samples from the bounding box center where the object might be occluded. As such, every visible object part can contribute to the final bounding box prediction, yielding better detection robustness. Key to the success of our approach is a visibility map, which we propose to build using a minimum barrier distance between every pixel in the bounding box and the box boundary. Our results on seven challenging 6D pose estimation datasets evidence that our method outperforms general detection frameworks by a large margin. Furthermore, combined with a pose regression network, we obtain state-of-the-art pose estimation results on the challenging BOP benchmark.
最新的6D对象姿态估计方法首先使用对象检测来获取2D边界框,然后在实际上进行姿态回归。然而,他们使用的通用对象检测方法不适合处理复杂场景,因此导致后续姿态网络初始化不良。为了解决这个问题,我们提出了一种Rigidity-aware检测方法,利用6D对象姿态估计中目标对象的Rigidity。这让我们引入一种方法,在训练期间从整个可见物体区域采样积极目标区域,而不是天真地从边界框中心处采样,因为对象可能会被 occlusion。因此,每个可见物体部分都可以贡献最终边界框预测,产生更好的检测稳定性。我们方法成功的关键是可见度地图,我们提议使用每个像素在边界框和边界框之间的最大障碍距离构建。我们在七个具有挑战性的6D对象姿态估计数据集上的结果证明,我们的方法比通用检测框架表现更好。此外,与姿态回归网络结合,我们获得了在具有挑战性的BOP基准架上的最先进的姿态估计结果。
https://arxiv.org/abs/2303.12396
The two-stage object pose estimation paradigm first detects semantic keypoints on the image and then estimates the 6D pose by minimizing reprojection errors. Despite performing well on standard benchmarks, existing techniques offer no provable guarantees on the quality and uncertainty of the estimation. In this paper, we inject two fundamental changes, namely conformal keypoint detection and geometric uncertainty propagation, into the two-stage paradigm and propose the first pose estimator that endows an estimation with provable and computable worst-case error bounds. On one hand, conformal keypoint detection applies the statistical machinery of inductive conformal prediction to convert heuristic keypoint detections into circular or elliptical prediction sets that cover the groundtruth keypoints with a user-specified marginal probability (e.g., 90%). Geometric uncertainty propagation, on the other, propagates the geometric constraints on the keypoints to the 6D object pose, leading to a Pose UnceRtainty SEt (PURSE) that guarantees coverage of the groundtruth pose with the same probability. The PURSE, however, is a nonconvex set that does not directly lead to estimated poses and uncertainties. Therefore, we develop RANdom SAmple averaGing (RANSAG) to compute an average pose and apply semidefinite relaxation to upper bound the worst-case errors between the average pose and the groundtruth. On the LineMOD Occlusion dataset we demonstrate: (i) the PURSE covers the groundtruth with valid probabilities; (ii) the worst-case error bounds provide correct uncertainty quantification; and (iii) the average pose achieves better or similar accuracy as representative methods based on sparse keypoints.
两阶段的对象姿态估计范式首先在图像中检测语义关键点,然后最小化投影误差来估计6D姿态。尽管在标准基准测试中表现良好,现有技术并没有提供可证明的质量和不确定性保证。在本文中,我们引入了两个基本变化,即 conformal keypoint 检测和几何不确定性传播,并将这两个变化融入两阶段范式中,并提出了第一个姿态估计器,该器具冒猜值和计算可证明的最坏误差限。一方面, conformal keypoint 检测应用了基于经验引导预测的统计机器,将启发式关键点检测转换为循环或椭圆预测集,以指定用户指定边际概率(例如90%)覆盖 groundtruth 关键点。另一方面,几何不确定性传播将几何约束传播到6D 对象姿态,导致一个 Pose UnceRtainty SEt(PURSE),保证覆盖 groundtruth 姿态的概率与相同的概率。然而,purSE 是一个非凸集合,并不直接导致估计姿态和不确定性。因此,我们开发了RANdom SAmple averaGing(RANSAG),计算平均姿态,并应用半确界放松来限制平均姿态和 groundtruth 姿态之间的最坏误差限。在 LineMOD Occlusion 数据集上,我们证明了:(i) purSE 覆盖 groundtruth 以有效概率;(ii)最坏误差限提供正确的不确定性量化;(iii)平均姿态以基于稀疏关键点的代表方法的更好或类似精度实现。
https://arxiv.org/abs/2303.12246
Inspired by the success of volumetric 3D pose estimation, some recent human mesh estimators propose to estimate 3D skeletons as intermediate representations, from which, the dense 3D meshes are regressed by exploiting the mesh topology. However, body shape information is lost in extracting skeletons, leading to mediocre performance. The advanced motion capture systems solve the problem by placing dense physical markers on the body surface, which allows to extract realistic meshes from their non-rigid motions. However, they cannot be applied to wild images without markers. In this work, we present an intermediate representation, named virtual markers, which learns 64 landmark keypoints on the body surface based on the large-scale mocap data in a generative style, mimicking the effects of physical markers. The virtual markers can be accurately detected from wild images and can reconstruct the intact meshes with realistic shapes by simple interpolation. Our approach outperforms the state-of-the-art methods on three datasets. In particular, it surpasses the existing methods by a notable margin on the SURREAL dataset, which has diverse body shapes. Code is available at this https URL.
受到体积三维姿态估计的成功启发,一些最近的人网格估计器提议将三维骨骼估计作为中间表示,利用网格拓扑,以促进Dense 3DMesh的增长。然而,在提取骨骼的过程中,身体形状信息被丢失,导致表现平平。高级运动捕捉系统解决这个问题的方法是在身体表面放置Dense的物理标记点,使其从非定长的运动时提取真实的Mesh。然而,如果没有标记点,就不能应用于野生图像。在本研究中,我们提出了一种中间表示,称为虚拟标记点,它基于大规模运动捕捉数据,以生成风格的方式模拟物理标记点的影响,学习64个关键地形点在身体表面的分布,以此模仿物理标记点的影响。虚拟标记点可以从野生图像中准确检测,并通过简单的插值恢复完整的Mesh,其表现优于现有的三个数据集的方法。特别是,它在SURREAL数据集上显著超越了现有的方法。代码在此httpsURL上可用。
https://arxiv.org/abs/2303.11726
Modern robotic platforms need a reliable localization system to operate daily beside humans. Simple pose estimation algorithms based on filtered wheel and inertial odometry often fail in the presence of abrupt kinematic changes and wheel slips. Moreover, despite the recent success of visual odometry, service and assistive robotic tasks often present challenging environmental conditions where visual-based solutions fail due to poor lighting or repetitive feature patterns. In this work, we propose an innovative online learning approach for wheel odometry correction, paving the way for a robust multi-source localization system. An efficient attention-based neural network architecture has been studied to combine precise performances with real-time inference. The proposed solution shows remarkable results compared to a standard neural network and filter-based odometry correction algorithms. Nonetheless, the online learning paradigm avoids the time-consuming data collection procedure and can be adopted on a generic robotic platform on-the-fly.
现代机器人平台需要可靠的定位系统,在日常生活中与人类协同操作。基于过滤轮和惯性计测距的简单姿态估计算法,在存在突然的机械运动变化和轮子滑动的情况下常常失败。此外,尽管视觉计测距最近取得了成功,服务和支持机器人任务常常面临着挑战性的环境条件,由于照明不足或重复特征模式等原因,视觉解决方案无法适用。在本工作中,我们提出了一种创新的在线学习方法,用于修正轮子计测距,为可靠的多来源定位系统铺平了道路。我们研究了高效的注意神经网络架构,使其能够结合精确性能和实时推理。提出的解决方案相对于标准神经网络和基于滤波的计测距修正算法,表现出卓越的结果。然而,在线学习范式避免了繁琐的数据采集程序,可以应用于通用的机器人平台实时采用。
https://arxiv.org/abs/2303.11725
This paper presents a novel approach for estimating human body shape and pose from monocular images that effectively addresses the challenges of occlusions and depth ambiguity. Our proposed method BoPR, the Body-aware Part Regressor, first extracts features of both the body and part regions using an attention-guided mechanism. We then utilize these features to encode extra part-body dependency for per-part regression, with part features as queries and body feature as a reference. This allows our network to infer the spatial relationship of occluded parts with the body by leveraging visible parts and body reference information. Our method outperforms existing state-of-the-art methods on two benchmark datasets, and our experiments show that it significantly surpasses existing methods in terms of depth ambiguity and occlusion handling. These results provide strong evidence of the effectiveness of our approach.
本论文提出了一种从单视角图像中估算人体形状和姿态的新方法,能够有效解决遮挡和深度不确定性的挑战。我们提出的算法是Body-aware Part Regressor,它使用注意力引导机制从身体和部分区域中提取特征。随后,我们利用这些特征为每个部分生成额外的身体-部分依赖关系,以作为查询和参考。这使我们的网络可以利用可见部分和身体参考信息推断遮挡部分与身体的空间关系。我们的方法在两个基准数据集上比现有最先进的方法表现更好,我们的实验表明它在深度不确定性和遮挡处理方面 significantly 超过了现有方法。这些结果提供了我们方法有效性的强烈证据。
https://arxiv.org/abs/2303.11675
Human pose is typically represented by a coordinate vector of body joints or their heatmap embeddings. While easy for data processing, unrealistic pose estimates are admitted due to the lack of dependency modeling between the body joints. In this paper, we present a structured representation, named Pose as Compositional Tokens (PCT), to explore the joint dependency. It represents a pose by M discrete tokens with each characterizing a sub-structure with several interdependent joints. The compositional design enables it to achieve a small reconstruction error at a low cost. Then we cast pose estimation as a classification task. In particular, we learn a classifier to predict the categories of the M tokens from an image. A pre-learned decoder network is used to recover the pose from the tokens without further post-processing. We show that it achieves better or comparable pose estimation results as the existing methods in general scenarios, yet continues to work well when occlusion occurs, which is ubiquitous in practice. The code and models are publicly available at this https URL.
人类姿态通常通过身体关节的坐标向量或heatmap嵌入来表示。虽然数据处理起来很容易,但由于身体关节之间的依赖建模不足,我们承认存在不现实的姿态估计。在本文中,我们提出了一种结构化表示,称为“组合代币”(PCT),以探索关节依赖。它通过M个离散代币表示一个姿态,每个代币都代表一个具有多个相互依赖关节的子结构。组合设计使其能够在低成本下实现微小的重构误差。然后,我们将姿态估计视为分类任务。特别是,我们学习一个分类器,从图像中预测M代币的类别。预先训练的解码网络使用无需进一步后处理的方法来恢复代币姿态。我们表明,它在一般场景下与现有方法相比可以实现更好的或类似的姿态估计结果,但当我发生遮挡时,这是实践中普遍存在的。代码和模型在这个httpsURL上publicly available。
https://arxiv.org/abs/2303.11638
In this paper, a novel Diffusion-based 3D Pose estimation (D3DP) method with Joint-wise reProjection-based Multi-hypothesis Aggregation (JPMA) is proposed for probabilistic 3D human pose estimation. On the one hand, D3DP generates multiple possible 3D pose hypotheses for a single 2D observation. It gradually diffuses the ground truth 3D poses to a random distribution, and learns a denoiser conditioned on 2D keypoints to recover the uncontaminated 3D poses. The proposed D3DP is compatible with existing 3D pose estimators and supports users to balance efficiency and accuracy during inference through two customizable parameters. On the other hand, JPMA is proposed to assemble multiple hypotheses generated by D3DP into a single 3D pose for practical use. It reprojects 3D pose hypotheses to the 2D camera plane, selects the best hypothesis joint-by-joint based on the reprojection errors, and combines the selected joints into the final pose. The proposed JPMA conducts aggregation at the joint level and makes use of the 2D prior information, both of which have been overlooked by previous approaches. Extensive experiments on Human3.6M and MPI-INF-3DHP datasets show that our method outperforms the state-of-the-art deterministic and probabilistic approaches by 1.5% and 8.9%, respectively. Code is available at this https URL.
在本文中,我们提出了一种基于扩散的3D姿态估计方法(D3DP),并结合Joint-wise重投影多假设聚合(JPMA)来实现Probabilistic 3D人类姿态估计。一方面,D3DP生成了单个2D观察下多个可能的3D姿态假设。它逐渐扩散 ground truth 3D姿态到随机分布,并学习基于2D关键点的混变去除器以恢复无混变3D姿态。我们提出的D3DP与现有的3D姿态估计器兼容,并支持用户在推理过程中通过两个可自定义参数平衡效率和准确性。另一方面,我们提出了JPMA,将其由D3DP生成的多个假设组装成一个单一的3D姿态,以便实际应用。它将3D姿态假设重投影到2D相机平面上,根据重投影误差选择最佳的Joint-by-Joint假设,并将它们组合成最终的3D姿态。我们提出的JPMA在Joint level上进行聚合,并利用先前方法忽视的2D先验信息。在对人类3.6M和MPI-INF-3DHP数据集进行的广泛实验中,我们发现,我们的方法分别比最先进的确定性和概率方法高出1.5%和8.9%。代码在此https URL上可用。
https://arxiv.org/abs/2303.11579
Most modern image-based 6D object pose estimation methods learn to predict 2D-3D correspondences, from which the pose can be obtained using a PnP solver. Because of the non-differentiable nature of common PnP solvers, these methods are supervised via the individual correspondences. To address this, several methods have designed differentiable PnP strategies, thus imposing supervision on the pose obtained after the PnP step. Here, we argue that this conflicts with the averaging nature of the PnP problem, leading to gradients that may encourage the network to degrade the accuracy of individual correspondences. To address this, we derive a loss function that exploits the ground truth pose before solving the PnP problem. Specifically, we linearize the PnP solver around the ground-truth pose and compute the covariance of the resulting pose distribution. We then define our loss based on the diagonal covariance elements, which entails considering the final pose estimate yet not suffering from the PnP averaging issue. Our experiments show that our loss consistently improves the pose estimation accuracy for both dense and sparse correspondence based methods, achieving state-of-the-art results on both Linemod-Occluded and YCB-Video.
大多数现代基于图像的6D物体姿态估计方法学习预测2D-3D对应关系,通过PnP求解器获得姿态。由于常见的PnP求解器的不可变性,这些方法通过个体对应关系进行监督。为了解决这一问题,有几种方法设计了不同的PnP策略,从而在PnP步骤后对获得的姿态进行监督。在这里,我们指出,这与PnP问题的平均值性质冲突,导致梯度可能鼓励网络降低个体对应关系的精度。为了解决这一问题,我们推导出一个损失函数,利用 ground-truth 姿态进行PnP求解。具体来说,我们将PnP求解器对 ground-truth 姿态进行线性化,并计算结果姿态分布的共变项。然后我们基于对角共变项定义我们的损失,这要求考虑最终姿态估计,但避免了PnP平均值问题。我们的实验结果表明,我们的损失 consistently 改善密集对应方法和小样本对应方法的姿态估计精度,在Linemod-Occluded和YCB-Video两个平台上取得了最先进的结果。
https://arxiv.org/abs/2303.11516
Optical Image Stabilization (OIS) system in mobile devices reduces image blurring by steering lens to compensate for hand jitters. However, OIS changes intrinsic camera parameters (i.e. $\mathrm{K}$ matrix) dynamically which hinders accurate camera pose estimation or 3D reconstruction. Here we propose a novel neural network-based approach that estimates $\mathrm{K}$ matrix in real-time so that pose estimation or scene reconstruction can be run at camera native resolution for the highest accuracy on mobile devices. Our network design takes gratified projection model discrepancy feature and 3D point positions as inputs and employs a Multi-Layer Perceptron (MLP) to approximate $f_{\mathrm{K}}$ manifold. We also design a unique training scheme for this network by introducing a Back propagated PnP (BPnP) layer so that reprojection error can be adopted as the loss function. The training process utilizes precise calibration patterns for capturing accurate $f_{\mathrm{K}}$ manifold but the trained network can be used anywhere. We name the proposed Dynamic Intrinsic Manifold Estimation network as DIME-Net and have it implemented and tested on three different mobile devices. In all cases, DIME-Net can reduce reprojection error by at least $64\%$ indicating that our design is successful.
在移动设备上,光学图像稳定器(OIS)系统可以减少镜头引导镜片造成的图像模糊,以补偿手抖动。然而,OIS动态地改变了相机固有的参数(即$\mathrm{K}$矩阵),这妨碍了准确的相机姿态估计或三维重建。在此,我们提出了一种基于神经网络的新方法,实时估计$\mathrm{K}$矩阵,以便在移动设备上以相机原生分辨率运行姿态估计或场景重建,以获得最高的准确性。我们的网络设计使用了满足愉悦投影模型差异特征和3D点位置的输入,并使用多层感知器(MLP)近似$f_{\mathrm{K}}$万维网。我们还设计了一个独特的训练方案,引入了反向传播PnP层,以便将投影误差作为损失函数使用。训练过程使用了精确的校准模式来捕捉准确的$f_{\mathrm{K}}$万维网,但训练网络可以应用于任何设备。我们称之为动态固有万维网估计网络,将其在三个不同的移动设备上实现和测试。在所有情况下,DIME-Net都可以至少减少投影误差64%,这表明我们的设计是成功的。
https://arxiv.org/abs/2303.11307
A central challenge in human pose estimation, as well as in many other machine learning and prediction tasks, is the generalization problem. The learned network does not have the capability to characterize the prediction error, generate feedback information from the test sample, and correct the prediction error on the fly for each individual test sample, which results in degraded performance in generalization. In this work, we introduce a self-correctable and adaptable inference (SCAI) method to address the generalization challenge of network prediction and use human pose estimation as an example to demonstrate its effectiveness and performance. We learn a correction network to correct the prediction result conditioned by a fitness feedback error. This feedback error is generated by a learned fitness feedback network which maps the prediction result to the original input domain and compares it against the original input. Interestingly, we find that this self-referential feedback error is highly correlated with the actual prediction error. This strong correlation suggests that we can use this error as feedback to guide the correction process. It can be also used as a loss function to quickly adapt and optimize the correction network during the inference process. Our extensive experimental results on human pose estimation demonstrate that the proposed SCAI method is able to significantly improve the generalization capability and performance of human pose estimation.
人类姿态估计的许多其他机器学习和预测任务中,一个核心挑战是泛化问题。通过学习网络预测,训练网络无法特征描述预测误差,无法从测试样本中生成反馈信息,也无法对每个个体测试样本实时纠正预测误差,导致泛化性能下降。在这项工作中,我们介绍了一种可自我纠正和适应的推理方法(SCAI),以解决网络预测的泛化挑战,并将人类姿态估计作为例子来展示其效率和性能。我们学习一个纠正网络,以根据 fitness 反馈错误纠正预测结果。这个反馈错误是由一个学习 fitness 反馈网络生成的,将预测结果映射到原始输入域,并将其与原始输入进行比较。有趣的是,我们发现这个自指反馈错误与实际预测误差高度相关。这个强相关性表明,我们可以利用这个错误作为反馈来指导纠正过程。它也可以作为损失函数,在推理过程中快速适应和优化纠正网络。我们对人类姿态估计的广泛实验结果表明,提出的SCAI方法能够显著改善人类姿态估计的泛化能力和性能。
https://arxiv.org/abs/2303.11180
Localization of magnetically actuated medical robots is essential for accurate actuation, closed loop control and delivery of functionality. Despite extensive progress in the use of magnetic field and inertial measurements for pose estimation, these have been either under single external permanent magnet actuation or coil systems. With the advent of new magnetic actuation systems comprised of multiple external permanent magnets for increased control and manipulability, new localization techniques are necessary to account for and leverage the additional magnetic field sources. In this letter, we introduce a novel magnetic localization technique in the Special Euclidean Group SE(3) for multiple external permanent magnetic field actuation and control systems. The method relies on a milli-meter scale three-dimensional accelerometer and a three-dimensional magnetic field sensor and is able to estimate the full 6 degree-of-freedom pose without any prior pose information. We demonstrated the localization system with two external permanent magnets and achieved localization errors of 8.5 ? 2.4 mm in position norm and 3.7 ? 3.6? in orientation, across a cubic workspace with 20 cm length.
磁性驱动的医疗机器人的精确定位对于准确的动作控制和功能交付至关重要。尽管在利用磁场和惯性测量单元进行姿态估计方面已经取得了广泛的进展,但这些通常都基于单个外部永久磁铁或线圈系统。随着新磁驱动系统由多个外部永久磁铁组成的出现,为了增加控制和操纵能力而采用新的定位技术是必要的。在本信中,我们介绍了一种名为SE(3)特别欧几里得组的新磁定位技术,用于多个外部永久磁铁的运动控制和控制系统。该方法依赖于毫米级别的三维加速度计和三维磁场传感器,能够在没有预先的姿态信息的情况下估计完整的六自由度姿态。我们使用两个外部永久磁铁进行了定位演示,并在一个长度为20厘米的立方工作空间中实现了定位误差,其中位置误差为8.5?2.4毫米,方向误差为3.7?3.6?。
https://arxiv.org/abs/2303.11059
Markerless motion capture using computer vision and human pose estimation (HPE) has the potential to expand access to precise movement analysis. This could greatly benefit rehabilitation by enabling more accurate tracking of outcomes and providing more sensitive tools for research. There are numerous steps between obtaining videos to extracting accurate biomechanical results and limited research to guide many critical design decisions in these pipelines. In this work, we analyze several of these steps including the algorithm used to detect keypoints and the keypoint set, the approach to reconstructing trajectories for biomechanical inverse kinematics and optimizing the IK process. Several features we find important are: 1) using a recent algorithm trained on many datasets that produces a dense set of biomechanically-motivated keypoints, 2) using an implicit representation to reconstruct smooth, anatomically constrained marker trajectories for IK, 3) iteratively optimizing the biomechanical model to match the dense markers, 4) appropriate regularization of the IK process. Our pipeline makes it easy to obtain accurate biomechanical estimates of movement in a rehabilitation hospital.
使用计算机视觉和人类姿态估计(HPE)进行无标记运动捕捉的潜在能力扩展了精确运动分析的访问。这将有助于康复,通过允许更准确地跟踪结果并提供更敏感的研究工具。在这项工作中,我们分析了几个步骤,包括用于检测关键点和关键点集的算法、用于重建生物医学逆运动学轨迹的方法,以及优化IK过程的方法。我们发现几个重要特征是:1)使用训练了许多数据集的最新算法,以生成一组生物医学动机的关键点的深度表示法;2)使用一种隐含表示来重建IK轨迹的平滑、具有身体解剖学约束的关键点标记法;3)迭代优化生物医学模型,使其与密集标记相匹配;4)适当规范IK过程。我们的管道使在康复医院获得准确的生物医学运动估计变得容易。
https://arxiv.org/abs/2303.10654
Most learning-based approaches to category-level 6D pose estimation are design around normalized object coordinate space (NOCS). While being successful, NOCS-based methods become inaccurate and less robust when handling objects of a category containing significant intra-category shape variations. This is because the object coordinates induced by global and rigid alignment of objects are semantically incoherent, making the coordinate regression hard to learn and generalize. We propose Semantically-aware Object Coordinate Space (SOCS) built by warping-and-aligning the objects guided by a sparse set of keypoints with semantically meaningful correspondence. SOCS is semantically coherent: Any point on the surface of a object can be mapped to a semantically meaningful location in SOCS, allowing for accurate pose and size estimation under large shape variations. To learn effective coordinate regression to SOCS, we propose a novel multi-scale coordinate-based attention network. Evaluations demonstrate that our method is easy to train, well-generalizing for large intra-category shape variations and robust to inter-object occlusions.
大多数基于学习的类别级别的6D姿态估计方法都围绕着 normalization 对象坐标空间 (NOCS) 设计。虽然这些方法都取得了成功,但在处理包含内部类别形状变异较大类别的物体时,NOCS 方法会变得不准确且不够稳健。这是因为由物体全球和固定对齐引起的对象坐标具有语义上的不一致性,这使得坐标回归很难学习和泛化。我们提出了一种语义化的 Object Coordinate Space (SOCS),通过稀疏的一组关键点以语义有意义的对应关系指导拉伸和对齐物体。SOCS 具有语义一致性:物体表面的任何点都可以映射到SOCS中的一个语义有意义的位置,从而实现在大型形状变异下准确的姿势和尺寸估计。为了学习有效地从SOCS中Regression,我们提出了一种新颖的多尺度坐标基注意力网络。评估表明,我们的方法易于训练,对于大型内部类别形状变异有很好的泛化能力,并且能够抵御外部物体遮挡。
https://arxiv.org/abs/2303.10346
Despite their potential, markerless hand tracking technologies are not yet applied in practice to the diagnosis or monitoring of the activity in inflammatory musculoskeletal diseases. One reason is that the focus of most methods lies in the reconstruction of coarse, plausible poses for gesture recognition or AR/VR applications, whereas in the clinical context, accurate, interpretable, and reliable results are required. Therefore, we propose ShaRPy, the first RGB-D Shape Reconstruction and hand Pose tracking system, which provides uncertainty estimates of the computed pose to guide clinical decision-making. Our method requires only a light-weight setup with a single consumer-level RGB-D camera yet it is able to distinguish similar poses with only small joint angle deviations. This is achieved by combining a data-driven dense correspondence predictor with traditional energy minimization, optimizing for both, pose and hand shape parameters. We evaluate ShaRPy on a keypoint detection benchmark and show qualitative results on recordings of a patient.
尽管它们有潜力,但无标记手跟踪技术尚未在实践中应用于诊断或监测抗炎神经肌肉疾病的活动。原因之一是大多数方法的关注点在于对手势识别或增强现实应用中的粗略POS重建,而在实践中,需要准确、可解释和可靠的结果。因此,我们提出了SARPy,它是RGB-D形状重建和手POS跟踪系统的先驱,可以提供计算POS的不确定性估计,以指导临床决策。我们的方法只需要一个轻便的框架和一个消费级RGB-D相机,但它能够在仅有较小关节角度差异的情况下区分相似的POS。这是通过将数据驱动的密集对应预测与传统的能量最小化优化相结合实现的。我们评估了SARPy在一个关键点检测基准上的性能,并展示了患者记录中的质量结果。
https://arxiv.org/abs/2303.10042
Event camera, as an emerging biologically-inspired vision sensor for capturing motion dynamics, presents new potential for 3D human pose tracking, or video-based 3D human pose estimation. However, existing works in pose tracking either require the presence of additional gray-scale images to establish a solid starting pose, or ignore the temporal dependencies all together by collapsing segments of event streams to form static image frames. Meanwhile, although the effectiveness of Artificial Neural Networks (ANNs, a.k.a. dense deep learning) has been showcased in many event-based tasks, the use of ANNs tends to neglect the fact that compared to the dense frame-based image sequences, the occurrence of events from an event camera is spatiotemporally much sparser. Motivated by the above mentioned issues, we present in this paper a dedicated end-to-end \textit{sparse deep learning} approach for event-based pose tracking: 1) to our knowledge this is the first time that 3D human pose tracking is obtained from events only, thus eliminating the need of accessing to any frame-based images as part of input; 2) our approach is based entirely upon the framework of Spiking Neural Networks (SNNs), which consists of Spike-Element-Wise (SEW) ResNet and our proposed spiking spatiotemporal transformer; 3) a large-scale synthetic dataset is constructed that features a broad and diverse set of annotated 3D human motions, as well as longer hours of event stream data, named SynEventHPD. Empirical experiments demonstrate the superiority of our approach in both performance and efficiency measures. For example, with comparable performance to the state-of-the-art ANNs counterparts, our approach achieves a computation reduction of 20\% in FLOPS. Our implementation is made available at this https URL and dataset will be released upon paper acceptance.
事件相机作为新兴的生物学灵感视觉传感器,用于捕获运动动态,提供了3D人类姿态跟踪或视频based3D人类姿态估计的新潜力。然而,现有的关于姿态跟踪的工作要么需要额外的灰度图像以建立稳定的起始姿态,要么忽略时间依赖关系,通过合并事件流Segments以形成静态图像帧。同时,尽管人工智能神经网络(ANNs,也称为密集深度学习)的有效性在许多事件任务中已被展示,但使用ANNs的倾向往往忽视了这个事实,与密集帧based图像序列相比,从事件相机发生的事件在时间和质量上都更稀疏。基于以上提到的问题,在本文中,我们提出了一种专门end-to-end \textit{稀疏深度学习}方法,用于事件based姿态跟踪: 1)据我们所知,这是首次从事件获取3D人类姿态跟踪,从而消除了访问任何帧based图像作为输入的必要性; 2)我们的方法是完全基于Spiking Neural Networks(SNNs)的框架,其中包括Spiking-Element-Wise(SEW)ResNet和我们的提议的Spiking spatiotemporalTransformer; 3)建立了一个大规模的合成数据集,其中包括广泛的和多样化的注释3D人类运动,以及更长的事件流数据,名为SynEventHPD。实证实验证明了我们方法的性能效率和 measures的优越性。例如,与最先进的ANNs替代品相媲美,我们的方法实现了FLOPS的20\%减少。我们的实现在此httpsURL上提供,数据集将在论文接受后发布。
https://arxiv.org/abs/2303.09681