Recent advances in robotics are pushing real-world autonomy, enabling robots to perform long-term and large-scale missions. A crucial component for successful missions is the incorporation of loop closures through place recognition, which effectively mitigates accumulated pose estimation drift. Despite computational advancements, optimizing performance for real-time deployment remains challenging, especially in resource-constrained mobile robots and multi-robot systems since, conventional keyframe sampling practices in place recognition often result in retaining redundant information or overlooking relevant data, as they rely on fixed sampling intervals or work directly in the 3D space instead of the feature space. To address these concerns, we introduce the concept of sample space in place recognition and demonstrate how different sampling techniques affect the query process and overall performance. We then present a novel keyframe sampling approach for LiDAR-based place recognition, which focuses on redundancy minimization and information preservation in the hyper-dimensional descriptor space. This approach is applicable to both learning-based and handcrafted descriptors, and through the experimental validation across multiple datasets and descriptor frameworks, we demonstrate the effectiveness of our proposed method, showing it can jointly minimize redundancy and preserve essential information in real-time. The proposed approach maintains robust performance across various datasets without requiring parameter tuning, contributing to more efficient and reliable place recognition for a wide range of robotic applications.
近年来,机器人技术的进步推动了现实世界的自主,使得机器人能够执行长期和大规模任务。成功执行任务的关键组件是引入通过空间识别进行闭环闭合,有效减轻了累积姿态估计漂移。尽管计算取得了进步,为实时部署优化性能仍然具有挑战性,尤其是在资源受限的移动机器人和多机器人系统上,因为传统的基于关键帧的采样实践通常会导致保留冗余信息或忽视相关信息,因为他们依赖于固定的采样间隔或直接在三维空间而不是特征空间工作。为了应对这些担忧,我们引入了空间识别中的样本空间概念,并展示了不同采样技术如何影响查询过程和整体性能。然后,我们提出了一个基于LiDAR的紧凑表示空间中进行闭环闭合的新关键帧采样方法,重点关注降维和信息保留。这种方法适用于基于学习和手工制作的描述符,并通过在多个数据集和描述框架上的实验验证,证明了我们所提出方法的有效性,表明它可以同时最小化冗余并保留关键信息。这种方法在各种数据集上保持稳健的性能,无需进行参数调整,为各种机器人应用提供更高效、可靠的姿态识别。
https://arxiv.org/abs/2410.02643
Detecting 3D keypoints with semantic consistency is widely used in many scenarios such as pose estimation, shape registration and robotics. Currently, most unsupervised 3D keypoint detection methods focus on the rigid-body objects. However, when faced with deformable objects, the keypoints they identify do not preserve semantic consistency well. In this paper, we introduce an innovative unsupervised keypoint detector Key-Grid for both the rigid-body and deformable objects, which is an autoencoder framework. The encoder predicts keypoints and the decoder utilizes the generated keypoints to reconstruct the objects. Unlike previous work, we leverage the identified keypoint in formation to form a 3D grid feature heatmap called grid heatmap, which is used in the decoder section. Grid heatmap is a novel concept that represents the latent variables for grid points sampled uniformly in the 3D cubic space, where these variables are the shortest distance between the grid points and the skeleton connected by keypoint pairs. Meanwhile, we incorporate the information from each layer of the encoder into the decoder section. We conduct an extensive evaluation of Key-Grid on a list of benchmark datasets. Key-Grid achieves the state-of-the-art performance on the semantic consistency and position accuracy of keypoints. Moreover, we demonstrate the robustness of Key-Grid to noise and downsampling. In addition, we achieve SE-(3) invariance of keypoints though generalizing Key-Grid to a SE(3)-invariant backbone.
检测3D关键点与语义一致性是许多应用场景(如姿态估计、形状配准和机器人技术)中广泛使用的。目前,大多数无监督3D关键点检测方法都关注于刚体物体。然而,面对变形物体,它们确定的关键点在语义上并不保持一致。在本文中,我们提出了一种创新的无监督关键点检测器Key-Grid,适用于刚体和变形物体,是一种自动编码器框架。编码器预测关键点,解码器利用生成的关键点重构物体。与之前的工作不同,我们利用已识别的关键点形成一个3D立方空间中采样均匀的网格点特征热图,即网格热图,用于解码器部分。网格热图是一种新颖的概念,它表示在3D立方空间中,网格点与通过关键点对齐的骨架之间的最短距离。同时,我们将编码器每一层的有关信息融入解码器部分。我们在一系列基准数据集上对Key-Grid进行广泛评估。Key-Grid在关键点的语义一致性和位置精度上实现了最先进的性能。此外,我们还证明了Key-Grid对噪声和下采样具有鲁棒性。此外,通过将Key-Grid扩展到SE(3)-不变的骨干网络,我们实现了关键点的SE(3)不变性。
https://arxiv.org/abs/2410.02237
LiDAR bundle adjustment (BA) is an effective approach to reduce the drifts in pose estimation from the front-end. Existing works on LiDAR BA usually rely on predefined geometric features for landmark representation. This reliance restricts generalizability, as the system will inevitably deteriorate in environments where these specific features are absent. To address this issue, we propose SGBA, a LiDAR BA scheme that models the environment as a semantic Gaussian mixture model (GMM) without predefined feature types. This approach encodes both geometric and semantic information, offering a comprehensive and general representation adaptable to various environments. Additionally, to limit computational complexity while ensuring generalizability, we propose an adaptive semantic selection framework that selects the most informative semantic clusters for optimization by evaluating the condition number of the cost function. Lastly, we introduce a probabilistic feature association scheme that considers the entire probability density of assignments, which can manage uncertainties in measurement and initial pose estimation. We have conducted various experiments and the results demonstrate that SGBA can achieve accurate and robust pose refinement even in challenging scenarios with low-quality initial pose estimation and limited geometric features. We plan to open-source the work for the benefit of the community this https URL.
LiDAR捆绑调整(BA)是从前端减少姿态估计漂移的有效方法。现有的LiDAR BA作品通常依赖于预定义的几何特征进行地标表示。这种依赖限制了系统的泛化能力,因为在这些特定特征缺失的环境中,系统将不可避免地恶化。为了解决这个问题,我们提出了SGBA,一种不依赖于预定义特征类型的LiDAR BA方案。这种方法既编码了几何信息,也编码了语义信息,提供了一个全面且适用于各种环境的通用表示。此外,为了在保证通用性的同时限制计算复杂度,我们提出了一个自适应语义选择框架,通过评估目标函数的条件数来选择最有信息量的语义聚类进行优化。最后,我们引入了一个概率特征关联方案,考虑了分配的全概率密度,可以管理测量和初始姿态估计中的不确定性。我们进行了各种实验,结果表明,在挑战性的场景中,即使初始姿态估计质量较低且缺乏几何特征,SGBA也可以实现准确和鲁棒的姿态精炼。我们计划将这项工作开源,以造福于社区,https://url。
https://arxiv.org/abs/2410.01618
Surgery monitoring in Mixed Reality (MR) environments has recently received substantial focus due to its importance in image-based decisions, skill assessment, and robot-assisted surgery. Tracking hands and articulated surgical instruments is crucial for the success of these applications. Due to the lack of annotated datasets and the complexity of the task, only a few works have addressed this problem. In this work, we present SurgeoNet, a real-time neural network pipeline to accurately detect and track surgical instruments from a stereo VR view. Our multi-stage approach is inspired by state-of-the-art neural-network architectural design, like YOLO and Transformers. We demonstrate the generalization capabilities of SurgeoNet in challenging real-world scenarios, achieved solely through training on synthetic data. The approach can be easily extended to any new set of articulated surgical instruments. SurgeoNet's code and data are publicly available.
近年来,由于在图像决策、技能评估和机器人辅助手术中具有重要性,手术在混合现实(MR)环境中的监测引起了广泛关注。跟踪双手和操作外科器械对于这些应用的成功至关重要。由于缺乏注释的數據和任务的复杂性,只有几篇论文解决了这个问题。在这篇工作中,我们提出了SurgeoNet,用于准确检测和跟踪手术器械的立体 VR 视角。我们多阶段的方法受到最先进的神经网络架构设计(如YOLO 和Transformer)的启发。我们证明了SurgeoNet在具有挑战性的真实世界场景中的泛化能力,仅通过在假数据上训练实现。这种方法可以很容易地扩展到任何新的操作外科器械的组合。SurgeoNet的代码和数据公开可用。
https://arxiv.org/abs/2410.01293
We present an approach for pose and burial fraction estimation of debris field barrels found on the seabed in the Southern California San Pedro Basin. Our computational workflow leverages recent advances in foundation models for segmentation and a vision transformer-based approach to estimate the point cloud which defines the geometry of the barrel. We propose BarrelNet for estimating the 6-DOF pose and radius of buried barrels from the barrel point clouds as input. We train BarrelNet using synthetically generated barrel point clouds, and qualitatively demonstrate the potential of our approach using remotely operated vehicle (ROV) video footage of barrels found at a historic dump site. We compare our method to a traditional least squares fitting approach and show significant improvement according to our defined benchmarks.
我们提出了一个用于海底南方加州圣佩德罗盆地中 debris field 油桶残骸的姿势和埋葬分数估计的方法。我们的计算工作流程利用了用于分割的基础模型以及基于视觉 transformer 的方法来估计点云,该点云定义了油桶的形状。我们提出了 BarrelNet,用于从油桶点云中估计 6 自由度油桶的姿势和半径。我们使用由同构生成油桶点云进行训练,并通过历史垃圾填埋场的遥控器视频录像来演示我们方法的潜力。我们将我们的方法与传统的最小二乘法方法进行了比较,并证明了根据我们定义的基准,我们的方法具有显著的改进。
https://arxiv.org/abs/2410.01061
Recent advancements in industrial anomaly detection have been hindered by the lack of realistic datasets that accurately represent real-world conditions. Existing algorithms are often developed and evaluated using idealized datasets, which deviate significantly from real-life scenarios characterized by environmental noise and data corruption such as fluctuating lighting conditions, variable object poses, and unstable camera positions. To address this gap, we introduce the Realistic Anomaly Detection (RAD) dataset, the first multi-view RGB-based anomaly detection dataset specifically collected using a real robot arm, providing unique and realistic data scenarios. RAD comprises 4765 images across 13 categories and 4 defect types, collected from more than 50 viewpoints, providing a comprehensive and realistic benchmark. This multi-viewpoint setup mirrors real-world conditions where anomalies may not be detectable from every perspective. Moreover, by sampling varying numbers of views, the algorithm's performance can be comprehensively evaluated across different viewpoints. This approach enhances the thoroughness of performance assessment and helps improve the algorithm's robustness. Besides, to support 3D multi-view reconstruction algorithms, we propose a data augmentation method to improve the accuracy of pose estimation and facilitate the reconstruction of 3D point clouds. We systematically evaluate state-of-the-art RGB-based and point cloud-based models using RAD, identifying limitations and future research directions. The code and dataset could found at this https URL
近年来,工业异常检测的发展受到了缺乏真实世界数据集的阻碍,这些数据集往往与现实生活中的环境噪声和数据损坏情况(例如波动照明条件、变化的对象姿势和不稳定的相机位置)存在较大偏差。为了填补这一空白,我们引入了Realistic Anomaly Detection(RAD)数据集,这是第一个由真实机器人手臂收集的多视角RGB异常检测数据集,提供了独特且现实的数据场景。RAD包括13个类别和4个缺陷类型的4765张图像,收集了超过50个视角,为用户提供了一个全面且现实的数据集作为基准。这种多视角设置反映了异常可能不会从每个角度检测出来的现实世界情况。此外,通过采样不同的视角,算法的性能可以在不同视角上进行全面的评估。这种方法增强了性能评估的全面性,并有助于提高算法的稳健性。此外,为了支持3D多视角重建算法,我们提出了一个数据增强方法来提高姿态估计的准确性并促进3D点云的重建。我们系统地使用RAD评估最先进的基于RGB和点云的模型,确定局限性和未来研究方向。代码和数据集可在此处找到:https://www.example.com/
https://arxiv.org/abs/2410.00713
Point cloud registration aims to provide estimated transformations to align point clouds, which plays a crucial role in pose estimation of various navigation systems, such as surgical guidance systems and autonomous vehicles. Despite the impressive performance of recent models on benchmark datasets, many rely on complex modules like KPConv and Transformers, which impose significant computational and memory demands. These requirements hinder their practical application, particularly in resource-constrained environments such as mobile robotics. In this paper, we propose a novel point cloud registration network that leverages a pure MLP architecture, constructing geometric information offline. This approach eliminates the computational and memory burdens associated with traditional complex feature extractors and significantly reduces inference time and resource consumption. Our method is the first to replace 3D coordinate inputs with offline-constructed geometric encoding, improving generalization and stability, as demonstrated by Maximum Mean Discrepancy (MMD) comparisons. This efficient and accurate geometric representation marks a significant advancement in point cloud analysis, particularly for applications requiring fast and reliability.
点云配准的目的是提供点云的转换估计,这对各种导航系统的姿态估计具有重要影响,如手术指导和自动驾驶车辆。尽管最近模型的基准数据集上的表现非常出色,但许多模型依赖于复杂的模块,如KPConv和Transformers,这些模块对计算和内存需求很高。这些要求阻碍了它们在资源受限环境(如移动机器人)中的实际应用。在本文中,我们提出了一个新颖的点云配准网络,该网络利用纯MLP架构,在离线环境中构建几何信息。这种方法消除了传统复杂特征提取器相关的计算和内存负担,显著减少了推理时间和资源消耗。我们的方法是第一个将3D坐标输入替换为在线构建的几何编码的方法,通过最大均方误差(MMD)比较证明了提高模型的泛化能力和稳定性。这种高效且准确的的几何表示标志着点云分析方面的重大进展,尤其是在需要快速和可靠的应用程序中。
https://arxiv.org/abs/2410.00589
This paper reformulates cross-dataset human pose estimation as a continual learning task, aiming to integrate new keypoints and pose variations into existing models without losing accuracy on previously learned datasets. We benchmark this formulation against established regularization-based methods for mitigating catastrophic forgetting, including EWC, LFL, and LwF. Moreover, we propose a novel regularization method called Importance-Weighted Distillation (IWD), which enhances conventional LwF by introducing a layer-wise distillation penalty and dynamic temperature adjustment based on layer importance for previously learned knowledge. This allows for a controlled adaptation to new tasks that respects the stability-plasticity balance critical in continual learning. Through extensive experiments across three datasets, we demonstrate that our approach outperforms existing regularization-based continual learning strategies. IWD shows an average improvement of 3.60\% over the state-of-the-art LwF method. The results highlight the potential of our method to serve as a robust framework for real-world applications where models must evolve with new data without forgetting past knowledge.
本文将跨数据集人姿态估计重新建模为一个持续学习任务,旨在将新的关键点和姿态变化集成到现有的模型中,同时不损失之前学习数据的精度。我们通过基准这个公式 against 已经建立的基于正则化的方法来减轻灾难性遗忘,包括 EWC、LFL 和 LwF。此外,我们提出了一种名为 Importance-Weighted Distillation(IWD)的新正则化方法,它通过引入层间蒸馏惩罚和基于层重要性的动态温度调整来增强传统的 LwF。这使得在连续学习过程中对新技术的调整具有可控制性,并尊重连续学习中的稳定性-可塑性平衡。通过三个数据集的广泛实验,我们证明了我们的方法超越了现有的基于正则化的连续学习策略。IWD 显示了与最先进的 LwF 方法相比平均提高了 3.60%。结果强调了我们的方法在真实应用场景中作为具有良好鲁棒性的框架具有潜力。
https://arxiv.org/abs/2409.20469
We propose ClassroomKD, a novel multi-mentor knowledge distillation framework inspired by classroom environments to enhance knowledge transfer between student and multiple mentors. Unlike traditional methods that rely on fixed mentor-student relationships, our framework dynamically selects and adapts the teaching strategies of diverse mentors based on their effectiveness for each data sample. ClassroomKD comprises two main modules: the Knowledge Filtering (KF) Module and the Mentoring Module. The KF Module dynamically ranks mentors based on their performance for each input, activating only high-quality mentors to minimize error accumulation and prevent information loss. The Mentoring Module adjusts the distillation strategy by tuning each mentor's influence according to the performance gap between the student and mentors, effectively modulating the learning pace. Extensive experiments on image classification (CIFAR-100 and ImageNet) and 2D human pose estimation (COCO Keypoints and MPII Human Pose) demonstrate that ClassroomKD significantly outperforms existing knowledge distillation methods. Our results highlight that a dynamic and adaptive approach to mentor selection and guidance leads to more effective knowledge transfer, paving the way for enhanced model performance through distillation.
我们提出了ClassroomKD,一种以课堂环境为基础的多导师知识蒸馏框架,旨在提高学生和多个导师之间的知识传递。与传统方法不同,我们的框架根据每个数据样本中导师的有效性动态选择和适应不同的教学策略。ClassroomKD包括两个主要模块:知识筛选(KF)模块和导师模块。KF模块根据每个输入的导师表现动态排名导师,仅选择高质量导师以最小化误差累积并防止信息损失。导师模块通过根据学生和导师之间的表现差距调整蒸馏策略,有效地调节学习进度。对图像分类(CIFAR-100和ImageNet)和2D人体姿态估计(COCO Keypoints和MPII Human Pose)的广泛实验证明,ClassroomKD显著优于现有知识蒸馏方法。我们的结果强调,动态和自适应的导师选择和指导方法导致知识传递更加有效,为通过蒸馏增强模型性能铺平道路。
https://arxiv.org/abs/2409.20237
Accurate camera calibration is a well-known and widely used task in computer vision that has been researched for decades. However, the standard approach based on checkerboard calibration patterns has some drawbacks that limit its applicability. For example, the calibration pattern must be completely visible without any occlusions. Alternative solutions such as ChArUco boards allow partial occlusions, but require a higher camera resolution due to the fine details of the position encoding. We present a new calibration pattern that combines the advantages of checkerboard calibration patterns with a lightweight position coding that can be decoded at very low resolutions. The decoding algorithm includes error correction and is computationally efficient. The whole approach is backward compatible to both checkerboard calibration patterns and several checkerboard calibration algorithms. Furthermore, the method can be used not only for camera calibration but also for camera pose estimation and marker-based object localization tasks.
精确的相机标定是一个在计算机视觉中广泛研究和应用的任务,已有数十年历史的经典方法。然而,基于检查板标定图案的标准方法存在一些限制,从而限制了其实用性。例如,校准图案必须完全可见,没有任何遮挡。替代方案如ChArUco板允许部分遮挡,但需要更高的相机分辨率,因为位置编码的精细细节。我们提出了一种结合检查板标定图案优点的轻量级位置编码的新标定方案,可以在非常低的分辨率下进行解码。解码算法包括错误校正,且具有计算效率。整个方法既与检查板标定图案保持向后兼容,也与多个检查板标定算法保持向后兼容。此外,该方法还可以用于不仅是相机标定,而且是相机姿态估计和标记基于物体局部定位任务。
https://arxiv.org/abs/2409.20127
3D Gaussian Splatting algorithms excel in novel view rendering applications and have been adapted to extend the capabilities of traditional SLAM systems. However, current Gaussian Splatting SLAM methods, designed mainly for hand-held RGB or RGB-D sensors, struggle with tracking drifts when used with rotating RGB-D camera setups. In this paper, we propose a robust Gaussian Splatting SLAM architecture that utilizes inputs from rotating multiple RGB-D cameras to achieve accurate localization and photorealistic rendering performance. The carefully designed Gaussian Splatting Loop Closure module effectively addresses the issue of accumulated tracking and mapping errors found in conventional Gaussian Splatting SLAM systems. First, each Gaussian is associated with an anchor frame and categorized as historical or novel based on its timestamp. By rendering different types of Gaussians at the same viewpoint, the proposed loop detection strategy considers both co-visibility relationships and distinct rendering outcomes. Furthermore, a loop closure optimization approach is proposed to remove camera pose drift and maintain the high quality of 3D Gaussian models. The approach uses a lightweight pose graph optimization algorithm to correct pose drift and updates Gaussians based on the optimized poses. Additionally, a bundle adjustment scheme further refines camera poses using photometric and geometric constraints, ultimately enhancing the global consistency of scenarios. Quantitative and qualitative evaluations on both synthetic and real-world datasets demonstrate that our method outperforms state-of-the-art methods in camera pose estimation and novel view rendering tasks. The code will be open-sourced for the community.
3D高斯平铺算法在新颖视角渲染应用中表现出色,并已适应传统SLAM系统的扩展功能。然而,针对主要针对手持式RGB或RGB-D传感器的现有个性化Gaussian平铺SLAM方法,在采用旋转的RGB-D相机设置时,追踪误差问题依然存在。在本文中,我们提出了一个稳健的Gaussian平铺SLAM架构,利用来自旋转多个RGB-D相机的输入,以实现精确的局部定位和逼真的渲染效果。精心设计的Gaussian平铺循环闭合模块有效地解决了传统Gaussian平铺SLAM系统中积累的跟踪和映射误差问题。首先,每个Gaussian都与一个附着帧联系,并根据其时间戳将其分类为历史或新奇。通过在同一视角上渲染不同类型的Gaussian,所提出的循环检测策略考虑了视点共现关系和不同的渲染结果。此外,还提出了一种循环关闭优化方法,以消除相机姿态漂移,并基于优化姿态更新Gaussian。此外,一种束调整方案通过利用 photometric 和 geometric 约束进一步优化相机姿态,最终提高场景的全局一致性。对合成和真实世界数据集的定量和定性评估表明,我们的方法在相机姿态估计和新颖视角渲染任务中超过了最先进的Methods。代码将公开开源,供社区使用。
https://arxiv.org/abs/2409.20111
We developed a robust solution for real-time 6D object detection in industrial applications by integrating FoundationPose, SAM2, and LightGlue, eliminating the need for retraining. Our approach addresses two key challenges: the requirement for an initial object mask in the first frame in FoundationPose and issues with tracking loss and automatic rotation for symmetric objects. The algorithm requires only a CAD model of the target object, with the user clicking on its location in the live feed during the initial setup. Once set, the algorithm automatically saves a reference image of the object and, in subsequent runs, employs LightGlue for feature matching between the object and the real-time scene, providing an initial prompt for detection. Tested on the YCB dataset and industrial components such as bleach cleanser and gears, the algorithm demonstrated reliable 6D detection and tracking. By integrating SAM2 and FoundationPose, we effectively mitigated common limitations such as the problem of tracking loss, ensuring continuous and accurate tracking under challenging conditions like occlusion or rapid movement.
我们通过将FoundationPose、SAM2和LightGlue集成开发了一个实时的6D物体检测解决方案,无需重新训练。我们的方法解决了两个关键挑战:在FoundationPose中第一帧需要初始物体掩码,以及对称物体跟踪损失和自动旋转问题。算法只需要目标对象的CAD模型,用户在初始设置过程中点击其位置。一旦设置完成,算法会自动保存物体的参考图像,并在后续运行中使用LightGlue在物体和实时场景之间进行特征匹配,为检测提供初始提示。在YCB数据集和工业组件(如漂白清洁剂和齿轮)上进行了测试,该算法展示了可靠的6D检测和跟踪。通过将SAM2和FoundationPose集成,我们有效地减轻了常见的限制,如跟踪损失问题,确保在具有挑战性的条件(如遮挡或快速运动)下实现连续和准确的跟踪。
https://arxiv.org/abs/2409.19986
We present Parametric Piecewise Linear Networks (PPLNs) for temporal vision inference. Motivated by the neuromorphic principles that regulate biological neural behaviors, PPLNs are ideal for processing data captured by event cameras, which are built to simulate neural activities in the human retina. We discuss how to represent the membrane potential of an artificial neuron by a parametric piecewise linear function with learnable coefficients. This design echoes the idea of building deep models from learnable parametric functions recently popularized by Kolmogorov-Arnold Networks (KANs). Experiments demonstrate the state-of-the-art performance of PPLNs in event-based and image-based vision applications, including steering prediction, human pose estimation, and motion deblurring. The source code of our implementation is available at this https URL.
我们提出了参数分块线性网络(PPLN)用于时空视觉推理。受到生物神经行为调节的神经形态原则的启发,PPLN 非常适合处理由事件相机捕获的数据,这些相机旨在模拟人视网膜中的神经活动。我们讨论了如何通过可学习系数的代表膜电位来表示人工神经元的模型。这种设计呼应了最近由 Kaldor-Arnold 网络(KANs)流行的构建深度模型的思想。实验证明了 PPLN 在事件驱动和图像驱动视觉应用中的最先进性能,包括自动驾驶预测、人体姿态估计和运动去雾。我们的实现代码可在 https:// 这个链接中获取。
https://arxiv.org/abs/2409.19772
Tactile sensing provides robots with rich feedback during manipulation, enabling a host of perception and controls capabilities. Here, we present a new open-source, vision-based tactile sensor designed to promote reproducibility and accessibility across research and hobbyist communities. Building upon the GelSlim 3.0 sensor, our design features two key improvements: a simplified, modifiable finger structure and easily manufacturable lenses. To complement the hardware, we provide an open-source perception library that includes depth and shear field estimation algorithms to enable in-hand pose estimation, slip detection, and other manipulation tasks. Our sensor is accompanied by comprehensive manufacturing documentation, ensuring the design can be readily produced by users with varying levels of expertise. We validate the sensor's reproducibility through extensive human usability testing. For documentation, code, and data, please visit the project website: this https URL
触觉传感器为机器人提供丰富的反馈,在操作过程中使机器人具备感知和控制能力。在这里,我们向开源社区介绍了一种新的基于视觉的触觉传感器,旨在促进可重复性和可访问性,该传感器可用于研究和爱好者社区。在基于GELSlim 3.0传感器的基础上,我们的设计采用了两个关键改进:简化、可修改的指尖结构和易于制造的透镜。为了补充硬件,我们还提供了开源感知库,包括深度和剪切场估计算法,以实现双手姿势估计、滑移检测和其他操作任务。我们的传感器配备了全面的制造说明文件,确保用户可以轻松地根据其专业水平生产该设计。我们通过大量的人体可用性测试验证了传感器的可重复性。有关文档、代码和数据,请访问项目网站:https://this URL。
https://arxiv.org/abs/2409.19770
Dashboard cameras (dashcams) record millions of driving videos daily, offering a valuable potential data source for various applications, including driving map production and updates. A necessary step for utilizing these dashcam data involves the estimation of camera poses. However, the low-quality images captured by dashcams, characterized by motion blurs and dynamic objects, pose challenges for existing image-matching methods in accurately estimating camera poses. In this study, we propose a precise pose estimation method for dashcam images, leveraging the inherent camera motion prior. Typically, image sequences captured by dash cameras exhibit pronounced motion prior, such as forward movement or lateral turns, which serve as essential cues for correspondence estimation. Building upon this observation, we devise a pose regression module aimed at learning camera motion prior, subsequently integrating these prior into both correspondences and pose estimation processes. The experiment shows that, in real dashcams dataset, our method is 22% better than the baseline for pose estimation in AUC5\textdegree, and it can estimate poses for 19% more images with less reprojection error in Structure from Motion (SfM).
翻译: 车载摄像头(dashcams)每天记录数百万条驾驶视频,为各种应用提供有价值的数据来源,包括自动驾驶地图制作和更新。利用这些车载摄像头数据的有必要步骤包括估计相机姿态。然而,由车载摄像头捕获的低质量图像,具有运动模糊和动态物体,对现有图像匹配方法准确估计相机姿态构成了挑战。在这项研究中,我们提出了一种精确的相机姿态估计方法,利用固有相机运动 prior。通常,车载摄像头捕获的图像序列表现出明显的运动先决,例如前向运动或横向转向,这些先决对匹配估计是至关重要的。 在此基础上,我们设计了一个用于学习相机运动先决的 pose 回归模块,然后在匹配和姿态估计过程中将先决集成进去。实验表明,在现实车载摄像头数据集中,我们的方法在 AUC5\degree 下的 pose 估计基线提升了 22%,同时可以在 SfM 中估计更多的图像,且结构从运动(SfM)中的投影误差更小。
https://arxiv.org/abs/2409.18673
This paper addresses a special Perspective-n-Point (PnP) problem: estimating the optimal pose to align 3D and 2D shapes in real-time without correspondences, termed as correspondence-free PnP. While several studies have focused on 3D and 2D shape registration, achieving both real-time and accurate performance remains challenging. This study specifically targets the 3D-2D geometric shape registration tasks, applying the recently developed Reproducing Kernel Hilbert Space (RKHS) to address the "big-to-small" issue. An iterative reweighted least squares method is employed to solve the RKHS-based formulation efficiently. Moreover, our work identifies a unique and interesting observability issue in correspondence-free PnP: the numerical ambiguity between rotation and translation. To address this, we proposed DynaWeightPnP, introducing a dynamic weighting sub-problem and an alternative searching algorithm designed to enhance pose estimation and alignment accuracy. Experiments were conducted on a typical case, that is, a 3D-2D vascular centerline registration task within Endovascular Image-Guided Interventions (EIGIs). Results demonstrated that the proposed algorithm achieves registration processing rates of 60 Hz (without post-refinement) and 31 Hz (with post-refinement) on modern single-core CPUs, with competitive accuracy comparable to existing methods. These results underscore the suitability of DynaWeightPnP for future robot navigation tasks like EIGIs.
本文解决了一个特殊的Perspective-n-Point (PnP)问题:在无需对应关系的情况下,实时估计3D和2D形状的最优姿态,称为无对应关系的PnP。虽然已经有一些研究关注3D和2D形状配准,但实现实时和准确 performance仍然具有挑战性。本研究特别关注3D-2D几何形状配准任务,并将新近开发的复原核哈勃空间(RKHS)应用于解决“大到小”的问题。采用迭代重新加权最小二乘法来求解基于RKHS的公式。此外,我们的工作揭示了无对应关系PnP中一个独特的有趣可观察性问题:旋转和平移之间的数值模糊。为解决这个问题,我们提出了DynaWeightPnP,引入了动态权重子问题和一个旨在提高姿态估计和配准精度的替代搜索算法。实验在典型的案例上进行,即3D-2D血管中心线配准任务 within Endovascular Image-Guided Interventions(EIGIs)。结果表明,与现有方法相当,DynaWeightPnP在现代单核CPU上实现了60 Hz(未经优化)和31 Hz(经过优化)的配准处理速率,具有竞争力的准确性。这些结果强调了DynaWeightPnP在未来的机器人导航任务如EIGIs中的适用性。
https://arxiv.org/abs/2409.18457
6D object pose estimation aims at determining an object's translation, rotation, and scale, typically from a single RGBD image. Recent advancements have expanded this estimation from instance-level to category-level, allowing models to generalize across unseen instances within the same category. However, this generalization is limited by the narrow range of categories covered by existing datasets, such as NOCS, which also tend to overlook common real-world challenges like occlusion. To tackle these challenges, we introduce Omni6D, a comprehensive RGBD dataset featuring a wide range of categories and varied backgrounds, elevating the task to a more realistic context. 1) The dataset comprises an extensive spectrum of 166 categories, 4688 instances adjusted to the canonical pose, and over 0.8 million captures, significantly broadening the scope for evaluation. 2) We introduce a symmetry-aware metric and conduct systematic benchmarks of existing algorithms on Omni6D, offering a thorough exploration of new challenges and insights. 3) Additionally, we propose an effective fine-tuning approach that adapts models from previous datasets to our extensive vocabulary setting. We believe this initiative will pave the way for new insights and substantial progress in both the industrial and academic fields, pushing forward the boundaries of general 6D pose estimation.
6D物体姿态估计的目标是确定一个物体的平移、旋转和缩放,通常从单个RGBD图像中实现。最近的发展从实例级别扩展到类别级别,使得模型可以在同一类别内的未见实例上进行泛化。然而,这种泛化限制了现有数据集中的类别范围,例如NOCS,它们往往忽视了常见现实挑战(如遮挡)。为了应对这些挑战,我们引入了Omni6D,一个包含广泛类别的全面RGBD数据集,将任务提升到一个更现实的环境中。 1) 数据集包括166个类别,4688个调整到规范姿态的实例,以及超过0.8亿个捕捉,显著拓宽了评估范围。 2) 我们引入了对称感知度量,对Omni6D上的现有算法进行了系统基准测试,提供了对新技术和新挑战的深入探讨。 3) 此外,我们提出了一个有效的自适应调整方法,将之前的数据集中的模型适应我们的广泛词汇设置。我们相信,这一举措将为工业和学术领域带来新的见解和显著的进步,推动6D姿态估计的边界。
https://arxiv.org/abs/2409.18261
The integration of Artificial Intelligence (AI) and Augmented Reality (AR) is set to transform satellite Assembly, Integration, and Testing (AIT) processes by enhancing precision, minimizing human error, and improving operational efficiency in cleanroom environments. This paper presents a technical description of the European Space Agency's (ESA) project "AI for AR in Satellite AIT," which combines real-time computer vision and AR systems to assist technicians during satellite assembly. Leveraging Microsoft HoloLens 2 as the AR interface, the system delivers context-aware instructions and real-time feedback, tackling the complexities of object recognition and 6D pose estimation in AIT workflows. All AI models demonstrated over 70% accuracy, with the detection model exceeding 95% accuracy, indicating a high level of performance and reliability. A key contribution of this work lies in the effective use of synthetic data for training AI models in AR applications, addressing the significant challenges of obtaining real-world datasets in highly dynamic satellite environments, as well as the creation of the Segmented Anything Model for Automatic Labelling (SAMAL), which facilitates the automatic annotation of real data, achieving speeds up to 20 times faster than manual human annotation. The findings demonstrate the efficacy of AI-driven AR systems in automating critical satellite assembly tasks, setting a foundation for future innovations in the space industry.
人工智能(AI)和增强现实(AR)的融合有望通过提高精确度、降低人为错误和提高清洁室环境中的操作效率,改变卫星组装、集成和测试(AIT)过程。本文介绍了欧洲航天局(ESA)的“AI在卫星AIT”项目,该项目将实时计算机视觉和AR系统相结合,帮助技术人员在卫星组装过程中进行辅助。利用微软HoloLens 2作为AR接口,系统提供上下文感知指令和实时反馈,解决AIT工作流程中物体识别和6D姿态估计的复杂性。所有AI模型都具有超过70%的准确率,检测模型甚至超过了95%的准确率,表明其在性能和可靠性方面具有高水平。这项工作的关键在于有效地利用合成数据为AR应用程序训练AI模型,解决在高度动态的卫星环境中获取真实世界数据所带来的重大挑战,以及创建分割任何模型(SAMAL)用于自动标注(SAM)的举措,实现比手动人类标注速度加快20倍。研究结果表明,AI驱动的AR系统在自动化关键卫星组装任务方面具有有效性,为航天工业未来的创新奠定了基础。
https://arxiv.org/abs/2409.18101
The basic body shape of a person does not change within a single video. However, most SOTA human mesh estimation (HME) models output a slightly different body shape for each video frame, which results in inconsistent body shapes for the same person. In contrast, we leverage anthropometric measurements like tailors are already obtaining from humans for centuries. We create a model called A2B that converts such anthropometric measurements to body shape parameters of human mesh models. Moreover, we find that finetuned SOTA 3D human pose estimation (HPE) models outperform HME models regarding the precision of the estimated keypoints. We show that applying inverse kinematics (IK) to the results of such a 3D HPE model and combining the resulting body pose with the A2B body shape leads to superior and consistent human meshes for challenging datasets like ASPset or fit3D, where we can lower the MPJPE by over 30 mm compared to SOTA HME models. Further, replacing HME models estimates of the body shape parameters with A2B model results not only increases the performance of these HME models, but also leads to consistent body shapes.
一个人的基本身体形状在单个视频中不会发生变化。然而,大多数SOTA人体网格估计(HME)模型在每个视频帧中输出 slightly different body shape,导致相同人物之间不一致的身体形状。相比之下,我们利用从人类那里获取的测量如裁缝已经获得的尺寸来创建一个名为A2B的人体网格模型。此外,我们发现,针对SOTA 3D人体姿势估计(HPE)模型的微调比HME模型在估计关键点的精度方面表现更好。我们证明了将这种3D HPE模型的结果应用逆运动学(IK)以及将结果的身体姿势与A2B人体形状相结合可以实现具有挑战性数据集如ASPset或fit3D中更卓越和一致的人体网格。此外,用A2B模型估计的身体形状参数替换HME模型的结果不仅提高了这些HME模型的性能,而且导致了一致的身体形状。
https://arxiv.org/abs/2409.17671
Fruit monitoring plays an important role in crop management, and rising global fruit consumption combined with labor shortages necessitates automated monitoring with robots. However, occlusions from plant foliage often hinder accurate shape and pose estimation. Therefore, we propose an active fruit shape and pose estimation method that physically manipulates occluding leaves to reveal hidden fruits. This paper introduces a framework that plans robot actions to maximize visibility and minimize leaf damage. We developed a novel scene-consistent shape completion technique to improve fruit estimation under heavy occlusion and utilize a perception-driven deformation graph model to predict leaf deformation during planning. Experiments on artificial and real sweet pepper plants demonstrate that our method enables robots to safely move leaves aside, exposing fruits for accurate shape and pose estimation, outperforming baseline methods. Project page: this https URL.
水果监测在作物管理中扮演着重要的角色,而全球水果消费量的增加和劳动力短缺迫使采用机器人进行自动监测。然而,植物叶片的遮挡通常会阻碍准确形状和姿态估计。因此,我们提出了一个主动的水果形状和姿态估计方法,通过物理操作叶片的遮挡来揭示隐藏的水果。本文介绍了一个规划机器人动作以最大化可见度并最小化叶片损伤的框架。我们开发了一种新的场景一致形状完成技术,以改善在重遮挡下水果估计的准确性,并利用感知驱动的变形图模型在规划过程中预测叶片变形。对人工和真实甜椒植株的实验表明,我们的方法使机器人能够安全地将叶片移开,从而准确地估计形状和姿态,超过基线方法。项目页面:此链接。
https://arxiv.org/abs/2409.17389