Despite increasing research efforts on household robotics, robots intended for deployment in domestic settings still struggle with more complex tasks such as interacting with functional elements like drawers or light switches, largely due to limited task-specific understanding and interaction capabilities. These tasks require not only detection and pose estimation but also an understanding of the affordances these elements provide. To address these challenges and enhance robotic scene understanding, we introduce SpotLight: A comprehensive framework for robotic interaction with functional elements, specifically light switches. Furthermore, this framework enables robots to improve their environmental understanding through interaction. Leveraging VLM-based affordance prediction to estimate motion primitives for light switch interaction, we achieve up to 84% operation success in real world experiments. We further introduce a specialized dataset containing 715 images as well as a custom detection model for light switch detection. We demonstrate how the framework can facilitate robot learning through physical interaction by having the robot explore the environment and discover previously unknown relationships in a scene graph representation. Lastly, we propose an extension to the framework to accommodate other functional interactions such as swing doors, showcasing its flexibility. Videos and Code: this http URL
尽管在家庭机器人领域的研究投入不断增加,但专为家庭环境设计的机器人仍然很难处理更复杂的任务,如与功能元素(如抽屉或灯光开关)的交互,这主要是由于它们在任务特定理解和交互能力方面的有限性。这些任务不仅要求检测和姿态估计,还需要了解这些元素提供的功能。为了应对这些挑战,提高机器人在场景理解方面的能力,我们引入了SpotLight:一个专为机器人与功能元素交互而设计的全面框架,特别是灯光开关。此外,这个框架使机器人能够通过交互来提高其环境理解。通过基于VLM的势能预测来估计灯光开关的交互运动初值,我们在现实世界实验中实现了84%的操作成功率。我们进一步引入了一个包含715个图像的专业数据集以及一个自定义的灯光开关检测模型。我们证明了框架可以通过物理交互帮助机器人学习,让机器人探索环境并发现场景图表示中 previously unknown 的关系。最后,我们提出了一个扩展框架以适应其他功能交互,如弹门,展示了其灵活性。视频和代码:这个链接
https://arxiv.org/abs/2409.11870
6D object pose estimation is the problem of identifying the position and orientation of an object relative to a chosen coordinate system, which is a core technology for modern XR applications. State-of-the-art 6D object pose estimators directly predict an object pose given an object observation. Due to the ill-posed nature of the pose estimation problem, where multiple different poses can correspond to a single observation, generating additional plausible estimates per observation can be valuable. To address this, we reformulate the state-of-the-art algorithm GDRNPP and introduce EPRO-GDR (End-to-End Probabilistic Geometry-Guided Regression). Instead of predicting a single pose per detection, we estimate a probability density distribution of the pose. Using the evaluation procedure defined by the BOP (Benchmark for 6D Object Pose Estimation) Challenge, we test our approach on four of its core datasets and demonstrate superior quantitative results for EPRO-GDR on LM-O, YCB-V, and ITODD. Our probabilistic solution shows that predicting a pose distribution instead of a single pose can improve state-of-the-art single-view pose estimation while providing the additional benefit of being able to sample multiple meaningful pose candidates.
6D物体姿态估计是确定一个物体相对于所选的坐标系的位置和方向,这是现代XR应用程序的核心技术。最先进的6D物体姿态估计算法会直接根据物体的观测结果预测物体的姿态。由于姿态估计问题具有不稳定性,即多个不同的姿态可以对应于单个观测,生成每个观测的额外可信估计值可能很有价值。为了解决这个问题,我们重新形式化最先进的算法GDRNPP,并引入了EPRO-GDR(端到端概率几何引导回归)。我们不再预测每个检测到一个姿态,而是估计姿态的概率密度分布。通过BOP(6D物体姿态估计基准)挑战定义的评估程序,我们在其四大核心数据集上测试我们的方法,并证明了EPRO-GDR在LM-O、YCB-V和ITODD上的优越量化结果。我们的概率解决方案表明,预测一个姿态分布而不是单个姿态可以提高最先进的单视图姿态估计,同时提供能够采样多个有意义的姿态候选值的优势。
https://arxiv.org/abs/2409.11819
This work presents Spacecraft Pose Network v3 (SPNv3), a Neural Network (NN) for monocular pose estimation of a known, non-cooperative target spacecraft. As opposed to existing literature, SPNv3 is designed and trained to be computationally efficient while providing robustness to spaceborne images that have not been observed during offline training and validation on the ground. These characteristics are essential to deploying NNs on space-grade edge devices. They are achieved through careful NN design choices, and an extensive trade-off analysis reveals features such as data augmentation, transfer learning and vision transformer architecture as a few of those that contribute to simultaneously maximizing robustness and minimizing computational overhead. Experiments demonstrate that the final SPNv3 can achieve state-of-the-art pose accuracy on hardware-in-the-loop images from a robotic testbed while having trained exclusively on computer-generated synthetic images, effectively bridging the domain gap between synthetic and real imagery. At the same time, SPNv3 runs well above the update frequency of modern satellite navigation filters when tested on a representative graphical processing unit system with flight heritage. Overall, SPNv3 is an efficient, flight-ready NN model readily applicable to a wide range of close-range rendezvous and proximity operations with target resident space objects. The code implementation of SPNv3 will be made publicly available.
本文提出了一种名为Spacecraft Pose Network v3 (SPNv3)的神经网络模型,用于对已知、非合作目标太空船进行单目姿态估计。与现有文献不同,SPNv3经过精心设计并训练,以实现计算效率,同时对地面上没有观察到的太空图像具有鲁棒性。这些特性对于在太空级边缘设备上部署神经网络模型至关重要。通过仔细的神经网络设计选择,以及广泛的权衡分析,揭示了数据增强、迁移学习和视觉Transformer架构等一些有助于同时最大鲁棒性和最小计算开销的特征。实验证明,最后的SPNv3可以在机器人台站上实现与硬件在环图像相同的最先进的姿态精度,同时仅在计算机生成的合成图像上进行训练,有效地缩小了合成和真实图像之间的领域差距。同时,在代表飞行遗产的图形处理单元系统上测试SPNv3时,其更新频率远高于现代卫星导航滤波器的更新频率。总体而言,SPNv3是一种高效、适用于广泛的近距离会合和接近操作目标居民太空对象的飞行就绪的神经网络模型。SPNv3的代码实现将公开发布。
https://arxiv.org/abs/2409.11661
In this paper, we present a novel method for self-supervised fine-tuning of pose estimation for bin-picking. Leveraging zero-shot pose estimation, our approach enables the robot to automatically obtain training data without manual labeling. After pose estimation the object is grasped, and in-hand pose estimation is used for data validation. Our pipeline allows the system to fine-tune while the process is running, removing the need for a learning phase. The motivation behind our work lies in the need for rapid setup of pose estimation solutions. Specifically, we address the challenging task of bin picking, which plays a pivotal role in flexible robotic setups. Our method is implemented on a robotics work-cell, and tested with four different objects. For all objects, our method increases the performance and outperforms a state-of-the-art method trained on the CAD model of the objects.
在本文中,我们提出了一种用于自监督姿态估计的 bin-picking 的新方法。通过利用零散样本姿态估计,我们的方法使机器人能够自动获得未经手动标注的训练数据。在姿态估计之后,机器人 grasp 物体,然后使用手中姿态估计进行数据验证。我们的流程允许系统在运行过程中进行微调,无需进行学习阶段。我们工作的动机是基于需要快速设置姿态估计解决方案。具体来说,我们解决了 bin 选择这一具有灵活机器人布局关键作用的任务。我们的方法在一个机器人工作单元上实现,并测试了四种不同的物体。对于所有物体,我们的方法提高了性能并超越了基于物体 CAD 模型训练的先进方法。
https://arxiv.org/abs/2409.11512
Vision Based Navigation consists in utilizing cameras as precision sensors for GNC after extracting information from images. To enable the adoption of machine learning for space applications, one of obstacles is the demonstration that available training datasets are adequate to validate the algorithms. The objective of the study is to generate datasets of images and metadata suitable for training machine learning algorithms. Two use cases were selected and a robust methodology was developed to validate the datasets including the ground truth. The first use case is in-orbit rendezvous with a man-made object: a mockup of satellite ENVISAT. The second use case is a Lunar landing scenario. Datasets were produced from archival datasets (Chang'e 3), from the laboratory at DLR TRON facility and at Airbus Robotic laboratory, from SurRender software high fidelity image simulator using Model Capture and from Generative Adversarial Networks. The use case definition included the selection of algorithms as benchmark: an AI-based pose estimation algorithm and a dense optical flow algorithm were selected. Eventually it is demonstrated that datasets produced with SurRender and selected laboratory facilities are adequate to train machine learning algorithms.
基于视觉的导航是指在从图像中提取信息后,利用摄像头作为高精度传感器来验证惯导导航控制系统的算法的技术。为了使机器学习在空间应用中得到采用,一个障碍是证明所使用的训练数据集足够充分来验证算法。本研究旨在生成适合训练机器学习算法的图像和元数据的图像。选择了两个用例,并开发了一种包括地面真实值的 robust 方法来验证数据集。第一个用例是在轨会合与人工物体:卫星 ENVISAT 的模型对照。第二个用例是月球着陆场景。数据集来源于:(Chang'e 3)归档数据集,(DLR TRON)实验室和空客机器人实验室,SurRender 高保真度图像模拟软件使用 Model Capture 和 Generative Adversarial Networks。用例定义包括选择基准算法:选择了一种基于 AI 的姿态估计算法和一种基于密度的光流算法作为基准。最终证明,使用 SurRender 和所选实验室设施生成的数据集是足够充分来训练机器学习算法。
https://arxiv.org/abs/2409.11383
In this work, we introduce OmniGen, a new diffusion model for unified image generation. Unlike popular diffusion models (e.g., Stable Diffusion), OmniGen no longer requires additional modules such as ControlNet or IP-Adapter to process diverse control conditions. OmniGenis characterized by the following features: 1) Unification: OmniGen not only demonstrates text-to-image generation capabilities but also inherently supports other downstream tasks, such as image editing, subject-driven generation, and visual-conditional generation. Additionally, OmniGen can handle classical computer vision tasks by transforming them into image generation tasks, such as edge detection and human pose recognition. 2) Simplicity: The architecture of OmniGen is highly simplified, eliminating the need for additional text encoders. Moreover, it is more user-friendly compared to existing diffusion models, enabling complex tasks to be accomplished through instructions without the need for extra preprocessing steps (e.g., human pose estimation), thereby significantly simplifying the workflow of image generation. 3) Knowledge Transfer: Through learning in a unified format, OmniGen effectively transfers knowledge across different tasks, manages unseen tasks and domains, and exhibits novel capabilities. We also explore the model's reasoning capabilities and potential applications of chain-of-thought mechanism. This work represents the first attempt at a general-purpose image generation model, and there remain several unresolved issues. We will open-source the related resources at this https URL to foster advancements in this field.
在这项工作中,我们引入了OmniGen,一种用于统一图像生成的全新扩散模型。与流行的扩散模型(如Stable Diffusion)不同,OmniGen不再需要额外的模块(如ControlNet或IP-Adapter)来处理多样化的控制条件。OmniGen的特点如下: 1)统一:OmniGen不仅展示了文本到图像生成能力,还天生支持其他下游任务,例如图像编辑、主题驱动生成和视觉条件生成。此外,OmniGen可以通过将经典计算机视觉任务转换为图像生成任务来处理它们,例如边缘检测和人体姿态识别。 2)简单性:OmniGen的架构高度简化,无需额外文本编码器。与现有扩散模型相比,OmniGen更加用户友好,使通过指令完成复杂任务而不需要额外预处理步骤(例如人体姿态估计),从而大大简化了图像生成工作流程。 3)知识传递:通过统一的格式学习,OmniGen有效地在不同任务和领域之间传递知识,管理未知任务和领域,并展示出新颖的特性。我们还研究了模型的推理能力和连锁思维机制的应用。这项工作代表了通用图像生成模型的首次尝试,但仍存在许多未解决的问题。我们将在此链接上公开相关资源,以推动该领域的发展。
https://arxiv.org/abs/2409.11340
Despite the recent advances in computer vision research, estimating the 3D human pose from single RGB images remains a challenging task, as multiple 3D poses can correspond to the same 2D projection on the image. In this context, depth data could help to disambiguate the 2D information by providing additional constraints about the distance between objects in the scene and the camera. Unfortunately, the acquisition of accurate depth data is limited to indoor spaces and usually is tied to specific depth technologies and devices, thus limiting generalization capabilities. In this paper, we propose a method able to leverage the benefits of depth information without compromising its broader applicability and adaptability in a predominantly RGB-camera-centric landscape. Our approach consists of a heatmap-based 3D pose estimator that, leveraging the paradigm of Privileged Information, is able to hallucinate depth information from the RGB frames given at inference time. More precisely, depth information is used exclusively during training by enforcing our RGB-based hallucination network to learn similar features to a backbone pre-trained only on depth data. This approach proves to be effective even when dealing with limited and small datasets. Experimental results reveal that the paradigm of Privileged Information significantly enhances the model's performance, enabling efficient extraction of depth information by using only RGB images.
尽管在计算机视觉研究领域最近取得了进展,从单张RGB图像中估计3D人体姿势仍然具有挑战性,因为多个3D姿势可以对应于图像上的相同2D投影。在这种情况下,深度数据可以通过提供关于场景中物体与相机之间距离的额外约束来帮助消除2D信息的不确定性。然而,获取准确深度数据仅限于室内空间,通常与特定的深度技术和设备绑定,因此限制了泛化能力。在本文中,我们提出了一种方法,可以在不牺牲其更广泛的适用性和适应性的前提下利用深度信息的优势。我们的方法基于热图的3D姿态估计算法,通过利用特权信息范式,可以在推理时从RGB帧中获得深度信息。具体来说,在训练过程中,仅使用深度数据来约束我们的RGB-based幻觉网络学习与仅基于深度数据预训练的骨干网络相似的特征。这种方法在处理有限和小型数据集时也表现有效。实验结果表明,特权信息范式显著增强了模型的性能,通过仅使用RGB图像实现对深度信息的有效地提取。
https://arxiv.org/abs/2409.11104
Visual localization refers to the process of determining camera poses and orientation within a known scene representation. This task is often complicated by factors such as illumination changes and variations in viewing angles. In this paper, we propose HGSLoc, a novel lightweight, plug and-play pose optimization framework, which integrates 3D reconstruction with a heuristic refinement strategy to achieve higher pose estimation accuracy. Specifically, we introduce an explicit geometric map for 3D representation and high-fidelity rendering, allowing the generation of high-quality synthesized views to support accurate visual localization. Our method demonstrates a faster rendering speed and higher localization accuracy compared to NeRF-based neural rendering localization approaches. We introduce a heuristic refinement strategy, its efficient optimization capability can quickly locate the target node, while we set the step-level optimization step to enhance the pose accuracy in the scenarios with small errors. With carefully designed heuristic functions, it offers efficient optimization capabilities, enabling rapid error reduction in rough localization estimations. Our method mitigates the dependence on complex neural network models while demonstrating improved robustness against noise and higher localization accuracy in challenging environments, as compared to neural network joint optimization strategies. The optimization framework proposed in this paper introduces novel approaches to visual localization by integrating the advantages of 3D reconstruction and heuristic refinement strategy, which demonstrates strong performance across multiple benchmark datasets, including 7Scenes and DB dataset.
视觉局部定位是指在已知场景表示中确定相机姿态和姿态的过程。这种情况通常会受到照明变化和视角变化等因素的困扰。在本文中,我们提出了HGSLoc,一种新颖的轻量级、可插拔、可运行的姿势优化框架,将3D重建与启发式优化策略相结合以实现更高的姿态估计精度。具体来说,我们引入了一个3D表示的显式几何图和高保真渲染,使得高质量的合成视图的生成成为支持准确视觉局部定位的充要条件。我们的方法在基于NeRF的神经渲染局部定位方法中具有更快的渲染速度和更高的局部定位精度。我们引入了启发式优化策略,其高效的优化能力可以快速定位目标节点,而我们将步级优化级设置为增强小误差情况下的姿态准确性。通过精心设计的启发式函数,它具有高效的优化能力,能够在粗糙的局部定位估计中迅速减少误差。我们的方法减轻了复杂神经网络模型的依赖,同时展示了在具有噪声的挑战性环境中更高的局部定位精度的提高,与神经网络关节优化策略相比。本文提出的优化框架通过结合3D重建和启发式优化策略,为视觉局部定位带来了新颖的方法,并在多个基准数据集上证明了强大的性能,包括7Scenes和DB数据集。
https://arxiv.org/abs/2409.10925
Camera-to-robot calibration is crucial for vision-based robot control and requires effort to make it accurate. Recent advancements in markerless pose estimation methods have eliminated the need for time-consuming physical setups for camera-to-robot calibration. While the existing markerless pose estimation methods have demonstrated impressive accuracy without the need for cumbersome setups, they rely on the assumption that all the robot joints are visible within the camera's field of view. However, in practice, robots usually move in and out of view, and some portion of the robot may stay out-of-frame during the whole manipulation task due to real-world constraints, leading to a lack of sufficient visual features and subsequent failure of these approaches. To address this challenge and enhance the applicability to vision-based robot control, we propose a novel framework capable of estimating the robot pose with partially visible robot manipulators. Our approach leverages the Vision-Language Models for fine-grained robot components detection, and integrates it into a keypoint-based pose estimation network, which enables more robust performance in varied operational conditions. The framework is evaluated on both public robot datasets and self-collected partial-view datasets to demonstrate our robustness and generalizability. As a result, this method is effective for robot pose estimation in a wider range of real-world manipulation scenarios.
相机到机器人标定对基于视觉的机器人控制至关重要,并需要努力使其准确。最近,无标记姿态估计方法的发展已经消除了 camera-to-robot 标定的需要花费大量时间进行物理设置的需求。虽然现有的无标记姿态估计方法在不需要繁琐设置的情况下表现出了令人印象深刻的准确度,但它们依赖于假设所有机器人关节都在摄像机视野范围内。然而,在实际应用中,机器人通常会移动到视野之外,而且在整个操作任务过程中,机器人部分时间可能保持在框架之外,导致缺乏足够的视觉特征,从而导致这些方法的失败。为了应对这个挑战,并提高其在基于视觉的机器人控制中的应用,我们提出了一个新颖的框架,能够通过部分可见的机器人操作器估计机器人的姿态。我们的方法利用了用于细粒度机器人组件检测的视觉语言模型,并将其集成到一个基于关键点的姿态估计网络中,这使得在不利的工作条件下具有更强的鲁棒性。该框架在公开机器人数据集和自收集部分可见数据集上进行了评估,以证明其稳健性和通用性。因此,该方法在广泛的现实操作场景中有效用于机器人姿态估计。
https://arxiv.org/abs/2409.10441
Robots interacting with humans through natural language can unlock numerous applications such as Referring Grasp Synthesis (RGS). Given a text query, RGS determines a stable grasp pose to manipulate the referred object in the robot's workspace. RGS comprises two steps: visual grounding and grasp pose estimation. Recent studies leverage powerful Vision-Language Models (VLMs) for visually grounding free-flowing natural language in real-world robotic execution. However, comparisons in complex, cluttered environments with multiple instances of the same object are lacking. This paper introduces HiFi-CS, featuring hierarchical application of Featurewise Linear Modulation (FiLM) to fuse image and text embeddings, enhancing visual grounding for complex attribute rich text queries encountered in robotic grasping. Visual grounding associates an object in 2D/3D space with natural language input and is studied in two scenarios: Closed and Open Vocabulary. HiFi-CS features a lightweight decoder combined with a frozen VLM and outperforms competitive baselines in closed vocabulary settings while being 100x smaller in size. Our model can effectively guide open-set object detectors like GroundedSAM to enhance open-vocabulary performance. We validate our approach through real-world RGS experiments using a 7-DOF robotic arm, achieving 90.33\% visual grounding accuracy in 15 tabletop scenes. We include our codebase in the supplementary material.
通过自然语言与人类进行交互的机器人可以解锁许多应用,例如参考抓取合成(RGS)。给定文本查询,RGS会确定一个稳定的抓取姿态来操纵机器人工作空间中的被指物体。RGS包括两个步骤:视觉校准和抓取姿态估计。最近的研究利用强大的视觉语言模型(VLMs)在实世界机器人执行中视觉校准自由流动的自然语言。然而,在复杂、杂乱的环境中,与多个相同物体进行比较的对比缺乏。本文介绍了HiFi-CS,其特征是层次应用特征点线性变换(FiLM)以融合图像和文本嵌入,提高机器人抓取中遇到的复杂属性丰富的文本查询的视觉校准。视觉校准将2D/3D空间中的物体与自然语言输入相关联,并在两种情景下进行研究:关闭和打开词汇表。HiFi-CS采用轻量级解码器与冻干的VLM相结合,在关闭词汇表设置下优于竞争基线,且大小为前者的100倍。我们的模型可以有效地引导像GroundedSAM这样的开箱即用的对象检测器,提高开箱即用的性能。我们通过使用7个自由度机器人臂进行实际世界RGS实验来验证我们的方法,实现了90.33%的视觉校准准确率,在15个桌子场景中。我们还在附录中提供了我们的代码库。
https://arxiv.org/abs/2409.10419
Autonomous driving holds great potential to transform road safety and traffic efficiency by minimizing human error and reducing congestion. A key challenge in realizing this potential is the accurate estimation of steering angles, which is essential for effective vehicle navigation and control. Recent breakthroughs in deep learning have made it possible to estimate steering angles directly from raw camera inputs. However, the limited available navigation data can hinder optimal feature learning, impacting the system's performance in complex driving scenarios. In this paper, we propose a shared encoder trained on multiple computer vision tasks critical for urban navigation, such as depth, pose, and 3D scene flow estimation, as well as semantic, instance, panoptic, and motion segmentation. By incorporating diverse visual information used by humans during navigation, this unified encoder might enhance steering angle estimation. To achieve effective multi-task learning within a single encoder, we introduce a multi-scale feature network for pose estimation to improve depth learning. Additionally, we employ knowledge distillation from a multi-backbone model pretrained on these navigation tasks to stabilize training and boost performance. Our findings demonstrate that a shared backbone trained on diverse visual tasks is capable of providing overall perception capabilities. While our performance in steering angle estimation is comparable to existing methods, the integration of human-like perception through multi-task learning holds significant potential for advancing autonomous driving systems. More details and the pretrained model are available at this https URL.
自动驾驶通过最小化人为错误和减少拥堵,具有巨大的变革道路安全和交通效率潜力。实现这一潜力的关键挑战是对转向角度的准确估计,这对于有效的车辆导航和控制至关重要。最近在深度学习方面的突破使得可以从原始相机输入直接估计转向角度。然而,可用导航数据有限可能会阻碍最优特征学习,影响在复杂驾驶场景下的系统性能。在本文中,我们提出了一种共享编码器,在多个对城市导航至关重要的计算机视觉任务上进行训练,例如深度、姿态、3D场景流动估计和语义、实例、全景和运动分割。通过结合人类在导航过程中使用的多样视觉信息,这个统一的编码器可能会增强转向角度估计。为了在一个编码器内实现有效的多任务学习,我们引入了一个多尺度特征网络来进行姿态估计以提高深度学习。此外,我们还使用来自预训练于这些导航任务的多人骨架模型的知识蒸馏来稳定训练并提高性能。我们的研究结果表明,通过训练在多样化视觉任务上的共享骨架,可以提供全面的感知能力。虽然我们的转向角度估计与现有方法相当,但通过多任务学习通过人类的感知方式前进具有显著的潜力,可以推动自动驾驶系统的发展。更多细节和预训练模型,请参阅此链接。
https://arxiv.org/abs/2409.10095
In this paper, we propose a novel coarse-to-fine continuous pose diffusion method to enhance the precision of pick-and-place operations within robotic manipulation tasks. Leveraging the capabilities of diffusion networks, we facilitate the accurate perception of object poses. This accurate perception enhances both pick-and-place success rates and overall manipulation precision. Our methodology utilizes a top-down RGB image projected from an RGB-D camera and adopts a coarse-to-fine architecture. This architecture enables efficient learning of coarse and fine models. A distinguishing feature of our approach is its focus on continuous pose estimation, which enables more precise object manipulation, particularly concerning rotational angles. In addition, we employ pose and color augmentation techniques to enable effective training with limited data. Through extensive experiments in simulated and real-world scenarios, as well as an ablation study, we comprehensively evaluate our proposed methodology. Taken together, the findings validate its effectiveness in achieving high-precision pick-and-place tasks.
在本文中,我们提出了一种新颖的粗-到细连续体位扩散方法,以提高机器人操作任务中选择和放置操作的精度。利用扩散网络的特性,我们促进了物体姿态的准确感知。这种准确的感知提高了选择和放置的成功率和整体操作的精度。我们的方法采用从RGB-D相机获取的顶级RGB图像的投影作为架构,并采用了粗-到细的架构。这种架构允许有效地学习粗和细模型。我们还采用姿态和颜色增强技术,以便在有限的数据上进行有效的训练。通过在模拟和现实世界场景中的广泛实验以及一个消融研究,我们全面评估了我们提出的方法。结合这些发现,我们的方法证实了其在实现高精度选择和放置任务方面的有效性。
https://arxiv.org/abs/2409.09725
We present a contrastive learning framework based on in-the-wild hand images tailored for pre-training 3D hand pose estimators, dubbed HandCLR. Pre-training on large-scale images achieves promising results in various tasks, but prior 3D hand pose pre-training methods have not fully utilized the potential of diverse hand images accessible from in-the-wild videos. To facilitate scalable pre-training, we first prepare an extensive pool of hand images from in-the-wild videos and design our method with contrastive learning. Specifically, we collected over 2.0M hand images from recent human-centric videos, such as 100DOH and Ego4D. To extract discriminative information from these images, we focus on the similarity of hands; pairs of similar hand poses originating from different samples, and propose a novel contrastive learning method that embeds similar hand pairs closer in the latent space. Our experiments demonstrate that our method outperforms conventional contrastive learning approaches that produce positive pairs sorely from a single image with data augmentation. We achieve significant improvements over the state-of-the-art method in various datasets, with gains of 15% on FreiHand, 10% on DexYCB, and 4% on AssemblyHands.
我们提出了一个基于野外手图像的对比学习框架,名为HandCLR。在大型图像上的预训练在各种任务上都取得了很好的效果,但先前的3D手姿势预训练方法并没有充分利用从野外视频获得的多样手图像的潜力。为了实现可扩展的预训练,我们首先从野外视频中准备了一个广泛的在手图像的池,并使用对比学习的方法设计我们的框架。具体来说,我们收集了来自最近的人机中心视频的超过2000万只手图像,如100DOH和Ego4D。为了从这些图像中提取有意义的信息,我们关注手之间的相似性;来自不同样本的双只相似的手姿势,并提出了一种新颖的对比学习方法,将相似的手对更接近地嵌入到潜在空间中。我们的实验结果表明,我们的方法优于使用数据增强技术从单张图片产生积极对的方法。在各种数据集上,我们的方法都取得了显著的改善,其中在FreiHand数据集上的改善率为15%,在DexYCB数据集上的改善率为10%,在AssemblyHands数据集上的改善率为4%。
https://arxiv.org/abs/2409.09714
This paper presents a detailed system design and component selection for the Transforming Proximity Operations and Docking Service (TPODS) module, designed to gain custody of uncontrolled resident space objects (RSOs) via rendezvous and proximity operation (RPO). In addition to serving as a free-flying robotic manipulator to work with cooperative and uncooperative RSOs, the TPODS modules are engineered to have the ability to cooperate with one another to build scaffolding for more complex satellite servicing activities. The structural design of the prototype module is inspired by Tensegrity principles, minimizing the structural mass of the modules frame. The prototype TPODS module is fabricated using lightweight polycarbonate with an aluminum or carbon fiber frame. The inner shell that houses various electronic and pneumatic components is 3-D printed using ABS material. Four OpenMV H7 R1 cameras are used for the pose estimation of resident space objects (RSOs), including other TPODS modules. Compressed air supplied by an external source is used for the initial testing and can be replaced by module-mounted nitrogen pressure vessels for full on-board propulsion later. A Teensy 4.1 single-board computer is used as a central command unit that receives data from the four OpenMV cameras, and commands its thrusters based on the control logic.
本文详细介绍了Transforming Proximity Operations and Docking Service (TPODS)模块的系统设计和组件选择,旨在通过遥控和亲近操作(RPO)获取无控制居民空间物体(RSOs)。除了作为与协作和非协作RSOs合作的自由飞行机器人 manipulator 之外,TPODS模块还通过工程设计具有相互合作以构建更复杂的卫星服务活动的脚手架的能力。原型的TPODS模块的设计灵感来自张力构件原理,减少了模块框架的结构质量。原型的TPODS模块使用轻质聚碳酸酯制造,包括铝或碳纤维框架。内部壳体,包括各种电子和气动元件,使用ABS材料进行3D打印。四个OpenMV H7 R1相机用于居民空间对象的(RSOs)姿态估计,包括其他TPODS模块。由外部源提供的压缩空气用于初步测试,可以根据模块安装的氮压容器进行全船推进。Teensy 4.1单板计算机用作中央指挥单元,接收来自四个OpenMV摄像头的数据,并根据控制逻辑控制其推力。
https://arxiv.org/abs/2409.09633
We propose the MAC-VO, a novel learning-based stereo VO that leverages the learned metrics-aware matching uncertainty for dual purposes: selecting keypoint and weighing the residual in pose graph optimization. Compared to traditional geometric methods prioritizing texture-affluent features like edges, our keypoint selector employs the learned uncertainty to filter out the low-quality features based on global inconsistency. In contrast to the learning-based algorithms that model the scale-agnostic diagonal weight matrix for covariance, we design a metrics-aware covariance model to capture the spatial error during keypoint registration and the correlations between different axes. Integrating this covariance model into pose graph optimization enhances the robustness and reliability of pose estimation, particularly in challenging environments with varying illumination, feature density, and motion patterns. On public benchmark datasets, MAC-VO outperforms existing VO algorithms and even some SLAM algorithms in challenging environments. The covariance map also provides valuable information about the reliability of the estimated poses, which can benefit decision-making for autonomous systems.
我们提出了MAC-VO,一种新的基于学习的立体视觉定位(Stereo VO)方法,它利用学习到的指标感知匹配不确定性双重目的:选择关键点和在姿态图优化中权衡残差。与传统几何方法优先考虑纹理丰富的特征(如边缘)相比,我们的关键点选择器利用学习到的不确定性基于全局不一致性过滤低质量特征。与基于学习的算法模型尺度无关的协方差权重矩阵不同,我们设计了一个指标感知的协方差模型,以捕捉关键点注册过程中的空间误差和不同轴之间的相关性。将协方差模型整合到姿态图优化中提高了姿势估计的稳健性和可靠性,特别是在具有变化光照、特征密度和运动模式的具有挑战性的环境中。在公开基准数据集上,MAC-VO在具有挑战性的环境中优于现有的视觉定位算法,甚至某些SLAM算法。协方差映射还提供了有关估计姿态可靠性的有价值信息,这对自动驾驶系统的决策有帮助。
https://arxiv.org/abs/2409.09479
Cooperative localization and target tracking are essential for multi-robot systems to implement high-level tasks. To this end, we propose a distributed invariant Kalman filter based on covariance intersection for effective multi-robot pose estimation. The paper utilizes the object-level measurement models, which have condensed information further reducing the communication burden. Besides, by modeling states on special Lie groups, the better linearity and consistency of the invariant Kalman filter structure can be stressed. We also use a combination of CI and KF to avoid overly confident or conservative estimates in multi-robot systems with intricate and unknown correlations, and some level of robot degradation is acceptable through multi-robot collaboration. The simulation and real data experiment validate the practicability and superiority of the proposed algorithm.
合作局部化和目标跟踪对多机器人系统实现高级任务至关重要。为此,我们提出了一个基于协方差交的分布式不变卡尔曼滤波器,用于有效的多机器人姿态估计。本文利用了物体级测量模型,这些模型具有进一步减少通信负担的压缩信息。此外,通过建模在特殊李群上的状态,可以强调不变卡尔曼滤波器结构的线性性和一致性。我们还使用CI和KF的组合来避免在复杂且未知关联的多机器人系统中的过度自信或保守估计,并通过多机器人合作来接受机器人退化。通过模拟和真实数据实验验证了所提出的算法的可行性和优越性。
https://arxiv.org/abs/2409.09410
Person Re-Identification (Re-ID) has gained popularity in computer vision, enabling cross-camera pedestrian recognition. Although the development of deep learning has provided a robust technical foundation for person Re-ID research, most existing person Re-ID methods overlook the potential relationships among local person features, failing to adequately address the impact of pedestrian pose variations and local body parts occlusion. Therefore, we propose a Transformer-enhanced Graph Convolutional Network (Tran-GCN) model to improve Person Re-Identification performance in monitoring videos. The model comprises four key components: (1) A Pose Estimation Learning branch is utilized to estimate pedestrian pose information and inherent skeletal structure data, extracting pedestrian key point information; (2) A Transformer learning branch learns the global dependencies between fine-grained and semantically meaningful local person features; (3) A Convolution learning branch uses the basic ResNet architecture to extract the person's fine-grained local features; (4) A Graph Convolutional Module (GCM) integrates local feature information, global feature information, and body information for more effective person identification after fusion. Quantitative and qualitative analysis experiments conducted on three different datasets (Market-1501, DukeMTMC-ReID, and MSMT17) demonstrate that the Tran-GCN model can more accurately capture discriminative person features in monitoring videos, significantly improving identification accuracy.
人物识别(Re-ID)在计算机视觉领域取得了广泛的应用,实现了跨摄像头的行人识别。尽管深度学习的快速发展为人物识别研究提供了坚实的技术基础,但现有的人物识别方法大多数忽视了局部特征之间的潜在关系,未能充分解决行人姿态变化和局部身体部分遮挡的影响。因此,我们提出了一个Transformer-增强的图卷积网络(Tran-GCN)模型,以提高在监控视频中的行人识别性能。该模型包括四个关键组件:(1)一个姿态估计学习分支用于估计行人的姿态信息和固有骨骼结构数据,提取行人关键点信息;(2)一个Transformer学习分支学习细粒度和有意义局部人的全局依赖关系;(3)一个卷积学习分支使用基本ResNet架构提取行人的细粒度局部特征;(4)一个图卷积模块(GCM)整合了局部特征信息、全局特征信息和身体信息,进行更有效的行人识别融合。在三个不同数据集(Market-1501,DukeMTMC-ReID和MSMT17)上进行的定量和定性分析实验证明,Tran-GCN模型在监测视频中更准确地捕捉到具有区分性的行人特征,显著提高了识别准确性。
https://arxiv.org/abs/2409.09391
Enhancing visual odometry by exploiting sparse depth measurements from LiDAR is a promising solution for improving tracking accuracy of an odometry. Most existing works utilize a monocular pinhole camera, yet could suffer from poor robustness due to less available information from limited field-of-view (FOV). This paper proposes a panoramic direct LiDAR-assisted visual odometry, which fully associates the 360-degree FOV LiDAR points with the 360-degree FOV panoramic image datas. 360-degree FOV panoramic images can provide more available information, which can compensate inaccurate pose estimation caused by insufficient texture or motion blur from a single view. In addition to constraints between a specific view at different times, constraints can also be built between different views at the same moment. Experimental results on public datasets demonstrate the benefit of large FOV of our panoramic direct LiDAR-assisted visual odometry to state-of-the-art approaches.
通过利用激光雷达从LIDAR进行稀疏深度测量来增强视觉里程计是一种提高跟踪准确性的有前途的解决方案。现有的大多数工作都使用单目针孔相机,但由于其视野范围有限,可能会受到信息不足的困扰。本文提出了一种全景直接LIDAR辅助视觉里程计,它完全将360度视野LIDAR点与360度视野全景图像数据相对应。360度视野全景图像可以提供更多的信息,可以弥补单视图中纹理或运动模糊导致的不准确姿态估计。此外,不同视图之间的约束条件也可以建立起来。公共数据集上的实验结果表明,我们全景直接LIDAR辅助视觉里程计的大视野街区可以为最先进的跟踪方法提供有益的改进。
https://arxiv.org/abs/2409.09287
In recent years, transformer-based architectures become the de facto standard for sequence modeling in deep learning frameworks. Inspired by the successful examples, we propose a causal visual-inertial fusion transformer (VIFT) for pose estimation in deep visual-inertial odometry. This study aims to improve pose estimation accuracy by leveraging the attention mechanisms in transformers, which better utilize historical data compared to the recurrent neural network (RNN) based methods seen in recent methods. Transformers typically require large-scale data for training. To address this issue, we utilize inductive biases for deep VIO networks. Since latent visual-inertial feature vectors encompass essential information for pose estimation, we employ transformers to refine pose estimates by updating latent vectors temporally. Our study also examines the impact of data imbalance and rotation learning methods in supervised end-to-end learning of visual inertial odometry by utilizing specialized gradients in backpropagation for the elements of SE$(3)$ group. The proposed method is end-to-end trainable and requires only a monocular camera and IMU during inference. Experimental results demonstrate that VIFT increases the accuracy of monocular VIO networks, achieving state-of-the-art results when compared to previous methods on the KITTI dataset. The code will be made available at this https URL.
近年来,基于Transformer的架构成为了深度学习框架中序列建模的默认标准。受到成功实例的启发,我们提出了一个用于深度视觉-惯性导航运动捕捉的因果视觉-惯性融合Transformer(VIFT)用于姿态估计。本研究旨在通过利用Transformer中的注意机制来提高姿态估计的准确性,与最近的方法相比,这种方法更好地利用了历史数据。通常,Transformer需要大量数据进行训练。为解决这一问题,我们利用归纳偏见来构建深度VIO网络。由于潜在视觉-惯性特征向量包含了姿态估计的关键信息,我们使用Transformer来通过时间更新潜在向量来优化姿态估计。此外,我们还研究了数据不平衡和旋转学习方法在监督端到端学习视觉惯性导航运动中的影响,通过在反向传播中使用SE$(3)$组的特殊梯度来解决。所提出的方法是端到端可训练的,并且在推理时只需要一个单目相机和IMU。实验结果表明,VIFT提高了单目VIO网络的准确性,在KITTI数据集上与以前的方法相比实现了最先进的结果。代码将在此处https URL上提供。
https://arxiv.org/abs/2409.08769
Despite researchers having extensively studied various ways to track body pose on-the-go, most prior work does not take into account wheelchair users, leading to poor tracking performance. Wheelchair users could greatly benefit from this pose information to prevent injuries, monitor their health, identify environmental accessibility barriers, and interact with gaming and VR experiences. In this work, we present WheelPoser, a real-time pose estimation system specifically designed for wheelchair users. Our system uses only four strategically placed IMUs on the user's body and wheelchair, making it far more practical than prior systems using cameras and dense IMU arrays. WheelPoser is able to track a wheelchair user's pose with a mean joint angle error of 14.30 degrees and a mean joint position error of 6.74 cm, more than three times better than similar systems using sparse IMUs. To train our system, we collect a novel WheelPoser-IMU dataset, consisting of 167 minutes of paired IMU sensor and motion capture data of people in wheelchairs, including wheelchair-specific motions such as propulsion and pressure relief. Finally, we explore the potential application space enabled by our system and discuss future opportunities. Open-source code, models, and dataset can be found here: this https URL.
尽管研究人员已经广泛研究了各种在行进中跟踪身体姿势的方法,但大多数先前的研究都没有考虑到轮椅用户,导致跟踪性能较差。轮椅用户可以从这种姿势信息中受益,以防止受伤,监测健康状况,识别环境可接近性障碍,并与游戏和VR体验进行交互。在这项工作中,我们提出了WheelPoser,专为轮椅用户设计的实时姿态估计系统。我们的系统使用了仅四个位于用户身体和轮椅上的策略性IMU,使得它比使用相机和密集IMU阵列的先前系统更加实用。WheelPoser能够以平均关节角误差14.30度和平均关节位置误差6.74厘米的成绩跟踪轮椅用户的姿势,比类似系统使用稀疏IMU的性能更好。为了训练我们的系统,我们收集了一个新的WheelPoser-IMU数据集,包括167分钟的轮椅用户双IMU传感器和运动捕捉数据,包括轮椅特有的运动,如推动和压力缓解。最后,我们探讨了我们的系统的应用空间,并讨论了未来的机会。开源代码、模型和数据集可以在这里找到:https:// this URL。
https://arxiv.org/abs/2409.08494