This paper introduces the Point Cloud Network (PCN) architecture, a novel implementation of linear layers in deep learning networks, and provides empirical evidence to advocate for its preference over the Multilayer Perceptron (MLP) in linear layers. We train several models, including the original AlexNet, using both MLP and PCN architectures for direct comparison of linear layers (Krizhevsky et al., 2012). The key results collected are model parameter count and top-1 test accuracy over the CIFAR-10 and CIFAR-100 datasets (Krizhevsky, 2009). AlexNet-PCN16, our PCN equivalent to AlexNet, achieves comparable efficacy (test accuracy) to the original architecture with a 99.5% reduction of parameters in its linear layers. All training is done on cloud RTX 4090 GPUs, leveraging pytorch for model construction and training. Code is provided for anyone to reproduce the trials from this paper.
本论文介绍了点云网络(PCN)架构,这是一种在深度学习网络中采用线性层的独特实现方式,并提供了实证证据,支持其对线性层中多层感知器(MLP)的偏好。我们训练了多个模型,包括原始AlexNet模型,同时使用MLP和PCN架构进行线性层的直接比较(Krizhevsky等人,2012)。收集的主要结果是模型参数计数和在CIFAR-10和CIFAR-100数据集上的最佳1%测试准确率(Krizhevsky,2009)。AlexNet-PCN16是我们的AlexNet的PCN等价物,其线性层参数减少99.5%。所有训练都在云端RTX 4090GPU上完成,利用PyTorch进行模型构建和训练。代码为任何人提供,用于复制本论文的实验。
https://arxiv.org/abs/2309.12996
Semantic Scene Completion (SSC) aims to jointly generate space occupancies and semantic labels for complex 3D scenes. Most existing SSC models focus on volumetric representations, which are memory-inefficient for large outdoor spaces. Point clouds provide a lightweight alternative but existing benchmarks lack outdoor point cloud scenes with semantic labels. To address this, we introduce PointSSC, the first cooperative vehicle-infrastructure point cloud benchmark for semantic scene completion. These scenes exhibit long-range perception and minimal occlusion. We develop an automated annotation pipeline leveraging Segment Anything to efficiently assign semantics. To benchmark progress, we propose a LiDAR-based model with a Spatial-Aware Transformer for global and local feature extraction and a Completion and Segmentation Cooperative Module for joint completion and segmentation. PointSSC provides a challenging testbed to drive advances in semantic point cloud completion for real-world navigation.
语义场景完成(SSC)旨在同时生成复杂3D场景中的空间利用率和语义标签。大多数现有SSC模型专注于体积表示,这在大型户外空间中内存效率较低。点云提供了轻量级替代品,但现有的基准缺乏户外点云场景带有语义标签的场景。为了解决这一问题,我们介绍了PointSSC,这是第一个合作车辆基础设施点云基准,以语义场景完成。这些场景表现出远程感知和最小遮挡。我们开发了一个自动化标注 pipeline,利用Segment anything高效地分配语义。为了基准进展,我们提出了一个基于LiDAR的模型,具有空间AwareTransformer,用于全球和 local 特征提取,以及一个合作完成和分割合作模块,用于同时完成和分割。PointSSC提供了一个具有挑战性的测试平台,以推动语义点云完成对现实世界导航的进步。
https://arxiv.org/abs/2309.12708
Enriching the robot representation of the operational environment is a challenging task that aims at bridging the gap between low-level sensor readings and high-level semantic understanding. Having a rich representation often requires computationally demanding architectures and pure point cloud based detection systems that struggle when dealing with everyday objects that have to be handled by the robot. To overcome these issues, we propose a graph-based representation that addresses this gap by providing a semantic representation of robot environments from multiple sources. In fact, to acquire information from the environment, the framework combines classical computer vision tools with modern computer vision cloud services, ensuring computational feasibility on onboard hardware. By incorporating an ontology hierarchy with over 800 object classes, the framework achieves cross-domain adaptability, eliminating the need for environment-specific tools. The proposed approach allows us to handle also small objects and integrate them into the semantic representation of the environment. The approach is implemented in the Robot Operating System (ROS) using the RViz visualizer for environment representation. This work is a first step towards the development of a general-purpose framework, to facilitate intuitive interaction and navigation across different domains.
增强机器人对操作环境的表示是一个挑战性的任务,旨在填补低级别传感器读数和高级别语义理解之间的差距。拥有丰富的表示通常需要计算密集型架构和纯点云 based 检测系统,在处理必须由机器人处理的日常物体时感到困难。为了克服这些问题,我们提议采用图表示法,以解决这一差距,并从多个来源提供机器人环境语义表示。实际上,为了从环境中获取信息,框架结合了传统的计算机视觉工具与现代计算机视觉云服务,确保在船体硬件上的计算可行性。通过包括超过800个物体类的主题级元知识 hierarchy,框架实现了跨域适应性,消除了需要特定环境工具的需求。提议的方法可以处理小型物体,并将其集成到环境语义表示中。该方法在机器人操作系统(ROS)中使用 RViz 可视化器实现,这是开发一个通用框架的第一步,以促进不同领域间的直观交互和导航。
https://arxiv.org/abs/2309.12692
We report on an effort that led to POLAR3D, a set of digital assets that enhance the POLAR dataset of stereo images generated by NASA to mimic lunar lighting conditions. Our contributions are twofold. First, we have annotated each photo in the POLAR dataset, providing approximately 23 000 labels for rocks and their shadows. Second, we digitized several lunar terrain scenarios available in the POLAR dataset. Specifically, by utilizing both the lunar photos and the POLAR's LiDAR point clouds, we constructed detailed obj files for all identifiable assets. POLAR3D is the set of digital assets comprising of rock/shadow labels and obj files associated with the digital twins of lunar terrain scenarios. This new dataset can be used for training perception algorithms for lunar exploration and synthesizing photorealistic images beyond the original POLAR collection. Likewise, the obj assets can be integrated into simulation environments to facilitate realistic rover operations in a digital twin of a POLAR scenario. POLAR3D is publicly available to aid perception algorithm development, camera simulation efforts, and lunar simulation exercises.POLAR3D is publicly available at this https URL.
我们报告了一项努力,该努力导致 POLAR3D 数字资产组的生成,该组包括由NASA生成的双筒图像的 POLAR 数据集,以模拟月球照明条件。我们的贡献有两个。首先,我们对 POLAR 数据集中的每张图像进行了注解,为岩石和阴影提供了大约 23 000 个标签。其次,我们对 POLAR 数据集中可用的几个月球地形场景进行了数字建模。具体而言,通过利用月球照片和 POLAR 的 LiDAR 点云,我们为所有可识别的资产构建了详细的 obj 文件。 POLAR3D 是由岩石和阴影标签以及与月球地形场景数字双胞胎相关的 obj 文件组成的一组数字资产。这个新数据集可以用于训练月球探索感知算法,合成超越原始 POLAR 收集的逼真图像,同时也可以将其用于模拟月球地形场景的数字双胞胎中的实际机器人操作。 POLAR3D 是公开可用的,以帮助感知算法开发、相机模拟努力和月球模拟练习。POLAR3D 可以在该 https URL 上公开可用。
https://arxiv.org/abs/2309.12397
Reconstructing a surface from a point cloud is an underdetermined problem. We use a neural network to study and quantify this reconstruction uncertainty under a Poisson smoothness prior. Our algorithm addresses the main limitations of existing work and can be fully integrated into the 3D scanning pipeline, from obtaining an initial reconstruction to deciding on the next best sensor position and updating the reconstruction upon capturing more data.
从点云中重构表面是一个 underdetermined 问题。我们使用神经网络研究并量化在Poisson平滑前重构的不确定性。我们的算法解决了现有工作的主要限制,可以 fully integrated into 3D 扫描管道,从获取初始重构到决定下一个最佳传感器位置,并在捕获更多数据时更新重构。
https://arxiv.org/abs/2309.11993
Over the last decades, ample achievements have been made on Structure from motion (SfM). However, the vast majority of them basically work in an offline manner, i.e., images are firstly captured and then fed together into a SfM pipeline for obtaining poses and sparse point cloud. In this work, on the contrary, we present an on-the-fly SfM: running online SfM while image capturing, the newly taken On-the-Fly image is online estimated with the corresponding pose and points, i.e., what you capture is what you get. Specifically, our approach firstly employs a vocabulary tree that is unsupervised trained using learning-based global features for fast image retrieval of newly fly-in image. Then, a robust feature matching mechanism with least squares (LSM) is presented to improve image registration performance. Finally, via investigating the influence of newly fly-in image's connected neighboring images, an efficient hierarchical weighted local bundle adjustment (BA) is used for optimization. Extensive experimental results demonstrate that on-the-fly SfM can meet the goal of robustly registering the images while capturing in an online way.
几十年来,在结构从运动(SfM)方面取得了大量成就。然而,绝大多数研究基本上采用 offline 的方式,即先拍摄图像,然后将它们输入到 SfM 流程中以获得姿态和稀疏点云。在本研究中,我们提出了一种实时的 SfM:在图像拍摄的同时运行在线的 SfM,并将新拍摄的图像在线估计其对应的姿态和点,即你得到什么就是什么。具体来说,我们的方法和先使用基于学习 global 特征的无监督训练来快速检索新飞入图像的图像。然后,我们介绍了一种基于最小平方的强特征匹配机制以提高图像对齐性能。最后,通过研究新飞入图像相邻图像的影响,我们使用了高效的分层加权局部卷积调整(BA)进行优化。广泛的实验结果证明,实时的 SfM可以在在线方式下稳健地对齐图像。
https://arxiv.org/abs/2309.11883
Wireless communications at high-frequency bands with large antenna arrays face challenges in beam management, which can potentially be improved by multimodality sensing information from cameras, LiDAR, radar, and GPS. In this paper, we present a multimodal transformer deep learning framework for sensing-assisted beam prediction. We employ a convolutional neural network to extract the features from a sequence of images, point clouds, and radar raw data sampled over time. At each convolutional layer, we use transformer encoders to learn the hidden relations between feature tokens from different modalities and time instances over abstraction space and produce encoded vectors for the next-level feature extraction. We train the model on a combination of different modalities with supervised learning. We try to enhance the model over imbalanced data by utilizing focal loss and exponential moving average. We also evaluate data processing and augmentation techniques such as image enhancement, segmentation, background filtering, multimodal data flipping, radar signal transformation, and GPS angle calibration. Experimental results show that our solution trained on image and GPS data produces the best distance-based accuracy of predicted beams at 78.44%, with effective generalization to unseen day scenarios near 73% and night scenarios over 84%. This outperforms using other modalities and arbitrary data processing techniques, which demonstrates the effectiveness of transformers with feature fusion in performing radio beam prediction from images and GPS. Furthermore, our solution could be pretrained from large sequences of multimodality wireless data, on fine-tuning for multiple downstream radio network tasks.
在高频band上具有大型天线阵列的无线通信面临着beam管理方面的挑战,这可以通过从相机、激光雷达、卫星定位系统和GPS收集的多模态感知信息来改善。在本文中,我们提出了一种多模态transformer深度学习框架,用于协助beam预测。我们使用卷积神经网络从时间序列图像、点云和雷达原始数据中抽取特征。在每个卷积层中,我们使用transformer编码器学习不同模态和时间实例之间的隐藏关系,并产生编码向量,用于更高级别的特征提取。我们使用监督学习结合不同模态进行训练。我们尝试通过利用焦点损失和指数平滑移动平均来增强不平衡数据。我们还评估了数据处理和增强技术,如图像增强、分割、背景滤波、多模态数据翻转、雷达信号转换和GPS角度校准。实验结果显示,我们训练在图像和GPS数据上的解决方案产生距离基于预测beam的最佳精度,为78.44%,并能够有效地泛化到未观察到的白天场景接近73%,夜晚场景超过84%。这比使用其他模态和任意数据处理技术的表现更好,这表明融合特征的transformer在从图像和GPS进行无线电 beam预测方面的有效性。此外,我们的解决方案可以从大量的多模态无线数据序列中预先训练,以微调多个后续无线电网络任务。
https://arxiv.org/abs/2309.11811
Lidars and cameras are critical sensors that provide complementary information for 3D detection in autonomous driving. While most prevalent methods progressively downscale the 3D point clouds and camera images and then fuse the high-level features, the downscaled features inevitably lose low-level detailed information. In this paper, we propose Fine-Grained Lidar-Camera Fusion (FGFusion) that make full use of multi-scale features of image and point cloud and fuse them in a fine-grained way. First, we design a dual pathway hierarchy structure to extract both high-level semantic and low-level detailed features of the image. Second, an auxiliary network is introduced to guide point cloud features to better learn the fine-grained spatial information. Finally, we propose multi-scale fusion (MSF) to fuse the last N feature maps of image and point cloud. Extensive experiments on two popular autonomous driving benchmarks, i.e. KITTI and Waymo, demonstrate the effectiveness of our method.
雷达和相机是自动驾驶中提供3D检测互补信息的关键传感器。虽然大多数流行方法都会逐步降低3D点云和相机图像的分辨率,然后将它们融合高层次特征,但降低分辨率的特征不可避免地会失去低层次的详细信息。在本文中,我们提出了精细的雷达-相机融合(FGFusion),充分利用图像和点云的多尺度特征,并以精细的方式将它们融合。首先,我们设计了一个双重路径层次结构,以提取图像的高层次语义和低层次的详细特征。其次,我们引入了一个辅助网络,以指导点云特征更好地学习精细的空间信息。最后,我们提出了多尺度融合(MSF),以融合图像和点云的最后N个特征映射。我们对两个流行的自动驾驶基准测试,即KITTI和谷歌自动驾驶,进行了广泛的实验,证明了我们的方法的有效性。
https://arxiv.org/abs/2309.11804
Recently, multi-modality models have been introduced because of the complementary information from different sensors such as LiDAR and cameras. It requires paired data along with precise calibrations for all modalities, the complicated calibration among modalities hugely increases the cost of collecting such high-quality datasets, and hinder it from being applied to practical scenarios. Inherit from the previous works, we not only fuse the information from multi-modality without above issues, and also exhaust the information in the RGB modality. We introduced the 2D Detection Annotations Transmittable Aggregation(\textbf{2DDATA}), designing a data-specific branch, called \textbf{Local Object Branch}, which aims to deal with points in a certain bounding box, because of its easiness of acquiring 2D bounding box annotations. We demonstrate that our simple design can transmit bounding box prior information to the 3D encoder model, proving the feasibility of large multi-modality models fused with modality-specific data.
最近,由于来自不同传感器,如激光雷达和摄像头的互补信息,引入了多模态模型。这要求所有模态的精确校准,而复杂的校准在模态之间增加了很大的收集高质量数据的成本,并阻碍了将其应用于实际场景。继承之前的工作,我们不仅融合来自不同模态的信息,而且也没有上述问题,还设计了数据特定的分支,称为 \textbf{Local Object Branch},旨在处理特定边界框内的点,因为其 acquisition 2D 边界框annotations 的简便性。我们证明,我们的简单设计可以向 3D 编码器模型传输边界框先验信息,证明了大型多模态模型与特定模态数据融合的可行性。
https://arxiv.org/abs/2309.11755
In this paper, we study the problem of task-oriented grasp synthesis from partial point cloud data using an eye-in-hand camera configuration. In task-oriented grasp synthesis, a grasp has to be selected so that the object is not lost during manipulation, and it is also ensured that adequate force/moment can be applied to perform the task. We formalize the notion of a gross manipulation task as a constant screw motion (or a sequence of constant screw motions) to be applied to the object after grasping. Using this notion of task, and a corresponding grasp quality metric developed in our prior work, we use a neural network to approximate a function for predicting the grasp quality metric on a cuboid shape. We show that by using a bounding box obtained from the partial point cloud of an object, and the grasp quality metric mentioned above, we can generate a good grasping region on the bounding box that can be used to compute an antipodal grasp on the actual object. Our algorithm does not use any manually labeled data or grasping simulator, thus making it very efficient to implement and integrate with screw linear interpolation-based motion planners. We present simulation as well as experimental results that show the effectiveness of our approach.
在本文中,我们使用手持式摄像头配置研究了从部分点云数据中任务导向抓握合成的问题。在任务导向抓握合成中,必须选择抓握才能确保对象在操作期间不会丢失,同时也确保了足够的力量/ Moment 能够用于完成任务。我们正式化了 gross 操作任务的概念,将其定义为在抓握后适用于对象的常数螺钉运动(或一组常数螺钉运动)。使用这个任务概念和我们在先前工作中开发的相应的抓握质量度量,我们使用神经网络approximating 一个用于预测立方体形状抓握质量度量的函数。我们表明,通过使用从对象部分点云中提取的 bounding box 和上述的抓握质量度量,可以在 bounding box 上生成一个良好的抓握区域,可用于计算实际对象的反例抓握。我们的算法不使用任何手动标签的数据或抓握模拟器,因此非常高效地与螺钉线性插值的运动规划系统集成。我们呈现了模拟和实验结果,证明了我们的方法的有效性。
https://arxiv.org/abs/2309.11689
Creating accurate 3D models of tree topology is an important task for tree pruning. The 3D model is used to decide which branches to prune and then to execute the pruning cuts. Previous methods for creating 3D tree models have typically relied on point clouds, which are often computationally expensive to process and can suffer from data defects, especially with thin branches. In this paper, we propose a method for actively scanning along a primary tree branch, detecting secondary branches to be pruned, and reconstructing their 3D geometry using just an RGB camera mounted on a robot arm. We experimentally validate that our setup is able to produce primary branch models with 4-5 mm accuracy and secondary branch models with 15 degrees orientation accuracy with respect to the ground truth model. Our framework is real-time and can run up to 10 cm/s with no loss in model accuracy or ability to detect secondary branches.
创建准确的树拓扑3D模型是修剪树的重要任务。3D模型被用来决定哪些分支需要修剪,然后执行修剪操作。以前的创建3D树模型的方法通常依赖于点云,通常处理点云很费计算资源,并且可能受到数据缺陷的影响,特别是薄的分支。在本文中,我们提出了一种方法,用于 actively扫描一条主要树分支,检测需要修剪的 secondary branches,并使用仅安装在机器人臂上的RGB摄像头重建它们的3D几何。我们进行了实验验证,我们的装置能够产生与基准模型精度高达4-5毫米的主要树模型和 secondary branch模型,与基准模型的15度方向精度相当。我们的框架是实时的,可以运行高达10厘米/秒,而模型精度或检测 secondary branches的能力没有损失。
https://arxiv.org/abs/2309.11580
Few-shot point cloud semantic segmentation aims to train a model to quickly adapt to new unseen classes with only a handful of support set samples. However, the noise-free assumption in the support set can be easily violated in many practical real-world settings. In this paper, we focus on improving the robustness of few-shot point cloud segmentation under the detrimental influence of noisy support sets during testing time. To this end, we first propose a Component-level Clean Noise Separation (CCNS) representation learning to learn discriminative feature representations that separates the clean samples of the target classes from the noisy samples. Leveraging the well separated clean and noisy support samples from our CCNS, we further propose a Multi-scale Degree-based Noise Suppression (MDNS) scheme to remove the noisy shots from the support set. We conduct extensive experiments on various noise settings on two benchmark datasets. Our results show that the combination of CCNS and MDNS significantly improves the performance. Our code is available at this https URL.
少量点云语义分割的目标是训练模型,以在新 unseen 类只有少量支持样本的情况下,快速适应这些新类。然而,支持样本中的噪声假设在许多实际实际场景中很容易违反。在本文中,我们重点是在测试期间,在噪声支持样本的有害影响下,提高少量点云分割的鲁棒性。为此,我们提出了一种组件级别的干净噪声分离(CCNS)表示学习,以学习区分目标类干净样本和噪声样本的表示。利用我们的 CCNS 中的干净和噪声支持样本,我们进一步提出了一种多尺度度数based噪声抑制(MDNS)方案,以从支持样本中删除噪声样本。我们在两个基准数据集上进行了广泛的实验,研究了各种噪声设置。我们的结果表明,CCNS 和 MDNS 的组合显著提高了性能。我们的代码可在 this https URL 上获取。
https://arxiv.org/abs/2309.11228
Existing fully-supervised point cloud segmentation methods suffer in the dynamic testing environment with emerging new classes. Few-shot point cloud segmentation algorithms address this problem by learning to adapt to new classes at the sacrifice of segmentation accuracy for the base classes, which severely impedes its practicality. This largely motivates us to present the first attempt at a more practical paradigm of generalized few-shot point cloud segmentation, which requires the model to generalize to new categories with only a few support point clouds and simultaneously retain the capability to segment base classes. We propose the geometric words to represent geometric components shared between the base and novel classes, and incorporate them into a novel geometric-aware semantic representation to facilitate better generalization to the new classes without forgetting the old ones. Moreover, we introduce geometric prototypes to guide the segmentation with geometric prior knowledge. Extensive experiments on S3DIS and ScanNet consistently illustrate the superior performance of our method over baseline methods. Our code is available at: this https URL.
现有的完全监督点云分割方法在出现新类别的动态测试环境中表现不佳。少量点云分割算法通过牺牲基准类分割精度来适应新类,这严重阻碍了其实用性。这主要激励我们提出一种更加实用的少量点云分割范式,该范式要求模型仅使用少量支持点云就泛化到新类别,同时保留分割基准类的能力。我们提议使用几何词汇来表示基准类和新类共享的几何组件,并将其融合到一个几何aware语义表示中,以更好地泛化到新类别,而不会忘记旧类。此外,我们引入了几何原型来指导基于几何知识的分割。在S3DIS和ScanNet等实验中,我们 consistently 证明了我们方法的优越性能优于基准方法。我们的代码可在以下https URL获得。
https://arxiv.org/abs/2309.11222
In the current deep learning paradigm, the amount and quality of training data are as critical as the network architecture and its training details. However, collecting, processing, and annotating real data at scale is difficult, expensive, and time-consuming, particularly for tasks such as 3D object registration. While synthetic datasets can be created, they require expertise to design and include a limited number of categories. In this paper, we introduce a new approach called AutoSynth, which automatically generates 3D training data for point cloud registration. Specifically, AutoSynth automatically curates an optimal dataset by exploring a search space encompassing millions of potential datasets with diverse 3D shapes at a low this http URL achieve this, we generate synthetic 3D datasets by assembling shape primitives, and develop a meta-learning strategy to search for the best training data for 3D registration on real point clouds. For this search to remain tractable, we replace the point cloud registration network with a much smaller surrogate network, leading to a $4056.43$ times speedup. We demonstrate the generality of our approach by implementing it with two different point cloud registration networks, BPNet and IDAM. Our results on TUD-L, LINEMOD and Occluded-LINEMOD evidence that a neural network trained on our searched dataset yields consistently better performance than the same one trained on the widely used ModelNet40 dataset.
在目前的深度学习范式中,训练数据的数量和质量对其网络架构和训练细节同样至关重要。然而,在大规模收集、处理和标注真实数据方面,这是一项非常困难、昂贵和耗时的任务,特别是对于3D物体注册的任务。虽然可以创建合成数据集,但需要专业知识来设计和包括有限的类别。在本文中,我们介绍了一种名为AutoSynth的新方法,该方法自动生成点云注册的3D训练数据。具体来说,AutoSynth通过探索涵盖数百万种不同3D形状的搜索空间来实现最优数据集的创建,我们使用组装形状基本点来生成合成的3D数据集,并开发了一种元学习策略,用于搜索实际点云注册中的最优训练数据。为了让搜索变得可控制,我们替换了点云注册网络,并将其替换为一个非常小的潜在网络,导致速度提高了4056.43倍。我们通过使用两个不同的点云注册网络BPNet和IDAM来演示了我们方法的通用性,我们的研究结果在TUD-L、LINEMOD和Occlusion-LINEMOD中表明,训练在我们搜索的数据集中的神经网络比在广泛使用ModelNet40数据集中训练的神经网络 consistentlyyields better performance.
https://arxiv.org/abs/2309.11170
In this paper, we present a simultaneous exploration and object search framework for the application of autonomous trolley collection. For environment representation, a task-oriented environment partitioning algorithm is presented to extract diverse information for each sub-task. First, LiDAR data is classified as potential objects, walls, and obstacles after outlier removal. Segmented point clouds are then transformed into a hybrid map with the following functional components: object proposals to avoid missing trolleys during exploration; room layouts for semantic space segmentation; and polygonal obstacles containing geometry information for efficient motion planning. For exploration and simultaneous trolley collection, we propose an efficient exploration-based object search method. First, a traveling salesman problem with precedence constraints (TSP-PC) is formulated by grouping frontiers and object proposals. The next target is selected by prioritizing object search while avoiding excessive robot backtracking. Then, feasible trajectories with adequate obstacle clearance are generated by topological graph search. We validate the proposed framework through simulations and demonstrate the system with real-world autonomous trolley collection tasks.
在本文中,我们提出了一种同时探索和对象搜索的框架,用于应用自主 trolley 收集。为了环境表示,我们提出了一种任务导向的环境分割算法,以提取每个子任务多样化的信息。首先,通过去除异常值,LiDAR 数据被分类为潜在对象、墙壁和障碍物。分割点云随后转换为具有以下功能组件的混合地图:对象提议以避免在探索中遗漏 trolleys;语义空间分割房间的布局;以及包含几何信息的多边形障碍物,以进行高效 motion planning。对于探索和同时 trolley 收集,我们提出了一种高效的探索基于对象搜索方法。首先,通过将 Frontier 和对象提议分组,提出了一个具有优先级限制的旅行推销员问题(TSP-PC)。通过优先考虑对象搜索而避免过度机器人回退,选择下一个目标。然后,通过拓扑图搜索生成可行的路径,并获得足够的障碍物清除。通过模拟验证所提出的框架,并使用自主 trolley 收集的实际任务演示了系统。
https://arxiv.org/abs/2309.11107
Affordance detection presents intricate challenges and has a wide range of robotic applications. Previous works have faced limitations such as the complexities of 3D object shapes, the wide range of potential affordances on real-world objects, and the lack of open-vocabulary support for affordance understanding. In this paper, we introduce a new open-vocabulary affordance detection method in 3D point clouds, leveraging knowledge distillation and text-point correlation. Our approach employs pre-trained 3D models through knowledge distillation to enhance feature extraction and semantic understanding in 3D point clouds. We further introduce a new text-point correlation method to learn the semantic links between point cloud features and open-vocabulary labels. The intensive experiments show that our approach outperforms previous works and adapts to new affordance labels and unseen objects. Notably, our method achieves the improvement of 7.96% mIOU score compared to the baselines. Furthermore, it offers real-time inference which is well-suitable for robotic manipulation applications.
代价检测面临着复杂的挑战,并拥有广泛的机器人应用。以往的工作面临着诸如3D物体形状的复杂性、现实世界物体中的潜在代价、以及缺乏支持代价理解开放性词汇库的支持等限制。在本文中,我们提出了在3D点云上使用开放性词汇检测方法的新方法,利用知识蒸馏和文本点相关度。我们的方法使用预先训练的3D模型通过知识蒸馏来提高3D点云特征提取和语义理解。我们进一步引入了一种新的文本点相关度方法,以学习点云特征和开放性词汇标签之间的语义联系。 intensive 实验表明,我们的方法比以往的工作表现更好,并适应新的代价标签和未观测的对象。特别地,我们的方法相较于基线方法实现了7.96%的mIOU得分改进。此外,它提供了实时推理,非常适合机器人操纵应用。
https://arxiv.org/abs/2309.10932
Affordance detection and pose estimation are of great importance in many robotic applications. Their combination helps the robot gain an enhanced manipulation capability, in which the generated pose can facilitate the corresponding affordance task. Previous methods for affodance-pose joint learning are limited to a predefined set of affordances, thus limiting the adaptability of robots in real-world environments. In this paper, we propose a new method for language-conditioned affordance-pose joint learning in 3D point clouds. Given a 3D point cloud object, our method detects the affordance region and generates appropriate 6-DoF poses for any unconstrained affordance label. Our method consists of an open-vocabulary affordance detection branch and a language-guided diffusion model that generates 6-DoF poses based on the affordance text. We also introduce a new high-quality dataset for the task of language-driven affordance-pose joint learning. Intensive experimental results demonstrate that our proposed method works effectively on a wide range of open-vocabulary affordances and outperforms other baselines by a large margin. In addition, we illustrate the usefulness of our method in real-world robotic applications. Our code and dataset are publicly available at this https URL
很多机器人应用中, affordance detection 和 pose estimation 非常重要。它们的结合可以帮助机器人获得更好的操作能力,其中生成的 pose 可以方便相应的 affordance 任务。以往的 affordance-pose 联合学习方法局限于预先定义的 affordance 集合,因此限制了机器人在真实环境中的适应性。在本文中,我们提出了一种新的方法,基于语言条件的 affordance-pose 联合学习在 3D 点云上。给定一个 3D 点云对象,我们的方法和语言引导扩散模型可以检测到 affordance 区域并生成任意无约束 affordance 标签的 6DOF pose。我们的方法包括一个开放式 affordance 检测分支和一个基于语言的引导扩散模型,该模型基于 affordance 文本生成 6DOF pose。我们还介绍了一个高质量的语言驱动 affordance-pose 联合学习任务数据集。密集的实验结果显示,我们提出的方法和其他基准方法在开放式 affordance 集合上非常有效,并且以显著优势超越其他基准方法。此外,我们还展示了我们方法在真实机器人应用中的有用性。我们的代码和数据集在此 https URL 上公开可用。
https://arxiv.org/abs/2309.10911
This letter describes an incremental multimodal surface mapping methodology, which represents the environment as a continuous probabilistic model. This model enables high-resolution reconstruction while simultaneously compressing spatial and intensity point cloud data. The strategy employed in this work utilizes Gaussian mixture models (GMMs) to represent the environment. While prior GMM-based mapping works have developed methodologies to determine the number of mixture components using information-theoretic techniques, these approaches either operate on individual sensor observations, making them unsuitable for incremental mapping, or are not real-time viable, especially for applications where high-fidelity modeling is required. To bridge this gap, this letter introduces a spatial hash map for rapid GMM submap extraction combined with an approach to determine relevant and redundant data in a point cloud. These contributions increase computational speed by an order of magnitude compared to state-of-the-art incremental GMM-based mapping. In addition, the proposed approach yields a superior tradeoff in map accuracy and size when compared to state-of-the-art mapping methodologies (both GMM- and not GMM-based). Evaluations are conducted using both simulated and real-world data. The software is released open-source to benefit the robotics community.
这封信描述了一种增量多模式表面映射方法,该方法将环境表示为连续的概率模型。这种方法能够高分辨率重构,同时压缩空间强度和点云数据。在本工作中,使用高斯混合模型(GMMs)来代表环境。虽然先前基于GMMs的映射工作已经发展了使用信息理论方法来确定混合物成分的方法,但这些方法要么针对单个传感器观测进行操作,使其不适合增量映射,要么不是实时可行的,特别是需要高保真建模的应用。为了填补这一差距,这封信介绍了一种空间哈希地图,用于快速GMMs子地图提取,并介绍了一种方法来确定点云中的相关和冗余数据。这些贡献相比最先进的增量基于GMMs的映射方法提高了计算速度一倍以上。此外,与最先进的映射方法(基于GMM和不是GMM)相比,建议的方法在地图精度和大小方面取得了更好的权衡。评估使用了模拟数据和实际数据。软件以开源方式发布,以造福机器人社区。
https://arxiv.org/abs/2309.10900
This document presents PLVS: a real-time system that leverages sparse SLAM, volumetric mapping, and 3D unsupervised incremental segmentation. PLVS stands for Points, Lines, Volumetric mapping, and Segmentation. It supports RGB-D and Stereo cameras, which may be optionally equipped with IMUs. The SLAM module is keyframe-based, and extracts and tracks sparse points and line segments as features. Volumetric mapping runs in parallel with respect to the SLAM front-end and generates a 3D reconstruction of the explored environment by fusing point clouds backprojected from keyframes. Different volumetric mapping methods are supported and integrated in PLVS. We use a novel reprojection error to bundle-adjust line segments. This error exploits available depth information to stabilize the position estimates of line segment endpoints. An incremental and geometric-based segmentation method is implemented and integrated for RGB-D cameras in the PLVS framework. We present qualitative and quantitative evaluations of the PLVS framework on some publicly available datasets. The appendix details the adopted stereo line triangulation method and provides a derivation of the Jacobians we used for line error terms. The software is available as open-source.
这份文档介绍了PLVS:一个利用稀疏SLAM、体积映射和3D无监督增量分割实时系统。PLVS代表点、线、体积映射和分割。它支持RGB-D和立体相机,这些相机可能可选地配备惯性测量单元。SLAM模块基于关键帧,提取和跟踪稀疏点和平线片段作为特征。体积映射在SLAM前端并行运行,从关键帧聚合点云并生成从关键帧聚合的探索环境3D重构。不同体积映射方法和PLVS框架被支持和整合。我们使用一种新颖的投影误差来打包调整线片段。这个误差利用可用的深度信息稳定线片段端点的位置估计。在PLVS框架中,RGB-D相机采用增量和几何based分割方法。我们提供了对PLVS框架的一些公开数据集的定性和定量评估。附录详细描述了所采用的立体线三角化方法,并提供了线误差 terms的 Jacobians的推导。软件作为开源可用。
https://arxiv.org/abs/2309.10896
Current state-of-the-art point cloud-based perception methods usually rely on large-scale labeled data, which requires expensive manual annotations. A natural option is to explore the unsupervised methodology for 3D perception tasks. However, such methods often face substantial performance-drop difficulties. Fortunately, we found that there exist amounts of image-based datasets and an alternative can be proposed, i.e., transferring the knowledge in the 2D images to 3D point clouds. Specifically, we propose a novel approach for the challenging cross-modal and cross-domain adaptation task by fully exploring the relationship between images and point clouds and designing effective feature alignment strategies. Without any 3D labels, our method achieves state-of-the-art performance for 3D point cloud semantic segmentation on SemanticKITTI by using the knowledge of KITTI360 and GTA5, compared to existing unsupervised and weakly-supervised baselines.
当前先进的点云基于感知方法通常依赖于大规模的标记数据,这需要昂贵的手动标注。一种自然的选项是探索无监督对于三维感知任务的方法。然而,这些方法往往面临显著的性能下降困难。幸运的是,我们发现存在大量的基于图像的 datasets 并且可以提出一种替代方案,即将二维图像的知识转移到三维点云。具体来说,我们提出了一种挑战性的跨modal 和跨域适应任务的新方法,通过 fully Exploring 图像和点云之间的关系并设计有效的特征对齐策略,来设计。与现有的无监督和弱监督基准相比,我们的方法在语义KITTI上的三维点云语义分割任务中表现出最先进的性能。在没有三维标签的情况下,我们的方法比现有的无监督和弱监督基准表现出更高的三维点云语义分割性能。
https://arxiv.org/abs/2309.10649