In recent years, modern techniques in deep learning and large-scale datasets have led to impressive progress in 3D instance segmentation, grasp pose estimation, and robotics. This allows for accurate detection directly in 3D scenes, object- and environment-aware grasp prediction, as well as robust and repeatable robotic manipulation. This work aims to integrate these recent methods into a comprehensive framework for robotic interaction and manipulation in human-centric environments. Specifically, we leverage 3D reconstructions from a commodity 3D scanner for open-vocabulary instance segmentation, alongside grasp pose estimation, to demonstrate dynamic picking of objects, and opening of drawers. We show the performance and robustness of our model in two sets of real-world experiments including dynamic object retrieval and drawer opening, reporting a 51% and 82% success rate respectively. Code of our framework as well as videos are available on: this https URL.
近年来,深度学习和大规模数据集技术在3D实例分割、抓持姿态估计和机器人领域取得了令人瞩目的进展。这使得可以在直接在3D场景中准确检测物体的准确度,实现物体和环境感知的抓取预测,以及稳健且可重复的机器人操作。本文旨在将这些最近的方法整合到一个以人为中心的环境中机器人交互和操作的全面框架中。具体来说,我们利用商品3D扫描器的3D重构进行开放式词汇实例分割,并辅以抓持姿态估计,展示了动态抓取物体和打开抽屉。我们在包括动态物体检索和抽屉打开的两种真实世界实验中分别评估了我们模型的性能和鲁棒性,报告的成功率分别为51%和82%。我们框架的代码和视频可在此处访问:https:// this URL。
https://arxiv.org/abs/2404.12440
With the emergence of large-scale models trained on diverse datasets, in-context learning has emerged as a promising paradigm for multitasking, notably in natural language processing and image processing. However, its application in 3D point cloud tasks remains largely unexplored. In this work, we introduce Point-In-Context (PIC), a novel framework for 3D point cloud understanding via in-context learning. We address the technical challenge of effectively extending masked point modeling to 3D point clouds by introducing a Joint Sampling module and proposing a vanilla version of PIC called Point-In-Context-Generalist (PIC-G). PIC-G is designed as a generalist model for various 3D point cloud tasks, with inputs and outputs modeled as coordinates. In this paradigm, the challenging segmentation task is achieved by assigning label points with XYZ coordinates for each category; the final prediction is then chosen based on the label point closest to the predictions. To break the limitation by the fixed label-coordinate assignment, which has poor generalization upon novel classes, we propose two novel training strategies, In-Context Labeling and In-Context Enhancing, forming an extended version of PIC named Point-In-Context-Segmenter (PIC-S), targeting improving dynamic context labeling and model training. By utilizing dynamic in-context labels and extra in-context pairs, PIC-S achieves enhanced performance and generalization capability in and across part segmentation datasets. PIC is a general framework so that other tasks or datasets can be seamlessly introduced into our PIC through a unified data format. We conduct extensive experiments to validate the versatility and adaptability of our proposed methods in handling a wide range of tasks and segmenting multi-datasets. Our PIC-S is capable of generalizing unseen datasets and performing novel part segmentation by customizing prompts.
随着大型模型在多样数据集上的出现,上下文学习已经成为多任务处理的有前景的范式,尤其是在自然语言处理和图像处理领域。然而,在3D点云任务中,它的应用仍然有很大一部分没有被探索。在这项工作中,我们引入了Point-In-Context (PIC),一种通过上下文学习对3D点云进行理解的全新框架。我们通过引入联合采样模块并提出了一个名为Point-In-Context-Generalist (PIC-G) 的普通版本解决了在3D点云上有效扩展遮罩点模型的技术挑战。PIC-G被设计成一个通用的3D点云任务模型,其输入和输出建模为坐标。在这种范式下,通过为每个类别分配标记点的XYZ坐标,实现了挑战性的分割任务;然后根据预测中距离最近的标签点进行最终选择。为了克服固定标签坐标分配的局限性,即在新类上表现不佳,我们提出了两种新的训练策略,称为In-Context Labeling和In-Context Enhancing,作为PIC的扩展版本,针对提高动态上下文标注和模型训练。通过利用动态上下文标签和额外的上下文对,PIC-S在各种3D点云任务上实现了增强的性能和泛化能力。PIC是一个通用框架,以便将其他任务或数据集轻松地引入到我们的PIC中,实现统一的数据格式。我们进行了广泛的实验来验证我们所提出的方法的多样性和适应性在处理各种任务和分割多数据集方面。我们的PIC-S能够通过自定义提示进行个性化扩展,实现新的部分分割。
https://arxiv.org/abs/2404.12352
Recognizing places from an opposing viewpoint during a return trip is a common experience for human drivers. However, the analogous robotics capability, visual place recognition (VPR) with limited field of view cameras under 180 degree rotations, has proven to be challenging to achieve. To address this problem, this paper presents Same Place Opposing Trajectory (SPOT), a technique for opposing viewpoint VPR that relies exclusively on structure estimated through stereo visual odometry (VO). The method extends recent advances in lidar descriptors and utilizes a novel double (similar and opposing) distance matrix sequence matching method. We evaluate SPOT on a publicly available dataset with 6.7-7.6 km routes driven in similar and opposing directions under various lighting conditions. The proposed algorithm demonstrates remarkable improvement over the state-of-the-art, achieving up to 91.7% recall at 100% precision in opposing viewpoint cases, while requiring less storage than all baselines tested and running faster than all but one. Moreover, the proposed method assumes no a priori knowledge of whether the viewpoint is similar or opposing, and also demonstrates competitive performance in similar viewpoint cases.
在往返旅行中,从对方面临识别地点是一个常见的人类驾驶者的经历。然而,具有有限视野相机的视场机器人学能力(VPR)在实现方面被证明具有挑战性。为解决这个问题,本文提出了 Same Place Opposing Trajectory(SPOT),一种基于立体视觉惯性测量(VO)的反对观点VPR技术。该方法扩展了最近在激光描述符和双距离矩阵序列匹配方面的最新进展,并采用了一种新颖的double(相似和反对)距离矩阵序列匹配方法。我们在各种光照条件下,使用公开可用的数据集对SPOT进行了评估。与最先进的实现相比,所提出的算法在反对观点情况下实现了显著的提高,达到91.7%的召回率,而在100%精确度时,所需存储比所有测试基线都要少,并且比所有基线都要快。此外,所提出的假设没有预先知识来确定视点的相似性或反对性,并且在相似观点情况下也具有竞争力的性能。
https://arxiv.org/abs/2404.12339
Precise robot manipulations require rich spatial information in imitation learning. Image-based policies model object positions from fixed cameras, which are sensitive to camera view changes. Policies utilizing 3D point clouds usually predict keyframes rather than continuous actions, posing difficulty in dynamic and contact-rich scenarios. To utilize 3D perception efficiently, we present RISE, an end-to-end baseline for real-world imitation learning, which predicts continuous actions directly from single-view point clouds. It compresses the point cloud to tokens with a sparse 3D encoder. After adding sparse positional encoding, the tokens are featurized using a transformer. Finally, the features are decoded into robot actions by a diffusion head. Trained with 50 demonstrations for each real-world task, RISE surpasses currently representative 2D and 3D policies by a large margin, showcasing significant advantages in both accuracy and efficiency. Experiments also demonstrate that RISE is more general and robust to environmental change compared with previous baselines. Project website: this http URL.
精确机器人操作需要丰富的空间信息在模仿学习中。基于图像的政策模型从固定的相机中建模物体位置,这些相机对相机视角的变化非常敏感。使用3D点云的政策通常预测关键帧,这使得在动态和接触丰富的场景中实现高效操作具有困难。为了有效地利用3D感知,我们提出了RISE,一个端到端的实世界模仿学习基准,它直接从单视点云中预测连续动作。它使用稀疏的3D编码器压缩点云。添加稀疏位置编码后,点被用Transformer特征化。最后,通过扩散头将特征解码为机器人动作。为每个现实世界任务训练50个演示,RISE在现有2D和3D策略的基础上优势明显,展示了在准确性和效率方面的显著优势。实验还表明,与之前的基础相比,RISE对环境变化的适应性更强。项目网站:this http URL。
https://arxiv.org/abs/2404.12281
Knowledge of tree species distribution is fundamental to managing forests. New deep learning approaches promise significant accuracy gains for forest mapping, and are becoming a critical tool for mapping multiple tree species at scale. To advance the field, deep learning researchers need large benchmark datasets with high-quality annotations. To this end, we present the PureForest dataset: a large-scale, open, multimodal dataset designed for tree species classification from both Aerial Lidar Scanning (ALS) point clouds and Very High Resolution (VHR) aerial images. Most current public Lidar datasets for tree species classification have low diversity as they only span a small area of a few dozen annotated hectares at most. In contrast, PureForest has 18 tree species grouped into 13 semantic classes, and spans 339 km$^2$ across 449 distinct monospecific forests, and is to date the largest and most comprehensive Lidar dataset for the identification of tree species. By making PureForest publicly available, we hope to provide a challenging benchmark dataset to support the development of deep learning approaches for tree species identification from Lidar and/or aerial imagery. In this data paper, we describe the annotation workflow, the dataset, the recommended evaluation methodology, and establish a baseline performance from both 3D and 2D modalities.
树木物种分布的了解是管理森林的基础。新的深度学习方法预计将在森林制图方面显著提高准确性,并成为放大规模绘制多种树种的 critical 工具。为了推动该领域的发展,深度学习研究人员需要大型高质量标注的数据集。为此,我们提出了 PureForest 数据集:一个大规模、开放、多模态的数据集,旨在从空域激光扫描(ALS)点云和非常高分辨率(VHR)航空图像中对树木物种进行分类。目前,大多数公开的 Lidar 数据集用于树木物种分类时具有较低的多样性,因为它们只覆盖了少数 annotated 公顷的土地。相比之下,PureForest 把 18 种树木分成了 13 个语义类,跨越了 449 个不同的单一森林,迄今为止是最大的、最全面的 Lidar 数据集,用于识别树木物种。通过将 PureForest 公开发布,我们希望为开发从 Lidar 和/或航空影像中识别树木物种的深度学习方法提供一个具有挑战性的基准数据集。在本文的数据论文中,我们描述了标注工作流程、数据集、推荐的评估方法和从 3D 和 2D 模态中建立基准性能。
https://arxiv.org/abs/2404.12064
LiDAR datasets for autonomous driving exhibit biases in properties such as point cloud density, range, and object dimensions. As a result, object detection networks trained and evaluated in different environments often experience performance degradation. Domain adaptation approaches assume access to unannotated samples from the test distribution to address this problem. However, in the real world, the exact conditions of deployment and access to samples representative of the test dataset may be unavailable while training. We argue that the more realistic and challenging formulation is to require robustness in performance to unseen target domains. We propose to address this problem in a two-pronged manner. First, we leverage paired LiDAR-image data present in most autonomous driving datasets to perform multimodal object detection. We suggest that working with multimodal features by leveraging both images and LiDAR point clouds for scene understanding tasks results in object detectors more robust to unseen domain shifts. Second, we train a 3D object detector to learn multimodal object features across different distributions and promote feature invariance across these source domains to improve generalizability to unseen target domains. To this end, we propose CLIX$^\text{3D}$, a multimodal fusion and supervised contrastive learning framework for 3D object detection that performs alignment of object features from same-class samples of different domains while pushing the features from different classes apart. We show that CLIX$^\text{3D}$ yields state-of-the-art domain generalization performance under multiple dataset shifts.
LiDAR数据集在自动驾驶中存在属性偏见,如点云密度、范围和物体尺寸等。因此,在不同的环境中训练和评估的对象检测网络通常会性能下降。域适应方法假设可以从测试分布访问未标注样本来解决这个问题。然而,在现实生活中,在训练过程中访问测试分布的未标注样本可能是不可能的。我们认为更现实和具有挑战性的方法是要求在未见过的目标领域中具有稳健性。为了应对这个问题,我们提出了双支柱的方法。首先,我们利用大多数自动驾驶数据集中存在的成对LiDAR图像数据来执行多模态目标检测。我们建议通过同时利用图像和LiDAR点云进行场景理解任务,使物体检测器对未见过的领域转移更加稳健。其次,我们训练了一个3D物体检测器,以学习不同分布中的多模态物体特征,并促进这些源域之间的特征不变性,以提高对未见过的目标领域的泛化能力。为此,我们提出了CLIX$^\text{3D}$,一个用于3D物体检测的多模态融合监督学习框架,它在不同分布的同一类样本之间进行对象特征的 alignment,同时将不同类别的特征推向远离。我们证明了,CLIX$^\text{3D}$在多个数据集变化下实现了最先进的领域泛化性能。
https://arxiv.org/abs/2404.11764
Understanding the real world through point cloud video is a crucial aspect of robotics and autonomous driving systems. However, prevailing methods for 4D point cloud recognition have limitations due to sensor resolution, which leads to a lack of detailed information. Recent advances have shown that Vision-Language Models (VLM) pre-trained on web-scale text-image datasets can learn fine-grained visual concepts that can be transferred to various downstream tasks. However, effectively integrating VLM into the domain of 4D point clouds remains an unresolved problem. In this work, we propose the Vision-Language Models Goes 4D (VG4D) framework to transfer VLM knowledge from visual-text pre-trained models to a 4D point cloud network. Our approach involves aligning the 4D encoder's representation with a VLM to learn a shared visual and text space from training on large-scale image-text pairs. By transferring the knowledge of the VLM to the 4D encoder and combining the VLM, our VG4D achieves improved recognition performance. To enhance the 4D encoder, we modernize the classic dynamic point cloud backbone and propose an improved version of PSTNet, im-PSTNet, which can efficiently model point cloud videos. Experiments demonstrate that our method achieves state-of-the-art performance for action recognition on both the NTU RGB+D 60 dataset and the NTU RGB+D 120 dataset. Code is available at \url{this https URL}.
通过点云视频理解现实世界是机器人学和自动驾驶系统的一个重要方面。然而,由于传感器分辨率有限,现有的4D点云识别方法存在局限性,导致缺乏详细信息。最近的研究表明,预训练在大型文本图像数据集上的视觉-语言模型(VLM)可以学习细粒度的视觉概念,并可以转移到各种下游任务。然而,将VLM知识有效地整合到4D点云领域仍然是一个未解决的问题。在这项工作中,我们提出了 Vision-Language Models Goes 4D(VG4D)框架,将VLM知识从视觉-文本预训练模型转移到4D点网网络。我们的方法包括将4D编码器的表示与VLM对齐,从大规模图像-文本对中学习共享视觉和文本空间,通过将VLM的知识传递给4D编码器并结合VLM,我们的VG4D实现了 improved 的识别性能。为了增强4D编码器,我们现代化经典的动态点云骨架并提出了改进的 im-PSTNet,它可以有效地建模点云视频。实验证明,我们的方法在 NTU RGB+D 60 数据集和 NTU RGB+D 120 数据集上的动作识别上实现了最先进的性能。代码可在此处访问:https://this URL。
https://arxiv.org/abs/2404.11605
Modern agricultural applications rely more and more on deep learning solutions. However, training well-performing deep networks requires a large amount of annotated data that may not be available and in the case of 3D annotation may not even be feasible for human annotators. In this work, we develop a deep learning approach to segment mushrooms and estimate their pose on 3D data, in the form of point clouds acquired by depth sensors. To circumvent the annotation problem, we create a synthetic dataset of mushroom scenes, where we are fully aware of 3D information, such as the pose of each mushroom. The proposed network has a fully convolutional backbone, that parses sparse 3D data, and predicts pose information that implicitly defines both instance segmentation and pose estimation task. We have validated the effectiveness of the proposed implicit-based approach for a synthetic test set, as well as provided qualitative results for a small set of real acquired point clouds with depth sensors. Code is publicly available at this https URL.
现代农业应用越来越依赖深度学习解决方案。然而,训练表现良好的深度网络需要大量注释数据,这可能无法获得,在3D注释情况下甚至可能不可行。在这项工作中,我们提出了一种用于分割蘑菇并估计其三维数据的方法,以点云的形式获取深度传感器测量得到的数据。为了绕过注释问题,我们创建了一个蘑菇场景的合成数据集,我们完全意识到3D信息,比如每个蘑菇的姿态。所提出的网络具有全卷积骨干,可以解析稀疏的3D数据,预测隐含的实例分割和姿态估计任务。我们在synthetic测试集以及一小部分真实获取点云的定性结果上进行了验证。代码公开可用,在https:// this URL。
https://arxiv.org/abs/2404.12144
This paper presents a vision and perception research dataset collected in Rome, featuring RGB data, 3D point clouds, IMU, and GPS data. We introduce a new benchmark targeting visual odometry and SLAM, to advance the research in autonomous robotics and computer vision. This work complements existing datasets by simultaneously addressing several issues, such as environment diversity, motion patterns, and sensor frequency. It uses up-to-date devices and presents effective procedures to accurately calibrate the intrinsic and extrinsic of the sensors while addressing temporal synchronization. During recording, we cover multi-floor buildings, gardens, urban and highway scenarios. Combining handheld and car-based data collections, our setup can simulate any robot (quadrupeds, quadrotors, autonomous vehicles). The dataset includes an accurate 6-dof ground truth based on a novel methodology that refines the RTK-GPS estimate with LiDAR point clouds through Bundle Adjustment. All sequences divided in training and testing are accessible through our website.
本文介绍了一个收集自罗马的数据集,包括RGB数据、3D点云、IMU和GPS数据。我们提出了一个新的基准,针对视觉导航和SLAM,以促进自主机器人和计算机视觉的研究。这项工作通过同时解决多个问题,如环境多样性、运动模式和传感器频率,补充了现有的数据集。它使用最先进的设备,并提供了准确校准传感器固有和外在参数的有效方法,同时解决时间同步问题。在记录过程中,我们覆盖了多层建筑、花园和城市和高速公路场景。结合手持和车载数据收集,我们的系统可以模拟任何机器人(四足、四旋翼、自动驾驶车辆)。该数据集基于一种新方法,通过利用激光雷达点云通过束调整对RTK-GPS估计进行优化。所有训练和测试序列都可以通过我们的网站访问。
https://arxiv.org/abs/2404.11322
Crop biomass, a critical indicator of plant growth, health, and productivity, is invaluable for crop breeding programs and agronomic research. However, the accurate and scalable quantification of crop biomass remains inaccessible due to limitations in existing measurement methods. One of the obstacles impeding the advancement of current crop biomass prediction methodologies is the scarcity of publicly available datasets. Addressing this gap, we introduce a new dataset in this domain, i.e. Multi-modality dataset for crop biomass estimation (MMCBE). Comprising 216 sets of multi-view drone images, coupled with LiDAR point clouds, and hand-labelled ground truth, MMCBE represents the first multi-modality one in the field. This dataset aims to establish benchmark methods for crop biomass quantification and foster the development of vision-based approaches. We have rigorously evaluated state-of-the-art crop biomass estimation methods using MMCBE and ventured into additional potential applications, such as 3D crop reconstruction from drone imagery and novel-view rendering. With this publication, we are making our comprehensive dataset available to the broader community.
农作物生物质,作为植物生长、健康状况和生产力的关键指标,对于农作物育种项目和农业研究具有巨大的价值。然而,由于现有测量方法的局限性,准确且可扩展地量化农作物生物质的仍然无法实现。阻碍当前农作物生物质预测方法进步的一个障碍是公共数据资源的稀少。为了解决这一缺口,我们在该领域引入了一个新的数据集,即多模态生物质估计数据集(MMCBE)。MMCBE由216个多视角无人机图像组成,与激光雷达点云和手动标注的地面真实数据集相结合,代表了该领域中的第一个多模态数据集。这个数据集旨在为农作物生物质计量建立基准方法,并推动基于视觉方法的开发。我们使用MMBCE对最先进的农作物生物质估计方法进行了严格的评估,并探索了其他潜在应用,如从无人机影像进行的三维农作物重建和基于新视角渲染。通过这一出版物,我们将全面的 dataset 提供给更广泛的社区。
https://arxiv.org/abs/2404.11256
Establishing accurate 3D correspondences between shapes stands as a pivotal challenge with profound implications for computer vision and robotics. However, existing self-supervised methods for this problem assume perfect input shape alignment, restricting their real-world applicability. In this work, we introduce a novel self-supervised Rotation-Invariant 3D correspondence learner with Local Shape Transform, dubbed RIST, that learns to establish dense correspondences between shapes even under challenging intra-class variations and arbitrary orientations. Specifically, RIST learns to dynamically formulate an SO(3)-invariant local shape transform for each point, which maps the SO(3)-equivariant global shape descriptor of the input shape to a local shape descriptor. These local shape descriptors are provided as inputs to our decoder to facilitate point cloud self- and cross-reconstruction. Our proposed self-supervised training pipeline encourages semantically corresponding points from different shapes to be mapped to similar local shape descriptors, enabling RIST to establish dense point-wise correspondences. RIST demonstrates state-of-the-art performances on 3D part label transfer and semantic keypoint transfer given arbitrarily rotated point cloud pairs, outperforming existing methods by significant margins.
建立准确的三维形状之间的精确对应是一个具有深刻影响的关键挑战,对计算机视觉和机器人技术具有重要的意义。然而,现有的自监督方法针对这个问题假设完美的输入形状对齐,限制了其在现实应用中的适用性。在这项工作中,我们引入了一种名为RIST的新自监督旋转不变3D匹配学习器,它可以甚至在不利类内变化和任意方向下,建立形状之间的密集匹配。具体来说,RIST学习为每个点动态构建一个SO(3)不变的局部形状变换,将输入形状的SO(3)等价全局形状描述映射到局部形状描述。这些局部形状描述作为输入到我们的解码器,以促进点云自和跨重建。我们提出的自监督训练流程鼓励来自不同形状的语义对应点被映射到类似的局部形状描述,使得RIST能够建立密集的点对点匹配。RIST在任意旋转点云对上表现出与现有方法相当甚至更好的性能,优势明显。
https://arxiv.org/abs/2404.11156
The integration of Light Detection and Ranging (LiDAR) and Internet of Things (IoT) technologies offers transformative opportunities for public health informatics in urban safety and pedestrian well-being. This paper proposes a novel framework utilizing these technologies for enhanced 3D object detection and activity classification in urban traffic scenarios. By employing elevated LiDAR, we obtain detailed 3D point cloud data, enabling precise pedestrian activity monitoring. To overcome urban data scarcity, we create a specialized dataset through simulated traffic environments in Blender, facilitating targeted model training. Our approach employs a modified Point Voxel-Region-based Convolutional Neural Network (PV-RCNN) for robust 3D detection and PointNet for classifying pedestrian activities, significantly benefiting urban traffic management and public health by offering insights into pedestrian behavior and promoting safer urban environments. Our dual-model approach not only enhances urban traffic management but also contributes significantly to public health by providing insights into pedestrian behavior and promoting safer urban environment.
利用光探测和测距(LiDAR)和物联网(IoT)技术在智慧城市和安全行人环境中,集成LiDAR和IoT技术为公共健康信息学带来了变革性的机会。本文提出了一种利用这些技术在城市交通场景中增强3D物体检测和活动分类的新颖框架。通过采用升高的LiDAR技术,我们获得了详细的三维点云数据,实现了对行人活动的精确监测。为了克服城市数据的稀缺性,我们在Blender中通过模拟交通环境创建了一个专门的数据集,从而实现针对模型的定向训练。我们采用了一种基于点体素区域卷积神经网络(PV-RCNN)的修改后的检测和PointNet进行分类的方法,显著提高了智能交通管理和公共卫生水平,通过提供对行人行为的见解,促进了更安全的城市环境。我们双模型方法不仅提高了智能交通管理,而且对公共卫生也做出了显著贡献,提供了对行人行为的见解,促进了更安全的城市环境。
https://arxiv.org/abs/2404.10978
Today's software stacks for autonomous vehicles rely on HD maps to enable sufficient localization, accurate path planning, and reliable motion prediction. Recent developments have resulted in pipelines for the automated generation of HD maps to reduce manual efforts for creating and updating these HD maps. We present FlexMap Fusion, a methodology to automatically update and enhance existing HD vector maps using OpenStreetMap. Our approach is designed to enable the use of HD maps created from LiDAR and camera data within Autoware. The pipeline provides different functionalities: It provides the possibility to georeference both the point cloud map and the vector map using an RTK-corrected GNSS signal. Moreover, missing semantic attributes can be conflated from OpenStreetMap into the vector map. Differences between the HD map and OpenStreetMap are visualized for manual refinement by the user. In general, our findings indicate that our approach leads to reduced human labor during HD map generation, increases the scalability of the mapping pipeline, and improves the completeness and usability of the maps. The methodological choices may have resulted in limitations that arise especially at complex street structures, e.g., traffic islands. Therefore, more research is necessary to create efficient preprocessing algorithms and advancements in the dynamic adjustment of matching parameters. In order to build upon our work, our source code is available at this https URL.
今天的自动驾驶软件栈依赖HD地图来实现足够的局部定位、精确路径规划和可靠的运动预测。最近的发展使得自动化生成HD地图的流程减少了对创建和更新这些HD地图的手动努力。我们提出了FlexMap Fusion,一种利用OpenStreetMap自动更新和增强现有HD矢量地图的方法。我们的方法旨在使使用来自激光雷达和相机数据的HD地图在Autoware中得到更广泛的应用。管道提供了不同的功能:它提供了使用RTK校正的GNSS信号将点云地图和矢量地图进行几何参考的可能性。此外,缺失的语义属性可以从OpenStreetMap中 conflat到矢量地图中。用户可以通过手动细化来使用户查看HD地图和OpenStreetMap之间的差异。总的来说,我们的研究结果表明,我们的方法在HD地图生成过程中减少了人类劳动,提高了地图映射管道的可扩展性,并提高了地图的完整性和可用性。方法论选择可能导致在复杂街道结构(如交通岛屿)上出现限制。因此,需要进行更多的研究来创建高效的预处理算法和改进动态调整匹配参数。为了基于我们的工作,我们的源代码可在此链接处访问:https://www.example.com/flexmap-fusion。
https://arxiv.org/abs/2404.10879
We introduce ECLAIR (Extended Classification of Lidar for AI Recognition), a new outdoor large-scale aerial LiDAR dataset designed specifically for advancing research in point cloud semantic segmentation. As the most extensive and diverse collection of its kind to date, the dataset covers a total area of 10$km^2$ with close to 600 million points and features eleven distinct object categories. To guarantee the dataset's quality and utility, we have thoroughly curated the point labels through an internal team of experts, ensuring accuracy and consistency in semantic labeling. The dataset is engineered to move forward the fields of 3D urban modeling, scene understanding, and utility infrastructure management by presenting new challenges and potential applications. As a benchmark, we report qualitative and quantitative analysis of a voxel-based point cloud segmentation approach based on the Minkowski Engine.
我们介绍了一个名为 ECLAIR(扩展分类激光雷达数据集)的新一代户外大型无人机激光雷达数据集,专门用于促进点云语义分割研究的进展。作为有史以来最广泛和多样化的数据集之一,该数据集涵盖了总面积为 10$km^2$,拥有近 600 万个点,并提供了十一条不同的物体类别。为了确保数据集的质量和实用性,我们通过内部专家团队对点标签进行了彻底审核,确保语义标注的准确性和一致性。该数据集通过呈现新的挑战和潜在应用,推动了三维城市建模、场景理解和实用基础设施管理领域的发展。作为基准,我们报道了基于Minkowski引擎的体素点云分割方法的定性和定量分析。
https://arxiv.org/abs/2404.10699
Occlusions hinder point cloud frame alignment in LiDAR data, a challenge inadequately addressed by scene flow models tested mainly on occlusion-free datasets. Attempts to integrate occlusion handling within networks often suffer accuracy issues due to two main limitations: a) the inadequate use of occlusion information, often merging it with flow estimation without an effective integration strategy, and b) reliance on distance-weighted upsampling that falls short in correcting occlusion-related errors. To address these challenges, we introduce the Correlation Matrix Upsampling Flownet (CMU-Flownet), incorporating an occlusion estimation module within its cost volume layer, alongside an Occlusion-aware Cost Volume (OCV) mechanism. Specifically, we propose an enhanced upsampling approach that expands the sensory field of the sampling process which integrates a Correlation Matrix designed to evaluate point-level similarity. Meanwhile, our model robustly integrates occlusion data within the context of scene flow, deploying this information strategically during the refinement phase of the flow estimation. The efficacy of this approach is demonstrated through subsequent experimental validation. Empirical assessments reveal that CMU-Flownet establishes state-of-the-art performance within the realms of occluded Flyingthings3D and KITTY datasets, surpassing previous methodologies across a majority of evaluated metrics.
遮挡会阻碍点云帧对齐 in LiDAR 数据,而场景流模型主要在无遮挡数据集上进行测试,对遮挡处理的努力通常由于两个主要局限而受到损害:a)对遮挡信息的不足使用,通常将遮挡信息与流量估计合并,缺乏有效的融合策略;b)依赖距离加权上采样,不足以纠正遮挡相关误差。为了应对这些挑战,我们引入了相关矩阵上采样飞行动力(CMU-Flownet),将遮挡估计模块集成在其成本体积层中,并采用一个遮挡感知成本体积(OCV)机制。具体来说,我们提出了一个增强的上采样方法,该方法扩展了采样过程的感官场,并设计了一个点级相似性评估的协方差矩阵。同时,我们的模型在场景流中稳健地整合了遮挡数据,并在流量估计的优化阶段将此信息战略性部署。通过后续实验验证,该方法的有效性得到了证实。 实验评估显示,CMU-Flownet 在遮挡的飞行物体3D(FLT)和KITTY 数据集上实现了最先进的性能,超过了大多数评估指标的前沿方法。
https://arxiv.org/abs/2404.10571
Gait is a behavioral biometric modality that can be used to recognize individuals by the way they walk from a far distance. Most existing gait recognition approaches rely on either silhouettes or skeletons, while their joint use is underexplored. Features from silhouettes and skeletons can provide complementary information for more robust recognition against appearance changes or pose estimation errors. To exploit the benefits of both silhouette and skeleton features, we propose a new gait recognition network, referred to as the GaitPoint+. Our approach models skeleton key points as a 3D point cloud, and employs a computational complexity-conscious 3D point processing approach to extract skeleton features, which are then combined with silhouette features for improved accuracy. Since silhouette- or CNN-based methods already require considerable amount of computational resources, it is preferable that the key point learning module is faster and more lightweight. We present a detailed analysis of the utilization of every human key point after the use of traditional max-pooling, and show that while elbow and ankle points are used most commonly, many useful points are discarded by max-pooling. Thus, we present a method to recycle some of the discarded points by a Recycling Max-Pooling module, during processing of skeleton point clouds, and achieve further performance improvement. We provide a comprehensive set of experimental results showing that (i) incorporating skeleton features obtained by a point-based 3D point cloud processing approach boosts the performance of three different state-of-the-art silhouette- and CNN-based baselines; (ii) recycling the discarded points increases the accuracy further. Ablation studies are also provided to show the effectiveness and contribution of different components of our approach.
步伐是一种行为生物测量方法,可以通过观察一个人从远处走来的方式来识别个体。目前的大多数步伐识别方法依赖于轮廓图或骨骼图,而它们之间的联合应用没有被充分利用。轮廓图和骨骼图的特征可以提供互补信息,以应对外貌变化或姿势估计错误。为了充分利用轮廓图和骨骼图的优势,我们提出了一个新的步伐识别网络,称为GaitPoint+。我们的方法将骨骼关键点建模为3D点云,并采用一种计算复杂性友好的3D点处理方法来提取骨骼特征,然后将这些特征与轮廓图特征相结合以提高准确性。由于轮廓图或CNN方法已经需要相当多的计算资源,因此更快的关键点学习模块和更轻量级的骨架网络更受欢迎。我们对使用传统最大池化方法后每个人体关键点的利用率进行了深入分析,并发现,尽管肘部和足踝关键点最常用,但许多有用的关键点却被最大池化丢弃了。因此,我们提出了一种通过回收被丢弃的关键点来提高骨架点云处理过程性能的方法,并在处理骨架点云的过程中实现进一步的性能提升。我们提供了全面的一组实验结果,表明:(i)通过基于点的方法对3D点云处理技术获得的骨骼特征可以提高三种最先进的轮廓图和CNN基站的性能;(ii)回收被丢弃的关键点可以进一步提高准确性。我们还提供了消融研究,以显示我们方法的不同组件的有效性和贡献。
https://arxiv.org/abs/2404.10213
Large-scale vision 2D vision language models, such as CLIP can be aligned with a 3D encoder to learn generalizable (open-vocabulary) 3D vision models. However, current methods require supervised pre-training for such alignment, and the performance of such 3D zero-shot models remains sub-optimal for real-world adaptation. In this work, we propose an optimization framework: Cross-MoST: Cross-Modal Self-Training, to improve the label-free classification performance of a zero-shot 3D vision model by simply leveraging unlabeled 3D data and their accompanying 2D views. We propose a student-teacher framework to simultaneously process 2D views and 3D point clouds and generate joint pseudo labels to train a classifier and guide cross-model feature alignment. Thereby we demonstrate that 2D vision language models such as CLIP can be used to complement 3D representation learning to improve classification performance without the need for expensive class annotations. Using synthetic and real-world 3D datasets, we further demonstrate that Cross-MoST enables efficient cross-modal knowledge exchange resulting in both image and point cloud modalities learning from each other's rich representations.
大规模视觉2D视觉语言模型,如CLIP,可以通过与3D编码器对齐来学习具有通用的(开放词汇)3D视觉模型。然而,现有的方法需要进行有监督的预训练才能实现这种对齐,而这种3D零 shot模型的性能在现实世界适应中仍然存在很大的提升空间。在这项工作中,我们提出了一个优化框架:Cross-MoST:跨模态自训练,通过简单地利用未标注的3D数据及其相关2D视图来提高零 shot 3D视觉模型的无标签分类性能。我们提出了一个学生-教师框架来同时处理2D视图和3D点云,并生成联合伪标签来训练分类器和指导跨模型特征对齐。因此,我们证明了CLIP这样的2D视觉语言模型可以作为补充3D表示学习来提高分类性能,而无需进行昂贵的类注释。使用合成和真实世界3D数据集,我们进一步证明了Cross-MoST能够实现高效的跨模态知识交流,从而实现图像和点云模态之间的知识学习。
https://arxiv.org/abs/2404.10146
Recent advances have demonstrated that Language Vision Models (LVMs) surpass the existing State-of-the-Art (SOTA) in two-dimensional (2D) computer vision tasks, motivating attempts to apply LVMs to three-dimensional (3D) data. While LVMs are efficient and effective in addressing various downstream 2D vision tasks without training, they face significant challenges when it comes to point clouds, a representative format for representing 3D data. It is more difficult to extract features from 3D data and there are challenges due to large data sizes and the cost of the collection and labelling, resulting in a notably limited availability of datasets. Moreover, constructing LVMs for point clouds is even more challenging due to the requirements for large amounts of data and training time. To address these issues, our research aims to 1) apply the Grounded SAM through Spherical Projection to transfer 3D to 2D, and 2) experiment with synthetic data to evaluate its effectiveness in bridging the gap between synthetic and real-world data domains. Our approach exhibited high performance with an accuracy of 0.96, an IoU of 0.85, precision of 0.92, recall of 0.91, and an F1 score of 0.92, confirming its potential. However, challenges such as occlusion problems and pixel-level overlaps of multi-label points during spherical image generation remain to be addressed in future studies.
近年来,在二维(2D)计算机视觉任务中,语言视觉模型(LVMs)已经超越了现有技术的水平,这促使人们尝试将LVMs应用于三维(3D)数据。虽然LVMs在解决各种下游2D视觉任务方面非常有效和高效,但在表示3D数据点时,它们面临很大的挑战。从3D数据中提取特征更加困难,而且由于数据量庞大和数据收集和标注成本高,导致数据集的可用性非常有限。此外,为点云构建LVMs还更具挑战性,因为需要大量数据和训练时间。为了应对这些问题,我们的研究旨在1)通过球形投影法应用 grounded SAM 将3D 转移到2D,2) 实验使用合成数据来评估其在将模拟和现实世界数据领域之间弥合差距方面的有效性。我们的方法在精度为0.96,IoU为0.85,精确度为0.92,召回率为0.91,F1分数为0.92的条件下表现出了良好的性能,证实了其潜力。然而,在球形图像生成过程中出现的遮挡问题和多标签点在球形图像生成期间的像素级别重叠等问题仍然需要未来的研究来解决。
https://arxiv.org/abs/2404.09931
Large garages are ubiquitous yet intricate scenes in our daily lives, posing challenges characterized by monotonous colors, repetitive patterns, reflective surfaces, and transparent vehicle glass. Conventional Structure from Motion (SfM) methods for camera pose estimation and 3D reconstruction fail in these environments due to poor correspondence construction. To address these challenges, this paper introduces LetsGo, a LiDAR-assisted Gaussian splatting approach for large-scale garage modeling and rendering. We develop a handheld scanner, Polar, equipped with IMU, LiDAR, and a fisheye camera, to facilitate accurate LiDAR and image data scanning. With this Polar device, we present a GarageWorld dataset consisting of five expansive garage scenes with diverse geometric structures and will release the dataset to the community for further research. We demonstrate that the collected LiDAR point cloud by the Polar device enhances a suite of 3D Gaussian splatting algorithms for garage scene modeling and rendering. We also propose a novel depth regularizer for 3D Gaussian splatting algorithm training, effectively eliminating floating artifacts in rendered images, and a lightweight Level of Detail (LOD) Gaussian renderer for real-time viewing on web-based devices. Additionally, we explore a hybrid representation that combines the advantages of traditional mesh in depicting simple geometry and colors (e.g., walls and the ground) with modern 3D Gaussian representations capturing complex details and high-frequency textures. This strategy achieves an optimal balance between memory performance and rendering quality. Experimental results on our dataset, along with ScanNet++ and KITTI-360, demonstrate the superiority of our method in rendering quality and resource efficiency.
大型车库在我们的日常生活中无处不在,但它们复杂的场景却鲜为人知。由于 poor correspondence construction,传统的相机姿态估计和3D重建方法在这些问题中失败。为了解决这些问题,本文介绍了Let'sGo,一种带有激光雷达的Gaussian插值方法,用于大型车库建模和渲染。我们开发了一种便携式扫描仪Polar,配备了惯性导航器、激光雷达和鱼眼相机,以帮助准确扫描 LiDAR 和图像数据。使用Polar设备,我们提出了一个由五个不同几何结构的宽敞车库场景组成的GarageWorld数据集,并将数据集发布给社区进行进一步研究。我们证明了Polar设备收集的LiDAR点云增强了用于车库场景建模和渲染的一组3D Gaussian插值算法。我们还提出了用于3D Gaussian插值算法训练的新型深度正则器,有效地消除了渲染图像中的浮动伪影,并开发了一个轻量级的实时级别细节(LOD)高斯渲染器,用于在基于web的设备上进行实时查看。此外,我们还探索了一种结合传统网格表示简单几何和颜色(例如墙壁和地面)与现代3D Gaussian表示捕捉复杂细节和高频纹理的混合表示策略。这种策略在内存性能和渲染质量之间实现了最优平衡。我们数据集上的实验结果,加上ScanNet++和KITTI-360,证明了我们在渲染质量和资源效率方面的优越性。
https://arxiv.org/abs/2404.09748
While 3D Gaussian Splatting has recently become popular for neural rendering, current methods rely on carefully engineered cloning and splitting strategies for placing Gaussians, which does not always generalize and may lead to poor-quality renderings. In addition, for real-world scenes, they rely on a good initial point cloud to perform well. In this work, we rethink 3D Gaussians as random samples drawn from an underlying probability distribution describing the physical representation of the scene -- in other words, Markov Chain Monte Carlo (MCMC) samples. Under this view, we show that the 3D Gaussian updates are strikingly similar to a Stochastic Langevin Gradient Descent (SGLD) update. As with MCMC, samples are nothing but past visit locations, adding new Gaussians under our framework can simply be realized without heuristics as placing Gaussians at existing Gaussian locations. To encourage using fewer Gaussians for efficiency, we introduce an L1-regularizer on the Gaussians. On various standard evaluation scenes, we show that our method provides improved rendering quality, easy control over the number of Gaussians, and robustness to initialization.
虽然最近3D高斯平铺在神经渲染中变得流行,但现有的方法依赖于仔细设计的克隆和分割策略来放置高斯分布,这并不总是通用,并可能导致渲染质量差。此外,对于现实世界的场景,它们依赖于一个良好的初始点云来表现出色。在这项工作中,我们将3D高斯视为来自描述场景物理表示的概率分布的随机样本——换句话说,随机过程蒙特卡洛(MCMC)样本。在这种观点下,我们证明了3D高斯更新与随机Langevin梯度下降(SGLD)更新非常相似。与MCMC一样,样本只是过去的访问位置,在我们的框架中添加新高斯分布只需要简单的策略,即在现有高斯位置放置新高斯。为了鼓励使用更少的Gaussians,我们在Gaussians上引入了L1正则化。在各种标准评估场景中,我们证明了我们的方法提供了改进的渲染质量,容易控制高斯数量,以及对初始化的鲁棒性。
https://arxiv.org/abs/2404.09591