Recent advances in Vision and Language Models (VLMs) have improved open-world 3D representation, facilitating 3D zero-shot capability in unseen categories. Existing open-world methods pre-train an extra 3D encoder to align features from 3D data (e.g., depth maps or point clouds) with CAD-rendered images and corresponding texts. However, the limited color and texture variations in CAD images can compromise the alignment robustness. Furthermore, the volume discrepancy between pre-training datasets of the 3D encoder and VLM leads to sub-optimal 2D to 3D knowledge transfer. To overcome these issues, we propose OpenDlign, a novel framework for learning open-world 3D representations, that leverages depth-aligned images generated from point cloud-projected depth maps. Unlike CAD-rendered images, our generated images provide rich, realistic color and texture diversity while preserving geometric and semantic consistency with the depth maps. OpenDlign also optimizes depth map projection and integrates depth-specific text prompts, improving 2D VLM knowledge adaptation for 3D learning efficient fine-tuning. Experimental results show that OpenDlign significantly outperforms existing benchmarks in zero-shot and few-shot 3D tasks, exceeding prior scores by 8.0% on ModelNet40 and 16.4% on OmniObject3D with just 6 million tuned parameters. Moreover, integrating generated depth-aligned images into existing 3D learning pipelines consistently improves their performance.
近年来,在Vision和语言模型(VLMs)方面的进步已经提高了开放世界3D表示,推动了在未见类别的3D零击能力。现有的开放世界方法在预训练3D编码器时添加了一个额外的3D编码器,使其将来自3D数据(如深度图或点云)的特征与CAD渲染图像和相关文本对齐。然而,CAD图像中有限的颜色和纹理变化可能会削弱对齐稳健性。此外,预训练3D编码器数据集和VLM数据集之间的体积差异导致了2D到3D知识传递的低效。为了克服这些问题,我们提出了OpenDlign,一种学习开放世界3D表示的新框架,它利用点云投影得到的深度图生成的深度对齐图像。与CAD渲染图像不同,我们的生成图像在保持几何和语义一致性的同时,提供了丰富、逼真的颜色和纹理多样性。此外,OpenDlign还优化了深度图投影并集成了深度特定文本提示,提高了2D VLM对3D学习的知识迁移效率。实验结果表明,OpenDlign在零击和少击3D任务上显著优于现有基准,在仅600万调整参数的情况下,超过了ModelNet40和OmniObject3D的分数。此外,将生成的深度对齐图像集成到现有的3D学习流程中,显著提高了它们的性能。
https://arxiv.org/abs/2404.16538
Recent advancements in self-supervised learning in the point cloud domain have demonstrated significant potential. However, these methods often suffer from drawbacks, including lengthy pre-training time, the necessity of reconstruction in the input space, or the necessity of additional modalities. In order to address these issues, we introduce Point-JEPA, a joint embedding predictive architecture designed specifically for point cloud data. To this end, we introduce a sequencer that orders point cloud tokens to efficiently compute and utilize tokens proximity based on their indices during target and context selection. The sequencer also allows shared computations of the tokens proximity between context and target selection, further improving the efficiency. Experimentally, our method achieves competitive results with state-of-the-art methods while avoiding the reconstruction in the input space or additional modality.
近年来,在点云领域自监督学习的进展已经展示了很大的潜力。然而,这些方法通常存在缺点,包括漫长的预训练时间、在输入空间进行重建的必要性,或者需要额外的模块。为了应对这些问题,我们引入了点JEPA,一种专门针对点云数据的联合嵌入预测架构。为此,我们引入一个序列器,对点云令牌进行排序,以在目标和上下文选择期间基于其索引计算并利用令牌的接近性。序列器还允许在上下文和目标选择之间共享计算令牌接近性,从而进一步提高效率。实验证明,我们的方法在获得与最先进方法竞争力的结果的同时,避免了在输入空间进行重建或添加额外模块。
https://arxiv.org/abs/2404.16432
This paper presents a robust fine-tuning method designed for pre-trained 3D point cloud models, to enhance feature robustness in downstream fine-tuned models. We highlight the limitations of current fine-tuning methods and the challenges of learning robust models. The proposed method, named Weight-Space Ensembles for Fine-Tuning then Linear Probing (WiSE-FT-LP), integrates the original pre-training and fine-tuning models through weight space integration followed by Linear Probing. This approach significantly enhances the performance of downstream fine-tuned models under distribution shifts, improving feature robustness while maintaining high performance on the target distribution. We apply this robust fine-tuning method to mainstream 3D point cloud pre-trained models and evaluate the quality of model parameters and the degradation of downstream task performance. Experimental results demonstrate the effectiveness of WiSE-FT-LP in enhancing model robustness, effectively balancing downstream task performance and model feature robustness without altering the model structures.
本文提出了一种用于预训练3D点云模型的稳健微调方法,以提高下游微调模型的特征鲁棒性。我们重点介绍了当前微调方法的局限性和学习稳健模型的挑战。所提出的方法,名为权重空间集成用于微调线性探测(WiSE-FT-LP),通过权重空间整合和线性探测来整合原始预训练和微调模型。这种方法在分布变化下显著增强了下游微调模型的性能,同时保持目标分布上的高性能。我们将这种稳健微调方法应用于主流3D点云预训练模型,并评估模型的参数质量和下游任务性能的退化。实验结果表明,WiSE-FT-LP在增强模型鲁棒性方面非常有效,有效地平衡了下游任务性能和模型特征鲁棒性,同时不改变模型结构。
https://arxiv.org/abs/2404.16422
LiDAR-based 3D object detection has become an essential part of automated driving due to its ability to localize and classify objects precisely in 3D. However, object detectors face a critical challenge when dealing with unknown foreground objects, particularly those that were not present in their original training data. These out-of-distribution (OOD) objects can lead to misclassifications, posing a significant risk to the safety and reliability of automated vehicles. Currently, LiDAR-based OOD object detection has not been well studied. We address this problem by generating synthetic training data for OOD objects by perturbing known object categories. Our idea is that these synthetic OOD objects produce different responses in the feature map of an object detector compared to in-distribution (ID) objects. We then extract features using a pre-trained and fixed object detector and train a simple multilayer perceptron (MLP) to classify each detection as either ID or OOD. In addition, we propose a new evaluation protocol that allows the use of existing datasets without modifying the point cloud, ensuring a more authentic evaluation of real-world scenarios. The effectiveness of our method is validated through experiments on the newly proposed nuScenes OOD benchmark. The source code is available at this https URL.
基于激光雷达的三维物体检测已成为自动驾驶中不可或缺的一部分,因为它能够精确地在三维中定位和分类物体。然而,在处理未知前景物体时,物体检测器面临着一个关键挑战,特别是那些它们在原始训练数据中没有的物体。这些离散(OD)物体可能导致分类错误,对自动驾驶车辆的安全和可靠性造成重大威胁。目前,基于激光雷达的OD物体检测研究得还不够充分。我们通过扰动已知物体类别生成合成训练数据来解决这个问题。我们的想法是,这些合成OD物体在物体检测器的特征图中与离散物体产生不同的响应。然后,我们使用预训练并固定的物体检测器提取特征,并训练一个简单的多层感知器(MLP)将检测结果分类为ID或OD。此外,我们提出了一个新评估协议,允许使用现有的数据集而无需修改点云,从而确保对真实世界场景进行更真实的评估。通过在 nuScenes OOD 基准上进行实验验证,验证了我们方法的有效性。源代码可在此处访问:https://www.nusensores.org/code/projects/od-detection/
https://arxiv.org/abs/2404.15879
In this work, we explore a novel task of generating human grasps based on single-view scene point clouds, which more accurately mirrors the typical real-world situation of observing objects from a single viewpoint. Due to the incompleteness of object point clouds and the presence of numerous scene points, the generated hand is prone to penetrating into the invisible parts of the object and the model is easily affected by scene points. Thus, we introduce S2HGrasp, a framework composed of two key modules: the Global Perception module that globally perceives partial object point clouds, and the DiffuGrasp module designed to generate high-quality human grasps based on complex inputs that include scene points. Additionally, we introduce S2HGD dataset, which comprises approximately 99,000 single-object single-view scene point clouds of 1,668 unique objects, each annotated with one human grasp. Our extensive experiments demonstrate that S2HGrasp can not only generate natural human grasps regardless of scene points, but also effectively prevent penetration between the hand and invisible parts of the object. Moreover, our model showcases strong generalization capability when applied to unseen objects. Our code and dataset are available at this https URL.
在这项工作中,我们探讨了一个新颖的任务:基于单视图场景点云生成人类抓取,这更准确地反映了从单个视点观察物体的典型现实情况。由于对象点云的不完整性和场景点的存在,生成的手容易穿透对象的不可见部分,模型也容易受到场景点的干扰。因此,我们引入了S2HGrasp,一个由两个关键模块组成的框架:全局感知模块全局地感知部分对象点云,而DiffuGrasp模块旨在根据包括场景点的复杂输入生成高质量的人抓取。此外,我们还引入了S2HGD数据集,它包括大约99,000个独特的单个对象单视图场景点云,每个点云都有一个人类抓取的标注。我们丰富的实验结果表明,S2HGrasp不仅可以生成无论场景点如何自然的的人类抓取,而且还能有效防止手与物体不可见部分之间的穿透。此外,当应用于未见过的物体时,我们的模型具有很强的泛化能力。我们的代码和数据集可在https://这个网址找到。
https://arxiv.org/abs/2404.15815
Face Recognition Systems (FRS) are widely used in commercial environments, such as e-commerce and e-banking, owing to their high accuracy in real-world conditions. However, these systems are vulnerable to facial morphing attacks, which are generated by blending face color images of different subjects. This paper presents a new method for generating 3D face morphs from two bona fide point clouds. The proposed method first selects bona fide point clouds with neutral expressions. The two input point clouds were then registered using a Bayesian Coherent Point Drift (BCPD) without optimization, and the geometry and color of the registered point clouds were averaged to generate a face morphing point cloud. The proposed method generates 388 face-morphing point clouds from 200 bona fide subjects. The effectiveness of the method was demonstrated through extensive vulnerability experiments, achieving a Generalized Morphing Attack Potential (G-MAP) of 97.93%, which is superior to the existing state-of-the-art (SOTA) with a G-MAP of 81.61%.
面部识别系统(FRS)在商业环境中(如电子商务和电子银行)得到了广泛应用,因为它们在现实情况下的准确度高。然而,这些系统容易受到由不同主题混合生成面部颜色图像的变形攻击。本文提出了一种从两个真实点云生成3D面部变形的方法。与优化无关,两个输入点云使用贝叶斯一致性点漂移(BCPD)进行注册,然后平均几何和颜色生成面部变形点云。该方法从200个真实主题中生成388个面部变形点云。通过广泛的漏洞实验,该方法的有效性得到了证明,实现了97.93%的泛化形态攻击潜力(G-MAP),远高于现有状态下的81.61%。
https://arxiv.org/abs/2404.15765
With the rapid advancement of 3D sensing technologies, obtaining 3D shape information of objects has become increasingly convenient. Lidar technology, with its capability to accurately capture the 3D information of objects at long distances, has been widely applied in the collection of 3D data in urban scenes. However, the collected point cloud data often exhibit incompleteness due to factors such as occlusion, signal absorption, and specular reflection. This paper explores the application of point cloud completion technologies in processing these incomplete data and establishes a new real-world benchmark Building-PCC dataset, to evaluate the performance of existing deep learning methods in the task of urban building point cloud completion. Through a comprehensive evaluation of different methods, we analyze the key challenges faced in building point cloud completion, aiming to promote innovation in the field of 3D geoinformation applications. Our source code is available at this https URL.
随着3D传感技术的快速发展,获取物体三维形状信息变得越来越方便。激光雷达技术通过其准确捕捉远处物体三维信息的能力,在城市场景中广泛应用于3D数据的收集。然而,收集的点云数据经常由于遮挡、信号吸收和镜面反射等因素而表现不完整。本文探讨了点云完成技术在处理这些不完整数据中的应用,并建立了一个新的真实世界基准建筑-PCC数据集,以评估现有深度学习方法在城市建筑点云完成任务中的性能。通过全面评估不同方法,我们分析了几种方法在构建点云完成中所面临的关键挑战,旨在推动3D地理信息应用领域创新。我们的源代码可在此处下载:https://www.example.com/。
https://arxiv.org/abs/2404.15644
Numerous prior studies predominantly emphasize constructing relation vectors for individual neighborhood points and generating dynamic kernels for each vector and embedding these into high-dimensional spaces to capture implicit local structures. However, we contend that such implicit high-dimensional structure modeling approch inadequately represents the local geometric structure of point clouds due to the absence of explicit structural information. Hence, we introduce X-3D, an explicit 3D structure modeling approach. X-3D functions by capturing the explicit local structural information within the input 3D space and employing it to produce dynamic kernels with shared weights for all neighborhood points within the current local region. This modeling approach introduces effective geometric prior and significantly diminishes the disparity between the local structure of the embedding space and the original input point cloud, thereby improving the extraction of local features. Experiments show that our method can be used on a variety of methods and achieves state-of-the-art performance on segmentation, classification, detection tasks with lower extra computational cost, such as \textbf{90.7\%} on ScanObjectNN for classification, \textbf{79.2\%} on S3DIS 6 fold and \textbf{74.3\%} on S3DIS Area 5 for segmentation, \textbf{76.3\%} on ScanNetV2 for segmentation and \textbf{64.5\%} mAP , \textbf{46.9\%} mAP on SUN RGB-D and \textbf{69.0\%} mAP , \textbf{51.1\%} mAP on ScanNetV2 . Our code is available at \href{this https URL}{this https URL}.
许多先前的研究主要侧重于为单个聚类点构建关系向量并生成动态核,并将它们嵌入高维空间以捕捉隐含的局部结构。然而,我们认为,由于缺乏明确的结构信息,这种隐含的高维结构建模方法不足以代表点云的局部几何结构。因此,我们引入了X-3D,一种明确的3D结构建模方法。X-3D通过捕获输入3D空间中的显式局部结构信息并使用它来生成共享权重的动态核来工作。这种建模方法引入了有效的几何先验,并显著降低了嵌入空间中局部结构与原始输入点云之间的差异,从而提高了局部特征的提取。实验表明,我们的方法可以应用于各种方法,并且在分类、检测任务上具有与较低附加计算成本 state-of-the-art 性能,例如在 ScanObjectNN 上达到 90.7% 的分类精度,在 S3DIS 6 折和 S3DIS Area 5 上达到 79.2% 的检测精度,在 ScanNetV2 上达到 74.3% 的检测精度和在 ScanNetV2 上达到 64.5% 的mAP,在 SUN RGB-D 上达到 46.9% 的mAP,在 ScanNetV2 上达到 69.0% 的mAP,在 ScanNetV2 上达到 51.1% 的mAP。我们的代码可在此处下载:https://this https URL/this https URL。
https://arxiv.org/abs/2404.15010
Existing Transformer-based models for point cloud analysis suffer from quadratic complexity, leading to compromised point cloud resolution and information loss. In contrast, the newly proposed Mamba model, based on state space models (SSM), outperforms Transformer in multiple areas with only linear complexity. However, the straightforward adoption of Mamba does not achieve satisfactory performance on point cloud tasks. In this work, we present Mamba3D, a state space model tailored for point cloud learning to enhance local feature extraction, achieving superior performance, high efficiency, and scalability potential. Specifically, we propose a simple yet effective Local Norm Pooling (LNP) block to extract local geometric features. Additionally, to obtain better global features, we introduce a bidirectional SSM (bi-SSM) with both a token forward SSM and a novel backward SSM that operates on the feature channel. Extensive experimental results show that Mamba3D surpasses Transformer-based counterparts and concurrent works in multiple tasks, with or without pre-training. Notably, Mamba3D achieves multiple SoTA, including an overall accuracy of 92.6% (train from scratch) on the ScanObjectNN and 95.1% (with single-modal pre-training) on the ModelNet40 classification task, with only linear complexity.
现有的基于Transformer的点云分析模型存在多项式复杂性,导致点云分辨率降低和信息损失。相比之下,新提出的Mamba模型(基于状态空间模型)在多个方面优于Transformer,具有仅线性复杂性。然而,直接采用Mamba模型在点云任务上并不能达到令人满意的表现。在这项工作中,我们提出了Mamba3D,一种专为点云学习而设计的状态空间模型,以提高局部特征提取,实现卓越的性能、高效率和可扩展性。具体来说,我们提出了一个简单而有效的局部归一化(LNP)模块来提取局部几何特征。此外,为了获得更好的全局特征,我们引入了一种双向状态空间模型(bi-SSM),包括一个标记向前状态空间模型和一个新颖的 backward SSM,该模型在特征通道上操作。大量的实验结果表明,Mamba3D在多项任务上超过了基于Transformer的模型以及当前的工作,包括从零开始训练的准确度。值得注意的是,Mamba3D在多个SoTA上实现了卓越的表现,包括ScanObjectNN上的 overall accuracy 为92.6%(从头开始训练)和ModelNet40分类任务上的95.1%(单模态预训练)。它具有仅线性复杂性。
https://arxiv.org/abs/2404.14966
Recently, X-ray microscopy (XRM) and light-sheet fluorescence microscopy (LSFM) have emerged as two pivotal imaging tools in preclinical research on bone remodeling diseases, offering micrometer-level resolution. Integrating these complementary modalities provides a holistic view of bone microstructures, facilitating function-oriented volume analysis across different disease cycles. However, registering such independently acquired large-scale volumes is extremely challenging under real and reference-free scenarios. This paper presents a fast two-stage pipeline for volume registration of XRM and LSFM. The first stage extracts the surface features and employs two successive point cloud-based methods for coarse alignment. The second stage fine-tunes the initial alignment using a modified cross-correlation method, ensuring precise volumetric registration. Moreover, we propose residual similarity as a novel metric to assess the alignment of two complementary modalities. The results imply robust gradual improvement across the stages. In the end, all correlating microstructures, particularly lacunae in XRM and bone cells in LSFM, are precisely matched, enabling new insights into bone diseases like osteoporosis which are a substantial burden in aging societies.
近年来,X射线显微镜(XRM)和光束扫描荧光显微镜(LSFM)在骨形态疾病的前临床研究中已成为两个重要的成像工具,实现了亚毫米级的分辨率。将这两种互补的成像方式集成在一起,提供了一个全面的骨微结构视图,从而在不同疾病周期中进行功能导向的体积分析。然而,在真实和参考无参考的情况下,注册这些独立获得的大规模体积是极其困难的。本文介绍了一种用于XRM和LSFM体积注册的两阶段快速流程。第一阶段提取表面特征并采用两种连续的点云基于方法进行粗对齐。第二阶段通过修改交叉相关方法对初始对齐进行微调,确保精确的体积对齐。此外,我们提出残余相似度作为一种新的指标来评估两种互补成像方式的对齐。结果显示,在各个阶段都出现了稳健的逐步改进。最后,所有相关微结构,特别是XRM中的窦道和LSFM中的骨细胞,都精确匹配,为研究骨质疏松等老年社会疾病提供了新的见解。
https://arxiv.org/abs/2404.14807
The fusion of multimodal sensor data streams such as camera images and lidar point clouds plays an important role in the operation of autonomous vehicles (AVs). Robust perception across a range of adverse weather and lighting conditions is specifically required for AVs to be deployed widely. While multi-sensor fusion networks have been previously developed for perception in sunny and clear weather conditions, these methods show a significant degradation in performance under night-time and poor weather conditions. In this paper, we propose a simple yet effective technique called ContextualFusion to incorporate the domain knowledge about cameras and lidars behaving differently across lighting and weather variations into 3D object detection models. Specifically, we design a Gated Convolutional Fusion (GatedConv) approach for the fusion of sensor streams based on the operational context. To aid in our evaluation, we use the open-source simulator CARLA to create a multimodal adverse-condition dataset called AdverseOp3D to address the shortcomings of existing datasets being biased towards daytime and good-weather conditions. Our ContextualFusion approach yields an mAP improvement of 6.2% over state-of-the-art methods on our context-balanced synthetic dataset. Finally, our method enhances state-of-the-art 3D objection performance at night on the real-world NuScenes dataset with a significant mAP improvement of 11.7%.
多模态传感器数据流(如摄像头图像和激光雷达点云)在自动驾驶车辆(AV)的操作中发挥着重要作用。在各种恶劣天气和照明条件下实现稳健感知对于广泛部署AV至关重要。虽然之前为在晴朗和良好的天气条件下进行感知而开发了多传感器融合网络,但这些方法在夜间和恶劣天气条件下性能显著下降。在本文中,我们提出了一种简单而有效的技术——ContextualFusion,将相机和激光雷达在不同照明和天气条件下的行为差异领域知识融入到3D物体检测模型中。具体来说,我们设计了一种基于操作上下文的GatedConv融合方法,用于融合传感器流。为了帮助评估,我们使用开源模拟器CARLA创建了一个多模态恶劣条件数据集AdverseOp3D,以解决现有数据集对于白天和良好天气条件的偏差。我们的ContextualFusion方法在 our context-balanced synthetic dataset上的性能提高了6.2%。最后,我们的方法在现实世界的NuScenes数据集上提高了最先进的3D物体性能,显著提高了11.7%的mAP。
https://arxiv.org/abs/2404.14780
Evaluating the performance of Multi-modal Large Language Models (MLLMs), integrating both point cloud and language, presents significant challenges. The lack of a comprehensive assessment hampers determining whether these models truly represent advancements, thereby impeding further progress in the field. Current evaluations heavily rely on classification and caption tasks, falling short in providing a thorough assessment of MLLMs. A pressing need exists for a more sophisticated evaluation method capable of thoroughly analyzing the spatial understanding and expressive capabilities of these models. To address these issues, we introduce a scalable 3D benchmark, accompanied by a large-scale instruction-tuning dataset known as 3DBench, providing an extensible platform for a comprehensive evaluation of MLLMs. Specifically, we establish the benchmark that spans a wide range of spatial and semantic scales, from object-level to scene-level, addressing both perception and planning tasks. Furthermore, we present a rigorous pipeline for automatically constructing scalable 3D instruction-tuning datasets, covering 10 diverse multi-modal tasks with more than 0.23 million QA pairs generated in total. Thorough experiments evaluating trending MLLMs, comparisons against existing datasets, and variations of training protocols demonstrate the superiority of 3DBench, offering valuable insights into current limitations and potential research directions.
评估多模态大型语言模型(MLLMs)的表现,将点云和语言集成在一起,带来了显著的挑战。缺乏全面的评估方法使得确定这些模型是否真正代表了进步成为不可能,从而阻碍了该领域进一步的发展。当前的评估主要依赖于分类和摘要任务,而无法提供对MLLMs的深入评估。迫切需要一种更复杂的评估方法,能够深入分析这些模型的空间理解和表现能力。为解决这些问题,我们引入了一个可扩展的3D基准,即3DBench,为全面评估MLLMs提供了一个扩展的平台。具体来说,我们建立了一个覆盖广泛的空间和语义尺度的基准,从物体级别到场景级别,解决了感知和规划任务。此外,我们提供了自动构建可扩展3D指令调整数据集的严格流程,包括10个多样化的多模态任务,总共生成了超过0.23百万个QA对。对热门MLLM的详细实验、与现有数据集的比较以及训练协议的变异,证明了3DBench的优越性,为当前的局限性和研究方向提供了宝贵的见解。
https://arxiv.org/abs/2404.14678
Lane detection has evolved highly functional autonomous driving system to understand driving scenes even under complex environments. In this paper, we work towards developing a generalized computer vision system able to detect lanes without using any annotation. We make the following contributions: (i) We illustrate how to perform unsupervised 3D lane segmentation by leveraging the distinctive intensity of lanes on the LiDAR point cloud frames, and then obtain the noisy lane labels in the 2D plane by projecting the 3D points; (ii) We propose a novel self-supervised training scheme, dubbed LaneCorrect, that automatically corrects the lane label by learning geometric consistency and instance awareness from the adversarial augmentations; (iii) With the self-supervised pre-trained model, we distill to train a student network for arbitrary target lane (e.g., TuSimple) detection without any human labels; (iv) We thoroughly evaluate our self-supervised method on four major lane detection benchmarks (including TuSimple, CULane, CurveLanes and LLAMAS) and demonstrate excellent performance compared with existing supervised counterpart, whilst showing more effective results on alleviating the domain gap, i.e., training on CULane and test on TuSimple.
车道检测已经发展成为高度功能自动驾驶系统,以在复杂环境中理解驾驶场景。在本文中,我们致力于开发一个通用计算机视觉系统,能够无需使用任何标注来检测车道。我们做出以下贡献:(一)通过利用LIDAR点云帧中车道独特的强度进行无监督的三维车道分割,然后通过投影获取二维平面上的噪音车道标签;(二)我们提出了一种新颖的自监督训练方案,称为LaneCorrect,通过学习来自对抗增强的几何一致性和实例意识来自动纠正车道标签;(三)在自监督预训练模型的基础上,我们通过训练学生网络来检测任意目标车道(例如TuSimple)而无需任何人类标签;(四)我们在包括TuSimple、CULane、CurveLanes和LLAMAS在内的四个主要车道检测基准上对自监督方法进行了全面评估,并证明了与现有监督方法相比具有卓越的性能,同时表现出在减轻领域差异方面的更有效结果,即在CULane上训练并在TuSimple上测试。
https://arxiv.org/abs/2404.14671
We have developed the world's first canopy height map of the distribution area of world-level giant trees. This mapping is crucial for discovering more individual and community world-level giant trees, and for analyzing and quantifying the effectiveness of biodiversity conservation measures in the Yarlung Tsangpo Grand Canyon (YTGC) National Nature Reserve. We proposed a method to map the canopy height of the primeval forest within the world-level giant tree distribution area by using a spaceborne LiDAR fusion satellite imagery (Global Ecosystem Dynamics Investigation (GEDI), ICESat-2, and Sentinel-2) driven deep learning modeling. And we customized a pyramid receptive fields depth separable CNN (PRFXception). PRFXception, a CNN architecture specifically customized for mapping primeval forest canopy height to infer the canopy height at the footprint level of GEDI and ICESat-2 from Sentinel-2 optical imagery with a 10-meter spatial resolution. We conducted a field survey of 227 permanent plots using a stratified sampling method and measured several giant trees using UAV-LS. The predicted canopy height was compared with ICESat-2 and GEDI validation data (RMSE =7.56 m, MAE=6.07 m, ME=-0.98 m, R^2=0.58 m), UAV-LS point clouds (RMSE =5.75 m, MAE =3.72 m, ME = 0.82 m, R^2= 0.65 m), and ground survey data (RMSE = 6.75 m, MAE = 5.56 m, ME= 2.14 m, R^2=0.60 m). We mapped the potential distribution map of world-level giant trees and discovered two previously undetected giant tree communities with an 89% probability of having trees 80-100 m tall, potentially taller than Asia's tallest tree. This paper provides scientific evidence confirming southeastern Tibet--northwestern Yunnan as the fourth global distribution center of world-level giant trees initiatives and promoting the inclusion of the YTGC giant tree distribution area within the scope of China's national park conservation.
我们已经开发了世界上第一个世界级的树冠高度图,展示了世界顶级大树分布区的范围。这张地图对于发现更多的个体和群落世界级的树冠,以及分析生物多样性保护措施在雅鲁藏布江大峡谷(YTGC)国家公园的有效性至关重要。我们提出了利用空间站的激光雷达融合卫星影像(全球生态系统动态调查(GEDI),ICESat-2 和 Sentinel-2)进行多层次深度学习建模,来绘制世界顶级大树分布区树冠高度的方法。我们还定制了一个可分离的 Pyramid 接收器场深度卷积神经网络 (PRFXception)。 PRFXception 是一个专门为将世界顶级大树的树冠高度映射到足迹水平,推断GEDI 和 ICESat-2 从 Sentinel-2 光学影像(具有10米空间分辨率)的足迹水平进行建模的 CNN 架构。 我们使用分层抽样方法对227个永久性抽样地进行了现场调查,并使用UAV-LS测量了数棵大树。预测的树冠高度与ICESat-2和GEDI验证数据(RMSE = 7.56 m,MAE = 6.07 m,ME = -0.98 m,R^2 = 0.58 m),UAV-LS点云数据(RMSE = 5.75 m,MAE = 3.72 m,ME = 0.82 m,R^2 = 0.65 m)以及地面调查数据(RMSE = 6.75 m,MAE = 5.56 m,ME = 2.14 m,R^2 = 0.60 m)进行了比较。我们绘制了世界顶级大树潜在分布图,并发现了两个之前未被发现的大树群落,有89%的概率具有80-100米高的大树,这棵大树可能会超过亚洲最高树。本文提供了科学证据,证实了西藏东南部--云南省西北部是世界级大树倡议的第四个全球分布中心,并促进了将 YTGC 大树分布区纳入中国国家公园保护范围。
https://arxiv.org/abs/2404.14661
Apple trees being deciduous trees, shed leaves each year which is preceded by the change in color of leaves from green to yellow (also known as senescence) during the fall season. The rate and timing of color change are affected by the number of factors including nitrogen (N) deficiencies. The green color of leaves is highly dependent on the chlorophyll content, which in turn depends on the nitrogen concentration in the leaves. The assessment of the leaf color can give vital information on the nutrient status of the tree. The use of a machine vision based system to capture and quantify these timings and changes in leaf color can be a great tool for that purpose. \par This study is based on data collected during the fall of 2021 and 2023 at a commercial orchard using a ground-based stereo-vision sensor for five weeks. The point cloud obtained from the sensor was segmented to get just the tree in the foreground. The study involved the segmentation of the trees in a natural background using point cloud data and quantification of the color using a custom-defined metric, \textit{yellowness index}, varying from $-1$ to $+1$ ($-1$ being completely green and $+1$ being completely yellow), which gives the proportion of yellow leaves on a tree. The performance of K-means based algorithm and gradient boosting algorithm were compared for \textit{yellowness index} calculation. The segmentation method proposed in the study was able to estimate the \textit{yellowness index} on the trees with $R^2 = 0.72$. The results showed that the metric was able to capture the gradual color transition from green to yellow over the study duration. It was also observed that the trees with lower nitrogen showed the color transition to yellow earlier than the trees with higher nitrogen. The onset of color transition during both years aligned with the $29^{th}$ week post-full bloom.
苹果树是一种落叶树,每年在秋季都会落叶,落叶之前,树叶的颜色从绿色变为黄色(也称为衰老) 。树叶颜色变化的速度和时间受多种因素影响,包括氮(N)不足。树叶绿色程度高度依赖于叶绿素的含量,而叶绿素的含量又取决于叶片中的氮浓度。评估树叶颜色可以提供关于树木营养状态的重要信息。利用基于机器视觉的系统来捕获和量化这些时间和树叶颜色变化可以成为达到这一目的的好工具。 这项研究基于2021年和2023年在商业或园艺场收集的数据,采用基于地面立体视觉传感器五周的时间。从传感器获得的点云数据对树木进行了分割,只获得了前景树。研究涉及使用点云数据分割自然背景中的树木,并使用自定义定义的指标(黄色指数)对颜色进行量化,该指标的取值范围从-1到+1(-1表示完全绿色,+1表示完全黄色),给出树木上黄色叶子的比例。比较了K-means基于算法和梯度提升算法在黄色指数计算方面的性能。研究提出的分割方法在R2=0.72的树木上估计了黄色指数。结果显示,该指标能够捕捉研究期间树叶颜色从绿色到黄色的逐渐转变。此外,还观察到氮含量较低的树木颜色转变到黄色的时间比氮含量较高的树木早。两年间树叶颜色转变的发病时间与第29周花全放后一致。
https://arxiv.org/abs/2404.14653
We address the task of estimating camera parameters from a set of images depicting a scene. Popular feature-based structure-from-motion (SfM) tools solve this task by incremental reconstruction: they repeat triangulation of sparse 3D points and registration of more camera views to the sparse point cloud. We re-interpret incremental structure-from-motion as an iterated application and refinement of a visual relocalizer, that is, of a method that registers new views to the current state of the reconstruction. This perspective allows us to investigate alternative visual relocalizers that are not rooted in local feature matching. We show that scene coordinate regression, a learning-based relocalization approach, allows us to build implicit, neural scene representations from unposed images. Different from other learning-based reconstruction methods, we do not require pose priors nor sequential inputs, and we optimize efficiently over thousands of images. Our method, ACE0 (ACE Zero), estimates camera poses to an accuracy comparable to feature-based SfM, as demonstrated by novel view synthesis. Project page: this https URL
我们从一系列图像描述的场景中估计相机的参数。流行的基于特征的结构从运动(SfM)工具通过迭代重构稀疏的3D点并进行相机视图的注册来解决这个任务。我们将递归结构从运动重新解释为迭代应用和优化,即从当前重建状态的视觉重定位器中注册新视图。这种视角使我们能够研究不基于局部特征匹配的 alternative visual relocalizers。我们证明了场景坐标回归,一种基于学习的重定位方法,可以从未姿态的图像中构建隐含的神经场景表示。与其它学习 based 的重构方法不同,我们不需要姿态先验或者顺序输入,而且我们在成千上万的图像上优化效率。我们的方法 ACE0(ACE Zero)估计相机的参数,其精度与基于特征的 SfM 相当,如图所示,通过新颖的视图合成证明了这一点。页面链接:这个 <https:// this URL>
https://arxiv.org/abs/2404.14351
This project has conducted research on robot path planning based on Visual SLAM. The main work of this project is as follows: (1) Construction of Visual SLAM system. Research has been conducted on the basic architecture of Visual SLAM. A Visual SLAM system is developed based on ORB-SLAM3 system, which can conduct dense point cloud mapping. (2) The map suitable for two-dimensional path planning is obtained through map conversion. This part converts the dense point cloud map obtained by Visual SLAM system into an octomap and then performs projection transformation to the grid map. The map conversion converts the dense point cloud map containing a large amount of redundant map information into an extremely lightweight grid map suitable for path planning. (3) Research on path planning algorithm based on reinforcement learning. This project has conducted experimental comparisons between the Q-learning algorithm, the DQN algorithm, and the SARSA algorithm, and found that DQN is the algorithm with the fastest convergence and best performance in high-dimensional complex environments. This project has conducted experimental verification of the Visual SLAM system in a simulation environment. The experimental results obtained based on open-source dataset and self-made dataset prove the feasibility and effectiveness of the designed Visual SLAM system. At the same time, this project has also conducted comparative experiments on the three reinforcement learning algorithms under the same experimental condition to obtain the optimal algorithm under the experimental condition.
本项目基于视觉SLAM进行了机器人路径规划的研究。本项目的主要工作如下: (1)构建了视觉SLAM系统的基本架构。本项目基于ORB-SLAM3系统开发了视觉SLAM系统,该系统可以进行密集点云映射。 (2)通过地图转换获得了二维路径规划地图。这部分将视觉SLAM系统获得的密集点云地图转换为八叉树映射,然后对网格图进行投影变换。地图转换将包含大量冗余地图信息的密集点云地图转换为极轻的网格地图,适于路径规划。 (3)基于强化学习路径规划算法的 research。本项目对Q-学习算法、DQN算法和SARSA算法进行了实验比较,发现DQN是在高维复杂环境中具有最快收敛速度和最佳性能的算法。本项目在仿真环境中对视觉SLAM系统进行了实验验证。基于开源数据集和自定义数据集的实验结果证明了设计的视觉SLAM系统的可行性和有效性。同时,本项目还在相同实验条件下对三种强化学习算法进行了比较,以获得在实验条件下最优的算法。
https://arxiv.org/abs/2404.14077
The increasing adoption of 3D point cloud data in various applications, such as autonomous vehicles, robotics, and virtual reality, has brought about significant advancements in object recognition and scene understanding. However, this progress is accompanied by new security challenges, particularly in the form of backdoor attacks. These attacks involve inserting malicious information into the training data of machine learning models, potentially compromising the model's behavior. In this paper, we propose CloudFort, a novel defense mechanism designed to enhance the robustness of 3D point cloud classifiers against backdoor attacks. CloudFort leverages spatial partitioning and ensemble prediction techniques to effectively mitigate the impact of backdoor triggers while preserving the model's performance on clean data. We evaluate the effectiveness of CloudFort through extensive experiments, demonstrating its strong resilience against the Point Cloud Backdoor Attack (PCBA). Our results show that CloudFort significantly enhances the security of 3D point cloud classification models without compromising their accuracy on benign samples. Furthermore, we explore the limitations of CloudFort and discuss potential avenues for future research in the field of 3D point cloud security. The proposed defense mechanism represents a significant step towards ensuring the trustworthiness and reliability of point-cloud-based systems in real-world applications.
3D点云数据的日益广泛应用,如自动驾驶、机器人学和虚拟现实,带来了物体识别和场景理解方面的显著进步。然而,这一进步伴随着新的安全挑战,特别是后门攻击。这些攻击涉及在机器学习模型的训练数据中插入恶意信息,可能危及模型的行为。在本文中,我们提出了CloudFort,一种专门设计用于增强3D点云分类器对后门攻击的鲁棒性的新颖防御机制。CloudFort利用空间分割和集成预测技术,有效减轻后门触发器对模型的影响,同时保留模型在干净数据上的性能。我们通过广泛的实验评估了CloudFort的有效性,证明了它对点云后门攻击(PCBA)具有很强的抵抗力。我们的结果表明,CloudFort显著增强了不牺牲准确性的3D点云分类模型的安全性。此外,我们探讨了CloudFort的局限性,并讨论了该领域未来研究的潜在方向。所提出的防御机制在确保基于点云的系统的可靠性和可信度方面迈出了重要的一步。
https://arxiv.org/abs/2404.14042
Point cloud registration is a fundamental technique in 3-D computer vision with applications in graphics, autonomous driving, and robotics. However, registration tasks under challenging conditions, under which noise or perturbations are prevalent, can be difficult. We propose a robust point cloud registration approach that leverages graph neural partial differential equations (PDEs) and heat kernel signatures. Our method first uses graph neural PDE modules to extract high dimensional features from point clouds by aggregating information from the 3-D point neighborhood, thereby enhancing the robustness of the feature representations. Then, we incorporate heat kernel signatures into an attention mechanism to efficiently obtain corresponding keypoints. Finally, a singular value decomposition (SVD) module with learnable weights is used to predict the transformation between two point clouds. Empirical experiments on a 3-D point cloud dataset demonstrate that our approach not only achieves state-of-the-art performance for point cloud registration but also exhibits better robustness to additive noise or 3-D shape perturbations.
点云配准是3D计算机视觉中的一个基本技术,应用于图形学、自动驾驶和机器人领域。然而,在具有噪声或扰动的环境下,配准任务可能会变得困难。我们提出了一种鲁棒的点云配准方法,它利用图神经 partial differential equations (PDEs) 和热核签名。我们的方法首先使用图神经 PDE 模块从点云中提取高维特征,通过聚合来自3D点邻域的信息来增强特征表示的鲁棒性。然后,我们将热核签名纳入关注机制,以高效地获得相应的关键点。最后,使用带可学习权重的单值分解(SVD)模块预测两个点云之间的变换。在3D点云数据集的实证实验中,我们的方法不仅实现了点云配准的尖端性能,还表现出了对添加噪声或3D形状扰动的鲁棒性更好。
https://arxiv.org/abs/2404.14034
Over the years, scene understanding has attracted a growing interest in computer vision, providing the semantic and physical scene information necessary for robots to complete some particular tasks autonomously. In 3D scenes, rich spatial geometric and topological information are often ignored by RGB-based approaches for scene understanding. In this study, we develop a bottom-up approach for scene understanding that infers support relations between objects from a point cloud. Our approach utilizes the spatial topology information of the plane pairs in the scene, consisting of three major steps. 1) Detection of pairwise spatial configuration: dividing primitive pairs into local support connection and local inner connection; 2) primitive classification: a combinatorial optimization method applied to classify primitives; and 3) support relations inference and hierarchy graph construction: bottom-up support relations inference and scene hierarchy graph construction containing primitive level and object level. Through experiments, we demonstrate that the algorithm achieves excellent performance in primitive classification and support relations inference. Additionally, we show that the scene hierarchy graph contains rich geometric and topological information of objects, and it possesses great scalability for scene understanding.
在过去的几年里,场景理解在计算机视觉领域吸引了越来越多的关注,为机器人完成某些特定任务提供了语义和物理场景信息。在3D场景中,基于RGB的 scene understanding 方法通常会忽略场景中的丰富空间几何和拓扑信息。在这项研究中,我们提出了一种自下而上的场景理解方法,推断出场景中物体之间的支持关系。我们的方法基于场景平面对的空间拓扑信息,包括三个主要步骤。1) 对对间空间配置的检测:将基本对分为局部支持连接和局部内连接;2) 基本分类:将基本分类为组合优化方法;和 3) 支持关系推断和层次图构建:自下而上支持关系推断和场景层次图构建包含基本水平和物体水平。通过实验,我们证明了该算法在基本分类和支持关系推理方面具有优异的性能。此外,我们还证明了场景层次图包含丰富的几何和拓扑信息,具有很好的可扩展性。
https://arxiv.org/abs/2404.13842