DETR-based methods, which use multi-layer transformer decoders to refine object queries iteratively, have shown promising performance in 3D indoor object detection. However, the scene point features in the transformer decoder remain fixed, leading to minimal contributions from later decoder layers, thereby limiting performance improvement. Recently, State Space Models (SSM) have shown efficient context modeling ability with linear complexity through iterative interactions between system states and inputs. Inspired by SSMs, we propose a new 3D object DEtection paradigm with an interactive STate space model (DEST). In the interactive SSM, we design a novel state-dependent SSM parameterization method that enables system states to effectively serve as queries in 3D indoor detection tasks. In addition, we introduce four key designs tailored to the characteristics of point cloud and SSM: The serialization and bidirectional scanning strategies enable bidirectional feature interaction among scene points within the SSM. The inter-state attention mechanism models the relationships between state points, while the gated feed-forward network enhances inter-channel correlations. To the best of our knowledge, this is the first method to model queries as system states and scene points as system inputs, which can simultaneously update scene point features and query features with linear complexity. Extensive experiments on two challenging datasets demonstrate the effectiveness of our DEST-based method. Our method improves the GroupFree baseline in terms of AP50 on ScanNet V2 (+5.3) and SUN RGB-D (+3.2) datasets. Based on the VDETR baseline, Our method sets a new SOTA on the ScanNetV2 and SUN RGB-D datasets.
基于DETR的方法,通过使用多层变压器解码器迭代地细化对象查询,在三维室内物体检测中展现了令人鼓舞的性能。然而,这些方法中的场景点特征在Transformer解码器中保持不变,导致后续解码层贡献有限,从而限制了性能改进的空间。最近,状态空间模型(SSM)通过系统状态和输入之间的迭代交互展示了高效的上下文建模能力,并且具有线性复杂度。受到SSMs的启发,我们提出了一种新的三维物体检测范式——带有互动状态空间模型(DEST)的方法。在互动SSM中,设计了新颖的状态依赖参数化方法,使系统状态能够有效地作为查询参与3D室内检测任务。此外,为点云和SSM的特点量身定制了四个关键的设计:序列化及双向扫描策略使得场景中的点之间可以进行双向特性交互;跨状态注意机制用于建模状态点之间的关系,而门控前馈网络则增强了通道间的相关性。据我们所知,这是首个将查询视为系统状态并将场景点作为系统输入的方法,并且该方法能够同时以线性复杂度更新场景点特征和查询特征。在两个具有挑战性的数据集上的广泛实验表明了DEST基础方法的有效性。我们的方法在ScanNet V2(+5.3)和SUN RGB-D(+3.2)数据集中提高了GroupFree基准模型的AP50性能指标。基于VDETR基准,我们的方法在ScanNetV2和SUN RGB-D数据集上设立了新的最先进水平(SOTA)。
https://arxiv.org/abs/2503.14493
Obstacle avoidance for unmanned aerial vehicles like quadrotors is a popular research topic. Most existing research focuses only on static environments, and obstacle avoidance in environments with multiple dynamic obstacles remains challenging. This paper proposes a novel deep-reinforcement learning-based approach for the quadrotors to navigate through highly dynamic environments. We propose a lidar data encoder to extract obstacle information from the massive point cloud data from the lidar. Multi frames of historical scans will be compressed into a 2-dimension obstacle map while maintaining the obstacle features required. An end-to-end deep neural network is trained to extract the kinematics of dynamic and static obstacles from the obstacle map, and it will generate acceleration commands to the quadrotor to control it to avoid these obstacles. Our approach contains perception and navigating functions in a single neural network, which can change from a navigating state into a hovering state without mode switching. We also present simulations and real-world experiments to show the effectiveness of our approach while navigating in highly dynamic cluttered environments.
无人机(如四旋翼飞行器)的避障是当前研究的一个热门话题。大多数现有的研究仅关注静态环境下的避障问题,而在含有多个动态障碍物的环境中进行避障仍然是一项挑战性任务。本文提出了一种基于深度强化学习的方法,旨在帮助四旋翼飞行器在高度动态的环境中导航。 我们设计了一个激光雷达数据编码器来从大量点云数据中提取障碍物信息。多帧历史扫描将被压缩成一个二维障碍地图,同时保留所需的障碍特征。通过端到端的深度神经网络训练,可以从该障碍图中提取静态和动态障碍物的动力学特性,并生成加速度指令以控制四旋翼飞行器避开这些障碍物。 我们的方法在一个单一的神经网络中结合了感知和导航功能,无需模式切换即可从导航状态转换为悬停状态。此外,我们还通过仿真和实际实验展示了该方法在高度动态且复杂的环境中导航时的有效性。
https://arxiv.org/abs/2503.14352
In sawmills, it is essential to accurately measure the raw material, i.e. wooden logs, to optimise the sawing process. Earlier studies have shown that accurate predictions of the inner structure of the logs can be obtained using just surface point clouds produced by a laser scanner. This provides a cost-efficient and fast alternative to the X-ray CT-based measurement devices. The essential steps in analysing log point clouds is segmentation, as it forms the basis for finding the fine surface details that provide the cues about the inner structure of the log. We propose a novel Point Transformer-based point cloud segmentation technique that learns to find the points belonging to the log surface in unsupervised manner. This is obtained using a loss function that utilises the geometrical properties of a cylinder while taking into account the shape variation common in timber logs. We demonstrate the accuracy of the method on wooden logs, but the approach could be utilised also on other cylindrical objects.
在锯木厂中,准确测量原材料(即木材原木)对于优化锯切过程至关重要。先前的研究表明,仅使用激光扫描仪产生的表面点云就可以获得关于原木内部结构的精确预测。这种方法为基于X射线CT的测量设备提供了一种成本效益高且快速的替代方案。分析原木点云的关键步骤是分割,因为它构成了寻找反映原木内部结构的精细表面细节的基础。 我们提出了一种基于Point Transformer的点云分割技术,该技术能够以无监督的方式学习找到属于原木表面的点。这种方法通过利用圆柱体的几何属性并考虑木材原木中常见的形状变化来实现,具体是通过一个损失函数实现的。我们在木质原木上展示了该方法的准确性,但该方法也可以应用于其他圆柱形物体。 这段文字介绍了在锯木行业中一种新的基于Point Transformer的点云分割技术,这种技术能够有效地识别和分离木材原木表面的数据点,并由此推断出原木内部结构的信息,从而优化锯切过程。这种方法相较于传统的X射线CT测量设备更加经济高效且快速,适用于对木材及其他圆柱形物体的研究与应用。
https://arxiv.org/abs/2503.14244
One of the main challenges in point cloud compression (PCC) is how to evaluate the perceived distortion so that the codec can be optimized for perceptual quality. Current standard practices in PCC highlight a primary issue: while single-feature metrics are widely used to assess compression distortion, the classic method of searching point-to-point nearest neighbors frequently fails to adequately build precise correspondences between point clouds, resulting in an ineffective capture of human perceptual features. To overcome the related limitations, we propose a novel assessment method called RBFIM, utilizing radial basis function (RBF) interpolation to convert discrete point features into a continuous feature function for the distorted point cloud. By substituting the geometry coordinates of the original point cloud into the feature function, we obtain the bijective sets of point features. This enables an establishment of precise corresponding features between distorted and original point clouds and significantly improves the accuracy of quality assessments. Moreover, this method avoids the complexity caused by bidirectional searches. Extensive experiments on multiple subjective quality datasets of compressed point clouds demonstrate that our RBFIM excels in addressing human perception tasks, thereby providing robust support for PCC optimization efforts.
点云压缩(PCC)的一个主要挑战是如何评估感知失真,以便优化编码器以达到最佳的感知质量。当前在PCC中的标准做法突出了一个关键问题:虽然单一特征度量被广泛用来评价压缩失真,但经典的查找最近邻点的方法经常无法有效建立点云之间的精确对应关系,这导致了对人类感知特征捕捉不足的问题。为了克服这些限制,我们提出了一种新的评估方法叫做RBFIM(径向基函数插值法),这种方法利用径向基函数将离散的点特征转换为失真点云中的连续特征函数。通过用原始点云的几何坐标替换该特征函数,我们可以获得原点云与失真点云之间的一一对应的点特征集合。这使得精确对应关系得以建立,并显著提高了质量评估的准确性。此外,这种方法避免了由双向搜索引起的复杂性。在多个压缩点云主观质量数据集上的广泛实验表明,我们的RBFIM方法在解决人类感知任务方面表现出色,从而为PCC优化工作提供了有力的支持。
https://arxiv.org/abs/2503.14154
Accurate body dimension and weight measurements are critical for optimizing poultry management, health assessment, and economic efficiency. This study introduces an innovative deep learning-based model leveraging multimodal data-2D RGB images from different views, depth images, and 3D point clouds-for the non-invasive estimation of duck body dimensions and weight. A dataset of 1,023 Linwu ducks, comprising over 5,000 samples with diverse postures and conditions, was collected to support model training. The proposed method innovatively employs PointNet++ to extract key feature points from point clouds, extracts and computes corresponding 3D geometric features, and fuses them with multi-view convolutional 2D features. A Transformer encoder is then utilized to capture long-range dependencies and refine feature interactions, thereby enhancing prediction robustness. The model achieved a mean absolute percentage error (MAPE) of 6.33% and an R2 of 0.953 across eight morphometric parameters, demonstrating strong predictive capability. Unlike conventional manual measurements, the proposed model enables high-precision estimation while eliminating the necessity for physical handling, thereby reducing animal stress and broadening its application scope. This study marks the first application of deep learning techniques to poultry body dimension and weight estimation, providing a valuable reference for the intelligent and precise management of the livestock industry with far-reaching practical significance.
准确的身体尺寸和体重测量对于优化家禽管理、健康评估以及经济效益至关重要。本研究介绍了一种基于深度学习的创新模型,该模型利用多模态数据(来自不同视角的2D RGB图像、深度图及3D点云),实现非侵入性的鸭子身体维度和重量估计。为了支持模型训练,收集了一个包含1,023只林武鸭、超过5,000个样本的数据集,这些样本涵盖了多种姿态和条件。 该方法创新性地使用了PointNet++从点云中提取关键特征点,并计算相应的三维几何特性,同时将其与多视角卷积二维特征融合。随后利用Transformer编码器捕捉长距离依赖关系并优化特征互动,从而增强预测的鲁棒性。模型在八个形态参数上实现了6.33%的平均绝对百分比误差(MAPE)和0.953的R²值,展示了强大的预测能力。 与传统的手动测量方法不同,该提议的方法能够实现高精度估计,同时避免了对动物进行物理操作的需求,从而减少了动物的压力,并扩大了其应用范围。这项研究标志着深度学习技术在家禽身体尺寸和体重估计中的首次应用,为畜牧业的智能化和精确化管理提供了宝贵的参考,具有深远的实际意义。
https://arxiv.org/abs/2503.14001
To address the issues of the existing frustum-based methods' underutilization of image information in road three-dimensional object detection as well as the lack of research on agricultural scenes, we constructed an object detection dataset using an 80-line Light Detection And Ranging (LiDAR) and a camera in a complex tractor road scene and proposed a new network called FrustumFusionNets (FFNets). Initially, we utilize the results of image-based two-dimensional object detection to narrow down the search region in the three-dimensional space of the point cloud. Next, we introduce a Gaussian mask to enhance the point cloud information. Then, we extract the features from the frustum point cloud and the crop image using the point cloud feature extraction pipeline and the image feature extraction pipeline, respectively. Finally, we concatenate and fuse the data features from both modalities to achieve three-dimensional object detection. Experiments demonstrate that on the constructed test set of tractor road data, the FrustumFusionNetv2 achieves 82.28% and 95.68% accuracy in the three-dimensional object detection of the two main road objects, cars and people, respectively. This performance is 1.83% and 2.33% better than the original model. It offers a hybrid fusion-based multi-object, high-precision, real-time three-dimensional object detection technique for unmanned agricultural machines in tractor road scenarios. On the Karlsruhe Institute of Technology and Toyota Technological Institute (KITTI) Benchmark Suite validation set, the FrustumFusionNetv2 also demonstrates significant superiority in detecting road pedestrian objects compared with other frustum-based three-dimensional object detection methods.
为了应对现有基于截锥(frustum)的方法在道路三维物体检测中对图像信息利用不足以及农业场景研究缺乏的问题,我们构建了一个使用80线激光雷达(LiDAR)和相机的复杂拖拉机路场景下的目标检测数据集,并提出了一种新的网络模型——FrustumFusionNets (FFNets)。首先,我们利用基于图像的二维物体检测结果来缩小三维点云空间中的搜索区域。接下来,引入高斯掩膜以增强点云信息。然后,分别通过点云特征提取管线和图像特征提取管线从截锥点云和裁剪图像中抽取特征。最后,我们将来自这两种模态的数据特征进行拼接融合,从而实现三维物体检测。 实验表明,在构建的拖拉机路数据测试集上,FrustumFusionNetv2分别在两个主要道路目标——车辆和行人——的三维物体检测中达到了82.28%和95.68%的准确率。这一性能比原始模型分别高出1.83%和2.33%,为无人驾驶农业机械提供了基于混合融合技术的多对象、高精度、实时三维物体检测方法。 此外,在卡尔斯鲁厄理工学院(Karlsruhe Institute of Technology,KIT)与丰田工业大学(Toyota Technological Institute,TTI)联合基准测试套件验证集上,FrustumFusionNetv2在道路行人目标检测方面也明显优于其他基于截锥的三维物体检测方法。
https://arxiv.org/abs/2503.13951
Category-level object pose estimation aims to determine the pose and size of novel objects in specific categories. Existing correspondence-based approaches typically adopt point-based representations to establish the correspondences between primitive observed points and normalized object coordinates. However, due to the inherent shape-dependence of canonical coordinates, these methods suffer from semantic incoherence across diverse object shapes. To resolve this issue, we innovatively leverage the sphere as a shared proxy shape of objects to learn shape-independent transformation via spherical representations. Based on this insight, we introduce a novel architecture called SpherePose, which yields precise correspondence prediction through three core designs. Firstly, We endow the point-wise feature extraction with SO(3)-invariance, which facilitates robust mapping between camera coordinate space and object coordinate space regardless of rotation transformation. Secondly, the spherical attention mechanism is designed to propagate and integrate features among spherical anchors from a comprehensive perspective, thus mitigating the interference of noise and incomplete point cloud. Lastly, a hyperbolic correspondence loss function is designed to distinguish subtle distinctions, which can promote the precision of correspondence prediction. Experimental results on CAMERA25, REAL275 and HouseCat6D benchmarks demonstrate the superior performance of our method, verifying the effectiveness of spherical representations and architectural innovations.
类别级别的对象姿态估计旨在确定特定类别中新型对象的姿态和大小。现有的基于对应的方法通常采用点基表示方法来建立原始观察到的点与归一化物体坐标的对应关系。然而,由于规范坐标固有的形状依赖性,在不同形状的对象之间这些方法会遭受语义不一致的问题。为了解决这个问题,我们创新地利用球体作为对象的共享代理形状,通过球形表示学习出形状无关的变换。基于这一见解,我们引入了一种名为SpherePose的新架构,该架构通过三个核心设计实现了精确对应预测。 首先,我们将SO(3)不变性赋予逐点特征提取过程,这使得无论旋转转换如何,都能在相机坐标空间和对象坐标空间之间进行稳健映射。其次,球形注意力机制被设计用来从全面的角度传播和集成来自球形锚之间的特征,从而减轻噪声和不完整点云的干扰。最后,我们设计了一种双曲对应损失函数来区分细微差别,这可以提高对应的预测精度。 在CAMERA25、REAL275和HouseCat6D基准测试上的实验结果表明了我们的方法优越的性能,并验证了球形表示和架构创新的有效性。
https://arxiv.org/abs/2503.13926
Self-supervised learning (SSL) on 3D point clouds has the potential to learn feature representations that can transfer to diverse sensors and multiple downstream perception tasks. However, recent SSL approaches fail to define pretext tasks that retain geometric information such as object pose and scale, which can be detrimental to the performance of downstream localization and geometry-sensitive 3D scene understanding tasks, such as 3D semantic segmentation and 3D object detection. We propose PSA-SSL, a novel extension to point cloud SSL that learns object pose and size-aware (PSA) features. Our approach defines a self-supervised bounding box regression pretext task, which retains object pose and size information. Furthermore, we incorporate LiDAR beam pattern augmentation on input point clouds, which encourages learning sensor-agnostic features. Our experiments demonstrate that with a single pretrained model, our light-weight yet effective extensions achieve significant improvements on 3D semantic segmentation with limited labels across popular autonomous driving datasets (Waymo, nuScenes, SemanticKITTI). Moreover, our approach outperforms other state-of-the-art SSL methods on 3D semantic segmentation (using up to 10 times less labels), as well as on 3D object detection. Our code will be released on this https URL.
https://arxiv.org/abs/2503.13914
3D perception plays a crucial role in real-world applications such as autonomous driving, robotics, and AR/VR. In practical scenarios, 3D perception models must continuously adapt to new data and emerging object categories, but retraining from scratch incurs prohibitive costs. Therefore, adopting class-incremental learning (CIL) becomes particularly essential. However, real-world 3D point cloud data often include corrupted samples, which poses significant challenges for existing CIL methods and leads to more severe forgetting on corrupted data. To address these challenges, we consider the scenario in which a CIL model can be updated using point clouds with unknown corruption to better simulate real-world conditions. Inspired by Farthest Point Sampling, we propose a novel exemplar selection strategy that effectively preserves intra-class diversity when selecting replay exemplars, mitigating forgetting induced by data corruption. Furthermore, we introduce a point cloud downsampling-based replay method to utilize the limited replay buffer memory more efficiently, thereby further enhancing the model's continual learning ability. Extensive experiments demonstrate that our method improves the performance of replay-based CIL baselines by 2% to 11%, proving its effectiveness and promising potential for real-world 3D applications.
https://arxiv.org/abs/2503.13869
Accurate estimation of total leaf area (TLA) is crucial for evaluating plant growth, photosynthetic activity, and transpiration. However, it remains challenging for bushy plants like dwarf tomatoes due to their complex canopies. Traditional methods are often labor-intensive, damaging to plants, or limited in capturing canopy complexity. This study evaluated a non-destructive method combining sequential 3D reconstructions from RGB images and machine learning to estimate TLA for three dwarf tomato cultivars: Mohamed, Hahms Gelbe Topftomate, and Red Robin -- grown under controlled greenhouse conditions. Two experiments (spring-summer and autumn-winter) included 73 plants, yielding 418 TLA measurements via an "onion" approach. High-resolution videos were recorded, and 500 frames per plant were used for 3D reconstruction. Point clouds were processed using four algorithms (Alpha Shape, Marching Cubes, Poisson's, Ball Pivoting), and meshes were evaluated with seven regression models: Multivariable Linear Regression, Lasso Regression, Ridge Regression, Elastic Net Regression, Random Forest, Extreme Gradient Boosting, and Multilayer Perceptron. The Alpha Shape reconstruction ($\alpha = 3$) with Extreme Gradient Boosting achieved the best performance ($R^2 = 0.80$, $MAE = 489 cm^2$). Cross-experiment validation showed robust results ($R^2 = 0.56$, $MAE = 579 cm^2$). Feature importance analysis identified height, width, and surface area as key predictors. This scalable, automated TLA estimation method is suited for urban farming and precision agriculture, offering applications in automated pruning, resource efficiency, and sustainable food production. The approach demonstrated robustness across variable environmental conditions and canopy structures.
精确估计总叶面积(TLA)对于评估植物生长、光合作用和蒸腾作用至关重要。然而,由于其复杂的树冠结构,这对于矮生番茄等灌木状植物来说仍具挑战性。传统方法通常耗时且劳动密集型,或者会对植物造成损害,或在捕捉树冠复杂度方面存在限制。本研究评估了一种结合RGB图像的顺序3D重建和机器学习的非破坏性方法,用于估计三种矮生番茄品种(Mohamed、Hahms Gelbe Topftomate 和 Red Robin)在控制温室条件下生长时的总叶面积。该研究进行了两次实验(春季-夏季和秋季-冬季),包括73株植物,并通过“洋葱”方法获得了418次TLA测量值。记录了高分辨率视频,每株植物使用500帧进行3D重建。点云利用四种算法处理(Alpha Shape、Marching Cubes、Poisson's 和 Ball Pivoting),并且网格通过七种回归模型评估:多元线性回归、Lasso 回归、岭回归、弹性网络回归、随机森林、极端梯度提升和多层感知器。使用 Alpha Shape 重建(α = 3)与极值梯度提升实现了最佳性能($R^2 = 0.80$, $MAE = 489 cm^2$)。跨实验验证显示了稳健的结果($R^2 = 0.56$, $MAE = 579 cm^2$)。特征重要性分析确定了高度、宽度和表面积是关键预测因子。这种可扩展且自动化的TLA估算方法适用于城市农业和精准农业,提供了自动化修剪、资源效率以及可持续食品生产的应用机会。该方法在不同环境条件和树冠结构下表现出稳健的性能。
https://arxiv.org/abs/2503.13778
Remote sensing novel view synthesis (NVS) offers significant potential for 3D interpretation of remote sensing scenes, with important applications in urban planning and environmental monitoring. However, remote sensing scenes frequently lack sufficient multi-view images due to acquisition constraints. While existing NVS methods tend to overfit when processing limited input views, advanced few-shot NVS methods are computationally intensive and perform sub-optimally in remote sensing scenes. This paper presents TriDF, an efficient hybrid 3D representation for fast remote sensing NVS from as few as 3 input views. Our approach decouples color and volume density information, modeling them independently to reduce the computational burden on implicit radiance fields and accelerate reconstruction. We explore the potential of the triplane representation in few-shot NVS tasks by mapping high-frequency color information onto this compact structure, and the direct optimization of feature planes significantly speeds up convergence. Volume density is modeled as continuous density fields, incorporating reference features from neighboring views through image-based rendering to compensate for limited input data. Additionally, we introduce depth-guided optimization based on point clouds, which effectively mitigates the overfitting problem in few-shot NVS. Comprehensive experiments across multiple remote sensing scenes demonstrate that our hybrid representation achieves a 30x speed increase compared to NeRF-based methods, while simultaneously improving rendering quality metrics over advanced few-shot methods (7.4% increase in PSNR, 12.2% in SSIM, and 18.7% in LPIPS). The code is publicly available at this https URL
远程感知新颖视角合成(NVS)为远程场景的三维解释提供了巨大的潜力,并在城市规划和环境监测等领域有着重要的应用。然而,由于获取条件的限制,远程感知场景常常缺乏足够的多视图图像。尽管现有的NVS方法在处理有限输入视图时容易过拟合,而先进的少量样本NVS方法计算负担重且在远程感测场景中性能不佳。本文介绍了TriDF,这是一种高效的混合3D表示法,用于从最少3个输入视角快速进行远程感知NVS。我们的方法将颜色和体积密度信息分离出来独立建模,以减少隐含辐射场的计算负荷并加速重建过程。通过在紧凑结构上映射高频颜色信息来探索三平面表示法在少量样本NVS任务中的潜力,并直接优化特征平面大大加快了收敛速度。我们采用连续密度字段对体积密度进行建模,并通过基于图像的渲染方法从相邻视图中引入参考特性,以此补偿有限输入数据带来的不足。此外,我们还引入了一种基于点云的深度引导优化技术,有效地缓解了少量样本NVS中的过拟合问题。在多个远程感知场景中的全面实验表明,我们的混合表示法相比NeRF基方法速度提高了30倍,并同时改进了高级少量样本方法的渲染质量指标(PSNR提高了7.4%,SSIM提高了12.2%,LPIPS提高了18.7%)。代码可在以下网址公开获取:[此URL]
https://arxiv.org/abs/2503.13347
Panoptic segmentation of LiDAR point clouds is fundamental to outdoor scene understanding, with autonomous driving being a primary application. While state-of-the-art approaches typically rely on end-to-end deep learning architectures and extensive manual annotations of instances, the significant cost and time investment required for labeling large-scale point cloud datasets remains a major bottleneck in this field. In this work, we demonstrate that competitive panoptic segmentation can be achieved using only semantic labels, with instances predicted without any training or annotations. Our method achieves performance comparable to current state-of-the-art supervised methods on standard benchmarks including SemanticKITTI and nuScenes, and outperforms every publicly available method on SemanticKITTI as a drop-in instance head replacement, while running in real-time on a single-threaded CPU and requiring no instance labels. Our method is fully explainable, and requires no learning or parameter tuning. Code is available at this https URL
基于激光雷达点云的全景分割是理解室外场景的关键,自动驾驶则是其主要应用领域之一。尽管当前最先进的方法通常依赖于端到端深度学习架构和大量手动标注的目标实例数据,但大规模点云数据集的标记成本高且耗时长的问题仍然是该领域的重大瓶颈。在这项工作中,我们展示了仅使用语义标签就可以实现具有竞争力的全景分割效果,并且我们的方法能够在没有任何训练或注释的情况下预测出实例信息。在包括SemanticKITTI和nuScenes在内的标准基准测试中,我们的方法实现了与当前最先进的监督学习方法相当的性能,在SemanticKITTI上的表现甚至超过了所有公开可用的方法(作为即插即用的实例头替代方案),同时可以在单线程CPU上实现实时运行,并且不需要任何实例标签。此外,我们提出的方法完全可解释,无需进行学习或参数调整。相关代码可在该URL处获取:[https URL]
https://arxiv.org/abs/2503.13203
Fully decentralized, safe, and deadlock-free multi-robot navigation in dynamic, cluttered environments is a critical challenge in robotics. Current methods require exact state measurements in order to enforce safety and liveness e.g. via control barrier functions (CBFs), which is challenging to achieve directly from onboard sensors like lidars and cameras. This work introduces LIVEPOINT, a decentralized control framework that synthesizes universal CBFs over point clouds to enable safe, deadlock-free real-time multi-robot navigation in dynamic, cluttered environments. Further, LIVEPOINT ensures minimally invasive deadlock avoidance behavior by dynamically adjusting agents' speeds based on a novel symmetric interaction metric. We validate our approach in simulation experiments across highly constrained multi-robot scenarios like doorways and intersections. Results demonstrate that LIVEPOINT achieves zero collisions or deadlocks and a 100% success rate in challenging settings compared to optimization-based baselines such as MPC and ORCA and neural methods such as MPNet, which fail in such environments. Despite prioritizing safety and liveness, LIVEPOINT is 35% smoother than baselines in the doorway environment, and maintains agility in constrained environments while still being safe and deadlock-free.
在动态且复杂的环境中,实现完全去中心化的、安全的以及无死锁的多机器人导航是一项关键挑战。当前的方法需要通过精确的状态测量(例如利用控制屏障函数(CBFs))来确保系统的安全性与活性,而直接从机载传感器如激光雷达和摄像头获取这种精确状态信息是非常具有挑战性的。这项工作引入了LIVEPOINT框架,这是一种去中心化的控制系统,它可以通过点云合成通用的CBFs,从而实现在动态且复杂的环境中进行安全且无死锁的实时多机器人导航。此外,LIVEPOINT通过基于一种新颖的对称交互度量来动态调整代理的速度,在避免死锁时保持最小程度的侵入性。 我们通过对包含高约束条件的多个机器人场景(例如门和交叉口)进行了模拟实验来验证我们的方法。结果显示,与基于优化的方法如MPC和ORCA及神经网络方法如MPNet相比,LIVEPOINT在具有挑战性的环境中实现了零碰撞或死锁以及100%的成功率,而后者在这种环境下会失败。尽管优先考虑安全性与活性,LIVEPOINT在门环境中的平滑度仍比基准高出35%,并且在受约束的环境中保持了灵活性的同时确保了安全性和无死锁状态。
https://arxiv.org/abs/2503.13098
Surgical domain models improve workflow optimization through automated predictions of each staff member's surgical role. However, mounting evidence indicates that team familiarity and individuality impact surgical outcomes. We present a novel staff-centric modeling approach that characterizes individual team members through their distinctive movement patterns and physical characteristics, enabling long-term tracking and analysis of surgical personnel across multiple procedures. To address the challenge of inter-clinic variability, we develop a generalizable re-identification framework that encodes sequences of 3D point clouds to capture shape and articulated motion patterns unique to each individual. Our method achieves 86.19% accuracy on realistic clinical data while maintaining 75.27% accuracy when transferring between different environments - a 12% improvement over existing methods. When used to augment markerless personnel tracking, our approach improves accuracy by over 50%. Through extensive validation across three datasets and the introduction of a novel workflow visualization technique, we demonstrate how our framework can reveal novel insights into surgical team dynamics and space utilization patterns, advancing methods to analyze surgical workflows and team coordination.
手术领域模型通过自动化预测每位工作人员的外科角色来优化工作流程。然而,越来越多的证据表明团队熟悉度和个人差异会影响手术结果。我们提出了一种以人员为中心的新建模方法,该方法通过独特的运动模式和物理特征来表征每个团队成员,并能够长期跟踪和分析多个程序中的外科人员。为了解决跨诊所可变性的问题,我们开发了一个通用的重新识别框架,编码3D点云序列以捕捉每个人特有的形状和关节运动模式。我们的方法在现实临床数据上实现了86.19%的准确率,在不同环境之间迁移时保持了75.27%的准确率——相较于现有方法提高了12%。当用于增强无标记人员跟踪时,我们的方法将准确性提升了超过50%。通过三个数据集上的广泛验证以及新型工作流程可视化技术的引入,我们展示了如何使用该框架揭示手术团队动态和空间利用模式的新见解,并推进了分析外科工作流程和团队协作的方法。
https://arxiv.org/abs/2503.13028
Diffusion models currently demonstrate impressive performance over various generative tasks. Recent work on image diffusion highlights the strong capabilities of Mamba (state space models) due to its efficient handling of long-range dependencies and sequential data modeling. Unfortunately, joint consideration of state space models with 3D point cloud generation remains limited. To harness the powerful capabilities of the Mamba model for 3D point cloud generation, we propose a novel diffusion framework containing dual latent Mamba block (DM-Block) and a time-variant frequency encoder (TF-Encoder). The DM-Block apply a space-filling curve to reorder points into sequences suitable for Mamba state-space modeling, while operating in a latent space to mitigate the computational overhead that arises from direct 3D data processing. Meanwhile, the TF-Encoder takes advantage of the ability of the diffusion model to refine fine details in later recovery stages by prioritizing key points within the U-Net architecture. This frequency-based mechanism ensures enhanced detail quality in the final stages of generation. Experimental results on the ShapeNet-v2 dataset demonstrate that our method achieves state-of-the-art performance (ShapeNet-v2: 0.14\% on 1-NNA-Abs50 EMD and 57.90\% on COV EMD) on certain metrics for specific categories while reducing computational parameters and inference time by up to 10$\times$ and 9$\times$, respectively. Source code is available in Supplementary Materials and will be released upon accpetance.
扩散模型目前在各种生成任务中展示了令人印象深刻的性能。最近关于图像扩散的研究强调了Mamba(状态空间模型)由于其高效处理长程依赖性和序列数据建模的能力而具有强大的功能。不幸的是,将状态空间模型与3D点云生成相结合的工作仍然有限。为了利用Mamba模型在3D点云生成中的强大能力,我们提出了一种新颖的扩散框架,其中包括双潜在Mamba块(DM-Block)和时变频率编码器(TF-Encoder)。DM-Block通过应用填充曲线将点重新排序为适合Mamba状态空间建模的序列,并且在潜在空间中操作以减轻直接处理3D数据所带来的计算开销。同时,TF-Encoder利用了扩散模型在后续恢复阶段细化细节的能力,在U-Net架构内优先考虑关键点。这种基于频率的机制确保了生成最终阶段中的细节质量得到增强。 在ShapeNet-v2数据集上的实验结果表明,我们的方法在某些特定类别的某些指标上达到了最先进的性能(ShapeNet-v2:1-NNA-Abs50 EMD为0.14%,COV EMD为57.90%),同时将计算参数和推理时间分别减少了最多10倍和9倍。源代码将在补充材料中提供,并在被接受时发布。
https://arxiv.org/abs/2503.13004
Multimodal 3D object detectors leverage the strengths of both geometry-aware LiDAR point clouds and semantically rich RGB images to enhance detection performance. However, the inherent heterogeneity between these modalities, including unbalanced convergence and modal misalignment, poses significant challenges. Meanwhile, the large size of the detection-oriented feature also constrains existing fusion strategies to capture long-range dependencies for the 3D detection tasks. In this work, we introduce a fast yet effective multimodal 3D object detector, incorporating our proposed Instance-level Contrastive Distillation (ICD) framework and Cross Linear Attention Fusion Module (CLFM). ICD aligns instance-level image features with LiDAR representations through object-aware contrastive distillation, ensuring fine-grained cross-modal consistency. Meanwhile, CLFM presents an efficient and scalable fusion strategy that enhances cross-modal global interactions within sizable multimodal BEV features. Extensive experiments on the KITTI and nuScenes 3D object detection benchmarks demonstrate the effectiveness of our methods. Notably, our 3D object detector outperforms state-of-the-art (SOTA) methods while achieving superior efficiency. The implementation of our method has been released as open-source at: this https URL.
多模态3D物体检测器利用了具备几何感知能力的LiDAR点云和语义丰富的RGB图像的优势,从而增强了检测性能。然而,这些模式之间的固有异质性,包括不平衡的收敛性和模式错配,带来了显著挑战。同时,为检测任务设计的特征尺寸庞大,限制了现有融合策略捕捉长程依赖关系的能力。 在本工作中,我们提出了一种快速且有效的多模态3D物体检测器,其中包含我们的实例级对比蒸馏(ICD)框架和交叉线性注意力融合模块(CLFM)。ICD通过对象感知的对比蒸馏将图像特征与LiDAR表示进行对齐,确保细粒度的跨模式一致性。同时,CLFM提出了一种高效且可扩展的融合策略,在大规模多模态鸟瞰图(BEV)特征中增强跨模式全局交互。 在KITTI和nuScenes 3D物体检测基准上的广泛实验表明了我们方法的有效性。值得注意的是,我们的3D物体检测器超越了最先进的方法,并实现了更高的效率。我们的方法的开源实现可在以下网址获取:this https URL。
https://arxiv.org/abs/2503.12914
Large-scale scene point cloud registration with limited overlap is a challenging task due to computational load and constrained data acquisition. To tackle these issues, we propose a point cloud registration method, MT-PCR, based on Modality Transformation. MT-PCR leverages a BEV capturing the maximal overlap information to improve the accuracy and utilizes images to provide complementary spatial features. Specifically, MT-PCR converts 3D point clouds to BEV images and eastimates correspondence by 2D image keypoints extraction and matching. Subsequently, the 2D correspondence estimates are then transformed back to 3D point clouds using inverse mapping. We have applied MT-PCR to Terrestrial Laser Scanning and Aerial Laser Scanning point cloud registration on the GrAco dataset, involving 8 low-overlap, square-kilometer scale registration scenarios. Experiments and comparisons with commonly used methods demonstrate that MT-PCR can achieve superior accuracy and robustness in large-scale scenes with limited overlap.
基于模态转换的大型场景点云配准(尤其是在重叠有限的情况下)是一项具有挑战性的任务,主要由于计算负载和数据采集限制。为了应对这些挑战,我们提出了一种新的点云配准方法MT-PCR(Modality Transformation Point Cloud Registration)。该方法利用俯视图(BEV)来捕获最大化的重叠信息以提高精度,并使用图像提供互补的空间特征。 具体来说,MT-PCR将三维点云转换为俯视图图像,并通过提取和匹配二维图像关键点来估计对应关系。随后,它会利用逆映射将这些二维对应的估算结果转换回三维点云中。我们已经在GrAco数据集上对地基激光扫描和航空激光扫描的点云配准应用了MT-PCR方法,该数据集包含8个低重叠度、平方公里尺度的注册场景。 实验结果显示,在具有有限重叠的大规模场景中,与常用的方法相比,MT-PCR能够实现更高的精度和更强的鲁棒性。
https://arxiv.org/abs/2503.12833
Accurately modeling sound propagation with complex real-world environments is essential for Novel View Acoustic Synthesis (NVAS). While previous studies have leveraged visual perception to estimate spatial acoustics, the combined use of surface normal and structural details from 3D representations in acoustic modeling has been underexplored. Given their direct impact on sound wave reflections and propagation, surface normals should be jointly modeled with structural details to achieve accurate spatial acoustics. In this paper, we propose a surface-enhanced geometry-aware approach for NVAS to improve spatial acoustic modeling. To achieve this, we exploit geometric priors, such as image, depth map, surface normals, and point clouds obtained using a 3D Gaussian Splatting (3DGS) based framework. We introduce a dual cross-attention-based transformer integrating geometrical constraints into frequency query to understand the surroundings of the emitter. Additionally, we design a ConvNeXt-based spectral features processing network called Spectral Refinement Network (SRN) to synthesize realistic binaural audio. Experimental results on the RWAVS and SoundSpace datasets highlight the necessity of our approach, as it surpasses existing methods in novel view acoustic synthesis.
在复杂的真实世界环境中准确模拟声音传播对于新颖视角声学合成(NVAS)至关重要。尽管之前的研究利用视觉感知来估算空间声学特性,但将三维表示中的表面法线和结构细节相结合以进行声学建模的方法却鲜有探索。鉴于它们对声波反射和传播的直接影响,需要联合建模表面法线与结构细节,从而实现精确的空间声学效果。本文提出了一种基于改进几何感知的表面增强方法来提升NVAS中的空间声学模型。 为了达成这一目标,我们利用了几何先验知识,包括图像、深度图、表面法线以及通过三维高斯点扩散(3DGS)框架获取的点云数据。我们引入了一个基于双交叉注意力机制的变压器网络,并将其与几何约束整合到频率查询中,以便更好地理解发射器周围的环境。此外,我们设计了一种名为声谱细化网络(SRN)的ConvNeXt基频谱特征处理网络来合成逼真的双耳音频。 在RWAVS和SoundSpace数据集上的实验结果突显了该方法的重要性,我们的模型超越了现有的新颖视角声学合成技术。
https://arxiv.org/abs/2503.12806
We present ProtoDepth, a novel prototype-based approach for continual learning of unsupervised depth completion, the multimodal 3D reconstruction task of predicting dense depth maps from RGB images and sparse point clouds. The unsupervised learning paradigm is well-suited for continual learning, as ground truth is not needed. However, when training on new non-stationary distributions, depth completion models will catastrophically forget previously learned information. We address forgetting by learning prototype sets that adapt the latent features of a frozen pretrained model to new domains. Since the original weights are not modified, ProtoDepth does not forget when test-time domain identity is known. To extend ProtoDepth to the challenging setting where the test-time domain identity is withheld, we propose to learn domain descriptors that enable the model to select the appropriate prototype set for inference. We evaluate ProtoDepth on benchmark dataset sequences, where we reduce forgetting compared to baselines by 52.2% for indoor and 53.2% for outdoor to achieve the state of the art.
我们介绍了ProtoDepth,这是一种新颖的基于原型的方法,用于无监督深度完成任务中的连续学习。深度完成是一项多模式三维重建任务,旨在从RGB图像和稀疏点云预测密集深度图。无监督学习范式非常适合于连续学习,因为在训练过程中不需要真实的标注信息。然而,在处理新的非稳态分布时,深度完成模型会严重遗忘之前学到的信息。 为了解决这个问题,我们通过学习原型集来适应预训练模型的冻结权重以适应新领域的方式解决遗忘问题。由于原始权重没有被修改,当测试环境的身份已知时,ProtoDepth不会产生遗忘现象。 为了将ProtoDepth扩展到更具挑战性的场景中,在测试期间不提供领域身份信息的情况下,我们提出了学习域描述符的方法,使模型能够选择合适的原型集进行推理。 我们在基准数据集序列上评估了ProtoDepth。与基线相比,我们的方法在室内和室外环境中的遗忘现象分别减少了52.2% 和 53.2%,从而达到了最先进的性能水平。
https://arxiv.org/abs/2503.12745
Autonomous driving is a safety-critical application, and it is therefore a top priority that the accompanying assistance systems are able to provide precise information about the surrounding environment of the vehicle. Tasks such as 3D Object Detection deliver an insufficiently detailed understanding of the surrounding scene because they only predict a bounding box for foreground objects. In contrast, 3D Semantic Segmentation provides richer and denser information about the environment by assigning a label to each individual point, which is of paramount importance for autonomous driving tasks, such as navigation or lane changes. To inspire future research, in this review paper, we provide a comprehensive overview of the current state-of-the-art methods in the field of Point Cloud Semantic Segmentation for autonomous driving. We categorize the approaches into projection-based, 3D-based and hybrid methods. Moreover, we discuss the most important and commonly used datasets for this task and also emphasize the importance of synthetic data to support research when real-world data is limited. We further present the results of the different methods and compare them with respect to their segmentation accuracy and efficiency.
自动驾驶是一项与安全密切相关的重要应用,因此确保其辅助系统能够提供关于车辆周围环境的精确信息是首要任务。例如,3D目标检测只能预测前景物体的边界框,提供的场景理解不够详细。相比之下,3D语义分割通过为每个单独点分配标签来提供更多和更密集的信息,这对于自动驾驶中的任务(如导航或车道变换)至关重要。 为了激发未来的研究,在这篇综述论文中,我们全面概述了用于自动驾驶领域的点云语义分割方法的最新技术。我们将这些方法分为基于投影的方法、3D方法和混合方法三类。此外,我们也讨论了这一领域中最重要且常用的几个数据集,并强调合成数据在现实世界数据有限时支持研究的重要性。最后,我们展示了不同方法的结果,并根据它们的分割准确性和效率进行了比较。
https://arxiv.org/abs/2503.12595