3D spatial perception is fundamental to generalizable robotic manipulation, yet obtaining reliable, high-quality 3D geometry remains challenging. Depth sensors suffer from noise and material sensitivity, while existing reconstruction models lack the precision and metric consistency required for physical interaction. We introduce Robo3R, a feed-forward, manipulation-ready 3D reconstruction model that predicts accurate, metric-scale scene geometry directly from RGB images and robot states in real time. Robo3R jointly infers scale-invariant local geometry and relative camera poses, which are unified into the scene representation in the canonical robot frame via a learned global similarity transformation. To meet the precision demands of manipulation, Robo3R employs a masked point head for sharp, fine-grained point clouds, and a keypoint-based Perspective-n-Point (PnP) formulation to refine camera extrinsics and global alignment. Trained on Robo3R-4M, a curated large-scale synthetic dataset with four million high-fidelity annotated frames, Robo3R consistently outperforms state-of-the-art reconstruction methods and depth sensors. Across downstream tasks including imitation learning, sim-to-real transfer, grasp synthesis, and collision-free motion planning, we observe consistent gains in performance, suggesting the promise of this alternative 3D sensing module for robotic manipulation.
3D空间感知对于通用的机器人操作至关重要,但获取可靠、高质量的3D几何信息仍然是一项挑战。深度传感器会受到噪声和材质敏感性的影响,而现有的重建模型缺乏物理交互所需的精度和度量一致性。为此,我们引入了Robo3R,这是一种前馈式的、专为操作设计的3D重建模型,能够直接从RGB图像和机器人状态实时预测精确的度量级场景几何信息。 Robo3R同时推断尺度不变的局部几何结构和相对相机姿态,并通过一个学习到的整体相似变换将其统一在机器人的标准坐标系中。为了满足操作所需的精度要求,Robo3R采用了一种掩码点头(masked point head)以生成尖锐、细节丰富的点云,并使用基于关键点的透视-n-点(PnP)方法来细化相机外参和整体对齐。 在经过精心策划的大规模合成数据集Robo3R-4M上进行训练,该数据集包含四百万帧高保真注释图像。Robo3R在重建精度方面优于现有的最先进的重建方法和深度传感器。在模仿学习、从仿真到真实环境的转移、抓取合成以及无碰撞路径规划等下游任务中,我们观察到了一致性的性能提升,这表明这种替代的3D感知模块对于机器人操作具有潜力。
https://arxiv.org/abs/2602.10101
3D affordance grounding aims to highlight the actionable regions on 3D objects, which is crucial for robotic manipulation. Previous research primarily focused on learning affordance knowledge from static cues such as language and images, which struggle to provide sufficient dynamic interaction context that can reveal temporal and causal cues. To alleviate this predicament, we collect a comprehensive video-based 3D affordance dataset, \textit{VIDA}, which contains 38K human-object-interaction videos covering 16 affordance types, 38 object categories, and 22K point clouds. Based on \textit{VIDA}, we propose a strong baseline: VideoAfford, which activates multimodal large language models with additional affordance segmentation capabilities, enabling both world knowledge reasoning and fine-grained affordance grounding within a unified framework. To enhance action understanding capability, we leverage a latent action encoder to extract dynamic interaction priors from HOI videos. Moreover, we introduce a \textit{spatial-aware} loss function to enable VideoAfford to obtain comprehensive 3D spatial knowledge. Extensive experimental evaluations demonstrate that our model significantly outperforms well-established methods and exhibits strong open-world generalization with affordance reasoning abilities. All datasets and code will be publicly released to advance research in this area.
三维可操作性定位(3D affordance grounding)的目标是突出3D物体上的可行动区域,这对机器人操纵至关重要。以往的研究主要集中在从静态线索(如语言和图像)中学习可操作性知识上,这些方法难以提供充分的动态交互背景信息来揭示时间性和因果关系的提示。为了缓解这一困境,我们收集了一个全面的基于视频的三维可操作性数据集VIDA,该数据集中包含38K个人与物体互动的视频,涵盖了16种不同的可操作类型、38个物品种类和22K个点云。基于VIDA数据集,我们提出了一项强基线方法:VideoAfford,它激活了具有附加可操作性分割功能的多模态大规模语言模型,从而使世界知识推理与细粒度的可操作性定位能够在统一框架内实现。为了增强动作理解能力,我们在HOI(人-物互动)视频中利用了一个潜在的动作编码器来提取动态交互先验。此外,我们引入了一种“空间感知”的损失函数,使VideoAfford能够获取全面的三维空间知识。广泛的实验评估表明,我们的模型在可操作性推理能力方面显著超越了现有方法,并且表现出强大的泛化性能。所有数据集和代码将公开发布以促进该领域的研究进展。
https://arxiv.org/abs/2602.09638
Fine-grained truck classification is critical for intelligent transportation systems (ITS), yet current LiDAR-based methods face scalability challenges due to their reliance on supervised deep learning and labor-intensive manual annotation. Vision-Language Models (VLMs) offer promising few-shot generalization, but their application to roadside LiDAR is limited by a modality gap between sparse 3D point clouds and dense 2D imagery. We propose a framework that bridges this gap by adapting off-the-shelf VLMs for fine-grained truck classification without parameter fine-tuning. Our new depth-aware image generation pipeline applies noise removal, spatial and temporal registration, orientation rectification, morphological operations, and anisotropic smoothing to transform sparse, occluded LiDAR scans into depth-encoded 2D visual proxies. Validated on a real-world dataset of 20 vehicle classes, our approach achieves competitive classification accuracy with as few as 16-30 examples per class, offering a scalable alternative to data-intensive supervised baselines. We further observe a "Semantic Anchor" effect: text-based guidance regularizes performance in ultra-low-shot regimes $k < 4$, but degrades accuracy in more-shot settings due to semantic mismatch. Furthermore, we demonstrate the efficacy of this framework as a Cold Start strategy, using VLM-generated labels to bootstrap lightweight supervised models. Notably, the few-shot VLM-based model achieves over correct classification rate of 75 percent for specific drayage categories (20ft, 40ft, and 53ft containers) entirely without the costly training or fine-tuning, significantly reducing the intensive demands of initial manual labeling, thus achieving a method of practical use in ITS applications.
细粒度卡车分类对于智能交通系统(ITS)至关重要,但目前基于激光雷达的方法由于依赖监督深度学习和劳动密集型的手动标注而面临可扩展性挑战。视觉-语言模型(VLMs)在少量样本情况下表现出色的泛化能力很有前景,但由于稀疏3D点云与稠密2D图像之间的模态差异,其应用于路边激光雷达的应用受到限制。我们提出了一种框架来弥合这一差距,通过适应现成的VLM以无需参数微调的方式进行细粒度卡车分类。 我们的新深度感知图像生成管道包括噪声去除、空间和时间配准、方向校正、形态操作及各向异性平滑等步骤,将稀疏且被遮挡的激光雷达扫描转换为深度编码的2D视觉代理。在包含20个车辆类别的真实世界数据集上验证后,我们的方法仅使用每个类别16-30个样本即达到了竞争性的分类准确率,提供了一种替代耗数据密集型监督基线的方法。 此外,我们观察到了“语义锚定”效应:基于文本的指导在超低样本($k < 4$)情况下稳定了性能表现,但在更多样本设置中由于语义不匹配而降低了准确性。我们还展示了此框架作为冷启动策略的有效性,即利用VLM生成的标签来引导轻量级监督模型。 值得注意的是,基于少量样本的VLM模型在特定拖车类别(20英尺、40英尺和53英尺集装箱)中达到了75%以上的正确分类率,完全无需昂贵的训练或微调过程。这种方法显著降低了初始手动标注的密集需求,并实现了适用于ITS应用的方法。
https://arxiv.org/abs/2602.09425
This work presents a finite-time stable pose estimator (FTS-PE) for rigid bodies undergoing rotational and translational motion in three dimensions, using measurements from onboard sensors that provide position vectors to inertially-fixed points and body velocities. The FTS-PE is a full-state observer for the pose (position and orientation) and velocities and is obtained through a Lyapunov analysis that shows its stability in finite time and its robustness to bounded measurement noise. Further, this observer is designed directly on the state space, the tangent bundle of the Lie group of rigid body motions, SE(3), without using local coordinates or (dual) quaternion representations. Therefore, it can estimate arbitrary rigid body motions without encountering singularities or the unwinding phenomenon and be readily applied to autonomous vehicles. A version of this observer that does not need translational velocity measurements and uses only point clouds and angular velocity measurements from rate gyros, is also obtained. It is discretized using the framework of geometric mechanics for numerical and experimental implementations. The numerical simulations compare the FTS-PE with a dual-quaternion extended Kalman filter and our previously developed variational pose estimator (VPE). The experimental results are obtained using point cloud images and rate gyro measurements obtained from a Zed 2i stereo depth camera sensor. These results validate the stability and robustness of the FTS-PE.
这项工作提出了一种针对三维空间中刚体旋转和平移运动的姿态估计器(有限时间稳定姿态估计算法,FTS-PE),该算法使用机载传感器提供的惯性固定点的位置向量和刚体速度测量值。FTS-PE是一个全状态观测器,用于估计姿态(位置和方向)以及速度,并通过李雅普诺夫分析得出,证明了其在有限时间内达到稳定性和对有界测量噪声的鲁棒性。 进一步地,该观察器直接设计于状态空间,即刚体运动的李群SE(3)的切丛上,而无需使用局部坐标或四元数表示。因此,它能够估计任意刚体运动而不遇到奇点或解开现象,并且可以轻松应用于自主车辆。此外还得到了一个不需要平移速度测量值、仅利用点云和角速度计测得的角速度的观察器版本。 该算法通过几何力学框架进行离散化处理,以实现数值和实验实施。数值仿真将FTS-PE与双四元数扩展卡尔曼滤波器以及我们先前开发的变分姿态估计算法(VPE)进行了比较。实验结果使用Zed 2i立体深度相机传感器获取到的点云图像和角速度计测量值获得,这些结果验证了FTS-PE的稳定性和鲁棒性。
https://arxiv.org/abs/2602.09414
A 3D understanding of anatomy is central to diagnosis and treatment planning, yet volumetric imaging remains costly with long wait times. Image-to-3D foundations models can solve this issue by reconstructing 3D data from 2D modalites. Current foundation models are trained on natural image distributions to reconstruct naturalistic objects from a single image by leveraging geometric priors across pixels. However, it is unclear whether these learned geometric priors transfer to medical data. In this study, we present a controlled zero-shot benchmark of single slice medical image-to-3D reconstruction across five state-of-the-art image-to-3D models: SAM3D, Hunyuan3D-2.1, Direct3D, Hi3DGen, and TripoSG. These are evaluated across six medical datasets spanning anatomical and pathological structures and two natrual datasets, using voxel based metrics and point cloud distance metrics. Across medical datasets, voxel based overlap remains moderate for all models, consistent with a depth reconstruction failure mode when inferring volume from a single slice. In contrast, global distance metrics show more separation between methods: SAM3D achieves the strongest overall topological similarity to ground truth medical 3D data, while alternative models are more prone to over-simplication of reconstruction. Our results quantify the limits of single-slice medical reconstruction and highlight depth ambiguity caused by the planar nature of 2D medical data, motivating multi-view image-to-3D reconstruction to enable reliable medical 3D inference.
对解剖学的三维理解是诊断和治疗计划的核心,然而体积成像仍然成本高昂且等待时间长。图像到三维基础模型可以通过从二维模式重建三维数据来解决这一问题。当前的基础模型是在自然图像分布上训练的,能够利用像素间的几何先验从单张图像中重构出现实主义对象。然而,这些学习到的几何先验是否能转移到医学数据上尚不清楚。在本研究中,我们提出了一项针对五个最先进的图像到三维重建模型(SAM3D、Hunyuan3D-2.1、Direct3D、Hi3DGen和TripoSG)的单切片医疗图像到三维重建控制零样本基准测试。这些模型是在六个医学数据集上进行评估,涵盖了解剖学和病理结构以及两个自然数据集,并使用基于体素的度量标准和点云距离度量进行了评测。 在医学数据集中,所有模型的基于体素的重叠保持中等水平,在从单片图像推断体积时显示出深度重建失败模式的一致性。相比之下,全局距离指标显示出了方法之间的更大差异:SAM3D实现了与地面真实三维医学数据最接近的整体拓扑相似度,而其他模型更容易出现过度简化重构的问题。 我们的结果量化了单一切片医疗重建的局限,并突显了由二维医疗数据平面性质引起的深度歧义问题,从而推动多视角图像到三维重建以实现可靠的医学三维推理。
https://arxiv.org/abs/2602.09407
We investigate what structure emerges in 3D Gaussian Splatting (3DGS) solutions from standard multi-view optimization. We term these Rendering-Optimal References (RORs) and analyze their statistical properties, revealing stable patterns: mixture-structured scales and bimodal radiance across diverse scenes. To understand what determines these parameters, we apply learnability probes by training predictors to reconstruct RORs from point clouds without rendering supervision. Our analysis uncovers fundamental density-stratification. Dense regions exhibit geometry-correlated parameters amenable to render-free prediction, while sparse regions show systematic failure across architectures. We formalize this through variance decomposition, demonstrating that visibility heterogeneity creates covariance-dominated coupling between geometric and appearance parameters in sparse regions. This reveals the dual character of RORs: geometric primitives where point clouds suffice, and view synthesis primitives where multi-view constraints are essential. We provide density-aware strategies that improve training robustness and discuss architectural implications for systems that adaptively balance feed-forward prediction and rendering-based refinement.
我们研究了从标准多视图优化中出现的3D高斯点集(3DGS)解中的结构,并将这些结构称为渲染最优参考(RORs),并对它们的统计特性进行了分析,揭示出了稳定模式:混合结构化尺度和跨多种场景的双峰辐射度。为了理解决定这些参数的因素,我们通过训练预测器从没有渲染监督的情况下重构ROR来应用可学习性探针。我们的分析发现了一个基本的密度分层现象:密集区域显示几何相关的参数,使得可以在不进行渲染的情况下进行预测;而在稀疏区域则在各种架构中显示出系统性的失败。 为了正式描述这一现象,我们通过方差分解展示了可见度异质性如何在稀疏区域创造出了以协方差为主导的几何和外观参数之间的耦合。这揭示了RORs的双重特性:点云足以表示的几何原语,在这种情况下可以实现无渲染预测;以及需要多视图约束来合成视图的原语,因为在这里几何与视觉效果之间存在复杂的相互作用。 我们提供了一些基于密度的方法以增强训练的鲁棒性,并讨论了对于在前馈预测和基于渲染的细化之间自适应地进行平衡的系统架构的含义。
https://arxiv.org/abs/2602.08909
Point cloud is a prevalent 3D data representation format with significant application values in immersive media, autonomous driving, digital heritage protection, etc. However, the large data size of point clouds poses challenges to transmission and storage, which influences the wide deployments. Therefore, point cloud compression plays a crucial role in practical applications for both human and machine perception optimization. To this end, the Moving Picture Experts Group (MPEG) has established two standards for point cloud compression, including Geometry-based Point Cloud Compression (G-PCC) and Video-based Point Cloud Compression (V-PCC). In the meantime, the Audio Video coding Standard (AVS) Workgroup of China also have launched and completed the development for its first generation point cloud compression standard, namely AVS PCC. This new standardization effort has adopted many new coding tools and techniques, which are different from the other counterpart standards. This paper reviews the AVS PCC standard from two perspectives, i.e., the related technologies and performance comparisons.
点云是一种在沉浸式媒体、自动驾驶和数字遗产保护等领域具有重要应用价值的三维数据表示格式。然而,庞大的点云数据量给传输和存储带来了挑战,从而影响了其广泛应用。因此,在人类感知和机器感知优化的实际应用中,点云压缩技术扮演着关键角色。为此,国际电信联盟下属的运动图像专家组(MPEG)建立了两个用于点云压缩的标准:基于几何的点云压缩(G-PCC)和基于视频的点云压缩(V-PCC)。同时,中国的音视频编码标准(AVS)工作组也推出了第一代点云压缩标准——AVS PCC。这项新的标准化工作采用了许多不同于其他现有标准的新编码工具和技术。本文从相关技术和性能比较两个角度对AVS PCC标准进行了回顾和分析。
https://arxiv.org/abs/2602.08613
Object-level segmentation in dynamic 4D Gaussian scenes remains challenging due to complex motion, occlusions, and ambiguous boundaries. In this paper, we present an efficient learning-free 4D Gaussian segmentation framework that lifts video segmentation masks to 4D spaces, whose core is a two-stage iterative boundary refinement, TIBR4D. The first stage is an Iterative Gaussian Instance Tracing (IGIT) at the temporal segment level. It progressively refines Gaussian-to-instance probabilities through iterative tracing, and extracts corresponding Gaussian point clouds that better handle occlusions and preserve completeness of object structures compared to existing one-shot threshold-based methods. The second stage is a frame-wise Gaussian Rendering Range Control (RCC) via suppressing highly uncertain Gaussians near object boundaries while retaining their core contributions for more accurate boundaries. Furthermore, a temporal segmentation merging strategy is proposed for IGIT to balance identity consistency and dynamic awareness. Longer segments enforce stronger multi-frame constraints for stable identities, while shorter segments allow identity changes to be captured promptly. Experiments on HyperNeRF and Neu3D demonstrate that our method produces accurate object Gaussian point clouds with clearer boundaries and higher efficiency compared to SOTA methods.
在动态4D高斯场景中进行对象级分割仍然具有挑战性,因为复杂运动、遮挡和模糊边界等因素的存在。本文提出了一种高效的无学习需求的4D高斯分割框架,该框架将视频分割掩码提升到4D空间,并以两级迭代边界细化(TIBR4D)为核心。第一阶段是时间片段级别的迭代高斯实例追踪(IGIT),通过反复追踪逐步改进高斯到实例的概率,并提取相应的高斯点云,这能更好地处理遮挡并保持对象结构的完整性,相比现有的单次阈值方法具有优势。第二阶段是在帧级别通过抑制靠近物体边界处高度不确定的高斯分布来控制高斯渲染范围(RCC),以获得更准确的边界。此外,还提出了一种时间分割合并策略用于IGIT,以平衡身份一致性与动态感知能力:较长的时间片段能施加强制多帧约束以保持稳定的身份识别;较短的时间片段则允许迅速捕捉到身份的变化。 在HyperNeRF和Neu3D数据集上的实验表明,相较于最先进的方法,我们的方法能够生成更准确的物体高斯点云,并具有更为清晰的边界以及更高的效率。
https://arxiv.org/abs/2602.08540
Event cameras have been widely adopted in safety-critical domains such as autonomous driving, robotics, and human-computer interaction. A pressing challenge arises from the vulnerability of deep neural networks to adversarial examples, which poses a significant threat to the reliability of event-based systems. Nevertheless, research into adversarial attacks on events is scarce. This is primarily due to the non-differentiable nature of mainstream event representations, which hinders the extension of gradient-based attack methods. In this paper, we propose MA-ADV, a novel \textbf{M}otion-\textbf{A}ware \textbf{Adv}ersarial framework. To the best of our knowledge, this is the first work to generate adversarial events by leveraging point cloud representations. MA-ADV accounts for high-frequency noise in events and employs a diffusion-based approach to smooth perturbations, while fully leveraging the spatial and temporal relationships among events. Finally, MA-ADV identifies the minimal-cost perturbation through a combination of sample-wise Adam optimization, iterative refinement, and binary search. Extensive experimental results validate that MA-ADV ensures a 100\% attack success rate with minimal perturbation cost, and also demonstrate enhanced robustness against defenses, underscoring the critical security challenges facing future event-based perception systems.
事件相机已在自动驾驶、机器人技术及人机交互等关键安全领域得到广泛应用。然而,深度神经网络对对抗样本的脆弱性引发了一个紧迫挑战,这对基于事件系统的可靠性构成了重大威胁。尽管如此,针对事件的攻击研究仍然非常有限。主要原因在于主流事件表示方法的非可微分特性阻碍了梯度驱动攻击策略的应用。 本文提出了一种名为MA-ADV(Motion-Aware Adversarial)的新框架。据我们所知,这是首次利用点云表征来生成对抗性事件的研究工作。MA-ADV能够处理事件中的高频噪声,并采用扩散方法平滑扰动效果,同时充分利用了事件之间的空间和时间关系。最终,通过结合基于样本的Adam优化、迭代细化及二分查找策略,MA-ADV识别出了最低成本的扰动方式。 大量实验结果验证了MA-ADV能够在施加最小扰动的情况下确保100%的成功攻击率,并且该方法还展示了更强的对抗防御能力。这强调了未来基于事件感知系统所面临的严峻安全挑战。
https://arxiv.org/abs/2602.08230
Geometric foundation models show promise in 3D reconstruction, yet their progress is severely constrained by the scarcity of diverse, large-scale 3D annotations. While Internet videos offer virtually unlimited raw data, utilizing them as a scaling source for geometric learning is challenging due to the absence of ground-truth geometry and the presence of observational noise. To address this, we propose SAGE, a framework for Scalable Adaptation of GEometric foundation models from raw video streams. SAGE leverages a hierarchical mining pipeline to transform videos into training trajectories and hybrid supervision: (1) Informative training trajectory selection; (2) Sparse Geometric Anchoring via SfM point clouds for global structural guidance; and (3) Dense Differentiable Consistency via 3D Gaussian rendering for multi-view constraints. To prevent catastrophic forgetting, we introduce a regularization strategy using anchor data. Extensive experiments show that SAGE significantly enhances zero-shot generalization, reducing Chamfer Distance by 20-42% on unseen benchmarks (7Scenes, TUM-RGBD, Matterport3D) compared to state-of-the-art baselines. To our knowledge, SAGE pioneers the adaptation of geometric foundation models via Internet video, establishing a scalable paradigm for general-purpose 3D learning.
几何基础模型在三维重建方面展现出巨大潜力,但其发展严重受限于多样化、大规模的三维注释数据的缺乏。尽管互联网视频提供了近乎无限的原始数据来源,但由于缺少地面真实几何信息和存在观察噪声,将其作为几何学习的扩展资源极具挑战性。为解决这一问题,我们提出了SAGE框架——一种从原始视频流中进行几何基础模型可扩展适应的方法。SAGE利用分层挖掘管道将视频转换成训练轨迹,并采用混合监督机制:(1)选择具有信息量的训练轨迹;(2)通过基于SfM点云的稀疏几何锚定提供全局结构指导;以及(3)使用三维高斯渲染实现多视图约束下的密集可微一致性。为了防止灾难性遗忘,我们提出了一种利用锚数据进行正则化的策略。 广泛的实验表明,与最先进的基准相比,SAGE显著增强了零样本泛化能力,在未见过的基准测试集上(7Scenes、TUM-RGBD和Matterport3D)减少了20%-42%的Chamfer距离。据我们所知,SAGE首次实现了通过互联网视频对几何基础模型进行适应,并为通用三维学习建立了一种可扩展范式。
https://arxiv.org/abs/2602.07891
The accuracy of the 3D models created from medical scans depends on imaging hardware, segmentation methods and mesh processing techniques etc. The effects of geometry type, class imbalance, voxel and point cloud alignment on accuracy remain to be thoroughly explored. This work evaluates the errors across the reconstruction pipeline and explores the use of voxel and surface-based accuracy metrics for different segmentation algorithms and geometry types. A sphere, a facemask, and an AAA were printed using the SLA technique and scanned using a micro-CT machine. Segmentation was performed using GMM, Otsu and RG based methods. Segmented and reference models aligned using the KU algorithm, were quantitatively compared to evaluate metrics like Dice and Jaccard scores, precision. Surface meshes were registered with reference meshes using an ICP-based alignment process. Metrics like chamfer distance, and average Hausdorff distance were evaluated. The Otsu method was found to be the most suitable method for all the geometries. AAA yielded low overlap scores due to its small wall thickness and misalignment. The effect of class imbalance on specificity was observed the most for AAA. Surface-based accuracy metrics differed from the voxel-based trends. The RG method performed best for sphere, while GMM and Otsu perform better for AAA. The facemask surface was most error-prone, possibly due to misalignment during the ICP process. Segmentation accuracy is a cumulative sum of errors across different stages of the reconstruction process. High voxel-based accuracy metrics may be misleading in cases of high class imbalance and sensitivity to alignment. The Jaccard index is found to be more stringent than the Dice and more suitable for accuracy assessment for thin-walled structures. Voxel and point cloud alignment should be ensured to make any reliable assessment of the reconstruction pipeline.
医学扫描生成的3D模型精度取决于成像硬件、分割方法和网格处理技术等因素。几何类型、类别不平衡、体素和点云对齐方式对精度的影响仍有待深入研究。本项工作评估了重建过程中各阶段产生的误差,并探索了不同分割算法和几何类型下基于体素和表面的精度指标的应用。使用SLA技术打印了一个球体、一个面罩以及一个腹主动脉瘤(AAA),然后利用微CT机对其进行扫描。采用高斯混合模型(GMM)、Otsu方法和RG基线法进行图像分割,将分割后的模型与参照模型用KU算法对齐后定量比较了Dice系数、Jaccard指数和精度等指标。表面网格通过基于ICP的对准过程注册到参考网格上,评估了如 Chamfer距离和平均Hausdorff距离这些指标。 研究发现Otsu方法适用于所有几何类型。由于AAA具有较小壁厚且易发生错位,其重叠得分较低。类别不平衡对特异性的效应在AAA中最为明显。基于表面的精度指标与体素基的趋势有所不同。对于球体,RG法表现最佳;而对于AAA,则GMM和Otsu方法效果更佳。面罩表面最容易出错,这可能是因为ICP过程中出现了偏差。 分割准确性是重建过程各个阶段误差累积的结果。在类别不平衡程度高或对齐敏感度高的情况下,基于体素的精度指标可能会产生误导性结果。Jaccard指数被认为比Dice系数更加严格,并且更适合用于评估薄壁结构的精确度。为了可靠地评估重建流程,应确保体素和点云的对准。
https://arxiv.org/abs/2602.07658
Large-scale, high-resolution forest canopy height mapping plays a crucial role in understanding regional and global carbon and water cycles. Spaceborne LiDAR missions, including the Ice, Cloud, and Land Elevation Satellite-2 (ICESat-2) and the Global Ecosystem Dynamics Investigation (GEDI), provide global observations of forest structure but are spatially sparse and subject to inherent uncertainties. In contrast, near-surface LiDAR platforms, such as airborne and unmanned aerial vehicle (UAV) LiDAR systems, offer much finer measurements of forest canopy structure, and a growing number of countries have made these datasets openly available. In this study, a state-of-the-art monocular depth estimation model, Depth Anything V2, was trained using approximately 16,000 km2 of canopy height models (CHMs) derived from publicly available airborne LiDAR point clouds and related products across multiple countries, together with 3 m resolution PlanetScope and airborne RGB imagery. The trained model, referred to as Depth2CHM, enables the estimation of spatially continuous CHMs directly from PlanetScope RGB imagery. Independent validation was conducted at sites in China (approximately 1 km2) and the United States (approximately 116 km2). The results showed that Depth2CHM could accurately estimate canopy height, with biases of 0.59 m and 0.41 m and root mean square errors (RMSEs) of 2.54 m and 5.75 m for these two sites, respectively. Compared with an existing global meter-resolution CHM product, the mean absolute error is reduced by approximately 1.5 m and the RMSE by approximately 2 m. These results demonstrated that monocular depth estimation networks trained with large-scale airborne LiDAR-derived canopy height data provide a promising and scalable pathway for high-resolution, spatially continuous forest canopy height estimation from satellite RGB imagery.
大规模、高分辨率的森林冠层高度测绘在理解区域和全球碳循环及水文循环方面起着关键作用。空间激光雷达任务,如冰川、云层和陆地高程卫星-2(ICESat-2)以及全球生态系统动态调查(GEDI),提供了对全球森林结构的观测,但这些观测的空间分辨率较低,并且存在固有的不确定性。相比之下,近地面激光雷达平台,例如航空和无人驾驶飞行器(UAV)携带的激光雷达系统,则能提供更为细致的森林冠层结构测量数据,而且越来越多的国家已将这些数据集公开。 在这项研究中,使用来自多个国家公共可用的航空激光雷达点云及其相关产品的约16,000平方公里的冠层高度模型(CHM)和3米分辨率的PlanetScope以及航空RGB影像训练了一个最先进的单目深度估计模型——Depth Anything V2。经过训练后的模型被称为Depth2CHM,它能够直接从PlanetScope RGB影像中估算出连续的空间冠层高度模型。 在独立验证中,研究团队在中国(约1平方公里)和美国(约116平方公里)的现场进行了测试。结果显示,Depth2CHM可以准确地估计冠层高度,在这两个地点的偏差分别为0.59米和0.41米,均方根误差(RMSE)则分别为2.54米和5.75米。与现有的全球分辨率米级别的冠层高度产品相比,平均绝对误差减少了大约1.5米,而均方根误差则减少了约2米。 这些结果表明,通过大规模航空激光雷达数据训练的单目深度估计网络为从卫星RGB影像中估算高分辨率、连续的空间森林冠层高度提供了一种有前景且可扩展的方法。
https://arxiv.org/abs/2602.06503
Human visual attention on three-dimensional objects emerges from the interplay between bottom-up geometric processing and top-down semantic recognition. Existing 3D saliency methods rely on hand-crafted geometric features or learning-based approaches that lack semantic awareness, failing to explain why humans fixate on semantically meaningful but geometrically unremarkable regions. We introduce SemGeo-AttentionNet, a dual-stream architecture that explicitly formalizes this dichotomy through asymmetric cross-modal fusion, leveraging diffusion-based semantic priors from geometry-conditioned multi-view rendering and point cloud transformers for geometric processing. Cross-attention ensures geometric features query semantic content, enabling bottom-up distinctiveness to guide top-down retrieval. We extend our framework to temporal scanpath generation through reinforcement learning, introducing the first formulation respecting 3D mesh topology with inhibition-of-return dynamics. Evaluation on SAL3D, NUS3D and 3DVA datasets demonstrates substantial improvements, validating how cognitively motivated architectures effectively model human visual attention on three-dimensional surfaces.
人类对三维物体的视觉注意力源自自下而上的几何处理与自上而下的语义识别之间的相互作用。现有的三维显著性方法依赖于手工设计的几何特征或缺乏语义感知的学习方法,无法解释为何人类会关注那些在几何意义上并不显眼但语义上有意义的区域。 我们引入了SemGeo-AttentionNet,这是一种双流架构,通过非对称跨模态融合明确地形式化这种对立面,利用基于扩散的语义先验从条件几何多视图渲染和点云变换器中进行几何处理。交叉注意确保了几何特征可以查询语义内容,使自下而上的独特性能够引导自上而下的检索。 我们通过强化学习将框架扩展到了时序扫描路径生成,并引入了首个考虑三维网格拓扑结构并包含抑制返回动力学的公式化方法。 在SAL3D、NUS3D和3DVA数据集上的评估表明,我们的方法取得了显著改进,验证了认知动机架构能够有效地建模人类对三维表面的视觉注意力。
https://arxiv.org/abs/2602.06419
Task-oriented handovers (TOH) are fundamental to effective human-robot collaboration, requiring robots to present objects in a way that supports the human's intended post-handover use. Existing approaches are typically based on object- or task-specific affordances, but their ability to generalize to novel scenarios is limited. To address this gap, we present AFT-Handover, a framework that integrates large language model (LLM)-driven affordance reasoning with efficient texture-based affordance transfer to achieve zero-shot, generalizable TOH. Given a novel object-task pair, the method retrieves a proxy exemplar from a database, establishes part-level correspondences via LLM reasoning, and texturizes affordances for feature-based point cloud transfer. We evaluate AFT-Handover across diverse task-object pairs, showing improved handover success rates and stronger generalization compared to baselines. In a comparative user study, our framework is significantly preferred over the current state-of-the-art, effectively reducing human regrasping before tool use. Finally, we demonstrate TOH on legged manipulators, highlighting the potential of our framework for real-world robot-human handovers.
任务导向的交接(TOH)是有效的人机协作的基础,它要求机器人以一种支持人类在交接后使用的方式呈现物品。现有方法通常基于特定物体或任务的功能性来设计,但它们对新场景的泛化能力有限。为了解决这一问题,我们提出了AFT-Handover框架,该框架整合了大型语言模型(LLM)驱动的功能推理和高效的基于纹理的功能转移技术,从而实现了零样本、通用的任务导向交接。 对于一个新的物体任务组合,这种方法从数据库中检索一个代理范例,并通过LLM推理建立零件级对应关系。随后,它对功能进行纹理化处理,以实现基于特征的点云传输。我们在各种不同的任务-对象配对上评估了AFT-Handover,结果显示手部交接的成功率得到提高,并且相比基线方法有更好的泛化能力。 在与现有最佳实践的一次对比用户研究中,我们的框架得到了显著更佳的评价,有效减少了人类在使用工具前重新抓握的需求。最后,我们在腿部操作器上展示了TOH的应用,强调了我们框架在现实世界机器人-人交接中的潜力。
https://arxiv.org/abs/2602.05760
We present PIRATR, an end-to-end 3D object detection framework for robotic use cases in point clouds. Extending PI3DETR, our method streamlines parametric 3D object detection by jointly estimating multi-class 6-DoF poses and class-specific parametric attributes directly from occlusion-affected point cloud data. This formulation enables not only geometric localization but also the estimation of task-relevant properties for parametric objects, such as a gripper's opening, where the 3D model is adjusted according to simple, predefined rules. The architecture employs modular, class-specific heads, making it straightforward to extend to novel object types without re-designing the pipeline. We validate PIRATR on an automated forklift platform, focusing on three structurally and functionally diverse categories: crane grippers, loading platforms, and pallets. Trained entirely in a synthetic environment, PIRATR generalizes effectively to real outdoor LiDAR scans, achieving a detection mAP of 0.919 without additional fine-tuning. PIRATR establishes a new paradigm of pose-aware, parameterized perception. This bridges the gap between low-level geometric reasoning and actionable world models, paving the way for scalable, simulation-trained perception systems that can be deployed in dynamic robotic environments. Code available at this https URL.
我们介绍了PIRATR,这是一个针对机器人应用场景的端到端3D物体检测框架,在点云数据中进行操作。通过扩展PI3DETR方法,我们的技术简化了参数化3D物体检测过程,能够直接从受遮挡影响的点云数据中联合估计多类别的6自由度姿态及与类别相关的特定参数属性。这种设定不仅支持几何定位,还能够在调整三维模型时根据简单的预定义规则估算出参数化物体的相关任务特性(例如夹爪的开启程度)。该架构采用模块化、类别特异性的头部设计,使得在不重新设计整个流程的情况下轻松扩展到新型对象类型成为可能。 我们在一个自动叉车平台上对PIRATR进行了验证,并重点关注了三个结构和功能上都截然不同的分类:起重机抓具、装载平台以及托盘。值得一提的是,在完全基于合成环境训练之后,PIRATR能够有效地泛化应用于真实的室外LiDAR扫描数据中,无需额外的微调步骤即可达到0.919的检测mAP值。 PIRATR确立了一种新的姿态感知参数化感知范式,这在低级几何推理与行动世界模型之间架起桥梁,为开发可部署于动态机器人环境中的大规模仿真训练感知系统铺平了道路。代码可在以下链接获取:[此URL](请将"this https URL"替换为您实际提供的GitHub或其他版本控制系统的具体网址)。
https://arxiv.org/abs/2602.05557
Robust 3D point cloud classification is often pursued by scaling up backbones or relying on specialized data augmentation. We instead ask whether structural abstraction alone can improve robustness, and study a simple topology-inspired decomposition based on the Mapper algorithm. We propose Mapper-GIN, a lightweight pipeline that partitions a point cloud into overlapping regions using Mapper (PCA lens, cubical cover, and followed by density-based clustering), constructs a region graph from their overlaps, and performs graph classification with a Graph Isomorphism Network. On the corruption benchmark ModelNet40-C, Mapper-GIN achieves competitive and stable accuracy under Noise and Transformation corruptions with only 0.5M parameters. In contrast to prior approaches that require heavier architectures or additional mechanisms to gain robustness, Mapper-GIN attains strong corruption robustness through simple region-level graph abstraction and GIN message passing. Overall, our results suggest that region-graph structure offers an efficient and interpretable source of robustness for 3D visual recognition.
鲁棒的三维点云分类通常通过扩展骨干网络或依赖专门的数据增强来实现。我们则探索仅通过结构抽象能否提升模型的鲁棒性,并研究了一种基于Mapper算法(使用PCA镜头、立方体覆盖和密度聚类)的简单拓扑启发式分解方法。 我们提出了Mapper-GIN,这是一种轻量级管道,它利用Mapper算法将点云分割成重叠区域,构建这些区域之间的关系图,并通过Graph Isomorphism Network (GIN) 进行图分类。在ModelNet40-C(一个包含噪声和变换干扰的基准测试)上,Mapper-GIN仅使用50万个参数就能实现与其它方法相当且稳定的准确率。 不同于以往的方法需要更复杂的架构或额外机制来获得鲁棒性,Mapper-GIN通过简单的区域级图抽象以及GIN消息传递实现了强大的抗扰动能力。总体而言,我们的结果表明,基于区域的结构图提供了一种高效且可解释性强的手段以增强3D视觉识别中的鲁棒性。
https://arxiv.org/abs/2602.05522
LiDAR-based 3D object detectors often struggle to detect far-field objects due to the sparsity of point clouds at long ranges, which limits the availability of reliable geometric cues. To address this, prior approaches augment LiDAR data with depth-completed virtual points derived from RGB images; however, directly incorporating all virtual points leads to increased computational cost and introduces challenges in effectively fusing real and virtual information. We present Point Virtual Transformer (PointViT), a transformer-based 3D object detection framework that jointly reasons over raw LiDAR points and selectively sampled virtual points. The framework examines multiple fusion strategies, ranging from early point-level fusion to BEV-based gated fusion, and analyses their trade-offs in terms of accuracy and efficiency. The fused point cloud is voxelized and encoded using sparse convolutions to form a BEV representation, from which a compact set of high-confidence object queries is initialised and refined through a transformer-based context aggregation module. Experiments on the KITTI benchmark report 91.16% 3D AP, 95.94% BEV AP, and 99.36% AP on the KITTI 2D detection benchmark for the Car class.
基于LiDAR的3D物体检测器在远距离物体检测方面经常遇到困难,这是由于长距离点云的稀疏性导致可靠的几何线索不足。为了解决这一问题,先前的方法通过从RGB图像中获取深度完成的虚拟点来增强LiDAR数据;然而,直接整合所有虚拟点会增加计算成本,并且在有效融合真实和虚拟信息方面也存在挑战。我们提出了一种基于Transformer的3D物体检测框架——Point Virtual Transformer(PointViT),该框架同时处理原始LiDAR点以及从RGB图像生成的选择性采样虚拟点。该框架探索了多种融合策略,包括早期点级别融合到BEV基础门控融合,并分析这些策略在准确性和效率方面的权衡。 通过将融合后的点云进行体素化并使用稀疏卷积编码形成一个鸟瞰视图(BEV)表示,然后从这个表示中初始化一组紧凑的高置信度物体查询,并通过基于Transformer上下文聚合模块进一步优化和细化这些查询。在KITTI基准测试上,该方法实现了91.16%的3D AP、95.94%的BEV AP以及针对Car类别的99.36% 2D检测AP。
https://arxiv.org/abs/2602.06406
With the advancement of 3D scanning technologies, point clouds have become fundamental for representing 3D spatial data, with applications that span across various scientific and technological fields. Practical analysis of this data depends crucially on available neighborhood descriptors to accurately characterize the local geometries of the point cloud. This paper introduces LitS, a novel neighborhood descriptor for 2D and 3D point clouds. LitS are piecewise constant functions on the unit circle that allow points to keep track of their surroundings. Each element in LitS' domain represents a direction with respect to a local reference system. Once constructed, evaluating LitS at any given direction gives us information about the number of neighbors in a cone-like region centered around that same direction. Thus, LitS conveys a lot of information about the local neighborhood of a point, which can be leveraged to gain global structural understanding by analyzing how LitS changes between close points. In addition, LitS comes in two versions ('regular' and 'cumulative') and has two parameters, allowing them to adapt to various contexts and types of point clouds. Overall, they are a versatile neighborhood descriptor, capable of capturing the nuances of local point arrangements and resilient to common point cloud data issues such as variable density and noise.
随着3D扫描技术的进步,点云已成为表示三维空间数据的基础,并在各种科学和技术领域中得到广泛应用。对这类数据的实际分析严重依赖于可用的邻域描述符来准确地表征点云的局部几何结构。本文介绍了LitS,这是一种用于二维和三维点云的新颖邻域描述符。 LitS是在单位圆上的分段常数函数,使得每个点可以记录其周围环境的信息。在LitS的定义域中,每一个元素代表相对于本地参考系统的一个方向。一旦构建完成,在给定的方向上评估LitS就可以提供关于该方向中心区域(类似锥形)内邻居数量的信息。因此,LitS能够传达大量有关某一点局部邻近点分布的信息,并且通过分析相近点之间的LitS变化情况,可以获得全局结构的理解。 此外,LitS有两种版本(“常规版”和“累积版”),并有两个参数,这使得它们可以适应各种环境和类型的点云。总体而言,LitS是一种灵活的邻域描述符,能够捕捉局部点排列中的细微差别,并且对常见的点云数据问题(如密度变化和噪声)具有较强的鲁棒性。
https://arxiv.org/abs/2602.04838
Following crop growth through the vegetative cycle allows farmers to predict fruit setting and yield in early stages, but it is a laborious and non-scalable task if performed by a human who has to manually measure fruit sizes with a caliper or dendrometers. In recent years, computer vision has been used to automate several tasks in precision agriculture, such as detecting and counting fruits, and estimating their size. However, the fundamental problem of matching the exact same fruits from one video, collected on a given date, to the fruits visible in another video, collected on a later date, which is needed to track fruits' growth through time, remains to be solved. Few attempts were made, but they either assume that the camera always starts from the same known position and that there are sufficiently distinct features to match, or they used other sources of data like GPS. Here we propose a new paradigm to tackle this problem, based on constellations of 3D centroids, and introduce a descriptor for very sparse 3D point clouds that can be used to match fruits across videos. Matching constellations instead of individual fruits is key to deal with non-rigidity, occlusions and challenging imagery with few distinct visual features to track. The results show that the proposed method can be successfully used to match fruits across videos and through time, and also to build an orchard map and later use it to locate the camera pose in 6DoF, thus providing a method for autonomous navigation of robots in the orchard and for selective fruit picking, for example.
随着作物在其营养生长期的成长,农民可以通过预测早期的果实设置和产量来做出规划。然而,如果这项工作由人工完成,则需要手动使用游标卡尺或树干测量仪量取果实体积,这既费力又难以规模化操作。近年来,计算机视觉技术已被用于自动化精准农业中的多种任务,如检测、计数水果以及估计其大小。但是,将一个日期拍摄的视频中特定果实与后续日期拍摄的视频中同一组果实进行匹配的问题仍未解决。这个问题对于追踪果实随时间增长的情况至关重要。 尽管已经有一些尝试去解决这一问题,但它们要么假设摄像机始终从相同的已知位置开始,并且有足够的特征来进行匹配,要么使用了像GPS这样的其他数据源。在此背景下,我们提出了一种新的范式来应对上述挑战——基于三维质心星座的模型,并引入了一个用于匹配视频之间稀疏3D点云果实的新描述符。通过匹配星座而不是单独的水果,这种方法能够有效解决非刚性、遮挡和具有挑战性的图像环境中难以追踪独特视觉特征的问题。 实验结果表明,所提出的方法可以成功地将不同视频间的果实进行配对,并构建果园地图,随后利用该地图实现六自由度(6DoF)相机姿态定位。这不仅为在果园中实现机器人自主导航提供了方法,还能够支持选择性水果采摘等应用。
https://arxiv.org/abs/2602.04722
Recently, the progress in the radar sensing technology consisting in the miniaturization of the packages and increase in measuring precision has drawn the interest of the robotics research community. Indeed, a crucial task enabling autonomy in robotics is to precisely determine the pose of the robot in space. To fulfill this task sensor fusion algorithms are often used, in which data from one or several exteroceptive sensors like, for example, LiDAR, camera, laser ranging sensor or GNSS are fused together with the Inertial Measurement Unit (IMU) measurements to obtain an estimate of the navigation states of the robot. Nonetheless, owing to their particular sensing principles, some exteroceptive sensors are often incapacitated in extreme environmental conditions, like extreme illumination or presence of fine particles in the environment like smoke or fog. Radars are largely immune to aforementioned factors thanks to the characteristics of electromagnetic waves they use. In this thesis, we present Radar-Inertial Odometry (RIO) algorithms to fuse the information from IMU and radar in order to estimate the navigation states of a (Uncrewed Aerial Vehicle) UAV capable of running on a portable resource-constrained embedded computer in real-time and making use of inexpensive, consumer-grade sensors. We present novel RIO approaches relying on the multi-state tightly-coupled Extended Kalman Filter (EKF) and Factor Graphs (FG) fusing instantaneous velocities of and distances to 3D points delivered by a lightweight, low-cost, off-the-shelf Frequency Modulated Continuous Wave (FMCW) radar with IMU readings. We also show a novel way to exploit advances in deep learning to retrieve 3D point correspondences in sparse and noisy radar point clouds.
最近,雷达传感技术的进步——包括封装的小型化和测量精度的提高——引起了机器人研究社区的兴趣。事实上,实现机器人自主性的一个关键任务是精确确定机器人在空间中的姿态。为了完成这一任务,通常使用传感器融合算法,将来自一个或多个外部感知传感器(如激光雷达、摄像头、测距仪或全球导航卫星系统(GNSS))的数据与惯性测量单元(IMU)的测量数据结合,以获得机器人的导航状态估计。 然而,由于其独特的传感原理,一些外部感知传感器在极端环境条件下往往失效,例如极强或极弱光照条件以及环境中存在细小颗粒物的情况,如烟雾或雾。雷达因其使用的电磁波特性而对上述因素具有很强的抗性。在这篇论文中,我们提出了一种基于融合IMU和雷达信息的雷达惯性里程计(RIO)算法,以估计能够在便携式资源受限嵌入式计算机上实时运行,并利用低成本、消费级传感器的无人飞行器(UAV)的导航状态。 我们介绍了新的RIO方法,这些方法依赖于多态紧耦合扩展卡尔曼滤波器(EKF)和因子图(FG),将轻量级、低成本现成的调频连续波(FMCW)雷达提供的三维点的速度和距离信息与IMU读数融合。此外,我们还展示了一种利用深度学习进展的新方法,用于在稀疏且嘈杂的雷达点云中检索三维点对应关系。
https://arxiv.org/abs/2602.04631