Reliable surface completion from sparse point clouds underpins many applications spanning content creation and robotics. While 3D diffusion transformers attain state-of-the-art results on this task, we uncover that they exhibit a catastrophic mode of failure: arbitrarily small on-surface perturbations to the input point cloud can fracture the output into multiple disconnected pieces -- a phenomenon we call Meltdown. Using activation-patching from mechanistic interpretability, we localize Meltdown to a single early denoising cross-attention activation. We find that the singular-value spectrum of this activation provides a scalar proxy: its spectral entropy rises when fragmentation occurs and returns to baseline when patched. Interpreted through diffusion dynamics, we show that this proxy tracks a symmetry-breaking bifurcation of the reverse process. Guided by this insight, we introduce PowerRemap, a test-time control that stabilizes sparse point-cloud conditioning. We demonstrate that Meltdown persists across state-of-the-art architectures (WaLa, Make-a-Shape), datasets (GSO, SimJEB) and denoising strategies (DDPM, DDIM), and that PowerRemap effectively counters this failure with stabilization rates of up to 98.3%. Overall, this work is a case study on how diffusion model behavior can be understood and guided based on mechanistic analysis, linking a circuit-level cross-attention mechanism to diffusion-dynamics accounts of trajectory bifurcations.
可靠地从稀疏点云完成表面是许多涉及内容创作和机器人技术的应用的基础。虽然3D扩散变换器在这一任务上取得了最先进的成果,但我们发现它们存在一种灾难性的失败模式:对输入点云上的任意小的表面扰动可能导致输出断裂成多个不相连的部分——我们将这种现象称为“崩溃”(Meltdown)。通过机制可解释性中的激活补丁技术,我们定位到这种崩溃发生在早期去噪交叉注意力激活的一个单独位置上。我们发现这个激活的奇异值谱提供了一个标量代理:当发生碎片化时其谱熵上升,在使用修补方法后返回基线水平。 根据扩散动力学理论,我们展示了该代理跟踪逆过程中的对称性破缺分岔点。依据这一见解,我们引入了PowerRemap测试时间控制机制以稳定稀疏点云的条件设置。我们证明崩溃现象在最先进的架构(WaLa, Make-a-Shape)、数据集(GSO, SimJEB)和去噪策略(DDPM, DDIM)上持续存在,并且PowerRemap有效缓解了这一问题,实现了高达98.3%的稳定率。 总的来说,这项工作是对如何基于机制分析理解并指导扩散模型行为的一个案例研究,将电路级交叉注意力机制与扩散动力学轨迹分岔理论联系起来。
https://arxiv.org/abs/2602.11130
Query-based 3D scene instance segmentation from point clouds has attained notable performance. However, existing methods suffer from the query initialization dilemma due to the sparse nature of point clouds and rely on computationally intensive attention mechanisms in query decoders. We accordingly introduce LaSSM, prioritizing simplicity and efficiency while maintaining competitive performance. Specifically, we propose a hierarchical semantic-spatial query initializer to derive the query set from superpoints by considering both semantic cues and spatial distribution, achieving comprehensive scene coverage and accelerated convergence. We further present a coordinate-guided state space model (SSM) decoder that progressively refines queries. The novel decoder features a local aggregation scheme that restricts the model to focus on geometrically coherent regions and a spatial dual-path SSM block to capture underlying dependencies within the query set by integrating associated coordinates information. Our design enables efficient instance prediction, avoiding the incorporation of noisy information and reducing redundant computation. LaSSM ranks first place on the latest ScanNet++ V2 leaderboard, outperforming the previous best method by 2.5% mAP with only 1/3 FLOPs, demonstrating its superiority in challenging large-scale scene instance segmentation. LaSSM also achieves competitive performance on ScanNet, ScanNet200, S3DIS and ScanNet++ V1 benchmarks with less computational cost. Extensive ablation studies and qualitative results validate the effectiveness of our design. The code and weights are available at this https URL.
https://arxiv.org/abs/2602.11007
Semantic segmentation of 3D point clouds is important for many applications, such as autonomous driving. To train semantic segmentation models, labeled point cloud segmentation datasets are essential. Meanwhile, point cloud labeling is time-consuming for annotators, which typically involves tuning the camera viewpoint and selecting points by lasso. To reduce the time cost of point cloud labeling, we propose a viewpoint recommendation approach to reduce annotators' labeling time costs. We adapt Fitts' law to model the time cost of lasso selection in point clouds. Using the modeled time cost, the viewpoint that minimizes the lasso selection time cost is recommended to the annotator. We build a data labeling system for semantic segmentation of 3D point clouds that integrates our viewpoint recommendation approach. The system enables users to navigate to recommended viewpoints for efficient annotation. Through an ablation study, we observed that our approach effectively reduced the data labeling time cost. We also qualitatively compare our approach with previous viewpoint selection approaches on different datasets.
三维点云的语义分割对于许多应用(如自动驾驶)非常重要。为了训练语义分割模型,标注良好的点云分割数据集是必不可少的。然而,点云标注对注释者来说非常耗时,通常需要调整相机视角并使用套索工具选择点。为减少点云标注的时间成本,我们提出了一种视角推荐方法来降低注释者的工作量。我们将Fitts定律应用于建模点云中套索选择的操作时间成本,并利用该模型确定最小化套索选择时间成本的视角。我们构建了一个用于三维点云语义分割的数据标注系统,该系统集成了我们的视角推荐方法。系统允许用户导航至推荐视角以实现高效的注释工作。通过消融研究,我们观察到这种方法显著降低了数据标注的时间成本,并且在不同数据集上与先前的视角选择方法进行了定性比较。
https://arxiv.org/abs/2602.10871
Cross-category anomaly detection for 3D point clouds aims to determine whether an unseen object belongs to a target category using only a few normal examples. Most existing methods rely on category-specific training, which limits their flexibility in few-shot scenarios. In this paper, we propose DMP-3DAD, a training-free framework for cross-category 3D anomaly detection based on multi-view realistic depth map projection. Specifically, by converting point clouds into a fixed set of realistic depth images, our method leverages a frozen CLIP visual encoder to extract multi-view representations and performs anomaly detection via weighted feature similarity, which does not require any fine-tuning or category-dependent adaptation. Extensive experiments on the ShapeNetPart dataset demonstrate that DMP-3DAD achieves state-of-the-art performance under few-shot setting. The results show that the proposed approach provides a simple yet effective solution for practical cross-category 3D anomaly detection.
跨类别异常检测在点云数据中的目标是通过仅使用少量正常样本,来判断未知物体是否属于某一特定类别。大多数现有方法依赖于针对特定类别的训练,这限制了它们在少样本场景下的灵活性。本文提出了一种名为 DMP-3DAD 的无训练框架,用于基于多视图真实深度图投影的跨类别三维异常检测。 具体来说,通过将点云转换为一组固定的真实深度图像,我们的方法利用冻结状态的 CLIP 视觉编码器提取多视角表示,并通过加权特征相似度进行异常检测。这种方法无需任何微调或依赖于类别的适应性调整。 在 ShapeNetPart 数据集上进行了广泛实验,结果显示 DMP-3DAD 在少样本设置下达到了最先进的性能表现。这些结果表明所提出的方法为实用的跨类别三维异常检测提供了一种简单而有效的解决方案。
https://arxiv.org/abs/2602.10806
Imitation Learning (IL) enables robots to learn complex skills from demonstrations without explicit task modeling, but it typically requires large amounts of demonstrations, creating significant collection costs. Prior work has investigated using flow as an intermediate representation to enable the use of human videos as a substitute, thereby reducing the amount of required robot demonstrations. However, most prior work has focused on the flow, either on the object or on specific points of the robot/hand, which cannot describe the motion of interaction. Meanwhile, relying on flow to achieve generalization to scenarios observed only in human videos remains limited, as flow alone cannot capture precise motion details. Furthermore, conditioning on scene observation to produce precise actions may cause the flow-conditioned policy to overfit to training tasks and weaken the generalization indicated by the flow. To address these gaps, we propose SFCrP, which includes a Scene Flow prediction model for Cross-embodiment learning (SFCr) and a Flow and Cropped point cloud conditioned Policy (FCrP). SFCr learns from both robot and human videos and predicts any point trajectories. FCrP follows the general flow motion and adjusts the action based on observations for precision tasks. Our method outperforms SOTA baselines across various real-world task settings, while also exhibiting strong spatial and instance generalization to scenarios seen only in human videos.
模仿学习(IL)使机器人能够通过演示来学习复杂的技能,而无需显式建模任务,但它通常需要大量的演示数据,这会产生较高的收集成本。先前的工作研究了使用流作为中间表示的方法,以利用人类视频代替机器人演示,从而减少所需的实际机器人演示数量。然而,大多数先前的研究主要集中在物体或特定的机器人/手部位上的流动上,这些方法无法描述交互动作。此外,仅依赖于流来实现对只在人类视频中观察到的情况下的泛化能力是有限的,因为单靠流不能捕捉精确的动作细节。进一步地,基于场景观测进行条件设置以生成精准动作可能会导致基于流的策略过度拟合训练任务,并削弱通过流指示的泛化能力。 为解决这些不足,我们提出了SFCrP方法,其中包括一个用于跨实体学习的场景流预测模型(SFCr)和一个基于流动及裁剪点云条件策略(FCrP)。SFCr能够从机器人视频和人类视频中进行学习,并预测任意点轨迹。FCrP遵循一般的流运动模式,在观察的基础上调整动作以完成精准任务。我们的方法在各种真实世界任务设置下优于最先进的基准模型,同时展示了对仅出现在人类视频中的场景的强空间和实例泛化能力。
https://arxiv.org/abs/2602.10594
LiDAR sensors are a key modality for 3D perception, yet they are typically designed independently of downstream tasks such as point cloud registration. Conventional registration operates on pre-acquired datasets with fixed LiDAR configurations, leading to suboptimal data collection and significant computational overhead for sampling, noise filtering, and parameter tuning. In this work, we propose an adaptive LiDAR sensing framework that dynamically adjusts sensor parameters, jointly optimizing LiDAR acquisition and registration hyperparameters. By integrating registration feedback into the sensing loop, our approach optimally balances point density, noise, and sparsity, improving registration accuracy and efficiency. Evaluations in the CARLA simulation demonstrate that our method outperforms fixed-parameter baselines while retaining generalization abilities, highlighting the potential of adaptive LiDAR for autonomous perception and robotic applications.
激光雷达传感器是三维感知的关键手段,但它们通常是在下游任务(如点云配准)独立设计的。传统的配准方法在处理预先采集的数据集时使用固定的激光雷达配置,这导致数据收集效率低下,并且在采样、噪声过滤和参数调整方面存在显著的计算开销。在这项工作中,我们提出了一种自适应激光雷达传感框架,该框架能够动态地调整传感器参数,在优化激光雷达获取的同时联合优化配准超参数。通过将配准反馈整合到感知循环中,我们的方法能够在点云密度、噪声和稀疏性之间实现最优平衡,从而提高配准的准确性和效率。 在CARLA模拟环境中的评估表明,相较于固定参数基线模型,本方法不仅表现更佳,并且保持了一定的泛化能力。这突显了自适应激光雷达在自主感知与机器人应用领域的潜力。
https://arxiv.org/abs/2602.10492
3D Gaussian Splatting (3DGS) has rapidly become a standard for high-fidelity 3D reconstruction, yet its adoption in multiple critical domains is hindered by the lack of interpretability of the generation models as well as classification of the Splats. While explainability methods exist for other 3D representations, like point clouds, they typically rely on ambiguous saliency maps that fail to capture the volumetric coherence of Gaussian primitives. We introduce XSPLAIN, the first ante-hoc, prototype-based interpretability framework designed specifically for 3DGS classification. Our approach leverages a voxel-aggregated PointNet backbone and a novel, invertible orthogonal transformation that disentangles feature channels for interpretability while strictly preserving the original decision boundaries. Explanations are grounded in representative training examples, enabling intuitive ``this looks like that'' reasoning without any degradation in classification performance. A rigorous user study (N=51) demonstrates a decisive preference for our approach: participants selected XSPLAIN explanations 48.4\% of the time as the best, significantly outperforming baselines $(p<0.001)$, showing that XSPLAIN provides transparency and user trust. The source code for this work is available at: this https URL
3D高斯点置(3D Gaussian Splatting,简称3DGS)已经成为高质量三维重建的标准方法之一。然而,在多个关键领域中采用该技术受到了生成模型解释性不足以及对Splat分类困难的阻碍。尽管存在其他如点云表示的3D数据可解释性的方法,但这些方法通常依赖于模糊的显著性图(saliency maps),无法捕捉高斯原语体素之间的体积连贯性。 我们引入了XSPLAIN,这是首个专门为3DGS分类设计的基于原型、事前解释框架。我们的方法利用了一个通过点积聚合的PointNet骨干网络以及一个新颖的可逆正交变换来分离特征通道以提高解释性,同时严格保持原始决策边界不变。这种解释框架基于具有代表性的训练样本进行构建,使得用户可以直观地进行“这看起来像那”的推理,并且不会对分类性能造成任何影响。 一项严格的用户研究(N=51)证明了我们方法的优越性:参与者有48.4%的时间选择了XSPLAIN作为最佳解释框架,显著优于基准线方法(p<0.001),表明XSPLAIN提供了透明度和用户信任。这项工作的源代码可在以下网址获得:[this https URL]
https://arxiv.org/abs/2602.10239
3D spatial perception is fundamental to generalizable robotic manipulation, yet obtaining reliable, high-quality 3D geometry remains challenging. Depth sensors suffer from noise and material sensitivity, while existing reconstruction models lack the precision and metric consistency required for physical interaction. We introduce Robo3R, a feed-forward, manipulation-ready 3D reconstruction model that predicts accurate, metric-scale scene geometry directly from RGB images and robot states in real time. Robo3R jointly infers scale-invariant local geometry and relative camera poses, which are unified into the scene representation in the canonical robot frame via a learned global similarity transformation. To meet the precision demands of manipulation, Robo3R employs a masked point head for sharp, fine-grained point clouds, and a keypoint-based Perspective-n-Point (PnP) formulation to refine camera extrinsics and global alignment. Trained on Robo3R-4M, a curated large-scale synthetic dataset with four million high-fidelity annotated frames, Robo3R consistently outperforms state-of-the-art reconstruction methods and depth sensors. Across downstream tasks including imitation learning, sim-to-real transfer, grasp synthesis, and collision-free motion planning, we observe consistent gains in performance, suggesting the promise of this alternative 3D sensing module for robotic manipulation.
3D空间感知对于通用的机器人操作至关重要,但获取可靠、高质量的3D几何信息仍然是一项挑战。深度传感器会受到噪声和材质敏感性的影响,而现有的重建模型缺乏物理交互所需的精度和度量一致性。为此,我们引入了Robo3R,这是一种前馈式的、专为操作设计的3D重建模型,能够直接从RGB图像和机器人状态实时预测精确的度量级场景几何信息。 Robo3R同时推断尺度不变的局部几何结构和相对相机姿态,并通过一个学习到的整体相似变换将其统一在机器人的标准坐标系中。为了满足操作所需的精度要求,Robo3R采用了一种掩码点头(masked point head)以生成尖锐、细节丰富的点云,并使用基于关键点的透视-n-点(PnP)方法来细化相机外参和整体对齐。 在经过精心策划的大规模合成数据集Robo3R-4M上进行训练,该数据集包含四百万帧高保真注释图像。Robo3R在重建精度方面优于现有的最先进的重建方法和深度传感器。在模仿学习、从仿真到真实环境的转移、抓取合成以及无碰撞路径规划等下游任务中,我们观察到了一致性的性能提升,这表明这种替代的3D感知模块对于机器人操作具有潜力。
https://arxiv.org/abs/2602.10101
3D affordance grounding aims to highlight the actionable regions on 3D objects, which is crucial for robotic manipulation. Previous research primarily focused on learning affordance knowledge from static cues such as language and images, which struggle to provide sufficient dynamic interaction context that can reveal temporal and causal cues. To alleviate this predicament, we collect a comprehensive video-based 3D affordance dataset, \textit{VIDA}, which contains 38K human-object-interaction videos covering 16 affordance types, 38 object categories, and 22K point clouds. Based on \textit{VIDA}, we propose a strong baseline: VideoAfford, which activates multimodal large language models with additional affordance segmentation capabilities, enabling both world knowledge reasoning and fine-grained affordance grounding within a unified framework. To enhance action understanding capability, we leverage a latent action encoder to extract dynamic interaction priors from HOI videos. Moreover, we introduce a \textit{spatial-aware} loss function to enable VideoAfford to obtain comprehensive 3D spatial knowledge. Extensive experimental evaluations demonstrate that our model significantly outperforms well-established methods and exhibits strong open-world generalization with affordance reasoning abilities. All datasets and code will be publicly released to advance research in this area.
三维可操作性定位(3D affordance grounding)的目标是突出3D物体上的可行动区域,这对机器人操纵至关重要。以往的研究主要集中在从静态线索(如语言和图像)中学习可操作性知识上,这些方法难以提供充分的动态交互背景信息来揭示时间性和因果关系的提示。为了缓解这一困境,我们收集了一个全面的基于视频的三维可操作性数据集VIDA,该数据集中包含38K个人与物体互动的视频,涵盖了16种不同的可操作类型、38个物品种类和22K个点云。基于VIDA数据集,我们提出了一项强基线方法:VideoAfford,它激活了具有附加可操作性分割功能的多模态大规模语言模型,从而使世界知识推理与细粒度的可操作性定位能够在统一框架内实现。为了增强动作理解能力,我们在HOI(人-物互动)视频中利用了一个潜在的动作编码器来提取动态交互先验。此外,我们引入了一种“空间感知”的损失函数,使VideoAfford能够获取全面的三维空间知识。广泛的实验评估表明,我们的模型在可操作性推理能力方面显著超越了现有方法,并且表现出强大的泛化性能。所有数据集和代码将公开发布以促进该领域的研究进展。
https://arxiv.org/abs/2602.09638
Fine-grained truck classification is critical for intelligent transportation systems (ITS), yet current LiDAR-based methods face scalability challenges due to their reliance on supervised deep learning and labor-intensive manual annotation. Vision-Language Models (VLMs) offer promising few-shot generalization, but their application to roadside LiDAR is limited by a modality gap between sparse 3D point clouds and dense 2D imagery. We propose a framework that bridges this gap by adapting off-the-shelf VLMs for fine-grained truck classification without parameter fine-tuning. Our new depth-aware image generation pipeline applies noise removal, spatial and temporal registration, orientation rectification, morphological operations, and anisotropic smoothing to transform sparse, occluded LiDAR scans into depth-encoded 2D visual proxies. Validated on a real-world dataset of 20 vehicle classes, our approach achieves competitive classification accuracy with as few as 16-30 examples per class, offering a scalable alternative to data-intensive supervised baselines. We further observe a "Semantic Anchor" effect: text-based guidance regularizes performance in ultra-low-shot regimes $k < 4$, but degrades accuracy in more-shot settings due to semantic mismatch. Furthermore, we demonstrate the efficacy of this framework as a Cold Start strategy, using VLM-generated labels to bootstrap lightweight supervised models. Notably, the few-shot VLM-based model achieves over correct classification rate of 75 percent for specific drayage categories (20ft, 40ft, and 53ft containers) entirely without the costly training or fine-tuning, significantly reducing the intensive demands of initial manual labeling, thus achieving a method of practical use in ITS applications.
细粒度卡车分类对于智能交通系统(ITS)至关重要,但目前基于激光雷达的方法由于依赖监督深度学习和劳动密集型的手动标注而面临可扩展性挑战。视觉-语言模型(VLMs)在少量样本情况下表现出色的泛化能力很有前景,但由于稀疏3D点云与稠密2D图像之间的模态差异,其应用于路边激光雷达的应用受到限制。我们提出了一种框架来弥合这一差距,通过适应现成的VLM以无需参数微调的方式进行细粒度卡车分类。 我们的新深度感知图像生成管道包括噪声去除、空间和时间配准、方向校正、形态操作及各向异性平滑等步骤,将稀疏且被遮挡的激光雷达扫描转换为深度编码的2D视觉代理。在包含20个车辆类别的真实世界数据集上验证后,我们的方法仅使用每个类别16-30个样本即达到了竞争性的分类准确率,提供了一种替代耗数据密集型监督基线的方法。 此外,我们观察到了“语义锚定”效应:基于文本的指导在超低样本($k < 4$)情况下稳定了性能表现,但在更多样本设置中由于语义不匹配而降低了准确性。我们还展示了此框架作为冷启动策略的有效性,即利用VLM生成的标签来引导轻量级监督模型。 值得注意的是,基于少量样本的VLM模型在特定拖车类别(20英尺、40英尺和53英尺集装箱)中达到了75%以上的正确分类率,完全无需昂贵的训练或微调过程。这种方法显著降低了初始手动标注的密集需求,并实现了适用于ITS应用的方法。
https://arxiv.org/abs/2602.09425
This work presents a finite-time stable pose estimator (FTS-PE) for rigid bodies undergoing rotational and translational motion in three dimensions, using measurements from onboard sensors that provide position vectors to inertially-fixed points and body velocities. The FTS-PE is a full-state observer for the pose (position and orientation) and velocities and is obtained through a Lyapunov analysis that shows its stability in finite time and its robustness to bounded measurement noise. Further, this observer is designed directly on the state space, the tangent bundle of the Lie group of rigid body motions, SE(3), without using local coordinates or (dual) quaternion representations. Therefore, it can estimate arbitrary rigid body motions without encountering singularities or the unwinding phenomenon and be readily applied to autonomous vehicles. A version of this observer that does not need translational velocity measurements and uses only point clouds and angular velocity measurements from rate gyros, is also obtained. It is discretized using the framework of geometric mechanics for numerical and experimental implementations. The numerical simulations compare the FTS-PE with a dual-quaternion extended Kalman filter and our previously developed variational pose estimator (VPE). The experimental results are obtained using point cloud images and rate gyro measurements obtained from a Zed 2i stereo depth camera sensor. These results validate the stability and robustness of the FTS-PE.
这项工作提出了一种针对三维空间中刚体旋转和平移运动的姿态估计器(有限时间稳定姿态估计算法,FTS-PE),该算法使用机载传感器提供的惯性固定点的位置向量和刚体速度测量值。FTS-PE是一个全状态观测器,用于估计姿态(位置和方向)以及速度,并通过李雅普诺夫分析得出,证明了其在有限时间内达到稳定性和对有界测量噪声的鲁棒性。 进一步地,该观察器直接设计于状态空间,即刚体运动的李群SE(3)的切丛上,而无需使用局部坐标或四元数表示。因此,它能够估计任意刚体运动而不遇到奇点或解开现象,并且可以轻松应用于自主车辆。此外还得到了一个不需要平移速度测量值、仅利用点云和角速度计测得的角速度的观察器版本。 该算法通过几何力学框架进行离散化处理,以实现数值和实验实施。数值仿真将FTS-PE与双四元数扩展卡尔曼滤波器以及我们先前开发的变分姿态估计算法(VPE)进行了比较。实验结果使用Zed 2i立体深度相机传感器获取到的点云图像和角速度计测量值获得,这些结果验证了FTS-PE的稳定性和鲁棒性。
https://arxiv.org/abs/2602.09414
A 3D understanding of anatomy is central to diagnosis and treatment planning, yet volumetric imaging remains costly with long wait times. Image-to-3D foundations models can solve this issue by reconstructing 3D data from 2D modalites. Current foundation models are trained on natural image distributions to reconstruct naturalistic objects from a single image by leveraging geometric priors across pixels. However, it is unclear whether these learned geometric priors transfer to medical data. In this study, we present a controlled zero-shot benchmark of single slice medical image-to-3D reconstruction across five state-of-the-art image-to-3D models: SAM3D, Hunyuan3D-2.1, Direct3D, Hi3DGen, and TripoSG. These are evaluated across six medical datasets spanning anatomical and pathological structures and two natrual datasets, using voxel based metrics and point cloud distance metrics. Across medical datasets, voxel based overlap remains moderate for all models, consistent with a depth reconstruction failure mode when inferring volume from a single slice. In contrast, global distance metrics show more separation between methods: SAM3D achieves the strongest overall topological similarity to ground truth medical 3D data, while alternative models are more prone to over-simplication of reconstruction. Our results quantify the limits of single-slice medical reconstruction and highlight depth ambiguity caused by the planar nature of 2D medical data, motivating multi-view image-to-3D reconstruction to enable reliable medical 3D inference.
对解剖学的三维理解是诊断和治疗计划的核心,然而体积成像仍然成本高昂且等待时间长。图像到三维基础模型可以通过从二维模式重建三维数据来解决这一问题。当前的基础模型是在自然图像分布上训练的,能够利用像素间的几何先验从单张图像中重构出现实主义对象。然而,这些学习到的几何先验是否能转移到医学数据上尚不清楚。在本研究中,我们提出了一项针对五个最先进的图像到三维重建模型(SAM3D、Hunyuan3D-2.1、Direct3D、Hi3DGen和TripoSG)的单切片医疗图像到三维重建控制零样本基准测试。这些模型是在六个医学数据集上进行评估,涵盖了解剖学和病理结构以及两个自然数据集,并使用基于体素的度量标准和点云距离度量进行了评测。 在医学数据集中,所有模型的基于体素的重叠保持中等水平,在从单片图像推断体积时显示出深度重建失败模式的一致性。相比之下,全局距离指标显示出了方法之间的更大差异:SAM3D实现了与地面真实三维医学数据最接近的整体拓扑相似度,而其他模型更容易出现过度简化重构的问题。 我们的结果量化了单一切片医疗重建的局限,并突显了由二维医疗数据平面性质引起的深度歧义问题,从而推动多视角图像到三维重建以实现可靠的医学三维推理。
https://arxiv.org/abs/2602.09407
We investigate what structure emerges in 3D Gaussian Splatting (3DGS) solutions from standard multi-view optimization. We term these Rendering-Optimal References (RORs) and analyze their statistical properties, revealing stable patterns: mixture-structured scales and bimodal radiance across diverse scenes. To understand what determines these parameters, we apply learnability probes by training predictors to reconstruct RORs from point clouds without rendering supervision. Our analysis uncovers fundamental density-stratification. Dense regions exhibit geometry-correlated parameters amenable to render-free prediction, while sparse regions show systematic failure across architectures. We formalize this through variance decomposition, demonstrating that visibility heterogeneity creates covariance-dominated coupling between geometric and appearance parameters in sparse regions. This reveals the dual character of RORs: geometric primitives where point clouds suffice, and view synthesis primitives where multi-view constraints are essential. We provide density-aware strategies that improve training robustness and discuss architectural implications for systems that adaptively balance feed-forward prediction and rendering-based refinement.
我们研究了从标准多视图优化中出现的3D高斯点集(3DGS)解中的结构,并将这些结构称为渲染最优参考(RORs),并对它们的统计特性进行了分析,揭示出了稳定模式:混合结构化尺度和跨多种场景的双峰辐射度。为了理解决定这些参数的因素,我们通过训练预测器从没有渲染监督的情况下重构ROR来应用可学习性探针。我们的分析发现了一个基本的密度分层现象:密集区域显示几何相关的参数,使得可以在不进行渲染的情况下进行预测;而在稀疏区域则在各种架构中显示出系统性的失败。 为了正式描述这一现象,我们通过方差分解展示了可见度异质性如何在稀疏区域创造出了以协方差为主导的几何和外观参数之间的耦合。这揭示了RORs的双重特性:点云足以表示的几何原语,在这种情况下可以实现无渲染预测;以及需要多视图约束来合成视图的原语,因为在这里几何与视觉效果之间存在复杂的相互作用。 我们提供了一些基于密度的方法以增强训练的鲁棒性,并讨论了对于在前馈预测和基于渲染的细化之间自适应地进行平衡的系统架构的含义。
https://arxiv.org/abs/2602.08909
Point cloud is a prevalent 3D data representation format with significant application values in immersive media, autonomous driving, digital heritage protection, etc. However, the large data size of point clouds poses challenges to transmission and storage, which influences the wide deployments. Therefore, point cloud compression plays a crucial role in practical applications for both human and machine perception optimization. To this end, the Moving Picture Experts Group (MPEG) has established two standards for point cloud compression, including Geometry-based Point Cloud Compression (G-PCC) and Video-based Point Cloud Compression (V-PCC). In the meantime, the Audio Video coding Standard (AVS) Workgroup of China also have launched and completed the development for its first generation point cloud compression standard, namely AVS PCC. This new standardization effort has adopted many new coding tools and techniques, which are different from the other counterpart standards. This paper reviews the AVS PCC standard from two perspectives, i.e., the related technologies and performance comparisons.
点云是一种在沉浸式媒体、自动驾驶和数字遗产保护等领域具有重要应用价值的三维数据表示格式。然而,庞大的点云数据量给传输和存储带来了挑战,从而影响了其广泛应用。因此,在人类感知和机器感知优化的实际应用中,点云压缩技术扮演着关键角色。为此,国际电信联盟下属的运动图像专家组(MPEG)建立了两个用于点云压缩的标准:基于几何的点云压缩(G-PCC)和基于视频的点云压缩(V-PCC)。同时,中国的音视频编码标准(AVS)工作组也推出了第一代点云压缩标准——AVS PCC。这项新的标准化工作采用了许多不同于其他现有标准的新编码工具和技术。本文从相关技术和性能比较两个角度对AVS PCC标准进行了回顾和分析。
https://arxiv.org/abs/2602.08613
Object-level segmentation in dynamic 4D Gaussian scenes remains challenging due to complex motion, occlusions, and ambiguous boundaries. In this paper, we present an efficient learning-free 4D Gaussian segmentation framework that lifts video segmentation masks to 4D spaces, whose core is a two-stage iterative boundary refinement, TIBR4D. The first stage is an Iterative Gaussian Instance Tracing (IGIT) at the temporal segment level. It progressively refines Gaussian-to-instance probabilities through iterative tracing, and extracts corresponding Gaussian point clouds that better handle occlusions and preserve completeness of object structures compared to existing one-shot threshold-based methods. The second stage is a frame-wise Gaussian Rendering Range Control (RCC) via suppressing highly uncertain Gaussians near object boundaries while retaining their core contributions for more accurate boundaries. Furthermore, a temporal segmentation merging strategy is proposed for IGIT to balance identity consistency and dynamic awareness. Longer segments enforce stronger multi-frame constraints for stable identities, while shorter segments allow identity changes to be captured promptly. Experiments on HyperNeRF and Neu3D demonstrate that our method produces accurate object Gaussian point clouds with clearer boundaries and higher efficiency compared to SOTA methods.
在动态4D高斯场景中进行对象级分割仍然具有挑战性,因为复杂运动、遮挡和模糊边界等因素的存在。本文提出了一种高效的无学习需求的4D高斯分割框架,该框架将视频分割掩码提升到4D空间,并以两级迭代边界细化(TIBR4D)为核心。第一阶段是时间片段级别的迭代高斯实例追踪(IGIT),通过反复追踪逐步改进高斯到实例的概率,并提取相应的高斯点云,这能更好地处理遮挡并保持对象结构的完整性,相比现有的单次阈值方法具有优势。第二阶段是在帧级别通过抑制靠近物体边界处高度不确定的高斯分布来控制高斯渲染范围(RCC),以获得更准确的边界。此外,还提出了一种时间分割合并策略用于IGIT,以平衡身份一致性与动态感知能力:较长的时间片段能施加强制多帧约束以保持稳定的身份识别;较短的时间片段则允许迅速捕捉到身份的变化。 在HyperNeRF和Neu3D数据集上的实验表明,相较于最先进的方法,我们的方法能够生成更准确的物体高斯点云,并具有更为清晰的边界以及更高的效率。
https://arxiv.org/abs/2602.08540
Event cameras have been widely adopted in safety-critical domains such as autonomous driving, robotics, and human-computer interaction. A pressing challenge arises from the vulnerability of deep neural networks to adversarial examples, which poses a significant threat to the reliability of event-based systems. Nevertheless, research into adversarial attacks on events is scarce. This is primarily due to the non-differentiable nature of mainstream event representations, which hinders the extension of gradient-based attack methods. In this paper, we propose MA-ADV, a novel \textbf{M}otion-\textbf{A}ware \textbf{Adv}ersarial framework. To the best of our knowledge, this is the first work to generate adversarial events by leveraging point cloud representations. MA-ADV accounts for high-frequency noise in events and employs a diffusion-based approach to smooth perturbations, while fully leveraging the spatial and temporal relationships among events. Finally, MA-ADV identifies the minimal-cost perturbation through a combination of sample-wise Adam optimization, iterative refinement, and binary search. Extensive experimental results validate that MA-ADV ensures a 100\% attack success rate with minimal perturbation cost, and also demonstrate enhanced robustness against defenses, underscoring the critical security challenges facing future event-based perception systems.
事件相机已在自动驾驶、机器人技术及人机交互等关键安全领域得到广泛应用。然而,深度神经网络对对抗样本的脆弱性引发了一个紧迫挑战,这对基于事件系统的可靠性构成了重大威胁。尽管如此,针对事件的攻击研究仍然非常有限。主要原因在于主流事件表示方法的非可微分特性阻碍了梯度驱动攻击策略的应用。 本文提出了一种名为MA-ADV(Motion-Aware Adversarial)的新框架。据我们所知,这是首次利用点云表征来生成对抗性事件的研究工作。MA-ADV能够处理事件中的高频噪声,并采用扩散方法平滑扰动效果,同时充分利用了事件之间的空间和时间关系。最终,通过结合基于样本的Adam优化、迭代细化及二分查找策略,MA-ADV识别出了最低成本的扰动方式。 大量实验结果验证了MA-ADV能够在施加最小扰动的情况下确保100%的成功攻击率,并且该方法还展示了更强的对抗防御能力。这强调了未来基于事件感知系统所面临的严峻安全挑战。
https://arxiv.org/abs/2602.08230
Geometric foundation models show promise in 3D reconstruction, yet their progress is severely constrained by the scarcity of diverse, large-scale 3D annotations. While Internet videos offer virtually unlimited raw data, utilizing them as a scaling source for geometric learning is challenging due to the absence of ground-truth geometry and the presence of observational noise. To address this, we propose SAGE, a framework for Scalable Adaptation of GEometric foundation models from raw video streams. SAGE leverages a hierarchical mining pipeline to transform videos into training trajectories and hybrid supervision: (1) Informative training trajectory selection; (2) Sparse Geometric Anchoring via SfM point clouds for global structural guidance; and (3) Dense Differentiable Consistency via 3D Gaussian rendering for multi-view constraints. To prevent catastrophic forgetting, we introduce a regularization strategy using anchor data. Extensive experiments show that SAGE significantly enhances zero-shot generalization, reducing Chamfer Distance by 20-42% on unseen benchmarks (7Scenes, TUM-RGBD, Matterport3D) compared to state-of-the-art baselines. To our knowledge, SAGE pioneers the adaptation of geometric foundation models via Internet video, establishing a scalable paradigm for general-purpose 3D learning.
几何基础模型在三维重建方面展现出巨大潜力,但其发展严重受限于多样化、大规模的三维注释数据的缺乏。尽管互联网视频提供了近乎无限的原始数据来源,但由于缺少地面真实几何信息和存在观察噪声,将其作为几何学习的扩展资源极具挑战性。为解决这一问题,我们提出了SAGE框架——一种从原始视频流中进行几何基础模型可扩展适应的方法。SAGE利用分层挖掘管道将视频转换成训练轨迹,并采用混合监督机制:(1)选择具有信息量的训练轨迹;(2)通过基于SfM点云的稀疏几何锚定提供全局结构指导;以及(3)使用三维高斯渲染实现多视图约束下的密集可微一致性。为了防止灾难性遗忘,我们提出了一种利用锚数据进行正则化的策略。 广泛的实验表明,与最先进的基准相比,SAGE显著增强了零样本泛化能力,在未见过的基准测试集上(7Scenes、TUM-RGBD和Matterport3D)减少了20%-42%的Chamfer距离。据我们所知,SAGE首次实现了通过互联网视频对几何基础模型进行适应,并为通用三维学习建立了一种可扩展范式。
https://arxiv.org/abs/2602.07891
The accuracy of the 3D models created from medical scans depends on imaging hardware, segmentation methods and mesh processing techniques etc. The effects of geometry type, class imbalance, voxel and point cloud alignment on accuracy remain to be thoroughly explored. This work evaluates the errors across the reconstruction pipeline and explores the use of voxel and surface-based accuracy metrics for different segmentation algorithms and geometry types. A sphere, a facemask, and an AAA were printed using the SLA technique and scanned using a micro-CT machine. Segmentation was performed using GMM, Otsu and RG based methods. Segmented and reference models aligned using the KU algorithm, were quantitatively compared to evaluate metrics like Dice and Jaccard scores, precision. Surface meshes were registered with reference meshes using an ICP-based alignment process. Metrics like chamfer distance, and average Hausdorff distance were evaluated. The Otsu method was found to be the most suitable method for all the geometries. AAA yielded low overlap scores due to its small wall thickness and misalignment. The effect of class imbalance on specificity was observed the most for AAA. Surface-based accuracy metrics differed from the voxel-based trends. The RG method performed best for sphere, while GMM and Otsu perform better for AAA. The facemask surface was most error-prone, possibly due to misalignment during the ICP process. Segmentation accuracy is a cumulative sum of errors across different stages of the reconstruction process. High voxel-based accuracy metrics may be misleading in cases of high class imbalance and sensitivity to alignment. The Jaccard index is found to be more stringent than the Dice and more suitable for accuracy assessment for thin-walled structures. Voxel and point cloud alignment should be ensured to make any reliable assessment of the reconstruction pipeline.
医学扫描生成的3D模型精度取决于成像硬件、分割方法和网格处理技术等因素。几何类型、类别不平衡、体素和点云对齐方式对精度的影响仍有待深入研究。本项工作评估了重建过程中各阶段产生的误差,并探索了不同分割算法和几何类型下基于体素和表面的精度指标的应用。使用SLA技术打印了一个球体、一个面罩以及一个腹主动脉瘤(AAA),然后利用微CT机对其进行扫描。采用高斯混合模型(GMM)、Otsu方法和RG基线法进行图像分割,将分割后的模型与参照模型用KU算法对齐后定量比较了Dice系数、Jaccard指数和精度等指标。表面网格通过基于ICP的对准过程注册到参考网格上,评估了如 Chamfer距离和平均Hausdorff距离这些指标。 研究发现Otsu方法适用于所有几何类型。由于AAA具有较小壁厚且易发生错位,其重叠得分较低。类别不平衡对特异性的效应在AAA中最为明显。基于表面的精度指标与体素基的趋势有所不同。对于球体,RG法表现最佳;而对于AAA,则GMM和Otsu方法效果更佳。面罩表面最容易出错,这可能是因为ICP过程中出现了偏差。 分割准确性是重建过程各个阶段误差累积的结果。在类别不平衡程度高或对齐敏感度高的情况下,基于体素的精度指标可能会产生误导性结果。Jaccard指数被认为比Dice系数更加严格,并且更适合用于评估薄壁结构的精确度。为了可靠地评估重建流程,应确保体素和点云的对准。
https://arxiv.org/abs/2602.07658
Large-scale, high-resolution forest canopy height mapping plays a crucial role in understanding regional and global carbon and water cycles. Spaceborne LiDAR missions, including the Ice, Cloud, and Land Elevation Satellite-2 (ICESat-2) and the Global Ecosystem Dynamics Investigation (GEDI), provide global observations of forest structure but are spatially sparse and subject to inherent uncertainties. In contrast, near-surface LiDAR platforms, such as airborne and unmanned aerial vehicle (UAV) LiDAR systems, offer much finer measurements of forest canopy structure, and a growing number of countries have made these datasets openly available. In this study, a state-of-the-art monocular depth estimation model, Depth Anything V2, was trained using approximately 16,000 km2 of canopy height models (CHMs) derived from publicly available airborne LiDAR point clouds and related products across multiple countries, together with 3 m resolution PlanetScope and airborne RGB imagery. The trained model, referred to as Depth2CHM, enables the estimation of spatially continuous CHMs directly from PlanetScope RGB imagery. Independent validation was conducted at sites in China (approximately 1 km2) and the United States (approximately 116 km2). The results showed that Depth2CHM could accurately estimate canopy height, with biases of 0.59 m and 0.41 m and root mean square errors (RMSEs) of 2.54 m and 5.75 m for these two sites, respectively. Compared with an existing global meter-resolution CHM product, the mean absolute error is reduced by approximately 1.5 m and the RMSE by approximately 2 m. These results demonstrated that monocular depth estimation networks trained with large-scale airborne LiDAR-derived canopy height data provide a promising and scalable pathway for high-resolution, spatially continuous forest canopy height estimation from satellite RGB imagery.
大规模、高分辨率的森林冠层高度测绘在理解区域和全球碳循环及水文循环方面起着关键作用。空间激光雷达任务,如冰川、云层和陆地高程卫星-2(ICESat-2)以及全球生态系统动态调查(GEDI),提供了对全球森林结构的观测,但这些观测的空间分辨率较低,并且存在固有的不确定性。相比之下,近地面激光雷达平台,例如航空和无人驾驶飞行器(UAV)携带的激光雷达系统,则能提供更为细致的森林冠层结构测量数据,而且越来越多的国家已将这些数据集公开。 在这项研究中,使用来自多个国家公共可用的航空激光雷达点云及其相关产品的约16,000平方公里的冠层高度模型(CHM)和3米分辨率的PlanetScope以及航空RGB影像训练了一个最先进的单目深度估计模型——Depth Anything V2。经过训练后的模型被称为Depth2CHM,它能够直接从PlanetScope RGB影像中估算出连续的空间冠层高度模型。 在独立验证中,研究团队在中国(约1平方公里)和美国(约116平方公里)的现场进行了测试。结果显示,Depth2CHM可以准确地估计冠层高度,在这两个地点的偏差分别为0.59米和0.41米,均方根误差(RMSE)则分别为2.54米和5.75米。与现有的全球分辨率米级别的冠层高度产品相比,平均绝对误差减少了大约1.5米,而均方根误差则减少了约2米。 这些结果表明,通过大规模航空激光雷达数据训练的单目深度估计网络为从卫星RGB影像中估算高分辨率、连续的空间森林冠层高度提供了一种有前景且可扩展的方法。
https://arxiv.org/abs/2602.06503
Human visual attention on three-dimensional objects emerges from the interplay between bottom-up geometric processing and top-down semantic recognition. Existing 3D saliency methods rely on hand-crafted geometric features or learning-based approaches that lack semantic awareness, failing to explain why humans fixate on semantically meaningful but geometrically unremarkable regions. We introduce SemGeo-AttentionNet, a dual-stream architecture that explicitly formalizes this dichotomy through asymmetric cross-modal fusion, leveraging diffusion-based semantic priors from geometry-conditioned multi-view rendering and point cloud transformers for geometric processing. Cross-attention ensures geometric features query semantic content, enabling bottom-up distinctiveness to guide top-down retrieval. We extend our framework to temporal scanpath generation through reinforcement learning, introducing the first formulation respecting 3D mesh topology with inhibition-of-return dynamics. Evaluation on SAL3D, NUS3D and 3DVA datasets demonstrates substantial improvements, validating how cognitively motivated architectures effectively model human visual attention on three-dimensional surfaces.
人类对三维物体的视觉注意力源自自下而上的几何处理与自上而下的语义识别之间的相互作用。现有的三维显著性方法依赖于手工设计的几何特征或缺乏语义感知的学习方法,无法解释为何人类会关注那些在几何意义上并不显眼但语义上有意义的区域。 我们引入了SemGeo-AttentionNet,这是一种双流架构,通过非对称跨模态融合明确地形式化这种对立面,利用基于扩散的语义先验从条件几何多视图渲染和点云变换器中进行几何处理。交叉注意确保了几何特征可以查询语义内容,使自下而上的独特性能够引导自上而下的检索。 我们通过强化学习将框架扩展到了时序扫描路径生成,并引入了首个考虑三维网格拓扑结构并包含抑制返回动力学的公式化方法。 在SAL3D、NUS3D和3DVA数据集上的评估表明,我们的方法取得了显著改进,验证了认知动机架构能够有效地建模人类对三维表面的视觉注意力。
https://arxiv.org/abs/2602.06419