Abstract
Affordance segmentation aims to parse 3D objects into functionally distinct parts, bridging recognition and interaction for applications in robotic manipulation, embodied AI, and AR. While recent studies leverage visual or textual prompts to guide this process, they often rely on point cloud encoders as generic feature extractors, overlooking the intrinsic challenges of 3D data such as sparsity, noise, and geometric ambiguity. As a result, 3D features learned in isolation frequently lack clear and semantically consistent functional boundaries. To address this bottleneck, we propose a semantic-grounded learning paradigm that transfers rich semantic knowledge from large-scale 2D Vision Foundation Models (VFMs) into the 3D domain. Specifically, We introduce Cross-Modal Affinity Transfer (CMAT), a pre-training strategy that aligns a 3D encoder with lifted 2D semantics and jointly optimizes reconstruction, affinity, and diversity to yield semantically organized representations. Building on this backbone, we further design the Cross-modal Affordance Segmentation Transformer (CAST), which integrates multi-modal prompts with CMAT-pretrained features to generate precise, prompt-aware segmentation maps. Extensive experiments on standard benchmarks demonstrate that our framework establishes new state-of-the-art results for 3D affordance segmentation.
Abstract (translated)
可操作性分割旨在将三维物体解析为功能上独立的部分,以实现机器人操控、具身人工智能和AR应用中识别与交互的桥梁。尽管最近的研究利用视觉或文本提示来指导这一过程,但它们通常依赖于点云编码器作为通用特征提取器,并且忽略了3D数据固有的挑战,如稀疏性、噪声和几何歧义问题。因此,在孤立条件下学习到的三维特征往往缺乏清晰且语义一致的功能边界。 为了解决这个瓶颈,我们提出了一种基于语义的学习范式,该范式将大规模二维视觉基础模型(VFMs)中的丰富语义知识转移到3D领域中。具体而言,我们引入了跨模态亲和力转移(CMAT),这是一种预训练策略,它使三维编码器与提升的二维语义对齐,并共同优化重建、亲和力和多样性以产生语义组织化的表示。在此基础上,我们进一步设计了跨模态可操作性分割变换器(CAST),该模型将多模式提示与CMAT预先训练的特征相结合,生成精确且提示感知的分割图。 在标准基准测试中的大量实验表明,我们的框架为三维可操作性分割建立了新的最先进的成果。
URL
https://arxiv.org/abs/2510.08316