Paper Reading AI Learner

Unlocking 3D Affordance Segmentation with 2D Semantic Knowledge

2025-10-09 15:01:26
Yu Huang, Zelin Peng, Changsong Wen, Xiaokang Yang, Wei Shen

Abstract

Affordance segmentation aims to parse 3D objects into functionally distinct parts, bridging recognition and interaction for applications in robotic manipulation, embodied AI, and AR. While recent studies leverage visual or textual prompts to guide this process, they often rely on point cloud encoders as generic feature extractors, overlooking the intrinsic challenges of 3D data such as sparsity, noise, and geometric ambiguity. As a result, 3D features learned in isolation frequently lack clear and semantically consistent functional boundaries. To address this bottleneck, we propose a semantic-grounded learning paradigm that transfers rich semantic knowledge from large-scale 2D Vision Foundation Models (VFMs) into the 3D domain. Specifically, We introduce Cross-Modal Affinity Transfer (CMAT), a pre-training strategy that aligns a 3D encoder with lifted 2D semantics and jointly optimizes reconstruction, affinity, and diversity to yield semantically organized representations. Building on this backbone, we further design the Cross-modal Affordance Segmentation Transformer (CAST), which integrates multi-modal prompts with CMAT-pretrained features to generate precise, prompt-aware segmentation maps. Extensive experiments on standard benchmarks demonstrate that our framework establishes new state-of-the-art results for 3D affordance segmentation.

Abstract (translated)

可操作性分割旨在将三维物体解析为功能上独立的部分,以实现机器人操控、具身人工智能和AR应用中识别与交互的桥梁。尽管最近的研究利用视觉或文本提示来指导这一过程,但它们通常依赖于点云编码器作为通用特征提取器,并且忽略了3D数据固有的挑战,如稀疏性、噪声和几何歧义问题。因此,在孤立条件下学习到的三维特征往往缺乏清晰且语义一致的功能边界。 为了解决这个瓶颈,我们提出了一种基于语义的学习范式,该范式将大规模二维视觉基础模型(VFMs)中的丰富语义知识转移到3D领域中。具体而言,我们引入了跨模态亲和力转移(CMAT),这是一种预训练策略,它使三维编码器与提升的二维语义对齐,并共同优化重建、亲和力和多样性以产生语义组织化的表示。在此基础上,我们进一步设计了跨模态可操作性分割变换器(CAST),该模型将多模式提示与CMAT预先训练的特征相结合,生成精确且提示感知的分割图。 在标准基准测试中的大量实验表明,我们的框架为三维可操作性分割建立了新的最先进的成果。

URL

https://arxiv.org/abs/2510.08316

PDF

https://arxiv.org/pdf/2510.08316.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot