Paper Reading AI Learner

Language-Assisted 3D Scene Understanding

2023-12-18 18:54:56
Yanmin Wu, Qiankun Gao, Renrui Zhang, Jian Zhang

Abstract

The scale and quality of point cloud datasets constrain the advancement of point cloud learning. Recently, with the development of multi-modal learning, the incorporation of domain-agnostic prior knowledge from other modalities, such as images and text, to assist in point cloud feature learning has been considered a promising avenue. Existing methods have demonstrated the effectiveness of multi-modal contrastive training and feature distillation on point clouds. However, challenges remain, including the requirement for paired triplet data, redundancy and ambiguity in supervised features, and the disruption of the original priors. In this paper, we propose a language-assisted approach to point cloud feature learning (LAST-PCL), enriching semantic concepts through LLMs-based text enrichment. We achieve de-redundancy and feature dimensionality reduction without compromising textual priors by statistical-based and training-free significant feature selection. Furthermore, we also delve into an in-depth analysis of the impact of text contrastive training on the point cloud. Extensive experiments validate that the proposed method learns semantically meaningful point cloud features and achieves state-of-the-art or comparable performance in 3D semantic segmentation, 3D object detection, and 3D scene classification tasks. The source code is available at this https URL.

Abstract (translated)

点云数据集的规模和质量限制了点云学习的进步。最近,随着多模态学习的发展,将其他模态(如图像和文本)的领域无关先验知识引入到点云特征学习以辅助点云特征学习被认为是一个有前途的途径。现有的方法已经证明了多模态对比训练和特征蒸馏在点云中的有效性。然而,仍然存在一些挑战,包括需要成对的三元组数据、监督特征的冗余和模糊以及原始先验知识的破坏。在本文中,我们提出了一种语言辅助的点云特征学习方法(LAST-PCL),通过LLM-based文本丰富来丰富语义概念。我们通过基于统计的基于训练的方法显著特征选择实现了去冗余和特征维度减少,同时不牺牲文本先验知识。此外,我们还深入研究了文本对比训练对点云的影响。大量实验证实,所提出的方法可以学习到语义上有意义的点云特征,并在3D语义分割、3D目标检测和3D场景分类任务中实现与最先进水平相当或更好的性能。源代码可在此处下载:https://www.acm.org/dl/doi/10.1145/2848206.2848313

URL

https://arxiv.org/abs/2312.11451

PDF

https://arxiv.org/pdf/2312.11451.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot