Abstract
Large pre-trained models have had a significant impact on computer vision by enabling multi-modal learning, where the CLIP model has achieved impressive results in image classification, object detection, and semantic segmentation. However, the model's performance on 3D point cloud processing tasks is limited due to the domain gap between depth maps from 3D projection and training images of CLIP. This paper proposes DiffCLIP, a new pre-training framework that incorporates stable diffusion with ControlNet to minimize the domain gap in the visual branch. Additionally, a style-prompt generation module is introduced for few-shot tasks in the textual branch. Extensive experiments on the ModelNet10, ModelNet40, and ScanObjectNN datasets show that DiffCLIP has strong abilities for 3D understanding. By using stable diffusion and style-prompt generation, DiffCLIP achieves an accuracy of 43.2\% for zero-shot classification on OBJ\_BG of ScanObjectNN, which is state-of-the-art performance, and an accuracy of 80.6\% for zero-shot classification on ModelNet10, which is comparable to state-of-the-art performance.
Abstract (translated)
大型预训练模型通过实现多模态学习对计算机视觉产生了重大影响。CLIP模型在图像分类、对象检测和语义分割等方面取得了令人印象深刻的结果。然而,模型在3D点云处理任务方面的性能受到3D投影和CLIP训练图像之间的域差的限制。本文提出了DiffCLIP,一个新的预训练框架,结合稳定的扩散控制Net,最小化视觉分支中的域差。此外,在文本分支中引入了少量的任务风格prompt generation模块。在ModelNet10、ModelNet40和扫描对象NN数据集上进行广泛的实验表明,DiffCLIP具有很强的3D理解能力。通过稳定的扩散和风格prompt generation,DiffCLIP实现了对扫描对象NN中 obj_bg 对象零样本分类的准确率为43.2%,这是当前最先进的性能,而ModelNet10中的对象零样本分类的准确率为80.6%,与当前最先进的性能相当。
URL
https://arxiv.org/abs/2305.15957