Abstract
In this paper, we propose a novel language-guided 3D arbitrary neural style transfer method (CLIP3Dstyler). We aim at stylizing any 3D scene with an arbitrary style from a text description, and synthesizing the novel stylized view, which is more flexible than the image-conditioned style transfer. Compared with the previous 2D method CLIPStyler, we are able to stylize a 3D scene and generalize to novel scenes without re-train our model. A straightforward solution is to combine previous image-conditioned 3D style transfer and text-conditioned 2D style transfer \bigskip methods. However, such a solution cannot achieve our goal due to two main challenges. First, there is no multi-modal model matching point clouds and language at different feature scales (\eg low-level, high-level). Second, we observe a style mixing issue when we stylize the content with different style conditions from text prompts. To address the first issue, we propose a 3D stylization framework to match the point cloud features with text features in local and global views. For the second issue, we propose an improved directional divergence loss to make arbitrary text styles more distinguishable as a complement to our framework. We conduct extensive experiments to show the effectiveness of our model on text-guided 3D scene style transfer.
Abstract (translated)
本文提出了一种独特的语言指导的三维任意神经网络风格转移方法(CLIP3Dstyler)。我们旨在从文本描述中塑造任何具有任意风格的三维场景,并合成新的样式化视图,这种方法比图像 conditioned 风格转移方法更灵活。与之前的2D方法CLIPStyler相比,我们能够在不需要重新训练模型的情况下塑造三维场景并泛化到新的场景。一种简单的方法是将之前的图像 conditioned 3D风格转移方法和文本 conditioned 2D风格转移方法结合起来。但是,这种方法无法达到我们的目标,因为面临两个主要挑战。第一个挑战是不存在匹配点云和语言在不同特征尺度上的方法(例如低水平和高级别)。第二个挑战是,从文本提示中塑造内容时,我们观察到风格混合问题。为了解决这些问题,我们提出了一个三维风格化框架,在该框架中 local 和 global 视角下的点云特征与文本特征进行匹配。对于第二个问题,我们提出了改进的方向交叉损失,以使任意文本风格更加可区分,作为我们框架的补充。我们进行了广泛的实验,以展示我们模型在文本指导的三维场景风格转移方面的有效性。
URL
https://arxiv.org/abs/2305.15732