Paper Reading AI Learner

Vox-E: Text-guided Voxel Editing of 3D Objects

2023-03-21 17:36:36
Etai Sella, Gal Fiebelman, Peter Hedman, Hadar Averbuch-Elor

Abstract

Large scale text-guided diffusion models have garnered significant attention due to their ability to synthesize diverse images that convey complex visual concepts. This generative power has more recently been leveraged to perform text-to-3D synthesis. In this work, we present a technique that harnesses the power of latent diffusion models for editing existing 3D objects. Our method takes oriented 2D images of a 3D object as input and learns a grid-based volumetric representation of it. To guide the volumetric representation to conform to a target text prompt, we follow unconditional text-to-3D methods and optimize a Score Distillation Sampling (SDS) loss. However, we observe that combining this diffusion-guided loss with an image-based regularization loss that encourages the representation not to deviate too strongly from the input object is challenging, as it requires achieving two conflicting goals while viewing only structure-and-appearance coupled 2D projections. Thus, we introduce a novel volumetric regularization loss that operates directly in 3D space, utilizing the explicit nature of our 3D representation to enforce correlation between the global structure of the original and edited object. Furthermore, we present a technique that optimizes cross-attention volumetric grids to refine the spatial extent of the edits. Extensive experiments and comparisons demonstrate the effectiveness of our approach in creating a myriad of edits which cannot be achieved by prior works.

Abstract (translated)

大型文本引导扩散模型吸引了广泛关注,因为它们能够合成传递复杂视觉概念的 diverse 图像。这种生成能力最近被利用来实现文本到3D的合成。在这项工作中,我们提出了一种技术,利用隐扩散模型的力量编辑现有的3D对象。我们使用Oriented 2D图像作为输入,并学习基于网格的体积表示。为了指导体积表示符合目标文本 prompt,我们遵循无条件文本到3D方法,并优化SDS损失。然而,我们发现,将这种扩散引导损失与基于图像的恢复损失相结合,并鼓励表示不偏离输入对象太过强烈,是一项挑战性的任务,因为这需要同时实现两个相互冲突的目标,而只看到2D投影的结构与外观结合。因此,我们引入了一种新的体积恢复损失,它在3D空间直接操作,利用我们的3D表示的明确性质来促进原始对象和编辑对象全局结构的相关性。此外,我们提出了一种技术,优化交叉注意力体积网格,以 refine 操作的Spatial extent。广泛的实验和比较证明了我们的方法在创建无法通过先前工作实现的一系列编辑。

URL

https://arxiv.org/abs/2303.12048

PDF

https://arxiv.org/pdf/2303.12048.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot