Paper Reading AI Learner

Lift3D: Zero-Shot Lifting of Any 2D Vision Model to 3D

2024-03-27 18:13:16
Mukund Varma T, Peihao Wang, Zhiwen Fan, Zhangyang Wang, Hao Su, Ravi Ramamoorthi

Abstract

In recent years, there has been an explosion of 2D vision models for numerous tasks such as semantic segmentation, style transfer or scene editing, enabled by large-scale 2D image datasets. At the same time, there has been renewed interest in 3D scene representations such as neural radiance fields from multi-view images. However, the availability of 3D or multiview data is still substantially limited compared to 2D image datasets, making extending 2D vision models to 3D data highly desirable but also very challenging. Indeed, extending a single 2D vision operator like scene editing to 3D typically requires a highly creative method specialized to that task and often requires per-scene optimization. In this paper, we ask the question of whether any 2D vision model can be lifted to make 3D consistent predictions. We answer this question in the affirmative; our new Lift3D method trains to predict unseen views on feature spaces generated by a few visual models (i.e. DINO and CLIP), but then generalizes to novel vision operators and tasks, such as style transfer, super-resolution, open vocabulary segmentation and image colorization; for some of these tasks, there is no comparable previous 3D method. In many cases, we even outperform state-of-the-art methods specialized for the task in question. Moreover, Lift3D is a zero-shot method, in the sense that it requires no task-specific training, nor scene-specific optimization.

Abstract (translated)

近年来,随着大型2D图像数据集的爆炸性增长,2D视觉模型在语义分割、风格迁移或场景编辑等任务上取得了显著突破。同时,对于多视角图像中的3D场景表示,如来自多视角图像的神经辐射场,3D场景表示的可用性仍然相对有限,这使得将2D视觉模型扩展到3D数据非常诱人,但同时也非常具有挑战性。事实上,将单个2D视觉操作扩展到3D通常需要高度创造性方法的专业领域,并且通常需要针对每个场景进行优化。在本文中,我们问是否可以提升任何2D视觉模型使其在3D中做出一致的预测。我们得出结论:是的,我们的新Lift3D方法训练预测由几个视觉模型(即DINO和CLIP)生成的特征空间中的未见过的视图,但 then 扩展到新颖的视觉操作和任务,如风格迁移、超分辨率、开词汇分割和图像色度;对于某些任务,没有 comparable之前的3D方法。在许多情况下,我们甚至超过了针对该任务的最佳3D方法。此外,Lift3D是一种零 shot方法,这意味着它不需要任务特定训练,也不需要场景特定优化。

URL

https://arxiv.org/abs/2403.18922

PDF

https://arxiv.org/pdf/2403.18922.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot