Paper Reading AI Learner

Disentangling 3D from Large Vision-Language Models for Controlled Portrait Generation

2025-06-16 21:26:45
Nick Yiwen Huang, Akin Caliskan, Berkay Kicanaoglu, James Tompkin, Hyeongwoo Kim

Abstract

We consider the problem of disentangling 3D from large vision-language models, which we show on generative 3D portraits. This allows free-form text control of appearance attributes like age, hair style, and glasses, and 3D geometry control of face expression and camera pose. In this setting, we assume we use a pre-trained large vision-language model (LVLM; CLIP) to generate from a smaller 2D dataset with no additional paired labels and with a pre-defined 3D morphable model (FLAME). First, we disentangle using canonicalization to a 2D reference frame from a deformable neural 3D triplane representation. But another form of entanglement arises from the significant noise in the LVLM's embedding space that describes irrelevant features. This damages output quality and diversity, but we overcome this with a Jacobian regularization that can be computed efficiently with a stochastic approximator. Compared to existing methods, our approach produces portraits with added text and 3D control, where portraits remain consistent when either control is changed. Broadly, this approach lets creators control 3D generators on their own 2D face data without needing resources to label large data or train large models.

Abstract (translated)

我们考虑从大型视觉-语言模型中分离出三维信息的问题,并通过生成的三维肖像展示了这一过程。这使得我们可以用自由形式的文字控制外观属性(如年龄、发型和眼镜),并通过3D几何学控制面部表情和相机姿态。在这个设置下,假设我们使用一个预训练的大规模视觉-语言模型(LVLM;CLIP)从一个小的2D数据集中生成结果,并且该数据集没有额外配对标签,同时定义了一个预设的三维可变形模型(FLAME)。首先,我们通过将神经3D三平面表示规范到二维参考帧来实现分离。然而,另一种纠缠形式来自于LVLM嵌入空间中的大量噪声,这些噪声描述了无关特征。这种噪声会损害输出质量和多样性,但我们可以通过计算效率高的随机近似器进行雅可比正则化的方法克服这一问题。 与现有方法相比,我们的方法可以在生成的肖像中添加文本和3D控制,当改变任一控制时,肖像仍然保持一致性。总体而言,这种方法让创作者能够在其自己的2D面部数据上控制三维生成模型,并且无需为标注大量数据或训练大型模型投入资源。

URL

https://arxiv.org/abs/2506.14015

PDF

https://arxiv.org/pdf/2506.14015.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot