Paper Reading AI Learner

ArtiScene: Language-Driven Artistic 3D Scene Generation Through Image Intermediary

2025-05-31 23:03:54
Zeqi Gu, Yin Cui, Zhaoshuo Li, Fangyin Wei, Yunhao Ge, Jinwei Gu, Ming-Yu Liu, Abe Davis, Yifan Ding

Abstract

Designing 3D scenes is traditionally a challenging task that demands both artistic expertise and proficiency with complex software. Recent advances in text-to-3D generation have greatly simplified this process by letting users create scenes based on simple text descriptions. However, as these methods generally require extra training or in-context learning, their performance is often hindered by the limited availability of high-quality 3D data. In contrast, modern text-to-image models learned from web-scale images can generate scenes with diverse, reliable spatial layouts and consistent, visually appealing styles. Our key insight is that instead of learning directly from 3D scenes, we can leverage generated 2D images as an intermediary to guide 3D synthesis. In light of this, we introduce ArtiScene, a training-free automated pipeline for scene design that integrates the flexibility of free-form text-to-image generation with the diversity and reliability of 2D intermediary layouts. First, we generate 2D images from a scene description, then extract the shape and appearance of objects to create 3D models. These models are assembled into the final scene using geometry, position, and pose information derived from the same intermediary image. Being generalizable to a wide range of scenes and styles, ArtiScene outperforms state-of-the-art benchmarks by a large margin in layout and aesthetic quality by quantitative metrics. It also averages a 74.89% winning rate in extensive user studies and 95.07% in GPT-4o evaluation. Project page: this https URL

Abstract (translated)

设计三维场景传统上是一项既需要艺术专长又需掌握复杂软件技能的挑战性任务。最近在文本到3D生成领域的进步通过让用户基于简单的文字描述来创建场景,大大简化了这一过程。然而,由于这些方法通常要求额外训练或上下文学习,因此受限于高质量三维数据有限可用性的性能问题仍然存在。相比之下,现代从网络规模图像中学习的文本到图像模型能够产生具有多样性和可靠性空间布局以及一致且视觉吸引人的风格的场景。 我们的关键见解是:与其直接从3D场景进行学习,不如利用生成的2D图像作为中间体来指导3D合成。基于此理念,我们介绍了ArtiScene——一个无需训练的自动化管线,用于场景设计,该管道将自由形式文本到图像生成的灵活性与2D中间布局的多样性和可靠性相结合。 首先,从场景描述中生成2D图像;然后提取对象的形状和外观以创建3D模型。这些模型利用来自同一中间图像的几何、位置和姿态信息进行最终场景组装。ArtiScene能够广泛适用于各种类型的场景和风格,并且在广泛的用户研究中获得了74.89%的胜率,在GPT-4o评估中得到了95.07%的好评。 通过定量指标,ArtiScene在布局和美学质量上大大优于最先进的基准测试。项目页面:[提供链接]

URL

https://arxiv.org/abs/2506.00742

PDF

https://arxiv.org/pdf/2506.00742.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot