Paper Reading AI Learner

StyleSculptor: Zero-Shot Style-Controllable 3D Asset Generation with Texture-Geometry Dual Guidance

2025-09-16 17:55:20
Zefan Qu, Zhenwei Wang, Haoyuan Wang, Ke Xu, Gerhard Hancke, Rynson W. H. Lau

Abstract

Creating 3D assets that follow the texture and geometry style of existing ones is often desirable or even inevitable in practical applications like video gaming and virtual reality. While impressive progress has been made in generating 3D objects from text or images, creating style-controllable 3D assets remains a complex and challenging problem. In this work, we propose StyleSculptor, a novel training-free approach for generating style-guided 3D assets from a content image and one or more style images. Unlike previous works, StyleSculptor achieves style-guided 3D generation in a zero-shot manner, enabling fine-grained 3D style control that captures the texture, geometry, or both styles of user-provided style images. At the core of StyleSculptor is a novel Style Disentangled Attention (SD-Attn) module, which establishes a dynamic interaction between the input content image and style image for style-guided 3D asset generation via a cross-3D attention mechanism, enabling stable feature fusion and effective style-guided generation. To alleviate semantic content leakage, we also introduce a style-disentangled feature selection strategy within the SD-Attn module, which leverages the variance of 3D feature patches to disentangle style- and content-significant channels, allowing selective feature injection within the attention framework. With SD-Attn, the network can dynamically compute texture-, geometry-, or both-guided features to steer the 3D generation process. Built upon this, we further propose the Style Guided Control (SGC) mechanism, which enables exclusive geometry- or texture-only stylization, as well as adjustable style intensity control. Extensive experiments demonstrate that StyleSculptor outperforms existing baseline methods in producing high-fidelity 3D assets.

Abstract (translated)

创建符合现有纹理和几何风格的三维资产在视频游戏和虚拟现实等实际应用中通常是可取甚至不可避免的。尽管从文本或图像生成三维物体已经取得了令人印象深刻的进展,但创建可控风格的三维资产仍然是一个复杂且具有挑战性的问题。在这项工作中,我们提出了StyleSculptor,这是一种全新的无训练方法,可以从内容图片和一个或多个风格图片中生成样式引导的三维资产。与之前的工作不同,StyleSculptor以零样本方式实现了样式引导的三维生成,从而能够捕捉用户提供的风格图片中的纹理、几何结构或两者的精细控制。 StyleSculptor的核心是一个新颖的分离式注意力(SD-Attn)模块,该模块通过跨3D注意机制在输入内容图和风格图之间建立了动态交互,实现了稳定的功能融合及有效的样式引导生成。为了减轻语义内容泄露的问题,我们在SD-Attn模块中引入了分离式的特征选择策略,利用三维特征补丁的方差来区分风格显著通道与内容显著通道,从而可以在注意力框架内实现选择性特征注入。 借助于SD-Attn,网络可以动态地计算纹理、几何或两者引导的功能以指导三维生成过程。在此基础上,我们进一步提出了样式导向控制(SGC)机制,它允许单独的几何或者纹理风格化以及可调节的样式强度控制。 大量的实验表明,StyleSculptor在产生高保真度三维资产方面超越了现有的基线方法。

URL

https://arxiv.org/abs/2509.13301

PDF

https://arxiv.org/pdf/2509.13301.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot