Paper Reading AI Learner

ImmerseGen: Agent-Guided Immersive World Generation with Alpha-Textured Proxies

2025-06-17 08:50:05
Jinyan Yuan, Bangbang Yang, Keke Wang, Panwang Pan, Lin Ma, Xuehai Zhang, Xiao Liu, Zhaopeng Cui, Yuewen Ma

Abstract

Automatic creation of 3D scenes for immersive VR presence has been a significant research focus for decades. However, existing methods often rely on either high-poly mesh modeling with post-hoc simplification or massive 3D Gaussians, resulting in a complex pipeline or limited visual realism. In this paper, we demonstrate that such exhaustive modeling is unnecessary for achieving compelling immersive experience. We introduce ImmerseGen, a novel agent-guided framework for compact and photorealistic world modeling. ImmerseGen represents scenes as hierarchical compositions of lightweight geometric proxies, i.e., simplified terrain and billboard meshes, and generates photorealistic appearance by synthesizing RGBA textures onto these proxies. Specifically, we propose terrain-conditioned texturing for user-centric base world synthesis, and RGBA asset texturing for midground and foreground this http URL reformulation offers several advantages: (i) it simplifies modeling by enabling agents to guide generative models in producing coherent textures that integrate seamlessly with the scene; (ii) it bypasses complex geometry creation and decimation by directly synthesizing photorealistic textures on proxies, preserving visual quality without degradation; (iii) it enables compact representations suitable for real-time rendering on mobile VR headsets. To automate scene creation from text prompts, we introduce VLM-based modeling agents enhanced with semantic grid-based analysis for improved spatial reasoning and accurate asset placement. ImmerseGen further enriches scenes with dynamic effects and ambient audio to support multisensory immersion. Experiments on scene generation and live VR showcases demonstrate that ImmerseGen achieves superior photorealism, spatial coherence and rendering efficiency compared to prior methods. Project webpage: this https URL.

Abstract (translated)

几十年来,自动创建用于沉浸式VR体验的3D场景一直是重要的研究焦点。然而,现有的方法通常依赖于高多边形网格建模后再进行简化处理或使用大量的三维高斯模型,这导致了复杂的流程或是有限的真实感视觉效果。在本文中,我们展示了为了实现令人信服的沉浸式体验,并不需要这种详尽的建模工作。我们引入了ImmerseGen,这是一个新的代理引导框架,用于紧凑且逼真的世界建模。 ImmerseGen将场景表示为轻量级几何代理(即简化的地形和海报网格)的层次组合,并通过合成RGBA纹理在这些代理上生成逼真的外观效果。具体而言,我们提出了基于地形条件的纹理处理方法来合成用户为中心的基本世界的组成元素,以及用于中景和前景元素的RGBA资产纹理化。这种重新构想提供了几个优势:(i) 它简化了建模过程,通过让代理指导生成模型生产与场景无缝集成的连贯纹理;(ii) 无需复杂的几何创建和减面处理,直接在代理上合成逼真的纹理可以保持视觉质量而不退化;(iii) 支持适合移动VR头显实时渲染的紧凑表示形式。 为了从文本提示中自动创建场景,我们引入了增强有语义网格分析功能的VLM建模代理,以改进空间推理和准确的资产放置。ImmerseGen进一步通过动态效果和环境音频来丰富场景,支持多感官沉浸体验。实验结果表明,在场景生成和现场VR演示中,与先前方法相比,ImmerseGen在逼真度、空间一致性以及渲染效率方面都表现出色。 项目网页:[请参阅原文提供的链接]

URL

https://arxiv.org/abs/2506.14315

PDF

https://arxiv.org/pdf/2506.14315.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot