Paper Reading AI Learner

EmbodiedGen: Towards a Generative 3D World Engine for Embodied Intelligence

2025-06-12 11:43:50
Wang Xinjie, Liu Liu, Cao Yu, Wu Ruiqi, Qin Wenkang, Wang Dehui, Sui Wei, Su Zhizhong

Abstract

Constructing a physically realistic and accurately scaled simulated 3D world is crucial for the training and evaluation of embodied intelligence tasks. The diversity, realism, low cost accessibility and affordability of 3D data assets are critical for achieving generalization and scalability in embodied AI. However, most current embodied intelligence tasks still rely heavily on traditional 3D computer graphics assets manually created and annotated, which suffer from high production costs and limited realism. These limitations significantly hinder the scalability of data driven approaches. We present EmbodiedGen, a foundational platform for interactive 3D world generation. It enables the scalable generation of high-quality, controllable and photorealistic 3D assets with accurate physical properties and real-world scale in the Unified Robotics Description Format (URDF) at low cost. These assets can be directly imported into various physics simulation engines for fine-grained physical control, supporting downstream tasks in training and evaluation. EmbodiedGen is an easy-to-use, full-featured toolkit composed of six key modules: Image-to-3D, Text-to-3D, Texture Generation, Articulated Object Generation, Scene Generation and Layout Generation. EmbodiedGen generates diverse and interactive 3D worlds composed of generative 3D assets, leveraging generative AI to address the challenges of generalization and evaluation to the needs of embodied intelligence related research. Code is available at this https URL.

Abstract (translated)

构建一个物理上真实且比例准确的三维模拟世界对于具身智能任务的训练和评估至关重要。三维数据资产的多样性、现实性、低成本获取性和可负担性是实现具身人工智能中的泛化和规模扩展的关键因素。然而,目前大多数具身智能任务仍然严重依赖于传统的由人工创建和标注的3D计算机图形资源,这些资源面临着高昂的生产成本和有限的真实感问题。这些问题显著限制了数据驱动方法的可扩展性。 我们提出了一种名为EmbodiedGen的基础平台,该平台用于生成交互式三维世界。它能够在低成本的情况下,大规模地生成高质量、可控且高度逼真的3D资产,并具备准确的物理特性和实际世界的规模(在统一机器人描述格式URDF中)。这些资源可以直接导入到各种物理模拟引擎中以实现细微程度的物理控制,支持下游任务中的训练和评估工作。EmbodiedGen是一个易于使用的全功能工具包,由六个关键模块组成:Image-to-3D、Text-to-3D、Texture Generation(纹理生成)、Articulated Object Generation(连杆对象生成)、Scene Generation(场景生成)和Layout Generation(布局生成)。通过利用生成式AI技术,EmbodiedGen能够创建包含生成式3D资产的多样化且可交互的三维世界,以解决具身智能相关研究中泛化与评估方面的挑战。该代码可在提供的URL处获取。

URL

https://arxiv.org/abs/2506.10600

PDF

https://arxiv.org/pdf/2506.10600.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot