Paper Reading AI Learner

SceneGenAgent: Precise Industrial Scene Generation with Coding Agent

2024-10-29 10:01:40
Xiao Xia, Dan Zhang, Zibo Liao, Zhenyu Hou, Tianrui Sun, Jing Li, Ling Fu, Yuxiao Dong

Abstract

The modeling of industrial scenes is essential for simulations in industrial manufacturing. While large language models (LLMs) have shown significant progress in generating general 3D scenes from textual descriptions, generating industrial scenes with LLMs poses a unique challenge due to their demand for precise measurements and positioning, requiring complex planning over spatial arrangement. To address this challenge, we introduce SceneGenAgent, an LLM-based agent for generating industrial scenes through C# code. SceneGenAgent ensures precise layout planning through a structured and calculable format, layout verification, and iterative refinement to meet the quantitative requirements of industrial scenarios. Experiment results demonstrate that LLMs powered by SceneGenAgent exceed their original performance, reaching up to 81.0% success rate in real-world industrial scene generation tasks and effectively meeting most scene generation requirements. To further enhance accessibility, we construct SceneInstruct, a dataset designed for fine-tuning open-source LLMs to integrate into SceneGenAgent. Experiments show that fine-tuning open-source LLMs on SceneInstruct yields significant performance improvements, with Llama3.1-70B approaching the capabilities of GPT-4o. Our code and data are available at this https URL .

Abstract (translated)

工业场景的建模对于工业制造中的模拟至关重要。虽然大型语言模型(LLMs)在根据文本描述生成通用的3D场景方面已经取得了显著进展,但使用LLMs生成工业场景却面临着独特的挑战,这主要是由于其对精确测量和定位的需求,需要进行复杂的空间布局规划。为了解决这一问题,我们引入了SceneGenAgent,这是一个基于LLM的代理程序,通过C#代码来生成工业场景。SceneGenAgent通过结构化且可计算的格式确保精准的布局规划,并通过布局验证及迭代优化来满足工业场景的定量要求。实验结果表明,由SceneGenAgent驱动的LLMs其性能超过了原有的水平,在真实世界中的工业场景生成任务中达到了高达81.0%的成功率,并有效地满足了大部分场景生成的要求。为了进一步提高可访问性,我们构建了SceneInstruct数据集,旨在对开源LLMs进行微调以集成到SceneGenAgent中。实验表明,在SceneInstruct上对开源LLMs进行微调能够显著提升性能,Llama3.1-70B接近达到了GPT-4o的能力水平。我们的代码和数据可以在这个 https URL 中获取。

URL

https://arxiv.org/abs/2410.21909

PDF

https://arxiv.org/pdf/2410.21909.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot