Paper Reading AI Learner

CasaGPT: Cuboid Arrangement and Scene Assembly for Interior Design

2025-04-28 04:35:04
Weitao Feng, Hang Zhou, Jing Liao, Li Cheng, Wenbo Zhou

Abstract

We present a novel approach for indoor scene synthesis, which learns to arrange decomposed cuboid primitives to represent 3D objects within a scene. Unlike conventional methods that use bounding boxes to determine the placement and scale of 3D objects, our approach leverages cuboids as a straightforward yet highly effective alternative for modeling objects. This allows for compact scene generation while minimizing object intersections. Our approach, coined CasaGPT for Cuboid Arrangement and Scene Assembly, employs an autoregressive model to sequentially arrange cuboids, producing physically plausible scenes. By applying rejection sampling during the fine-tuning stage to filter out scenes with object collisions, our model further reduces intersections and enhances scene quality. Additionally, we introduce a refined dataset, 3DFRONT-NC, which eliminates significant noise presented in the original dataset, 3D-FRONT. Extensive experiments on the 3D-FRONT dataset as well as our dataset demonstrate that our approach consistently outperforms the state-of-the-art methods, enhancing the realism of generated scenes, and providing a promising direction for 3D scene synthesis.

Abstract (translated)

我们提出了一种新颖的室内场景合成方法,该方法通过学习安排分解的立方体基元来表示场景中的三维对象。与传统的使用边界框确定三维物体放置和比例的方法不同,我们的方法利用立方体作为建模物体的直接且高度有效的方式。这使得在减少物体相交的同时能够生成紧凑的场景。我们称之为CasaGPT(用于立方体排列和场景组装)的方法采用自回归模型顺序安排立方体,从而生成物理上合理的场景。通过在微调阶段应用拒绝采样来过滤掉有对象碰撞的场景,我们的模型进一步减少了相交并提高了场景质量。此外,我们还引入了一个改进的数据集3DFRONT-NC,该数据集消除了原始数据集3D-FRONT中存在的大量噪声。在3D-FRONT和我们自己的数据集上的广泛实验表明,我们的方法始终优于最先进的方法,增强了生成场景的现实感,并为三维场景合成提供了有前景的方向。

URL

https://arxiv.org/abs/2504.19478

PDF

https://arxiv.org/pdf/2504.19478.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot