Abstract
Controllable scene synthesis aims to create interactive environments for various industrial use cases. Scene graphs provide a highly suitable interface to facilitate these applications by abstracting the scene context in a compact manner. Existing methods, reliant on retrieval from extensive databases or pre-trained shape embeddings, often overlook scene-object and object-object relationships, leading to inconsistent results due to their limited generation capacity. To address this issue, we present CommonScenes, a fully generative model that converts scene graphs into corresponding controllable 3D scenes, which are semantically realistic and conform to commonsense. Our pipeline consists of two branches, one predicting the overall scene layout via a variational auto-encoder and the other generating compatible shapes via latent diffusion, capturing global scene-object and local inter-object relationships while preserving shape diversity. The generated scenes can be manipulated by editing the input scene graph and sampling the noise in the diffusion model. Due to lacking a scene graph dataset offering high-quality object-level meshes with relations, we also construct SG-FRONT, enriching the off-the-shelf indoor dataset 3D-FRONT with additional scene graph labels. Extensive experiments are conducted on SG-FRONT where CommonScenes shows clear advantages over other methods regarding generation consistency, quality, and diversity. Codes and the dataset will be released upon acceptance.
Abstract (translated)
可控场景合成的目标是为各种工业应用创建交互环境。场景图提供了高度合适的接口,以通过紧凑的方式抽象场景上下文,方便这些应用。现有的方法依赖于从广泛的数据库或预先训练的形状嵌入中检索,往往忽略场景对象和对象之间的关系,导致由于它们的生成能力有限而产生不一致的结果。为了解决这一问题,我们提出了CommonScenes,这是一个全生成模型,将场景图转换为相应的可控3D场景,语义上真实且符合常识。我们的管道由两个分支组成,一个通过Variational Auto-encoder 预测整个场景布局,另一个通过隐式扩散生成兼容的形状,捕捉全球场景对象和本地对象之间的关系,同时保持形状多样性。生成的场景可以通过编辑输入场景图和采样扩散模型中的噪声来操纵。由于缺少提供高质量对象级网格与关系的场景图数据集,我们还建立了SG-Front,将现有的室内数据集3D-Front中添加额外的场景图标签。在SG-Front上进行广泛的实验,CommonScenes 在生成一致性、质量和多样性方面明显优于其他方法。代码和数据集将在接受后发布。
URL
https://arxiv.org/abs/2305.16283