Abstract
Compositional 3D scene synthesis has diverse applications across a spectrum of industries such as robotics, films, and video games, as it closely mirrors the complexity of real-world multi-object environments. Early works typically employ shape retrieval based frameworks which naturally suffer from limited shape diversity. Recent progresses have been made in shape generation with powerful generative models, such as diffusion models, which increases the shape fidelity. However, these approaches separately treat 3D shape generation and layout generation. The synthesized scenes are usually hampered by layout collision, which implies that the scene-level fidelity is still under-explored. In this paper, we aim at generating realistic and reasonable 3D scenes from scene graph. To enrich the representation capability of the given scene graph inputs, large language model is utilized to explicitly aggregate the global graph features with local relationship features. With a unified graph convolution network (GCN), graph features are extracted from scene graphs updated via joint layout-shape distribution. During scene generation, an IoU-based regularization loss is introduced to constrain the predicted 3D layouts. Benchmarked on the SG-FRONT dataset, our method achieves better 3D scene synthesis, especially in terms of scene-level fidelity. The source code will be released after publication.
Abstract (translated)
组件式3D场景合成在机器人学、电影和游戏等各个行业中具有广泛的应用,因为它与现实世界多物体环境中的复杂性密切相关。早期的作品通常基于形状检索的框架,但自然地存在形状多样性的限制。随着强大生成模型的进步(如扩散模型),形状生成取得了显著提高。然而,这些方法分别处理3D形状生成和布局生成。生成的场景通常受到布局碰撞的影响,这表明在场景级别上,场景级保真度还有待进一步探索。在本文中,我们的目标是生成真实和合理的3D场景,从场景图入手。为了丰富给定的场景图输入的表示能力,我们使用了大型语言模型来明确聚合全局图特征和局部关系特征。通过统一的图卷积网络(GCN),从更新后的场景图中提取 graph 特征。在场景生成过程中,引入了基于IoU的 Regularization Loss 来约束预测的3D布局。在SG-FRONT数据集上的基准测试中,我们的方法实现了更好的3D场景合成,尤其是在场景级别保真度方面。源代码将在发表后发布。
URL
https://arxiv.org/abs/2403.12848