Abstract
We propose a computational framework to jointly parse a single RGB image and reconstruct a holistic 3D configuration composed by a set of CAD models using a stochastic grammar model. Specifically, we introduce a Holistic Scene Grammar (HSG) to represent the 3D scene structure, which characterizes a joint distribution over the functional and geometric space of indoor scenes. The proposed HSG captures three essential and often latent dimensions of the indoor scenes: i) latent human context, describing the affordance and the functionality of a room arrangement, ii) geometric constraints over the scene configurations, and iii) physical constraints that guarantee physically plausible parsing and reconstruction. We solve this joint parsing and reconstruction problem in an analysis-by-synthesis fashion, seeking to minimize the differences between the input image and the rendered images generated by our 3D representation, over the space of depth, surface normal, and object segmentation map. The optimal configuration, represented by a parse graph, is inferred using Markov chain Monte Carlo (MCMC), which efficiently traverses through the non-differentiable solution space, jointly optimizing object localization, 3D layout, and hidden human context. Experimental results demonstrate that the proposed algorithm improves the generalization ability and significantly outperforms prior methods on 3D layout estimation, 3D object detection, and holistic scene understanding.
Abstract (translated)
我们提出了一种计算框架,用于联合解析单个RGB图像,并使用随机语法模型重建由一组CAD模型组成的整体3D配置。具体来说,我们引入了一个整体场景语法(HSG)来表示3D场景结构,它表征了室内场景的功能和几何空间上的联合分布。拟议的HSG捕获了室内场景的三个基本且通常潜在的维度:i)潜在的人类背景,描述房间布置的可供性和功能,ii)场景配置的几何约束,以及iii)保证物理上合理的物理约束解析和重建。我们以合成分析的方式解决这个联合解析和重建问题,寻求在深度,表面法线和对象分割图的空间上最小化输入图像和由我们的3D表示生成的渲染图像之间的差异。使用马尔可夫链蒙特卡罗(MCMC)推断出由解析图表示的最优配置,其有效地遍历不可微分解空间,共同优化对象定位,3D布局和隐藏的人类背景。实验结果表明,该算法提高了泛化能力,在三维布局估计,三维物体检测和整体场景理解方面明显优于现有方法。
URL
https://arxiv.org/abs/1808.02201