Abstract
While there has been significant progress in solving the problems of image pixel labeling, object detection and scene classification, existing approaches normally address them separately. In this paper, we propose to tackle these problems from a bottom-up perspective, where we simply need a semantic segmentation of the scene as input. We employ the DeepLab architecture, based on the ResNet deep network, which leverages multi-scale inputs to later fuse their responses to perform a precise pixel labeling of the scene. This semantic segmentation mask is used to localize the objects and to recognize the scene, following two simple yet effective strategies. We evaluate the benefits of our solutions, performing a thorough experimental evaluation on the NYU Depth V2 dataset. Our approach achieves a performance that beats the leading results by a significant margin, defining the new state of the art in this benchmark for the three tasks comprising the scene understanding: semantic segmentation, object detection and scene categorization.
Abstract (translated)
虽然在解决图像像素标记,对象检测和场景分类的问题方面已经取得了重大进展,但是现有方法通常分别解决它们。在本文中,我们建议从自下而上的角度解决这些问题,我们只需要将场景的语义分段作为输入。我们采用基于ResNet深度网络的DeepLab架构,该架构利用多尺度输入,以便稍后融合其响应,以执行场景的精确像素标记。该语义分割掩码用于定位对象并识别场景,遵循两个简单但有效的策略。我们评估了我们解决方案的优势,对NYU Depth V2数据集进行了全面的实验评估。我们的方法实现了以显着优势击败领先结果的性能,在此基准中定义了包括场景理解的三个任务的新技术水平:语义分割,对象检测和场景分类。
URL
https://arxiv.org/abs/1807.07284