Paper Reading AI Learner

In pixels we trust: From Pixel Labeling to Object Localization and Scene Categorization

2018-07-19 08:28:32
Carlos Herranz-Perdiguero, Carolina Redondo-Cabrera, Roberto J. López-Sastre

Abstract

While there has been significant progress in solving the problems of image pixel labeling, object detection and scene classification, existing approaches normally address them separately. In this paper, we propose to tackle these problems from a bottom-up perspective, where we simply need a semantic segmentation of the scene as input. We employ the DeepLab architecture, based on the ResNet deep network, which leverages multi-scale inputs to later fuse their responses to perform a precise pixel labeling of the scene. This semantic segmentation mask is used to localize the objects and to recognize the scene, following two simple yet effective strategies. We evaluate the benefits of our solutions, performing a thorough experimental evaluation on the NYU Depth V2 dataset. Our approach achieves a performance that beats the leading results by a significant margin, defining the new state of the art in this benchmark for the three tasks comprising the scene understanding: semantic segmentation, object detection and scene categorization.

Abstract (translated)

虽然在解决图像像素标记,对象检测和场景分类的问题方面已经取得了重大进展,但是现有方法通常分别解决它们。在本文中,我们建议从自下而上的角度解决这些问题,我们只需要将场景的语义分段作为输入。我们采用基于ResNet深度网络的DeepLab架构,该架构利用多尺度输入,以便稍后融合其响应,以执行场景的精确像素标记。该语义分割掩码用于定位对象并识别场景,遵循两个简单但有效的策略。我们评估了我们解决方案的优势,对NYU Depth V2数据集进行了全面的实验评估。我们的方法实现了以显着优势击败领先结果的性能,在此基准中定义了包括场景理解的三个任务的新技术水平:语义分割,对象检测和场景分类。

URL

https://arxiv.org/abs/1807.07284

PDF

https://arxiv.org/pdf/1807.07284.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot