Scene_Parsing

Pyramidal Convolution: Rethinking Convolutional Neural Networks for Visual Recognition

2020-06-20 10:19:29

Ionut Cosmin Duta, Li Liu, Fan Zhu, Ling Shao

arXiv_CV

arXiv_CV Segmentation CNN Recognition Detection Object_Detection Classification Image_Classification Action Scene_Parsing
Abstract

This work introduces pyramidal convolution (PyConv), which is capable of processing the input at multiple filter scales. PyConv contains a pyramid of kernels, where each level involves different types of filters with varying size and depth, which are able to capture different levels of details in the scene. On top of these improved recognition capabilities, PyConv is also efficient and, with our formulation, it does not increase the computational cost and parameters compared to standard convolution. Moreover, it is very flexible and extensible, providing a large space of potential network architectures for different applications. PyConv has the potential to impact nearly every computer vision task and, in this work, we present different architectures based on PyConv for four main tasks on visual recognition: image classification, video action classification/recognition, object detection and semantic image segmentation/parsing. Our approach shows significant improvements over all these core tasks in comparison with the baselines. For instance, on image recognition, our 50-layers network outperforms in terms of recognition performance on ImageNet dataset its counterpart baseline ResNet with 152 layers, while having 2.39 times less parameters, 2.52 times lower computational complexity and more than 3 times less layers. On image segmentation, our novel framework sets a new state-of-the-art on the challenging ADE20K benchmark for scene parsing. Code is available at: this https URL

Abstract (translated)

URL

https://arxiv.org/abs/2006.11538

PDF

https://arxiv.org/pdf/2006.11538.pdf
Read All
CascadePSP: Toward Class-Agnostic and Very High-Resolution Segmentation via Global and Local Refinement

2020-05-06 01:38:03

Ho Kei Cheng (HKUST), Jihoon Chung (HKUST), Yu-Wing Tai (Tencent), Chi-Keung Tang (HKUST)

arXiv_CV

arXiv_CV Segmentation Semantic_Segmentation Quantitative Pose Scene_Parsing
Abstract

State-of-the-art semantic segmentation methods were almost exclusively trained on images within a fixed resolution range. These segmentations are inaccurate for very high-resolution images since using bicubic upsampling of low-resolution segmentation does not adequately capture high-resolution details along object boundaries. In this paper, we propose a novel approach to address the high-resolution segmentation problem without using any high-resolution training data. The key insight is our CascadePSP network which refines and corrects local boundaries whenever possible. Although our network is trained with low-resolution segmentation data, our method is applicable to any resolution even for very high-resolution images larger than 4K. We present quantitative and qualitative studies on different datasets to show that CascadePSP can reveal pixel-accurate segmentation boundaries using our novel refinement module without any finetuning. Thus, our method can be regarded as class-agnostic. Finally, we demonstrate the application of our model to scene parsing in multi-class segmentation.

Abstract (translated)

URL

https://arxiv.org/abs/2005.02551

PDF

https://arxiv.org/pdf/2005.02551.pdf
Read All
Strip Pooling: Rethinking Spatial Pooling for Scene Parsing

2020-03-30 10:40:11

Qibin Hou, Li Zhang, Ming-Ming Cheng, Jiashi Feng

arXiv_CV

arXiv_CV Prediction Pose Scene_Parsing
Abstract

Spatial pooling has been proven highly effective in capturing long-range contextual information for pixel-wise prediction tasks, such as scene parsing. In this paper, beyond conventional spatial pooling that usually has a regular shape of NxN, we rethink the formulation of spatial pooling by introducing a new pooling strategy, called strip pooling, which considers a long but narrow kernel, i.e., 1xN or Nx1. Based on strip pooling, we further investigate spatial pooling architecture design by 1) introducing a new strip pooling module that enables backbone networks to efficiently model long-range dependencies, 2) presenting a novel building block with diverse spatial pooling as a core, and 3) systematically comparing the performance of the proposed strip pooling and conventional spatial pooling techniques. Both novel pooling-based designs are lightweight and can serve as an efficient plug-and-play module in existing scene parsing networks. Extensive experiments on popular benchmarks (e.g., ADE20K and Cityscapes) demonstrate that our simple approach establishes new state-of-the-art results. Code is made available at this https URL.

Abstract (translated)

URL

https://arxiv.org/abs/2003.13328

PDF

https://arxiv.org/pdf/2003.13328.pdf
Read All
EPSNet: Efficient Panoptic Segmentation Network with Cross-layer Attention Fusion

2020-03-23 09:11:44

Chia-Yuan Chang, Shuo-En Chang, Pei-Yung Hsiao, Li-Chen Fu

arXiv_CV

arXiv_CV Segmentation Semantic_Segmentation Attention Inference Pose Scene_Parsing
Abstract

Panoptic segmentation is a scene parsing task which unifies semantic segmentation and instance segmentation into one single task. However, the current state-of-the-art studies did not take too much concern on inference time. In this work, we propose an Efficient Panoptic Segmentation Network (EPSNet) to tackle the panoptic segmentation tasks with fast inference speed. Basically, EPSNet generates masks based on simple linear combination of prototype masks and mask coefficients. The light-weight network branches for instance segmentation and semantic segmentation only need to predict mask coefficients and produce masks with the shared prototypes predicted by prototype network branch. Furthermore, to enhance the quality of shared prototypes, we adopt a module called "cross-layer attention fusion module", which aggregates the multi-scale features with attention mechanism helping them capture the long-range dependencies between each other. To validate the proposed work, we have conducted various experiments on the challenging COCO panoptic dataset, which achieve highly promising performance with significantly faster inference speed (53ms on GPU).

Abstract (translated)

URL

https://arxiv.org/abs/2003.10142

PDF

https://arxiv.org/pdf/2003.10142.pdf
Read All
Semantic Flow for Fast and Accurate Scene Parsing

2020-02-24 08:53:18

Xiangtai Li, Ansheng You, Zhen Zhu, Houlong Zhao, Maoke Yang, Kuiyuan Yang, Yunhai Tong

arXiv_RO

arXiv_RO Pose Scene_Parsing Optical_Flow
Abstract

In this paper, we focus on effective methods for fast and accurate scene parsing. A common practice to improve the performance is to attain high resolution feature maps with strong semantic representation. Two strategies are widely used---astrous convolutions and feature pyramid fusion, are either computation intensive or ineffective. Inspired by Optical Flow for motion alignment between adjacent video frames, we propose a Flow Alignment Module (FAM) to learn Semantic Flow between feature maps of adjacent levels and broadcast high-level features to high resolution features effectively and efficiently. Furthermore, integrating our module to a common feature pyramid structure exhibits superior performance over other real-time methods even on very light-weight backbone networks, such as ResNet-18. Extensive experiments are conducted on several challenging datasets, including Cityscapes, PASCAL Context, ADE20K and CamVid. Particularly, our network is the first to achieve 80.4\% mIoU on Cityscapes with a frame rate of 26 FPS. The code will be available at \url{this https URL}.

Abstract (translated)

URL

https://arxiv.org/abs/2002.10120

PDF

https://arxiv.org/pdf/2002.10120.pdf
Read All
Real-Time Panoptic Segmentation from Dense Detections

2019-12-03 05:50:02

Rui Hou, Jie Li, Arjun Bhargava, Allan Raventos, Vitor Guizilini, Chao Fang, Jerome Lynch, Adrien Gaidon

arXiv_CV

arXiv_CV Segmentation Semantic_Segmentation Detection Object_Detection Attention Inference Pose Scene_Parsing
Abstract

Panoptic segmentation is a complex full scene parsing task requiring simultaneous instance and semantic segmentation at high resolution. Current state-of-the-art approaches cannot run in real-time, and simplifying these architectures to improve efficiency severely degrades their accuracy. In this paper, we propose a new single-shot panoptic segmentation network that leverages dense detections and a global self-attention mechanism to operate in real-time with performance approaching the state of the art. We introduce a novel parameter-free mask construction method that substantially reduces computational complexity by efficiently reusing information from the object detection and semantic segmentation sub-tasks. The resulting network has a simple data flow that does not require feature map re-sampling or clustering post-processing, enabling significant hardware acceleration. Our experiments on the Cityscapes and COCO benchmarks show that our network works at 30 FPS on 1024x2048 resolution, trading a 3% relative performance degradation from the current state of the art for up to 440% faster inference.

Abstract (translated)

URL

https://arxiv.org/abs/1912.01202

PDF

https://arxiv.org/pdf/1912.01202.pdf
Read All
Learning Generalizable Representations via Diverse Supervision

2019-11-29 00:56:06

Ziqi Pang, Zhiyuan Hu, Pavel Tokmakov, Yu-Xiong Wang, Martial Hebert

arXiv_CV

arXiv_CV Recognition Classification Attention Represenation_Learning Few-Shot Scene_Parsing
Abstract

The problem of rare category recognition has received a lot of attention recently, with state-of-the-art methods achieving significant improvements. However, we identify two major limitations in the existing literature. First, the benchmarks are constructed by randomly splitting the categories of artificially balanced datasets into frequent (head), and rare (tail) subsets, which results in unrealistic category distributions in both of them. Second, the idea of using external sources of supervision to learn generalizable representations is largely overlooked. In this work, we attempt to address both of these shortcomings by introducing the ADE-FewShot benchmark. It stands upon the ADE dataset for scene parsing that features a realistic, long-tail distribution of categories as well as a diverse set of annotations. We turn it into a realistic few-shot classification benchmark by splitting the object categories into head and tail based on their distribution in the world. We then analyze the effect of applying various supervision sources on representation learning for rare category recognition, and observe significant improvements.

Abstract (translated)

URL

https://arxiv.org/abs/1911.12911

PDF

https://arxiv.org/pdf/1911.12911.pdf
Read All
Differentiating Features for Scene Segmentation Based on Dedicated Attention Mechanisms

2019-11-19 08:17:59

Zhiqiang Xiong, Zhicheng Wang, Zhaohui Yu, Xi Gu

arXiv_CV

arXiv_CV Segmentation Semantic_Segmentation Attention Prediction Transformer Pose Scene_Parsing
Abstract

Semantic segmentation is a challenge in scene parsing. It requires both context information and rich spatial information. In this paper, we differentiate features for scene segmentation based on dedicated attention mechanisms (DF-DAM), and two attention modules are proposed to optimize the high-level and low-level features in the encoder, respectively. Specifically, we use the high-level and low-level features of ResNet as the source of context information and spatial information, respectively, and optimize them with attention fusion module and 2D position attention module, respectively. For attention fusion module, we adopt dual channel weight to selectively adjust the channel map for the highest two stage features of ResNet, and fuse them to get context information. For 2D position attention module, we use the context information obtained by attention fusion module to assist the selection of the lowest-stage features of ResNet as supplementary spatial information. Finally, the two sets of information obtained by the two modules are simply fused to obtain the prediction. We evaluate our approach on Cityscapes and PASCAL VOC 2012 datasets. In particular, there aren't complicated and redundant processing modules in our architecture, which greatly reduces the complexity, and we achieving 82.3% Mean IoU on PASCAL VOC 2012 test dataset without pre-training on MS-COCO dataset.

Abstract (translated)

URL

https://arxiv.org/abs/1911.08149

PDF

https://arxiv.org/pdf/1911.08149.pdf
Read All
Adaptive Context Network for Scene Parsing

2019-11-05 08:16:28

Jun Fu, Jing Liu, Yuhang Wang, Yong Li, Yongjun Bao, Jinhui Tang, Hanqing Lu

arXiv_CV

arXiv_CV CNN Pose Scene_Parsing
Abstract

Recent works attempt to improve scene parsing performance by exploring different levels of contexts, and typically train a well-designed convolutional network to exploit useful contexts across all pixels equally. However, in this paper, we find that the context demands are varying from different pixels or regions in each image. Based on this observation, we propose an Adaptive Context Network (ACNet) to capture the pixel-aware contexts by a competitive fusion of global context and local context according to different per-pixel demands. Specifically, when given a pixel, the global context demand is measured by the similarity between the global feature and its local feature, whose reverse value can be used to measure the local context demand. We model the two demand measurements by the proposed global context module and local context module, respectively, to generate adaptive contextual features. Furthermore, we import multiple such modules to build several adaptive context blocks in different levels of network to obtain a coarse-to-fine result. Finally, comprehensive experimental evaluations demonstrate the effectiveness of the proposed ACNet, and new state-of-the-arts performances are achieved on all four public datasets, i.e. Cityscapes, ADE20K, PASCAL Context, and COCO Stuff.

Abstract (translated)

URL

https://arxiv.org/abs/1911.01664

PDF

https://arxiv.org/pdf/1911.01664.pdf
Read All
Segment for Restoration, Restore for Segmentation

2019-11-02 08:39:52

Weihao Xia, Yujiu Yang

arXiv_CV

arXiv_CV Segmentation Semantic_Segmentation Pose Scene_Parsing Restoration Enhancement
Abstract

Most state-of-the-art semantic segmentation or scene parsing approaches only achieve high accuracy rates in optimal weather conditions. The performance decrease enormously if images with unknown disturbances occur, which is less discussed but appears more in real applications. Most existing research works cast the handling of the challenging adverse conditions as a post-processing step of signal restoration or enhancement after sensing, then feed the restored images for visual understanding. However, the performance will largely depend on the quality of restoration or enhancement. Whether restoration-based approaches would actually boost the semantic segmentation performance remains questionable. In this paper, we propose a novel framework to tackle semantic segmentation and image restoration under adverse environmental conditions, named SR-Restore. The proposed approach contains two components: Semantically-Guided Adaptation, which exploits and leverages semantic information from degraded images then help to refine the segmentation; and Exemplar-Guided Synthesis, which synthesizes restored or enhanced images from semantic label maps given specific degraded exemplars. SR-Restore exploits the possibility of building connections of low-level image processing and high level computer vision tasks, achieving image restoration via segmentation refinement. Extensive experiments on several datasets demonstrate that our approach can not only improve the accuracy of high-level vision tasks with image adaption, but also boosts the perceptual quality and structural similarity of degraded images with image semantic guidance.

Abstract (translated)

URL

https://arxiv.org/abs/1911.00679

PDF

https://arxiv.org/pdf/1911.00679.pdf
Read All
Segmenting Ships in Satellite Imagery With Squeeze and Excitation U-Net

2019-10-27 08:28:51

Venkatesh R, Anand Metha

arXiv_CV

arXiv_CV Segmentation Semantic_Segmentation Detection Pose Scene_Parsing
Abstract

The ship-detection task in satellite imagery presents significant obstacles to even the most state of the art segmentation models due to lack of labelled dataset or approaches which are not able to generalize to unseen images. The most common methods for semantic segmentation involve complex two-stage networks or networks which make use of a multi-scale scene parsing module. In this paper, we propose a modified version of the popular U-Net architecture called Squeeze and Excitation U-Net and train it with a loss that helps in directly optimizing the intersection over union (IoU) score. Our method gives comparable performance to other methods while having the additional benefit of being computationally efficient.

Abstract (translated)

URL

https://arxiv.org/abs/1910.12206

PDF

https://arxiv.org/pdf/1910.12206.pdf
Read All
Fully-Automatic Semantic Segmentation for Food Intake Tracking in Long-Term Care Homes

2019-10-24 15:50:20

Kaylen J Pfisterer, Robert Amelard, Audrey G Chung, Braeden Syrnyk, Alexander MacLean, Alexander Wong

arXiv_CV

arXiv_CV Segmentation Semantic_Segmentation CNN Classification Tracking Pose Scene_Parsing
Abstract

Malnutrition impacts quality of life and places annually-recurring burden on the health care system. Half of older adults are at risk for malnutrition in long-term care (LTC). Monitoring and measuring nutritional intake is paramount yet involves time-consuming and subjective visual assessment, limiting current methods' reliability. The opportunity for automatic image-based estimation exists. Some progress outside LTC has been made (e.g., calories consumed, food classification), however, these methods have not been implemented in LTC, potentially due to a lack of ability to independently evaluate automatic segmentation methods within the intake estimation pipeline. Here, we propose and evaluate a novel fully-automatic semantic segmentation method for pixel-level classification of food on a plate using a deep convolutional neural network (DCNN). The macroarchitecture of the DCNN is a multi-scale encoder-decoder food network (EDFN) architecture comprising a residual encoder microarchitecture, a pyramid scene parsing decoder microarchitecture, and a specialized per-pixel food/no-food classification layer. The network was trained and validated on the pre-labelled UNIMIB 2016 food dataset (1027 tray images, 73 categories), and tested on our novel LTC plate dataset (390 plate images, 9 categories). Our fully-automatic segmentation method attained similar intersection over union to the semi-automatic graph cuts (91.2% vs. 93.7%). Advantages of our proposed system include: testing on a novel dataset, decoupled error analysis, no user-initiated annotations, with similar segmentation accuracy and enhanced reliability in terms of types of segmentation errors. This may address several short-comings currently limiting utility of automated food intake tracking in time-constrained LTC and hospital settings.

Abstract (translated)

URL

https://arxiv.org/abs/1910.11250

PDF

https://arxiv.org/pdf/1910.11250.pdf
Read All
Deep Multiphase Level Set for Scene Parsing

2019-10-08 01:58:24

Pingping Zhang, Wei Liu, Yinjie Lei, Chunhua Shen, Huchuan Lu

arXiv_CV

arXiv_CV Segmentation CNN Pose Scene_Parsing Contour
Abstract

Recently, Fully Convolutional Network (FCN) seems to be the go-to architecture for image segmentation, including semantic scene parsing. However, it is difficult for a generic FCN to discriminate pixels around the object boundaries, thus FCN based methods may output parsing results with inaccurate boundaries. Meanwhile, level set based active contours are superior to the boundary estimation due to the sub-pixel accuracy that they achieve. However, they are quite sensitive to initial settings. To address these limitations, in this paper we propose a novel Deep Multiphase Level Set (DMLS) method for semantic scene parsing, which efficiently incorporates multiphase level sets into deep neural networks. The proposed method consists of three modules, i.e., recurrent FCNs, adaptive multiphase level set, and deeply supervised learning. More specifically, recurrent FCNs learn multi-level representations of input images with different contexts. Adaptive multiphase level set drives the discriminative contour for each semantic class, which makes use of the advantages of both global and local information. In each time-step of the recurrent FCNs, deeply supervised learning is incorporated for model training. Extensive experiments on three public benchmarks have shown that our proposed method achieves new state-of-the-art performances.

Abstract (translated)

URL

https://arxiv.org/abs/1910.03166

PDF

https://arxiv.org/pdf/1910.03166.pdf
Read All
Boosting Real-Time Driving Scene Parsing with Shared Semantics

2019-09-16 07:38:26

Zhenzhen Xiang, Anbo Bao, Jie Li, Jianbo Su

arXiv_CV

arXiv_CV Segmentation Semantic_Segmentation Pose Autonomous Scene_Parsing
Abstract

Real-time scene parsing is a fundamental feature for autonomous driving vehicles with multiple cameras. Comparing with traditional methods which individually process the frames from each camera, in this letter we demonstrate that sharing semantics between cameras with overlapped views can boost the parsing performance. Our framework is based on a deep neural network for semantic segmentation but with two kinds of additional modules for sharing and fusing semantics. On one hand, a semantics sharing module is designed to establish the pixel-wise mapping between the input image pair. Features as well as semantics are shared by the map to reduce duplicated workload which leads to more efficient computation. On the other hand, feature fusion modules are designed to combine different modal of semantic features, which learns to leverage the information from both inputs for better results. To evaluate the effectiveness of the proposed framework, we collect a new dataset with a dual-camera vision system for driving scene parsing. Experimental results show that our network outperforms the baseline method on the parsing accuracy with comparable computations.

Abstract (translated)

URL

https://arxiv.org/abs/1909.07038

PDF

https://arxiv.org/pdf/1909.07038.pdf
Read All
Holistic++ Scene Understanding: Single-view 3D Holistic Scene Parsing and Human Pose Estimation with Human-Object Interaction and Physical Commonsense

2019-09-04 00:42:20

Yixin Chen, Siyuan Huang, Tao Yuan, Siyuan Qi, Yixin Zhu, Song-Chun Zhu

arXiv_CV

arXiv_CV Relation Pose_Estimation Pose Action 3D Scene_Parsing Reconstruction Agent
Abstract

We propose a new 3D holistic++ scene understanding problem, which jointly tackles two tasks from a single-view image: (i) holistic scene parsing and reconstruction---3D estimations of object bounding boxes, camera pose, and room layout, and (ii) 3D human pose estimation. The intuition behind is to leverage the coupled nature of these two tasks to improve the granularity and performance of scene understanding. We propose to exploit two critical and essential connections between these two tasks: (i) human-object interaction (HOI) to model the fine-grained relations between agents and objects in the scene, and (ii) physical commonsense to model the physical plausibility of the reconstructed scene. The optimal configuration of the 3D scene, represented by a parse graph, is inferred using Markov chain Monte Carlo (MCMC), which efficiently traverses through the non-differentiable joint solution space. Experimental results demonstrate that the proposed algorithm significantly improves the performance of the two tasks on three datasets, showing an improved generalization ability.

Abstract (translated)

URL

https://arxiv.org/abs/1909.01507

PDF

https://arxiv.org/pdf/1909.01507.pdf
Read All
Consensus Feature Network for Scene Parsing

2019-07-29 13:22:30

Tianyi Wu, Sheng Tang, Rui Zhang, Guodong Guo, Yongdong Zhang

arXiv_CV

arXiv_CV Classification Prediction Pose Scene_Parsing
Abstract

Scene parsing is challenging as it aims to assign one of the semantic categories to each pixel in scene images. Thus, pixel-level features are desired for scene parsing. However, classification networks are dominated by the discriminative portion, so directly applying classification networks to scene parsing will result in inconsistent parsing predictions within one instance and among instances of the same category. To address this problem, we propose two transform units to learn pixel-level consensus features. One is an Instance Consensus Transform (ICT) unit to learn the instance-level consensus features by aggregating features within the same instance. The other is a Category Consensus Transform (CCT) unit to pursue category-level consensus features through keeping the consensus of features among instances of the same category in scene images. The proposed ICT and CCT units are lightweight, data-driven and end-to-end trainable. The features learned by the two units are more coherent in both instance-level and category-level. Furthermore, we present the Consensus Feature Network (CFNet) based on the proposed ICT and CCT units. Experiments on four scene parsing benchmarks, including Cityscapes, Pascal Context, CamVid, and COCO Stuff, show that the proposed CFNet learns pixel-level consensus feature and obtain consistent parsing results.

Abstract (translated)

URL

https://arxiv.org/abs/1907.12411

PDF

https://arxiv.org/pdf/1907.12411.pdf
Read All
Quadtree Generating Networks: Efficient Hierarchical Scene Parsing with Sparse Convolutions

2019-07-27 00:20:12

Kashyap Chitta, Jose M. Alvarez, Martial Hebert

arXiv_CV

arXiv_CV Segmentation Semantic_Segmentation CNN Inference Sparse Prediction Scene_Parsing
Abstract

Semantic segmentation with Convolutional Neural Networks is a memory-intensive task due to the high spatial resolution of feature maps and output predictions. In this paper, we present Quadtree Generating Networks (QGNs), a novel approach able to drastically reduce the memory footprint of modern semantic segmentation networks. The key idea is to use quadtrees to represent the predictions and target segmentation masks instead of dense pixel grids. Our quadtree representation enables hierarchical processing of an input image, with the most computationally demanding layers only being used at regions in the image containing boundaries between classes. In addition, given a trained model, our representation enables flexible inference schemes to trade-off accuracy and computational cost, allowing the network to adapt in constrained situations such as embedded devices. We demonstrate the benefits of our approach on the Cityscapes, SUN-RGBD and ADE20k datasets. On Cityscapes, we obtain an relative 3% mIoU improvement compared to a dilated network with similar memory consumption; and only receive a 3% relative mIoU drop compared to a large dilated network, while reducing memory consumption by over 4$\times$.

Abstract (translated)

URL

https://arxiv.org/abs/1907.11821

PDF

https://arxiv.org/pdf/1907.11821.pdf
Read All
Context-Integrated and Feature-Refined Network for Lightweight Urban Scene Parsing

2019-07-26 10:50:30

Bin Jiang, Wenxuan Tu, Chao Yang, Junsong Yuan

arXiv_CV

arXiv_CV Segmentation Semantic_Segmentation Attention Pose Scene_Parsing
Abstract

Semantic segmentation for lightweight urban scene parsing is a very challenging task, because both accuracy and efficiency (e.g., execution speed, memory footprint, and computation complexity) should all be taken into account. However, most previous works pay too much attention to one-sided perspective, either accuracy or speed, and ignore others, which poses a great limitation to actual demands of intelligent devices. To tackle this dilemma, we propose a new lightweight architecture named Context-Integrated and Feature-Refined Network (CIFReNet). The core components of our architecture are the Long-skip Refinement Module (LRM) and the Multi-scale Contexts Integration Module (MCIM). With low additional computation cost, LRM is designed to ease the propagation of spatial information and boost the quality of feature refinement. Meanwhile, MCIM consists of three cascaded Dense Semantic Pyramid (DSP) blocks with a global constraint. It makes full use of sub-regions close to the target and enlarges the field of view in an economical yet powerful way. Comprehensive experiments have demonstrated that our proposed method reaches a reasonable trade-off among overall properties on Cityscapes and Camvid dataset. Specifically, with only 7.1 GFLOPs, CIFReNet that contains less than 1.9 M parameters obtains a competitive result of 70.9% MIoU on Cityscapes test set and 64.5% on Camvid test set at a real-time speed of 32.3 FPS, which is more cost-efficient than other state-of-the-art methods.

Abstract (translated)

URL

https://arxiv.org/abs/1907.11474

PDF

https://arxiv.org/pdf/1907.11474.pdf
Read All
SlimYOLOv3: Narrower, Faster and Better for Real-Time UAV Applications

2019-07-25 14:22:43

Pengyi Zhang, Yunxin Zhong, Xiaoqiong Li

arXiv_CV

arXiv_CV CNN Detection Object_Detection Drone Regularization Pose Scene_Parsing
Abstract

Drones or general Unmanned Aerial Vehicles (UAVs), endowed with computer vision function by on-board cameras and embedded systems, have become popular in a wide range of applications. However, real-time scene parsing through object detection running on a UAV platform is very challenging, due to limited memory and computing power of embedded devices. To deal with these challenges, in this paper we propose to learn efficient deep object detectors through channel pruning of convolutional layers. To this end, we enforce channel-level sparsity of convolutional layers by imposing L1 regularization on channel scaling factors and prune less informative feature channels to obtain "slim" object detectors. Based on such approach, we present SlimYOLOv3 with fewer trainable parameters and floating point operations (FLOPs) in comparison of original YOLOv3 (Joseph Redmon et al., 2018) as a promising solution for real-time object detection on UAVs. We evaluate SlimYOLOv3 on VisDrone2018-Det benchmark dataset; compelling results are achieved by SlimYOLOv3 in comparison of unpruned counterpart, including ~90.8% decrease of FLOPs, ~92.0% decline of parameter size, running ~2 times faster and comparable detection accuracy as YOLOv3. Experimental results with different pruning ratios consistently verify that proposed SlimYOLOv3 with narrower structure are more efficient, faster and better than YOLOv3, and thus are more suitable for real-time object detection on UAVs. Our codes are made publicly available at https://github.com/PengyiZhang/SlimYOLOv3.

Abstract (translated)

URL

https://arxiv.org/abs/1907.11093

PDF

https://arxiv.org/pdf/1907.11093.pdf
Read All
ELKPPNet: An Edge-aware Neural Network with Large Kernel Pyramid Pooling for Learning Discriminative Features in Semantic Segmentation

2019-06-27 03:58:45

Xianwei Zheng, Linxi Huan, Hanjiang Xiong, Jianya Gong

arXiv_CV

arXiv_CV Segmentation Semantic_Segmentation CNN Face Prediction Pose Action Scene_Parsing
Abstract

Semantic segmentation has been a hot topic across diverse research fields. Along with the success of deep convolutional neural networks, semantic segmentation has made great achievements and improvements, in terms of both urban scene parsing and indoor semantic segmentation. However, most of the state-of-the-art models are still faced with a challenge in discriminative feature learning, which limits the ability of a model to detect multi-scale objects and to guarantee semantic consistency inside one object or distinguish different adjacent objects with similar appearance. In this paper, a practical and efficient edge-aware neural network is presented for semantic segmentation. This end-to-end trainable engine consists of a new encoder-decoder network, a large kernel spatial pyramid pooling (LKPP) block, and an edge-aware loss function. The encoder-decoder network was designed as a balanced structure to narrow the semantic and resolution gaps in multi-level feature aggregation, while the LKPP block was constructed with a densely expanding receptive field for multi-scale feature extraction and fusion. Furthermore, the new powerful edge-aware loss function is proposed to refine the boundaries directly from the semantic segmentation prediction for more robust and discriminative features. The effectiveness of the proposed model was demonstrated using Cityscapes, CamVid, and NYUDv2 benchmark datasets. The performance of the two structures and the edge-aware loss function in ELKPPNet was validated on the Cityscapes dataset, while the complete ELKPPNet was evaluated on the CamVid and NYUDv2 datasets. A comparative analysis with the state-of-the-art methods under the same conditions confirmed the superiority of the proposed algorithm.

Abstract (translated)

URL

https://arxiv.org/abs/1906.11428

PDF

https://arxiv.org/pdf/1906.11428.pdf
Read All

Content

Scene_Parsing (20)

Scene_Parsing

Abstract

Abstract (translated)

URL

PDF Copy

Abstract

Abstract (translated)

URL

PDF Copy

Abstract

Abstract (translated)

URL

PDF Copy

Abstract

Abstract (translated)

URL

PDF Copy

Abstract

Abstract (translated)

URL

PDF Copy

Abstract

Abstract (translated)

URL

PDF Copy

Abstract

Abstract (translated)

URL

PDF Copy

Abstract

Abstract (translated)

URL

PDF Copy

Abstract

Abstract (translated)

URL

PDF Copy

Abstract

Abstract (translated)

URL

PDF Copy

Abstract

Abstract (translated)

URL

PDF Copy

Abstract

Abstract (translated)

URL

PDF Copy

Abstract

Abstract (translated)

URL

PDF Copy

Abstract

Abstract (translated)

URL

PDF Copy

Abstract

Abstract (translated)

URL

PDF Copy

Abstract

Abstract (translated)

URL

PDF Copy

Abstract

Abstract (translated)

URL

PDF Copy

Abstract

Abstract (translated)

URL

PDF Copy

Abstract

Abstract (translated)

URL

PDF Copy

Abstract

Abstract (translated)

URL

PDF

PDF

PDF

PDF

PDF

PDF

PDF

PDF

PDF

PDF

PDF

PDF

PDF

PDF

PDF

PDF

PDF

PDF

PDF

PDF