Abstract
In this paper, we propose a new CNN model DiCENet, that is built using: (1) dimension-wise convolutions and (2) efficient channel fusion. The introduced blocks maximize the use of information in the input tensor by learning representations across all dimensions while simultaneously reducing the complexity of the network and achieving high accuracy. Our model shows significant improvements over state-of-the-art models across various visual recognition tasks, including image classification, object detection, and semantic segmentation. Our model delivers either the same or better performance than existing models with fewer FLOPs, including task-specific models. Notably, DiCENet delivers competitive performance to neural architecture search-based methods at fewer FLOPs (70-100 MFLOPs). On the MS-COCO object detection, DiCENet is 4.5% more accurate and has 5.6 times fewer FLOPs than YOLOv2. On the PASCAL VOC 2012 semantic segmentation dataset, DiCENet is 4.3% more accurate and has 3.2 times fewer FLOPs than a recent efficient semantic segmentation network, ESPNet. Our source code is available at \url{this https URL}
Abstract (translated)
本文提出了一种新的CNN模型DICENET,该模型使用:(1)维卷积和(2)有效的信道融合。引入的块通过学习所有维度的表示,最大限度地利用输入张量中的信息,同时降低网络的复杂性并实现高精度。我们的模型在不同的视觉识别任务(包括图像分类、对象检测和语义分割)上比最先进的模型有了显著的改进。我们的模型提供了与现有模型相同或更好的性能,具有更少的触发器,包括特定于任务的模型。值得注意的是,Dicenet以更少的触发器(70-100 mFlops)为基于神经架构搜索的方法提供了具有竞争力的性能。在MS-COCO目标检测中,Dicenet比Yolov2精确4.5%,并且比Yolov2少5.6倍。在PascalVOC 2012语义分割数据集上,Dicenet比最近的高效语义分割网络(ESPNET)精确4.3%,并且具有3.2倍的失败率。我们的源代码位于url此https url_
URL
https://arxiv.org/abs/1906.03516