Paper Reading AI Learner

Multi-scale Unified Network for Image Classification

2024-03-27 06:40:26
Wenzhuo Liu, Fei Zhu, Cheng-Lin Liu

Abstract

Convolutional Neural Networks (CNNs) have advanced significantly in visual representation learning and recognition. However, they face notable challenges in performance and computational efficiency when dealing with real-world, multi-scale image inputs. Conventional methods rescale all input images into a fixed size, wherein a larger fixed size favors performance but rescaling small size images to a larger size incurs digitization noise and increased computation cost. In this work, we carry out a comprehensive, layer-wise investigation of CNN models in response to scale variation, based on Centered Kernel Alignment (CKA) analysis. The observations reveal lower layers are more sensitive to input image scale variations than high-level layers. Inspired by this insight, we propose Multi-scale Unified Network (MUSN) consisting of multi-scale subnets, a unified network, and scale-invariant constraint. Our method divides the shallow layers into multi-scale subnets to enable feature extraction from multi-scale inputs, and the low-level features are unified in deep layers for extracting high-level semantic features. A scale-invariant constraint is posed to maintain feature consistency across different scales. Extensive experiments on ImageNet and other scale-diverse datasets, demonstrate that MSUN achieves significant improvements in both model performance and computational efficiency. Particularly, MSUN yields an accuracy increase up to 44.53% and diminishes FLOPs by 7.01-16.13% in multi-scale scenarios.

Abstract (translated)

卷积神经网络(CNN)在图像表示学习和识别方面取得了显著的进步。然而,当处理真实世界、多尺度图像输入时,它们在性能和计算效率方面面临显著挑战。传统方法将所有输入图像缩放到固定大小,其中较大固定尺寸有利于性能,但将小尺寸图像缩放到较大尺寸会导致量化噪声和增加计算成本。在本文中,我们根据中心卷积对齐(CKA)分析对CNN模型进行了全面的层次调查。观察结果表明,低层对输入图像尺度变化更加敏感,而高层层对输入图像的尺度变化不太敏感。基于这一洞察,我们提出了多尺度统一网络(MUSN)由多尺度子网、统一网络和尺度不变约束组成。我们的方法将浅层层分为多尺度子网,以便从多尺度输入中提取特征,并在深层中统一低级特征以提取高级语义特征。一个尺度不变约束 posed 为保持不同尺度特征的一致性。在ImageNet和其他尺度多样数据集上进行的大量实验证明,MSUN在模型性能和计算效率方面取得了显著的改进。特别是,MSUN在多尺度场景中的准确率提高了44.53%,FLOPs降低了7.01-16.13%。

URL

https://arxiv.org/abs/2403.18294

PDF

https://arxiv.org/pdf/2403.18294.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot