Paper Reading AI Learner

Feature boosting with efficient attention for scene parsing

2024-02-29 15:22:21
Vivek Singh, Shailza Sharma, Fabio Cuzzolin

Abstract

The complexity of scene parsing grows with the number of object and scene classes, which is higher in unrestricted open scenes. The biggest challenge is to model the spatial relation between scene elements while succeeding in identifying objects at smaller scales. This paper presents a novel feature-boosting network that gathers spatial context from multiple levels of feature extraction and computes the attention weights for each level of representation to generate the final class labels. A novel `channel attention module' is designed to compute the attention weights, ensuring that features from the relevant extraction stages are boosted while the others are attenuated. The model also learns spatial context information at low resolution to preserve the abstract spatial relationships among scene elements and reduce computation cost. Spatial attention is subsequently concatenated into a final feature set before applying feature boosting. Low-resolution spatial attention features are trained using an auxiliary task that helps learning a coarse global scene structure. The proposed model outperforms all state-of-the-art models on both the ADE20K and the Cityscapes datasets.

Abstract (translated)

场景解析的复杂度随着物体和场景类别的数量增加而增加,在无限制的开放场景中更高。最大的挑战是在小尺度上成功识别物体,同时建模场景元素之间的空间关系。本文提出了一种新颖的特征增强网络,该网络从多个级联的特征提取中收集空间上下文,并为每个表示级别计算注意力权重以生成最终分类标签。一种新颖的“通道注意力模块”被设计用于计算注意力权重,确保在提取阶段相关的特征得到增强,而其他特征则得到削弱。模型还在低分辨率下学习空间上下文信息,以保留场景元素之间的抽象空间关系,并降低计算成本。在应用特征增强之前,将低分辨率的空间注意力特征连接到最终特征集合中。低分辨率的空间注意力特征使用辅助任务进行训练,帮助学习粗略全局场景结构。与最先进的模型相比,所提出的模型在ADE20K和Cityscapes数据集上都表现出色。

URL

https://arxiv.org/abs/2402.19250

PDF

https://arxiv.org/pdf/2402.19250.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot