Abstract
The complexity of scene parsing grows with the number of object and scene classes, which is higher in unrestricted open scenes. The biggest challenge is to model the spatial relation between scene elements while succeeding in identifying objects at smaller scales. This paper presents a novel feature-boosting network that gathers spatial context from multiple levels of feature extraction and computes the attention weights for each level of representation to generate the final class labels. A novel `channel attention module' is designed to compute the attention weights, ensuring that features from the relevant extraction stages are boosted while the others are attenuated. The model also learns spatial context information at low resolution to preserve the abstract spatial relationships among scene elements and reduce computation cost. Spatial attention is subsequently concatenated into a final feature set before applying feature boosting. Low-resolution spatial attention features are trained using an auxiliary task that helps learning a coarse global scene structure. The proposed model outperforms all state-of-the-art models on both the ADE20K and the Cityscapes datasets.
Abstract (translated)
场景解析的复杂度随着物体和场景类别的数量增加而增加,在无限制的开放场景中更高。最大的挑战是在小尺度上成功识别物体,同时建模场景元素之间的空间关系。本文提出了一种新颖的特征增强网络,该网络从多个级联的特征提取中收集空间上下文,并为每个表示级别计算注意力权重以生成最终分类标签。一种新颖的“通道注意力模块”被设计用于计算注意力权重,确保在提取阶段相关的特征得到增强,而其他特征则得到削弱。模型还在低分辨率下学习空间上下文信息,以保留场景元素之间的抽象空间关系,并降低计算成本。在应用特征增强之前,将低分辨率的空间注意力特征连接到最终特征集合中。低分辨率的空间注意力特征使用辅助任务进行训练,帮助学习粗略全局场景结构。与最先进的模型相比,所提出的模型在ADE20K和Cityscapes数据集上都表现出色。
URL
https://arxiv.org/abs/2402.19250