Abstract
Data-fusion networks have shown significant promise for RGB-thermal scene parsing. However, the majority of existing studies have relied on symmetric duplex encoders for heterogeneous feature extraction and fusion, paying inadequate attention to the inherent differences between RGB and thermal modalities. Recent progress in vision foundation models (VFMs) trained through self-supervision on vast amounts of unlabeled data has proven their ability to extract informative, general-purpose features. However, this potential has yet to be fully leveraged in the domain. In this study, we take one step toward this new research area by exploring a feasible strategy to fully exploit VFM features for RGB-thermal scene parsing. Specifically, we delve deeper into the unique characteristics of RGB and thermal modalities, thereby designing a hybrid, asymmetric encoder that incorporates both a VFM and a convolutional neural network. This design allows for more effective extraction of complementary heterogeneous features, which are subsequently fused in a dual-path, progressive manner. Moreover, we introduce an auxiliary task to further enrich the local semantics of the fused features, thereby improving the overall performance of RGB-thermal scene parsing. Our proposed HAPNet, equipped with all these components, demonstrates superior performance compared to all other state-of-the-art RGB-thermal scene parsing networks, achieving top ranks across three widely used public RGB-thermal scene parsing datasets. We believe this new paradigm has opened up new opportunities for future developments in data-fusion scene parsing approaches.
Abstract (translated)
数据融合网络在色温场景解析方面表现出巨大的潜力。然而,现有的研究大多依赖于对称的多层解码器来进行异构特征提取和融合,而忽略了红光和热模态固有的差异。在通过自监督学习大量无标签数据上训练的视觉基础模型(VFMs)的最近进步证明,它们具有提取有信息量的通用特征的能力。然而,在领域内这一潜力尚未得到充分利用。在这项研究中,我们迈出这一新研究领域的一步,通过探索一种可行的策略,充分利用VFM特征进行红光-热场景解析。具体来说,我们深入研究了红光和热模态的独特特点,从而设计了一个半监督的 asymmetric 编码器,该编码器既包含一个VFM,也包含一个卷积神经网络。这种设计允许更有效地提取互补的异质特征,然后以双路、逐步的方式进行融合。此外,我们还引入了一个辅助任务,进一步丰富了融合特征的局部语义,从而提高了整个RGB-热场景解析的性能。我们提出的HAPNet,配备了所有这些组件,在所有其他最先进的RGB-热场景解析网络中表现出卓越的性能,在三处广泛使用的公共RGB-热场景解析数据集上实现了Top Rank。我们相信,这一新范式为数据融合场景解析方法的未来发展打开了新的机会。
URL
https://arxiv.org/abs/2404.03527