Abstract
Multispectral pedestrian detection is an important task for many around-the-clock applications, since the visible and thermal modalities can provide complementary information especially under low light conditions. To reduce the influence of hand-designed components in available multispectral pedestrian detectors, we propose a MultiSpectral pedestrian DEtection TRansformer (MS-DETR), which extends deformable DETR to multi-modal paradigm. In order to facilitate the multi-modal learning process, a Reference box Constrained Cross-Attention (RCCA) module is firstly introduced to the multi-modal Transformer decoder, which takes fusion branch together with the reference boxes as intermediaries to enable the interaction of visible and thermal modalities. To further balance the contribution of different modalities, we design a modality-balanced optimization strategy, which aligns the slots of decoders by adaptively adjusting the instance-level weight of three branches. Our end-to-end MS-DETR shows superior performance on the challenging KAIST and CVC-14 benchmark datasets.
Abstract (translated)
多光谱行人检测是许多全天候应用的重要任务,因为可见和热能模式可以提供补充信息,特别是在光线较弱的情况下。为了减少可用的多光谱行人检测组件的影响,我们提出了一种多光谱行人检测框架(MS-DETR),该框架将可变形的DETR扩展到多模态范式。为了促进多模态学习过程,我们首先介绍了一个参考框约束交叉注意力(RCCA)模块,并将其引入到多模态Transformer解码器中,它将融合分支和参考框作为中介,以实现可见和热能模式的互动。为了进一步平衡不同模式的贡献,我们设计了一种模式平衡优化策略,该策略通过自适应调整三个分支实例级别的权重来对齐解码器的窗口。我们的端到端MS-DETR在挑战性的KAIST和CVC-14基准数据集上表现出更好的性能。
URL
https://arxiv.org/abs/2302.00290