Abstract
In the surveillance and defense domain, multi-target detection and classification (MTD) is considered essential yet challenging due to heterogeneous inputs from diverse data sources and the computational complexity of algorithms designed for resource-constrained embedded devices, particularly for Al-based solutions. To address these challenges, we propose a feature fusion and knowledge-distilled framework for multi-modal MTD that leverages data fusion to enhance accuracy and employs knowledge distillation for improved domain adaptation. Specifically, our approach utilizes both RGB and thermal image inputs within a novel fusion-based multi-modal model, coupled with a distillation training pipeline. We formulate the problem as a posterior probability optimization task, which is solved through a multi-stage training pipeline supported by a composite loss function. This loss function effectively transfers knowledge from a teacher model to a student model. Experimental results demonstrate that our student model achieves approximately 95% of the teacher model's mean Average Precision while reducing inference time by approximately 50%, underscoring its suitability for practical MTD deployment scenarios.
Abstract (translated)
在监控和防御领域,多目标检测与分类(MTD)被认为是非常重要但同时也极具挑战性的任务。这主要是由于来自各种数据源的异质输入以及为资源受限的嵌入式设备设计算法所面临的计算复杂性,特别是对于基于人工智能的解决方案来说更是如此。为了应对这些挑战,我们提出了一种特征融合和知识蒸馏框架,用于多模态MTD,该框架利用数据融合来提高准确性,并采用知识蒸馏以实现更好的领域适应性。 具体而言,我们的方法采用了新颖的基于融合的多模态模型,同时使用RGB图像和热成像输入,并结合了蒸馏训练流水线。我们将问题定义为后验概率优化任务,并通过复合损失函数支持的多阶段训练管道来解决这个问题。该损失函数有效地将知识从教师模型传递到学生模型。 实验结果表明,我们的学生模型达到了大约95%的教师模型的平均精度(mAP),同时将推理时间减少了约50%,这突显了它在实际MTD部署场景中的适用性。
URL
https://arxiv.org/abs/2506.00365