Paper Reading AI Learner

Feature Fusion and Knowledge-Distilled Multi-Modal Multi-Target Detection

2025-05-31 03:11:44
Ngoc Tuyen Do, Tri Nhu Do

Abstract

In the surveillance and defense domain, multi-target detection and classification (MTD) is considered essential yet challenging due to heterogeneous inputs from diverse data sources and the computational complexity of algorithms designed for resource-constrained embedded devices, particularly for Al-based solutions. To address these challenges, we propose a feature fusion and knowledge-distilled framework for multi-modal MTD that leverages data fusion to enhance accuracy and employs knowledge distillation for improved domain adaptation. Specifically, our approach utilizes both RGB and thermal image inputs within a novel fusion-based multi-modal model, coupled with a distillation training pipeline. We formulate the problem as a posterior probability optimization task, which is solved through a multi-stage training pipeline supported by a composite loss function. This loss function effectively transfers knowledge from a teacher model to a student model. Experimental results demonstrate that our student model achieves approximately 95% of the teacher model's mean Average Precision while reducing inference time by approximately 50%, underscoring its suitability for practical MTD deployment scenarios.

Abstract (translated)

在监控和防御领域,多目标检测与分类(MTD)被认为是非常重要但同时也极具挑战性的任务。这主要是由于来自各种数据源的异质输入以及为资源受限的嵌入式设备设计算法所面临的计算复杂性,特别是对于基于人工智能的解决方案来说更是如此。为了应对这些挑战,我们提出了一种特征融合和知识蒸馏框架,用于多模态MTD,该框架利用数据融合来提高准确性,并采用知识蒸馏以实现更好的领域适应性。 具体而言,我们的方法采用了新颖的基于融合的多模态模型,同时使用RGB图像和热成像输入,并结合了蒸馏训练流水线。我们将问题定义为后验概率优化任务,并通过复合损失函数支持的多阶段训练管道来解决这个问题。该损失函数有效地将知识从教师模型传递到学生模型。 实验结果表明,我们的学生模型达到了大约95%的教师模型的平均精度(mAP),同时将推理时间减少了约50%,这突显了它在实际MTD部署场景中的适用性。

URL

https://arxiv.org/abs/2506.00365

PDF

https://arxiv.org/pdf/2506.00365.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot