Paper Reading AI Learner

XMACNet: An Explainable Lightweight Attention based CNN with Multi Modal Fusion for Chili Disease Classification

2026-03-06 11:23:43
Tapon Kumer Ray, Rajkumar Y, Shalini R, Srigayathri K, Jayashree S, Lokeswari P

Abstract

Plant disease classification via imaging is a critical task in precision agriculture. We propose XMACNet, a novel light-weight Convolutional Neural Network (CNN) that integrates self-attention and multi-modal fusion of visible imagery and vegetation indices for chili disease detection. XMACNet uses an EfficientNetV2S backbone enhanced by a self-attention module and a fusion branch that processes both RGB images and computed vegetation index maps (NDVI, NPCI, MCARI). We curated a new dataset of 12,000 chili leaf images across six classes (five disease types plus healthy), augmented synthetically via StyleGAN to mitigate data scarcity. Trained on this dataset, XMACNet achieves high accuracy, F1-score, and AUC, outperforming baseline models such as ResNet-50, MobileNetV2, and a Swin Transformer variant. Crucially, XMACNet is explainable: we use Grad-CAM++ and SHAP to visualize and quantify the models focus on disease features. The models compact size and fast inference make it suitable for edge deployment in real-world farming scenarios.

Abstract (translated)

通过图像识别进行植物病害分类是精准农业中的关键任务。我们提出了一种名为XMACNet的新颖轻量级卷积神经网络(CNN),该网络结合了自注意力机制以及可见光图像和植被指数的多模态融合,用于辣椒病害检测。XMACNet采用EfficientNetV2S骨干网,并通过添加一个自注意模块和一个多模态融合分支来增强其性能,该分支同时处理RGB图像和计算出的植被指数图(NDVI、NPCCI、MCARI)。我们整理了一个新的数据集,包含12,000张辣椒叶片图片,涵盖了六种类别(五种病害类型加上健康状态),并通过StyleGAN合成增广来缓解数据不足的问题。在该数据集上训练后,XMACNet取得了高准确率、F1值和AUC值,并优于基准模型如ResNet-50、MobileNetV2以及Swin Transformer变体的表现。尤为重要的是,XMACNet具有可解释性:我们通过Grad-CAM++和SHAP来可视化并量化模型对病害特征的聚焦情况。该模型的小巧体积和快速推理使其非常适合部署在实际农业场景中的边缘设备上。

URL

https://arxiv.org/abs/2603.06750

PDF

https://arxiv.org/pdf/2603.06750.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot