Abstract
Accurate identification of agricultural pests is essential for crop protection but remains challenging due to the large intra-class variance and fine-grained differences among pest species. While deep learning has advanced pest detection, most existing approaches rely solely on low-level visual features and lack effective multi-modal integration, leading to limited accuracy and poor interpretability. Moreover, the scarcity of high-quality multi-modal agricultural datasets further restricts progress in this field. To address these issues, we construct two novel multi-modal benchmarks-CTIP102 and STIP102-based on the widely-used IP102 dataset, and introduce a Multi-scale Cross-Modal Fusion Network (MSFNet-CPD) for robust pest detection. Our approach enhances visual quality via a super-resolution reconstruction module, and feeds both the original and reconstructed images into the network to improve clarity and detection performance. To better exploit semantic cues, we propose an Image-Text Fusion (ITF) module for joint modeling of visual and textual features, and an Image-Text Converter (ITC) that reconstructs fine-grained details across multiple scales to handle challenging backgrounds. Furthermore, we introduce an Arbitrary Combination Image Enhancement (ACIE) strategy to generate a more complex and diverse pest detection dataset, MTIP102, improving the model's generalization to real-world scenarios. Extensive experiments demonstrate that MSFNet-CPD consistently outperforms state-of-the-art methods on multiple pest detection benchmarks. All code and datasets will be made publicly available at: this https URL.
Abstract (translated)
准确识别农业害虫对于作物保护至关重要,但因为同一大类内部的变异性和害虫物种之间的细微差别,这项任务仍然具有挑战性。虽然深度学习已经推进了害虫检测技术的进步,但是大多数现有的方法主要依赖于低层次的视觉特征,并且缺乏有效的跨模态融合机制,导致精度有限且解释性较差。此外,高质量多模态农业数据集的稀缺进一步限制了该领域的发展。 为了解决这些问题,我们基于广泛使用的IP102数据集构建了两个新颖的多模态基准——CTIP102和STIP102,并引入了一个名为Multi-scale Cross-Modal Fusion Network(MSFNet-CPD)的模型用于稳健害虫检测。我们的方法通过超分辨率重建模块提升了视觉质量,并将原始图像与重构后的图像同时输入网络,以提高清晰度及检测性能。为了更好地利用语义线索,我们提出了一个Image-Text Fusion (ITF) 模块,用于联合建模视觉和文本特征;并且引入了Image-Text Converter (ITC),它可以在多个尺度上重建细微细节,从而应对复杂的背景挑战。 此外,我们还提出了一种Arbitrary Combination Image Enhancement(ACIE)策略来生成一个更加复杂且多样化的害虫检测数据集MTIP102,这有助于模型更好地适应现实世界的情况。广泛的实验表明,MSFNet-CPD在多个害虫检测基准测试中持续超越了现有最先进的方法。 所有代码和数据集将公开提供:[this URL](https://example.com) (请用实际链接替换示例URL)
URL
https://arxiv.org/abs/2505.02441