智慧农业(中英文)2026,Vol.8Issue(1):72-85,14.
农作物病虫害自监督适应性多模态特征融合识别方法
Self-Supervised Adaptive Multimodal Feature Fusion Recognition of Crop Diseases and Pests
摘要
Abstract
[Objective]Crop diseases and pests are significant factors restricting global agricultural production.Traditional intelligent recognition technologies predominantly rely on single-modal image data processed by convolutional neural networks(CNNs)or Transformers.However,in complex natural environments,these methods often suffer from insufficient information utilization and lim-ited robustness due to the lack of semantic guidance.Although emerging multimodal approaches like CLIP have introduced textual in-formation,they typically rely on shallow feature alignment in the embedding space without achieving deep semantic interaction or ef-fective feature fusion.Furthermore,the asymmetry between the quantity of image samples and text labels during training poses a chal-lenge for effective cross-modal learning.In this study,a self-supervised adaptive multimodal feature fusion recognition(SAFusion-CLIP)method is proposed,aiming to significantly enhance classification accuracy and model generalization in fine-grained diseases and pests recognition tasks.[Methods]A comprehensive recognition framework was constructed,integrating four key components to achieve deep fusion of visual and textual features.First,prompt engineering was conducted by utilizing large language models(LLMs)combined with authoritative agricultural guides to transform simple category labels into fine-grained pathological semantic descriptions.These descriptions encapsulated morphological details,color gradients,and texture features,with quality verified by BERTScore and ROUGE-L metrics.Second,a cross-modal balanced alignment module was designed to resolve the problem of sam-ple asymmetry between image batches and fixed text labels.This module employed a dot-product attention mechanism to calculate the correlation between image and text projections,applying Softmax normalization to dynamically align image features with their corre-sponding textual representations.Third,an adaptive fusion mechanism was employed to achieve deep semantic interaction.A gating unit based on the Sigmoid function was designed to calculate a gate value,which dynamically allocated weights to image and text fea-tures,allowing the model to adaptively integrate complementary information from both modalities.Finally,a self-supervised feature reconstruction task was introduced to enhance the robustness of feature representation.A simple decoder was utilized to reconstruct the original image and text embeddings from the fused features,and the model was optimized using a composite objective function combining image-text contrastive loss,mean squared error reconstruction loss,and weighted cross-entropy classification loss.[Results and Discussions]Extensive experiments were conducted on the standard PlantVillage dataset,which includes 39 categories covering 14 crop species.The proposed SAFusion-CLIP model achieved a classification accuracy of 99.67%,with precision,recall,and F1-Score all exceeding 99.00%.Comparative analysis demonstrated that the proposed method significantly outperformed mainstream single-modal and baseline multimodal models,ResNet50(96.51%),Swin-Transformer(97.48%),and baseline CLIP(98.23%),respectively.Visualization analysis using Gradient-weighted Class Activation Mapping(Grad-CAM)indicated that,unlike single-modal models which were susceptible to background noise or non-specific physical damage,the SAFusion-CLIP model focused more precisely on core lesion areas,effectively suppressing background interference.Furthermore,ablation studies confirmed the effectiveness of the proposed modules,showing that the combination of the self-supervised architecture and the adaptive fusion mechanism resulted in a 2.46 percentage points accuracy improvement over the baseline,validating the necessity of deep feature interaction and reconstruction tasks.[Conclusions]By fusing textual semantics with visual features,the SAFusion-CLIP method effectively overcame the limitations of single-modal recognition.The adaptive fusion mechanism ensured deep interaction between modalities,while the self-supervised reconstruction task significantly enhanced the robustness of feature representation.The experimental results verified that this data-driven approach significantly improves accuracy and generalization capabilities in fine-grained crop disease classification tasks,pro-viding a new and effective solution for precision agricultural prevention and control.关键词
病虫害识别/多模态融合/适应性特征融合/自监督学习Key words
diseases and pests recognition/multimodal fusion/adaptive feature fusion/self-supervised learning分类
农业科技引用本文复制引用
叶鹏林,闵超,苟良杰,王鹏程,黄小鹏,李鑫,蒙玉平..农作物病虫害自监督适应性多模态特征融合识别方法[J].智慧农业(中英文),2026,8(1):72-85,14.基金项目
国家自然科学基金项目(52574048) (52574048)
四川省科技计划项目(2025NSFTD0016) National Natural Science Foundation of China(52574048) (2025NSFTD0016)
Sichuan Science and Technology Program(2025NS-FTD0016) (2025NS-FTD0016)