南方农业学报2026,Vol.57Issue(2):451-461,11.DOI:10.3969/j.issn.2095-1191.2026.02.013
AI驱动的PCA-随机森林模型在广西桑树品种智能鉴定中的应用
Application of AI-driven PCA-random forest model in intelligent identification of mulberry varieties in Guangxi
摘要
Abstract
[Objective]This study aimed to investigate the application of an artificial intelligence(AI)-driven princi-pal component analysis(PCA)-random forest model in the intelligent identification of mulberry varieties in Guangxi,which could provide scientific basis for clarifying the genetic relationships of local mulberry germplasm resources in Guangxi as well as their conservation and utilization.[Method]Taking plump seeds of six elite mulberry varieties as the research materials,whole-genome resequencing was performed on the six varieties using the Illumina NovaSeq 6000 sequencer to construct mulberry genomic DNA libraries.To solve the problem of single nucleotide polymorphism(SNP)redundancy of PCA-random forest model,a two-round screening strategy was adopted.Candidate SNP loci were dimen-sionally reduced via PCA,and the score of each SNP locus under the first several principal components was obtained through PCA.Bayesian optimization was used to search for the optimal hyperparameters of the model for its construction,and a machine learning algorithm with 5-fold cross-validation to prevent overfitting was applied for model training,thus ac-quiring the importance value of each SNP.Core SNP loci were screened to establish a molecular marker library for mulberry variety identification,and the key SNP loci extracted from the mulberry samples to be identified were aligned with those in the library to verify the identification accuracy of the optimal model.[Result]The mapping rates of the six mulberry va-rieties ranged from 92.87%to 97.34%,indicating high quality of the sequencing data.After strict quality control and align-ment,a total of 1163291 high-quality SNP loci were obtained,which were distributed in the upstream regions,exons,in-trons and intergenic regions with the proportions of 6.27%,10.74%,25.18%,and 48.17%respectively.The proportions of SNPs with transition and transversion mutations were 64.66%and 35.34%respectively,with an average fixation index of 0.63.The SNP density was unevenly distributed across different chromosomes,and the SNP loci density on Chr01 was sig-nificantly higher than that on other chromosomes.Based on the scores of each SNP locus under the first three principal components,the top 10000 SNP loci were selected for subsequent machine learning training and locus screening.Ac-cording to the importance value of each SNP,225 core SNP loci were finally screened out.The four key evaluation indica-tors of the optimal model,including F1 score,precision rate,recall rate and accuracy rate,all reached 100%.Comparative verification showed that the variety number of the SNP library constructed for each mulberry sample was completely con-sistent with the predicted variety name.[Conclusion]The AI-driven PCA-random forest model algorithm successfully screens 225 core SNP loci from six mulberry varieties,and this method can be effectively applied to the identification of mulberry varieties.关键词
桑树/PCA-随机森林/SNP/智能鉴定Key words
mulberry/PCA-random forest/SNP/intelligent identification分类
农业科技引用本文复制引用
刘丹,林强,邱长玉,黄胜,卿军,莫荣利,陆晓媚,曾燕蓉,何国玲,张朝华..AI驱动的PCA-随机森林模型在广西桑树品种智能鉴定中的应用[J].南方农业学报,2026,57(2):451-461,11.基金项目
国家重点研发计划项目(2022YFD1601301) (2022YFD1601301)
国家现代农业产业技术体系建设专项(CARS-18) National Key Research and Development Program of China(2022YFD1601301) (CARS-18)
China Agriculture Research System(CARS-18) (CARS-18)