中国农业科学2025,Vol.58Issue(15):2960-2979,20.DOI:10.3864/j.issn.0578-1752.2025.15.003
基于G2PSE堆叠集成的全基因组选择方法
Genomic Selection Method Based on G2PSE Stacking Ensemble
摘要
Abstract
[Objective]Genomic selection(GS)is a core technology for predicting individual phenotypes or genetic values from genome-wide marker information,which has important theoretical value and practical significance in agricultural breeding and genetic research.However,high-dimensional feature redundancy and nonlinear relationship modeling are key challenges in genomic selection.A genotype to phenotype stacking ensemble(G2PSE)is proposed,aiming to improve the prediction accuracy and generalization ability,and provide an efficient solution for high-dimensional genomic data analysis.[Method]The G2PSE stacking ensemble model framework was constructed,incorporating ten-fold cross-validation,ensemble learning,feature selection(LAR algorithm),and feature enhancement strategies.The model employed random forests(RF),support vector regression(SVR),and gradient boosting regression(GBR)as base learners,with ordinary least squares regression(OLSR)as the meta-learner.Additionally,the impact of meta-learners such as random forest,support vector regression,and neural networks on model performance was evaluated.The G2PSE model consisted of three core submodels:(1)All-feature stacking ensemble(AFSE),which fully utilized all SNP features;(2)LAR-feature stacking ensemble(LFSE),which reduced redundant information through feature selection to improve generalization;(3)LAR-feature enhanced stacking ensemble(LFESE),which combined feature selection with enhancement strategies to optimize prediction capability in high-dimensional data environments.The performance of three feature enhancement variants(AFESE,HFESEⅠ,HFESEⅡ)was explored.Finally,the model was evaluated experimentally on multi-trait datasets of three species,namely wheat,soybean,and tilapia,and further evaluated on an independent test set using the Pepper203 dataset to validate the robustness of the model.[Result]The G2PSE model significantly outperformed traditional methods and single machine learning models in two metrics,Pearson correlation coefficient(PCC)and mean absolute error(MAE).Among the three core submodels,LFESE performed the best by combining the feature selection and enhancement strategies,LFSE reduced redundant information and enhanced the generalization ability by feature selection,and AFSE had a significant advantage in comprehensively capturing genotypic global information.In addition,the three feature enhancement variant models further validated the importance of feature quality compared to feature quantity in improving prediction performance.The experiments also showed that the linear regression model performed best in meta-learner selection,while the LFESE and LFSE submodels demonstrated a more balanced performance in terms of computational efficiency.And a reasonable feature selection threshold was crucial for model performance,where the optimal threshold for low-dimensional datasets was 10%-20%,while the optimal threshold for high-dimensional datasets was 1%.Finally,the evaluation on an independent test set proved that the LFESE submodel had the best generalization ability.[Conclusion]The G2PSE model significantly improves genomic selection prediction performance through ensemble learning,feature selection,and enhancement strategies.关键词
全基因组选择/堆叠集成/特征选择/特征增强/农业育种Key words
genomic selection/stacking ensemble/feature selection/feature enhancement/agricultural breeding引用本文复制引用
庄润杰,刘慧铭,王诗雨,吕婉萍,温永仙..基于G2PSE堆叠集成的全基因组选择方法[J].中国农业科学,2025,58(15):2960-2979,20.基金项目
福建省自然科学基金(2021J01126)、国家自然科学基金(32071892)、福建农林大学科技创新专项基金(KFB22094XA) (2021J01126)