辽宁大学学报(自然科学版)2023,Vol.50Issue(4):312-317,6.
基于刺突蛋白序列和机器学习方法预测冠状病毒宿主多分类
Multi-Classification Prediction of Coronavirus Hosts Based on Spike Protein Sequences and Machine Learning Methods
摘要
Abstract
Severe acute respiratory syndrome coronavirus 2(SARS-COV-2)caused a global pandemic of COVID-19 in late 2019,with the coronavirus jumping species to multiple mammals,including humans.Rapid and accurate prediction of coronavirus host classification is of great significance for future epidemic control and prevention.In this study,spike protein sequences were collected from the NCBI(National center for biotechnology information)virus database.Using CD-HIT software to remove repeated data,3 216 sequences were obtained,which were divided into 6 samples according to host classification.Sorted by collection time,they were divided into training set and test set in 8∶2 ratio.Distribution descriptor(CTDD)and natural language model Seq2Vec were used to encode the characteristics of spike protein sequence.A variety of machine learning methods are used to train and evaluate predictive classification models.As the best model,Seq2Vec-GCNN has an accuracy of 99.37%in predicting human hosts,while CTDD-RF has an excellent performance in predicting other host classification,with an accuracy of 95.82%for swine,95.96%for avian,98.33%for camels,92.06%for bats and 94.01%for other mammals.The results show that it is practical and effective to use machine learning methods to construct predictive coronavirus host classification models based on spike protein sequences.关键词
机器学习/冠状病毒/刺突蛋白Key words
machine learning/coronavirus/spike protein分类
信息技术与安全科学引用本文复制引用
赵健,王治博,谢翟,张力,刘宏生..基于刺突蛋白序列和机器学习方法预测冠状病毒宿主多分类[J].辽宁大学学报(自然科学版),2023,50(4):312-317,6.基金项目
沈阳市中青年科技创新人才支持计划项目(RC210216) (RC210216)
国家自然科学基金青年科学基金项目(82003655) (82003655)
辽宁省教育厅面上项目(LJKZ0088) (LJKZ0088)
辽宁省"兴辽英才计划"项目(XLYC2002045) (XLYC2002045)
辽宁省重点研发计划项目(2019JH2/10300041) (2019JH2/10300041)