|国家科技期刊平台
首页|期刊导航|中国医学科学院学报|基于机器学习分析的浸润性乳腺癌蛋白质编码基因的标志物鉴定

基于机器学习分析的浸润性乳腺癌蛋白质编码基因的标志物鉴定OA北大核心CSTPCDMEDLINE

Identification of Protein-Coding Gene Markers in Breast Invasive Carcinoma Based on Machine Learning

中文摘要英文摘要

目的 应用随机森林(RF)、极限梯度提升算法(XGBoost)、轻量的梯度提升机(LightGBM)、类别型特征提升(CatBoost)4 种机器学习算法分析浸润性乳腺癌转录组表达数据,筛选与浸润性乳腺癌预后相关的生物标志物.方法 通过癌症基因组图谱公共数据库下载浸润性乳腺癌的表达数据,采用DESeq2 程序包、t检验及Cox单因素分析,对人类浸润性乳腺癌样本中与生存预后相关的差异蛋白质编码基因进行筛选.基于RF、XGBoost、LightGBM、CatBoost等机器学习模型的构建与比较,挖掘浸润性乳腺癌预后相关的蛋白质编码基因标志物,并使用基因表达综合数据库的乳腺癌表达数据作为外部测试进行验证.结果 共获得151 个与生存预后相关的差异蛋白质编码基因,其中由C3orf80、UGP2 和SPC25 3 个基因构建的机器学习模型效果较好.结论 筛选出3 个(UGP2、C3orf80、SPC25)与浸润性乳腺癌预后相关的生物标志物,为诊断和治疗浸润性乳腺癌提供了新的方向.

Objective To screen out the biomarkers linked to prognosis of breast invasive carcinoma based on the analysis of transcriptome data by random forest(RF),extreme gradient boosting(XGBoost),light gradient boosting machine(LightGBM),and categorical boosting(CatBoost).Methods We obtained the ex-pression data of breast invasive carcinoma from The Cancer Genome Atlas and employed DESeq2,t-test,and Cox univariate analysis to identify the differentially expressed protein-coding genes associated with survival prog-nosis in human breast invasive carcinoma samples.Furthermore,RF,XGBoost,LightGBM,and CatBoost mod-els were established to mine the protein-coding gene markers related to the prognosis of breast invasive cancer and the model performance was compared.The expression data of breast cancer from the Gene Expression Omnibus was used for validation.Results A total of 151 differentially expressed protein-coding genes related to survival prog-nosis were screened out.The machine learning model established with C3orf80,UGP2,and SPC25 demonstrated the best performance.Conclusions Three protein-coding genes(UGP2,C3orf80,and SPC25)were screened out to identify breast invasive carcinoma.This study provides a new direction for the treatment and diagnosis of breast invasive carcinoma.

武乐;闵开元;柳江枫;梁万丰;杨晔宏;胡刚;杨俊涛

南开大学统计与数据科学学院, 天津 300071中国医学科学院 北京协和医学院 基础医学研究所重大疾病共性机制研究全国重点实验室, 北京 100005

临床医学

浸润性乳腺癌生物标志物蛋白质编码基因UGP2C3orf80SPC25

breast invasive carcinomabiomarkerprotein-coding genesUGP2C3orf80SPC25

《中国医学科学院学报》 2024 (002)

147-153 / 7

国家自然科学基金(31970649)和中国医学科学院医学与健康科技创新工程(CIFMS2021-I2M-1-057、CIFMS2021-12M-1-001);

10.3881/j.issn.1000-503X.15717

评论