色谱2026,Vol.44Issue(4):444-452,9.DOI:10.3724/SP.J.1123.2025.02012
基于机器学习的全氟及多氟化合物电离效率预测模型开发及其在半定量分析中的应用
Development of a machine learning-based ionization efficiency prediction model for per-and polyfluoroalkyl substances and its application in semi-quantitative analysis
摘要
Abstract
Per-and polyfluoroalkyl substances(PFASs)represent a category of emerging con-taminants of global concern in fields such as environmental science and food safety,due to their persistence,bioaccumulative properties and potential toxicity.Although screening methods for PFASs using high resolution mass spectrometry(HRMS)have been developed rapidly,the diversity of PFASs and the absence of standards pose significant challenges for quantitative analysis.In this study,50 PFASs were analyzed by HPLC-HRMS.The ionization efficiency(IE)was calculated as the slope of the calibration curve.A quantitative structure-activity relationship(QSAR)model was developed employing machine learning to predict the ionization efficiencies of PFASs using PaDEL molecular descriptors.The model enables semi-quantitative estimation of PFASs concentrations in the absence of reference standards by incorporating predicted IE values.Eighteen critical descriptors were selected from a total of 1 444 PaDEL descriptors through the application of recursive feature elimination(RFE).These selected descriptors encompassed topological descriptors,geometrical descriptors,autocorrelation descriptors,electrostatic and polarity descriptors.These individual descriptors including VE1_Dzv,GATS6i,JGI10,GATS1p and MATS4m were of great importance.Three algorithms including elastic net linear regression,random forest(RF),and XGBoost were evaluated for model performance.In the elastic net linear regression model,the root mean square error(RMSE)for the training dataset was 0.049 0,and the coefficient of determination(R²)was 0.993 0;for the test dataset,the RMSE was 0.163 0,with an R² of 0.756 1.In the RF model,the RMSE for the training dataset was 0.163 1,and the R² was 0.921 9;for the test dataset,the RMSE was 0.131 6,with an R² of 0.840 9.In the XGBoost model,the RMSE for the training dataset was 0.052 1,and the R² was 0.992 0;for the test dataset,the RMSE was 0.118 4,with an R² of 0.871 3.Nonlinear algorithms of random forest and XGBoost demonstrated superior predictive performance compared to the elastic net linear regression,with XGBoost exhibiting best performance.Random forest,a bagging-based approach,trains individual decision trees independently and aggregates predictions through averaging.In contrast,XGBoost employs gradient boosting methodology,it-eratively optimizing the model by sequentially training new trees in order to address residuals from previous iterations.The independent training mechanism of random forest inherently lacks the it-erative optimization framework that is characteristic of gradient boosting.Specifically,XGBoost systematically enhances predictive accuracy by generating new trees that target residual errors from preceding models,thereby progressively refining predictive performance.This fundamental difference in optimization strategy enables XGBoost to more effectively correct prediction errors compared to the ability of random forest.Based on the results of a comprehensive evaluation of the three models,the XGBoost algorithm was ultimately selected for its demonstrated performance advantages.The prediction errors of ionization efficiency(IE)for the 50 PFASs were within 1.67-fold,with a median value of 1.04-fold and RMSE of 1.06.The established XGBoost model was further applied for the semi-quantitative concentration prediction of 50 PFASs across concentration gradients,where the prediction errors ranged from 0.12 to 4.90-fold,with a median value of 0.96-fold and RMSE of 0.94.The accuracy of the prediction improved as the concentrations increased.Furthermore,the model was applied to predict concentrations of PFASs in fish tissue.After sample extraction and cleanup using solid-phase extraction,the samples were analyzed using HPLC-HRMS.The concentrations of PFASs were semi-quantified using the predicted IEs,yielding prediction errors ranging from 0.79-fold to 1.81-fold.These findings highlight the robustness of the IE prediction model for PFASs.Notably,the performance of the developed model was better than or comparable to the performance of previous studies.In conclusion,this study introduces a machine learning-based QSAR model for the prediction of ionization efficiency.This approach illustrates the ability to estimate the con-centrations of PFASs in the absence of standards,thereby presenting considerable potential for the risk assessment of compounds lacking standards in suspect and non-targeted screening.关键词
全氟及多氟烷基化合物/定量构效关系/机器学习/半定量预测模型Key words
per-and polyfluoroalkyl substances(PFASs)/quantitative structure-activity rela-tionship(QSAR)/machine learning/semi-quantitative predictive model分类
化学化工引用本文复制引用
孙沈正,李瑶瑶,高燕,李康聪,陈智,李秀琴,张庆合..基于机器学习的全氟及多氟化合物电离效率预测模型开发及其在半定量分析中的应用[J].色谱,2026,44(4):444-452,9.基金项目
国家重点研发计划项目(2023YFF0612601) (2023YFF0612601)
国家自然科学基金(22006145). National Key Research and Development Program of China(No.2023YFF0612601) (22006145)
National Natural Science Foundation of China(No.22006145). (No.22006145)