|国家科技期刊平台
首页|期刊导航|网络安全与数据治理|基于Boosting集成学习的风险URL检测研究

基于Boosting集成学习的风险URL检测研究OA

Research on risk URL detection based on Boosting ensemble learning

中文摘要英文摘要

随着互联网的不断发展,网站数量不断增长,URL作为访问网站的唯一入口,成为Web攻击的重点对象.传统的URL检测方式主要是针对恶意URL,主要方法是基于特征值和黑白名单,容易产生漏报,且对于复杂URL的检测能力不足.为解决上述问题,基于集成学习中的Boosting思想,提出一种针对业务访问的风险URL检测的混合模型.该模型前期将URL作为字符串,使用自然语言处理技术对其进行分词及向量化,然后采用分步建模法的思想,首先利用GBDT算法构建二分类模型,判断URL是否存在风险,接着将风险URL原始字符串输入到多分类模型中,利用XGBoost算法对其进行多分类判定,明确风险URL的具体风险类型,为安全分析人员提供参考.在模型构建过程中不断进行参数调优,并采用AUC值和F1 值分别对二分类模型和多分类模型进行评估,评估结果显示二分类模型的AUC值为98.91%,多分类模型的F1 值为0.993,效果较好.将其应用到实际环境中,与现有检测手段进行对比,发现模型的检出率高于现有WAF和APT安全设备,其检测结果弥补了现有检测手段的漏报.

With the continuous development of the Internet and the growing number of websites,URL,as the only access to web-sites,has become the focus of web attacks.The traditional URL detection method mainly targets malicious URLs,based on fea-ture values and black-and-white lists,but it is prone to false positives and lacks detection capability for complex URLs.To resolve the appeal issue,a hybrid model for risk URL detection in business access is proposed based on the Boosting concept in ensemble learning.In the early stage of this model,the URL is treated as a string,and natural language processing techniques are used to segment and vectorize it.Then,a two-step approach is adopted.Firstly,the GBDT algorithm is used to construct a binary classifi-cation model to determine whether the URL is at risk.Then,the original string of the risk URL is input into a multi classification model,and the XGBoost algorithm is used to perform multi classification judgment on it,clarifying the specific risk types of the risk URL and providing reference for security analysts.During the model construction process,parameter optimization was contin-uously carried out,and the AUC value and F1 value were used to evaluate the binary classification model and the multi classifica-tion model,respectively.The evaluation results showed that the AUC value of the binary classification model was 98.91%,and the F1 value of the multi classification model was 0.993,indicating good performance.Applying it to practical environments and comparing it with existing detection methods,it was found that the detection rate of the model is higher than that of existing WAF and APT detection devices,and its detection results make up for the missed reports of existing detection methods.

冯美琪;李赟;蒋冰;王立松;刘春波;陈伟

中国民航信息网络股份有限公司 运行中心,北京 101318||中国民航信息网络股份有限公司 IT基础设施国产化适配工程技术研究中心,北京 101318中国民航大学 信息安全测评中心,天津 300300

计算机与自动化

Web攻击集成学习正则化分步建模法

web attacksensemble learningregularizationstepwise modeling method

《网络安全与数据治理》 2024 (007)

32-40 / 9

10.19358/j.issn.2097-1788.2024.07.006

评论