首页|期刊导航|统计与决策|基于Bagging集成的高维不平衡数据特征选择方法

基于Bagging集成的高维不平衡数据特征选择方法OA北大核心CHSSCDCSSCICSTPCD

Bagging Ensemble-based Feature Selection Method for High-dimensional Imbalanced Data

中文摘要英文摘要

随着大数据的发展,很多应用领域样本都以高维形式呈现,而数据集的高维特性会削减不平衡学习的分类效果.针对高维不平衡数据的分类问题,文章提出了一种基于SVM-RFE与Bagging集成的自适应特征选择方法WAFS,该算法结合嵌入式与包装式的特征选择方法,能够自适应地选择最优特征构成特征空间.在5个不同维度(100~5000)的高维不平衡公开数据集上与基于过滤式的CSFS特征选择算法和基于嵌入式的ASG特征选择算法进行了对比分析,并探究了适合不同数据集的最佳采样方式以及不同维度数据集的最优特征空间率.以AUC、Acc、Recall、F1-score和G-mean为评估指标,实验结果表明,WAFS算法在不同维度数据集上都有比较好的表现,尤其是在分类高维小样本的不平衡数据集上具有巨大优势,在保证了准确率的前提下,该模型也有很强的稳定性和泛化性.

With the development of big data,samples in many application areas are presented in high-dimensional forms,and the high-dimensional characteristics of datasets will attenuate the classification effect of imbalanced learning.Aiming at the classification of high-dimensional imbalanced data,this paper proposes an adaptive feature selection method WAFS based on SVM-RFE and Bagging ensemble,which combines embedded and wrapper feature selection methods to adaptively select the opti-mal features to form feature space.Through 5 high-dimensional imbalanced public datasets with different dimensions(100~25000),WAFS is compared with the filter-based CSS feature selection algorithm and the embedded ASG feature selection algo-rithm.Also,the optimal sampling method for different datasets and the optimal rate of feature space of datasets with different di-mension are explored.AUC,Acc,Recall,F1-score and G-mean is taken as the evaluation indicators,and the experiment is con-ducted to show that the WAFS algorithm has good performance on datasets with different dimensions,especially in high-dimen-sional and imbalanced datasets with small samples,and that the model has strong stability and generalization under the premise of ensuring accuracy.

王劲波;刘礼

厦门大学 经济学院,福建 厦门 361000

经济学

自适应特征选择Bagging集成高维不平衡

self-adaptationfeature selectionBagging ensemblehigh-dimensional imbalance

《统计与决策》 2024 (022)

53-58 / 6

国家社会科学基金一般项目(22BTJ006)

10.13546/j.cnki.tjyjc.2024.22.009

评论