集成技术Issue(2):35-41,7.
一种基于聚类提升的不平衡数据分类算法
A Clustering-Based Enhanced Classiifcation Algorithm for Imbalanced Data
摘要
Abstract
Imbalanced data exist widely in the real world and their classiifcation is a hot topic in the ifeld of machine learning. A clustering-based enhanced AdaBoost algorithm was proposed to improve the poor classiifcation performance produced by the traditional algorithm in classifying the minority class of imbalanced datasets. The algorithm firstly constructs balanced training sets by the clustering-based undersampling, using K-means clustering to cluster the majority class and extract cluster centroids and then merge with all minority class instances to generate a new balanced training set. To avoid the declining of the classiifcation accuracy caused by the shortage of training sets owing to too few minority class samples, SMOTE (Synthetic Minority Oversampling Technique) combining the clustering-based undersampling was used. Next, the misclassiifcation loss function in the basic classiifer of the AdaBoost algorithm was modiifed based on the cost-sensitive learning theory to assign asymmetric misclassiifcation losses to samples of different classes. The experimental results show that, the proposed algorithm makes the model training samples more representative and greatly increases the classiifcation accuracy of the minority class, keeping the overall classiifcation performance.关键词
不平衡数据分类/K均值聚类/AdaBoost/集成学习Key words
imbalanced data classiifcation/K-mean clustering/AdaBoost/ensemble learning分类
信息技术与安全科学引用本文复制引用
胡小生,张润晶,钟勇..一种基于聚类提升的不平衡数据分类算法[J].集成技术,2014,(2):35-41,7.基金项目
广东高校优秀青年创新人才培养项目(2013LYM_0097);佛山市智能教育评价指标体系研究(DX20120220);佛山科学技术学院校级科研项目。 (2013LYM_0097)