南京大学学报(自然科学版)Issue(2):421-429,9.DOI:10.13232/j.cnki.jnju.2015.02.029
一种基于不平衡数据的聚类抽样方法
A method using clustering and sampling approach for imbalance data
摘要
Abstract
The classification issue is an important research content in the fields of machine learning.The current clas-sification methods have been relatively mature,and generally,using such methods to classify the balanced data can achieve a good effect of classification.But in real world,data proportion is unbalanced in many cases.The traditional classifiers are designed based on the premise of balanced data,and always pursue the best overall accuracy. Therefore,using the traditional classifiers to classify massive unbalanced data will lead to the sharp fall of classifiers’ performance,and the classification result obtained will be greatly biased.It is most commonly seen that the recognition rate of minority samples is far less than that of majority samples.For this reason,the samples which should belong to minority type will be mistakenly classified to majority type.Aimed at the above problem,we can transfer unbalanced datasets to balanced datasets by under-sampling technique,so as to reduce the unbalanced degree of data and allow the traditional classifiers to achieve a good effect when classifying.However,under-sampling will cause the loss of important information,and using clustering algorithm will counteract this loss.Meanwhile,the com-bination method can be integrated to use many classifiers for weighted voting,which can significantly improve the ac-curacy and generalization ability of models.This paper also proposes a learning algorithm based on decomposition. Firstly,we cluster the sample data by k-means algorithm.Then on the basis of the clustering,we take advantage of under-sampling technology in accordance with weights to produce a balanced dataset.Furthermore,we adopt decision tree algorithm to train and test each balanced dataset,and adjust the weights of mistaken samples.High weights of samples will lead to high possibility of being selected.Considering comprehensively about the error ratio of each base classifier to be regarded as the weight of classifiers,we can select a best classifier to get weight ensemble.Finally,we complete three experiments by choosing eight groups of data from UCI dataset.The first experiment is aimed to pick out a classifier with good effect when combined with methods in this paper.The last two experiments compare the data-level method and algorithm-level method.Compared with other algorithms,the proposed method has a higher precision of the minority class and and F-measure of the minority class.At the same time,it can greatly reduce the number of training sets.关键词
机器学习/不平衡数据/集成学习/欠抽样Key words
machine learning/imbalanced data/ensemble learning/under-sampling分类
信息技术与安全科学引用本文复制引用
朱亚奇,邓维斌..一种基于不平衡数据的聚类抽样方法[J].南京大学学报(自然科学版),2015,(2):421-429,9.基金项目
国家自然科学基金(61272060,61309014),重庆市自然科学基金(cstc2012jjA40032,cstc2013jcyjA40063),重庆市/信息产业部计算机网络与通信技术重点实验室开放基金(CY-CNCL-2010-05) (61272060,61309014)