首页|期刊导航|南京大学学报（自然科学版）|一种基于不平衡数据的聚类抽样方法

一种基于不平衡数据的聚类抽样方法

朱亚奇邓维斌

南京大学学报（自然科学版）Issue(2)：421-429,9.

南京大学学报（自然科学版）Issue(2)：421-429,9.DOI:10.13232/j.cnki.jnju.2015.02.029

一种基于不平衡数据的聚类抽样方法

A method using clustering and sampling approach for imbalance data

朱亚奇 ¹邓维斌¹

作者信息

1. 重庆邮电大学计算智能重庆市重点实验室，重庆，400065
折叠

摘要

Abstract

The classification issue is an important research content in the fields of machine learning.The current clas-sification methods have been relatively mature,and generally,using such methods to classify the balanced data can achieve a good effect of classification.But in real world,data proportion is unbalanced in many cases.The traditional classifiers are designed based on the premise of balanced data,and always pursue the best overall accuracy. Therefore,using the traditional classifiers to classify massive unbalanced data will lead to the sharp fall of classifiers’ performance,and the classification result obtained will be greatly biased.It is most commonly seen that the recognition rate of minority samples is far less than that of majority samples.For this reason,the samples which should belong to minority type will be mistakenly classified to majority type.Aimed at the above problem,we can transfer unbalanced datasets to balanced datasets by under-sampling technique,so as to reduce the unbalanced degree of data and allow the traditional classifiers to achieve a good effect when classifying.However,under-sampling will cause the loss of important information,and using clustering algorithm will counteract this loss.Meanwhile,the com-bination method can be integrated to use many classifiers for weighted voting,which can significantly improve the ac-curacy and generalization ability of models.This paper also proposes a learning algorithm based on decomposition. Firstly,we cluster the sample data by k-means algorithm.Then on the basis of the clustering,we take advantage of under-sampling technology in accordance with weights to produce a balanced dataset.Furthermore,we adopt decision tree algorithm to train and test each balanced dataset,and adjust the weights of mistaken samples.High weights of samples will lead to high possibility of being selected.Considering comprehensively about the error ratio of each base classifier to be regarded as the weight of classifiers,we can select a best classifier to get weight ensemble.Finally,we complete three experiments by choosing eight groups of data from UCI dataset.The first experiment is aimed to pick out a classifier with good effect when combined with methods in this paper.The last two experiments compare the data-level method and algorithm-level method.Compared with other algorithms,the proposed method has a higher precision of the minority class and and F-measure of the minority class.At the same time,it can greatly reduce the number of training sets.

关键词

机器学习/不平衡数据/集成学习/欠抽样

Key words

machine learning/imbalanced data/ensemble learning/under-sampling

分类

信息技术与安全科学

引用本文复制引用

朱亚奇,邓维斌..一种基于不平衡数据的聚类抽样方法[J].南京大学学报（自然科学版）,2015,(2):421-429,9.

基金项目

国家自然科学基金(61272060,61309014),重庆市自然科学基金(cstc2012jjA40032,cstc2013jcyjA40063),重庆市/信息产业部计算机网络与通信技术重点实验室开放基金(CY-CNCL-2010-05) （61272060,61309014）

南京大学学报（自然科学版）

OACSCDCSTPCD

ISSN：0469-5097

访问量0

下载量0

段落导航