计算机工程与应用Issue(11):120-125,138,7.DOI:10.3778/j.issn.1002-8331.1312-0148
考虑类内不平衡的谱聚类过抽样方法
Spectral clustering based oversampling:oversampling taking within class ;imbalance into consideration
摘要
Abstract
Imbalanced datasets are one of the most crucial challenges encountered by data mining techniques. Oversam-pling has been proven to be a very effective method in dealing with imbalanced datasets. However, traditional oversam-pling methods pay no attention to within class imbalance which is pervasive in real world datasets. To resolve this prob-lem, this paper proposes an oversampling method based on modified spectral clustering. This method first automatically decides the best number of clusters. Then modified spectral clustering is applied to minority samples. Based on the num-ber of samples contained in each cluster, this proposal judges the number of samples which shall be generated inside each cluster to get a dataset which is balanced both between and within class. This method is tested in 4 real world datasets and one simulated dataset. It is proven to be effective. Moreover, a comparison between traditional k-means clustering based oversampling and the method proposed in this paper is conducted. The results are analyzed and explained.关键词
谱聚类/不平衡数据集/过抽样Key words
spectral clustering/imbalanced dataset/oversampling分类
信息技术与安全科学引用本文复制引用
骆自超,金隼,邱雪峰..考虑类内不平衡的谱聚类过抽样方法[J].计算机工程与应用,2014,(11):120-125,138,7.基金项目
国家十二五科技支撑计划(No.2012BAF06B03);国家自然基金(No.51175340)。 ()