计算机科学与探索2025,Vol.19Issue(10):2697-2711,15.DOI:10.3778/j.issn.1673-9418.2411043
基于密度峰值快速聚类算法的合成过采样方法
Synthetic Oversampling Method Based on Fast Clustering Algorithm for Density Peaks
摘要
Abstract
As a major challenge in the classification task,the class imbalance problem stems from the significant imbal-ance between the number of majority and minority samples in the training dataset.This imbalance not only affects the gen-eralization ability of the classifier,but also may lead to a significant decrease in the recognition accuracy of minority sam-ples.Oversampling techniques,especially synthetic minority over-sampling technique(SMOTE)and its variants,serve as an effective means of mitigating such problems by generating additional minority samples to balance the dataset.However,these methods have limitations such as the potential introduction of noise in sample generation,insufficient sample diver-sity,and insufficient attention to boundary regions.In view of the key role of boundary samples in classification decision-making and their susceptibility to classifier misjudgment,this paper proposes an innovative oversampling strategy to accu-rately identify boundary samples and generate high-quality new samples around them.Firstly,the CFSFDP(compressed file system fast data processing)clustering algorithm is used to calculate the local density of each minority sample by vir-tue of its ability to identify the local density peak,and then the samples located at the classification boundary are screened out.Subsequently,by calculating the Euclidean distance between these boundary samples and their nearest majority class samples,a suitable spherical region is defined for each boundary sample,which not only covers the potential distribution range of the boundary samples,but also avoids excessive overlap with the majority class samples.After determining the boundary sample and its corresponding spherical region,a new synthetic sample is randomly generated in the region.This step not only increases the diversity of minority samples,but also makes the generated samples closer to the real boundary distribution,which helps the classifier to better learn the complex features of minority classes.To verify the effectiveness of the proposed method,it is comprehensively compared with 9 existing oversampling methods on 32 real-world imbal-ance datasets.Experimental results show that the proposed method performs well in multiple evaluation indices.关键词
不平衡数据/CFSFDP聚类算法/合成过采样/边界样本Key words
unbalanced data/CFSFDP clustering algorithm/synthetic oversampling/boundary samples分类
信息技术与安全科学引用本文复制引用
冷强奎,李梓涵..基于密度峰值快速聚类算法的合成过采样方法[J].计算机科学与探索,2025,19(10):2697-2711,15.基金项目
辽宁省教育厅科研项目(JYTMS20230819) (JYTMS20230819)
辽宁工程技术大学博士科研启动基金(21-1043).This work was supported by the Research Project of Liaoning Provincial Department of Education(JYTMS20230819),and the Doctoral Research Startup Fund of Liaoning Technical University(21-1043). (21-1043)