| 注册
首页|期刊导航|计算机科学与探索|基于密度峰值快速聚类算法的合成过采样方法

基于密度峰值快速聚类算法的合成过采样方法

冷强奎 李梓涵

计算机科学与探索2025,Vol.19Issue(10):2697-2711,15.
计算机科学与探索2025,Vol.19Issue(10):2697-2711,15.DOI:10.3778/j.issn.1673-9418.2411043

基于密度峰值快速聚类算法的合成过采样方法

Synthetic Oversampling Method Based on Fast Clustering Algorithm for Density Peaks

冷强奎 1李梓涵1

作者信息

  • 1. 辽宁工程技术大学 电子与信息工程学院,辽宁 葫芦岛 125105
  • 折叠

摘要

Abstract

As a major challenge in the classification task,the class imbalance problem stems from the significant imbal-ance between the number of majority and minority samples in the training dataset.This imbalance not only affects the gen-eralization ability of the classifier,but also may lead to a significant decrease in the recognition accuracy of minority sam-ples.Oversampling techniques,especially synthetic minority over-sampling technique(SMOTE)and its variants,serve as an effective means of mitigating such problems by generating additional minority samples to balance the dataset.However,these methods have limitations such as the potential introduction of noise in sample generation,insufficient sample diver-sity,and insufficient attention to boundary regions.In view of the key role of boundary samples in classification decision-making and their susceptibility to classifier misjudgment,this paper proposes an innovative oversampling strategy to accu-rately identify boundary samples and generate high-quality new samples around them.Firstly,the CFSFDP(compressed file system fast data processing)clustering algorithm is used to calculate the local density of each minority sample by vir-tue of its ability to identify the local density peak,and then the samples located at the classification boundary are screened out.Subsequently,by calculating the Euclidean distance between these boundary samples and their nearest majority class samples,a suitable spherical region is defined for each boundary sample,which not only covers the potential distribution range of the boundary samples,but also avoids excessive overlap with the majority class samples.After determining the boundary sample and its corresponding spherical region,a new synthetic sample is randomly generated in the region.This step not only increases the diversity of minority samples,but also makes the generated samples closer to the real boundary distribution,which helps the classifier to better learn the complex features of minority classes.To verify the effectiveness of the proposed method,it is comprehensively compared with 9 existing oversampling methods on 32 real-world imbal-ance datasets.Experimental results show that the proposed method performs well in multiple evaluation indices.

关键词

不平衡数据/CFSFDP聚类算法/合成过采样/边界样本

Key words

unbalanced data/CFSFDP clustering algorithm/synthetic oversampling/boundary samples

分类

信息技术与安全科学

引用本文复制引用

冷强奎,李梓涵..基于密度峰值快速聚类算法的合成过采样方法[J].计算机科学与探索,2025,19(10):2697-2711,15.

基金项目

辽宁省教育厅科研项目(JYTMS20230819) (JYTMS20230819)

辽宁工程技术大学博士科研启动基金(21-1043).This work was supported by the Research Project of Liaoning Provincial Department of Education(JYTMS20230819),and the Doctoral Research Startup Fund of Liaoning Technical University(21-1043). (21-1043)

计算机科学与探索

OA北大核心

1673-9418

访问量0
|
下载量0
段落导航相关论文