沈阳航空航天大学学报2024,Vol.41Issue(3):71-84,14.DOI:10.3969/j.issn.2095-1248.2024.03.010
高维数据聚类数量可视化确定模式
Visualized determination mode for clustering quantity of high-dimensional data
摘要
Abstract
In order to solve the problem that the classical K-means clustering algorithm reguired users to know the number of clusters in advance and the clustering results were sensitive to initialization of the algorithm,a comprehensive scheme was proposed to improve the random initial partitioning of K-means algorithm and visually determine the number of clusters.Firstly,the data was standardized to make it obey normal distribution,and the most important features were extracted by principal compo-nent analysis to achieve dimensionality reduction of high-dimensional data.Then,the farthest centroid selection and min-max distance rule were used to modify the random initialization of K-means algo-rithm to avoid empty clusters and ensure data separability.Based on these,the statistical empirical rule was used to estimate the range of the number of clusters,and the optimal number of clusters was as-sessed by searching the elbow of sum-of-squared-error curve within this range.Finally,by calculating and comparing the silhouette coefficients of each cluster,the clustering quality of the algorithm was evaluated,thereby ultimately determining the inherent number of clusters in the data.The simulation re-sults show that the proposed scheme can not only visually determine the potential number of clusters in the data,but also provide an effective method for high-dimensional data analysis in the era of big data.关键词
K-均值聚类算法/主分量分析/最远质心选择/最小-最大距离规则/统计经验法则/肘部法/轮廓分析Key words
K-means clustering algorithm/principal component analysis/farthest centroid selection/min-max distance rule/statistical empirical rule/elbow method/silhouette analysis分类
信息技术与安全科学引用本文复制引用
何选森,何帆,樊跃平,陈洪军..高维数据聚类数量可视化确定模式[J].沈阳航空航天大学学报,2024,41(3):71-84,14.基金项目
广东省普通高校重点领域专项(项目编号:2021ZDZX1035) (项目编号:2021ZDZX1035)