计算机与数字工程2023,Vol.51Issue(10):2231-2235,5.DOI:10.3969/j.issn.1672-9722.2023.10.003
基于Flink框架的K-means算法优化及并行计算策略
K-means Algorithm Optimization and Parallel Computing Strategy Based on Flink Framework
李召鑫 1孟祥印 1肖世德 1胡锴沣 1赖焕杰1
作者信息
- 1. 西南交通大学机械工程学院 成都 610031
- 折叠
摘要
Abstract
K-means algorithm is widely used in the field of machine learning and data mining because of its simple principle and good clustering effect,but it still has some shortcomings:K-means algorithm needs to specify the number of classification cate-gories K.K-means algorithm selection strategy for the initial clustering center is random selection,which may affect the accuracy and calculation speed of the final clustering results.The above shortcomings all limit the improvement of the calculation efficiency of the K-means algorithm.To solve the above problems,this paper proposes a K-means optimization algorithm based on Flink parallel-ization.This algorithm introduces the Canopy algorithm on the basis of the traditional K-means algorithm to complete the initial clus-tering,and obtains the number of categories K,and then uses the maximum distance algorithm to calculate the initial clustering cen-ter,and uses the parallel computing power of the Flink framework to perform clustering experiments on multiple data sets.The ex-perimental results show that the algorithm in this paper can reduce the number of iterations of the clustering process,and also has a certain improvement in the accuracy of clustering.It also has good computational efficiency in the environment of large-scale data sets.关键词
Flink/K-means算法/Canopy算法/并行化Key words
Flink/K-means algorithm/Canopy algorithm/parallel分类
信息技术与安全科学引用本文复制引用
李召鑫,孟祥印,肖世德,胡锴沣,赖焕杰..基于Flink框架的K-means算法优化及并行计算策略[J].计算机与数字工程,2023,51(10):2231-2235,5.