电子科技大学学报2017,Vol.46Issue(1):61-68,8.DOI:10.3969/j.issn.1001-0548.2017.01.010
一种K-means改进算法的并行化实现与应用
The Parallel Implementation and Application of an Improved K-means Algorithm
摘要
Abstract
Following with the growth of massive data, clustering research, one of the core problems of big dataisfaced with more and more problems such as high computing complexity and lack of resource. It has proposed an improved parallel K-means algorithm based on Hadoop. To overcomethe problem that the traditional K-means algorithm often has local optimal solution due to the randomness choice of initial center, we introduce Canopy algorithm to initialize clustering center andapply K-means algorithm on canopy. Meanwhile, clusters are merged among canopies. The result is stable and iteration number is less. In addition, the parallel implementation methods and strategies of the improved algorithm are presented, combining with the distributed computing model of MapReduce. And a new method of text clustering is introduced by improving the similarity of measurement. The experiment results indicate the validity and scalability of our method.关键词
canopy算法/Hadoop/MapReduce/并行K-means/文本聚类Key words
canopy algorithm/Hadoop/MapReduce/parallel K-means/text clustering分类
信息技术与安全科学引用本文复制引用
李晓瑜,俞丽颖,雷航,唐雪飞..一种K-means改进算法的并行化实现与应用[J].电子科技大学学报,2017,46(1):61-68,8.基金项目
国家科技支撑计划(2012BAH87F03);中央高校基本科研业务费(ZYGX2014J065) (2012BAH87F03)