计算机技术与发展2017,Vol.27Issue(9):26-30,5.DOI:10.3969/j.issn.1673-629X.2017.09.006
基于MapReduce的单遍K-means聚类算法
A Single-pass K-means Clustering Algorithm with MapReduce
摘要
Abstract
The application of fitting K-means into MapReduce framework can greatly improve the processing of K-means on large data-sets. But K-means achieves an acceptable clustering effect through multiple iterations. Each iteration is executed as an independent map job,in which the whole dataset must be read and wrote to slow disks,resulting in high I/O overhead,and it is not consistent with the de-sign concept of the MapReduce framework. Therefore,a single-pass K-means clustering algorithm based on MapReduce,called MRSK, is proposed. It reads the data by single-pass and uses the K-means++ seeding algorithm to get the initial cluster center. On the basis of theoretically analyzing the complexity of the MRSK,a series of test and analysis for MRSK is conducted. The experimental results show that compared with the available MapReduce-based and stream-based K-means variants,MRSK performs both faster execution times and higher quality of clustering results.关键词
MapReduce框架/数据聚类/K-means++/Mahout/单遍技术Key words
MapReduce framework/data clustering/K-means++/Mahout/single-pass分类
信息技术与安全科学引用本文复制引用
唐浩,杨余旺,辛智斌..基于MapReduce的单遍K-means聚类算法[J].计算机技术与发展,2017,27(9):26-30,5.基金项目
国家自然科学基金资助项目(61640020) (61640020)
江苏省科技支撑计划(BE2012386,BE2011342) (BE2012386,BE2011342)
江苏省农业自主创新项目(CX(13)3054, CX(16)1006) (CX(13)
江苏省重点研发计划(BE2016368-1) (BE2016368-1)
深圳市战略性新兴产业发展专项资金项目(JCYJ20130331151710105) (JCYJ20130331151710105)