首页|期刊导航|计算机技术与发展|基于MapReduce的单遍K-means聚类算法

基于MapReduce的单遍K-means聚类算法

唐浩杨余旺辛智斌

计算机技术与发展2017，Vol.27Issue(9)：26-30,5.

计算机技术与发展2017，Vol.27Issue(9)：26-30,5.DOI:10.3969/j.issn.1673-629X.2017.09.006

基于MapReduce的单遍K-means聚类算法

A Single-pass K-means Clustering Algorithm with MapReduce

唐浩 ¹杨余旺 ¹辛智斌²

作者信息

1. 南京理工大学计算机科学与工程学院,江苏南京 210094
2. 淮海集团工业有限公司,山西长治 046000
折叠

摘要

Abstract

The application of fitting K-means into MapReduce framework can greatly improve the processing of K-means on large data-sets. But K-means achieves an acceptable clustering effect through multiple iterations. Each iteration is executed as an independent map job,in which the whole dataset must be read and wrote to slow disks,resulting in high I/O overhead,and it is not consistent with the de-sign concept of the MapReduce framework. Therefore,a single-pass K-means clustering algorithm based on MapReduce,called MRSK, is proposed. It reads the data by single-pass and uses the K-means++ seeding algorithm to get the initial cluster center. On the basis of theoretically analyzing the complexity of the MRSK,a series of test and analysis for MRSK is conducted. The experimental results show that compared with the available MapReduce-based and stream-based K-means variants,MRSK performs both faster execution times and higher quality of clustering results.

关键词

MapReduce框架/数据聚类/K-means++/Mahout/单遍技术

Key words

MapReduce framework/data clustering/K-means++/Mahout/single-pass

分类

信息技术与安全科学

引用本文复制引用

唐浩,杨余旺,辛智斌..基于MapReduce的单遍K-means聚类算法[J].计算机技术与发展,2017,27(9):26-30,5.

基金项目

国家自然科学基金资助项目(61640020) （61640020）

江苏省科技支撑计划(BE2012386,BE2011342) （BE2012386,BE2011342）

江苏省农业自主创新项目(CX(13)3054, CX(16)1006) （CX(13）

江苏省重点研发计划(BE2016368-1) （BE2016368-1）

深圳市战略性新兴产业发展专项资金项目(JCYJ20130331151710105) （JCYJ20130331151710105）

计算机技术与发展

OACSTPCD

ISSN：1673-629X

访问量0

下载量0

段落导航