计算机工程与应用2019,Vol.55Issue(1):161-167,270,8.DOI:10.3778/j.issn.1002-8331.1709-0161
RDD上扩展索引层优化的分布式K-means算法
Optimization of Distributed K-means Algorithm with RDD-based Extended Index Layer
马菁 1李力2
作者信息
- 1. 上海交通大学 计算机科学与工程系,上海 200240
- 2. 贵州大学 贵州省公共大数据重点实验室,贵阳 550025
- 折叠
摘要
Abstract
K-means is a classical clustering algorithm. Distributed computing is utilized to improve its scalability on large-scale data in recent years. However, traditional disk-based distributed systems still have a large amount of I/O consumption. This paper extends the machine learning library MLlib in Spark, a memory-based distributed platform with low I/O consumption and good fault tolerance mechanism. This paper develops a RDD-based index layer, which exploits a two-level index comp to rising multiple index strategies, including R-tree, quad tree, KD-tree and Ball tree. By overriding the data partitioning methods, this paper optimizes K-means algorithm by preprocessing the points with close spatial distance. The index layer is used to store the general information of the corresponding point set so that the search space pruning can be used in K-means algorithm. The experimental results show that the index layer can prune the search space by more than 40%, and improve the efficiency by 21% compared to the na?ve distributed K-means algorithm.关键词
K-means算法/聚类算法/Spark系统/R树/四叉树/KD树/Ball树Key words
K-means/clustering algorithm/Spark/R-tree/quad tree/KD-tree/Ball-tree分类
信息技术与安全科学引用本文复制引用
马菁,李力..RDD上扩展索引层优化的分布式K-means算法[J].计算机工程与应用,2019,55(1):161-167,270,8.