首页|期刊导航|计算机工程与应用|RDD上扩展索引层优化的分布式K-means算法

RDD上扩展索引层优化的分布式K-means算法

马菁李力

计算机工程与应用2019，Vol.55Issue(1)：161-167,270,8.

计算机工程与应用2019，Vol.55Issue(1)：161-167,270,8.DOI:10.3778/j.issn.1002-8331.1709-0161

RDD上扩展索引层优化的分布式K-means算法

Optimization of Distributed K-means Algorithm with RDD-based Extended Index Layer

马菁 ¹李力²

作者信息

1. 上海交通大学计算机科学与工程系,上海 200240
2. 贵州大学贵州省公共大数据重点实验室,贵阳 550025
折叠

摘要

Abstract

K-means is a classical clustering algorithm. Distributed computing is utilized to improve its scalability on large-scale data in recent years. However, traditional disk-based distributed systems still have a large amount of I/O consumption. This paper extends the machine learning library MLlib in Spark, a memory-based distributed platform with low I/O consumption and good fault tolerance mechanism. This paper develops a RDD-based index layer, which exploits a two-level index comp to rising multiple index strategies, including R-tree, quad tree, KD-tree and Ball tree. By overriding the data partitioning methods, this paper optimizes K-means algorithm by preprocessing the points with close spatial distance. The index layer is used to store the general information of the corresponding point set so that the search space pruning can be used in K-means algorithm. The experimental results show that the index layer can prune the search space by more than 40%, and improve the efficiency by 21% compared to the na?ve distributed K-means algorithm.

关键词

K-means算法/聚类算法/Spark系统/R树/四叉树/KD树/Ball树

Key words

K-means/clustering algorithm/Spark/R-tree/quad tree/KD-tree/Ball-tree

分类

信息技术与安全科学

引用本文复制引用

马菁,李力..RDD上扩展索引层优化的分布式K-means算法[J].计算机工程与应用,2019,55(1):161-167,270,8.

计算机工程与应用

OA北大核心CSCDCSTPCD

ISSN：1002-8331

访问量0

下载量0

段落导航