首页|期刊导航|计算机应用与软件|一种分类数据聚类算法及其高效并行实现

一种分类数据聚类算法及其高效并行实现

丁祥武谭佳王梅

计算机应用与软件2017，Vol.34Issue(7)：249-256,8.

计算机应用与软件2017，Vol.34Issue(7)：249-256,8.DOI:10.3969/j.issn.1000-386x.2017.07.046

一种分类数据聚类算法及其高效并行实现

A CLUSTERING ALGORITHM OF CATEGORICAL DATA AND ITS EFFICIENT PARALLEL IMPLEMENTATION

丁祥武 ¹谭佳 ¹王梅¹

作者信息

1. 东华大学计算机科学技术学院上海 201620
折叠

摘要

Abstract

For large-scale, high-dimensional, sparse categorical data clustering, compared with the traditional clustering algorithm, CLOPE has a great improvement in the quality of clustering and running speed.However, CLOPE has some defects such as clustering quality instability, not to distinguish the attribute clustering contribution between each dimension, need to specify rejection factor r in advance.Therefore, this paper proposes a clustering algorithm for categorical data based on random sequence iteration and attribute weight (RW-CLOPE).RW-CLOPE use the "shuffle" model to sort the raw data randomly to eliminate the effect of data input sequence on clustering quality.At the same time, based on the information entropy, the calculation method of attribute weights is proposed to distinguish the attribute clustering contribution between each dimension which greatly improve the quality of data clustering.Finally, the RW-CLOPE algorithm has been implemented on the efficient cluster platform-Spark.Experiments on three different and real data show that RW-CLOPE algorithm achieves better clustering quality than p-CLOPE algorithm when the number of disordered dataset is the same.For the mushrooms dataset, when CLOPE achieve optimal results, RW-CLOPE can achieve 68% higher profit value than CLOPE and 25% higher profit value than p-CLOPE.The execution time of RW-CLOPE algorithm is much shorter than p-CLOPE algorithm when dealing massive data.When the computational resources are sufficient, the more the number of data sets in, the more obvious the improvement of the execution time is.

关键词

分类数据/CLOPE/p-CLOPE/RW-CLOPE/Spark

Key words

Categorical data/ CLOPE/ p-CLOPE/ RW-CLOPE/ Spark

分类

信息技术与安全科学

引用本文复制引用

丁祥武,谭佳,王梅..一种分类数据聚类算法及其高效并行实现[J].计算机应用与软件,2017,34(7):249-256,8.

基金项目

上海市信息化发展资金项目(XX-XXFZ-05-16-0139). （XX-XXFZ-05-16-0139）

计算机应用与软件

OA北大核心CSTPCD

ISSN：1000-386X

访问量0

下载量0

段落导航