计算机应用与软件2017,Vol.34Issue(7):249-256,8.DOI:10.3969/j.issn.1000-386x.2017.07.046
一种分类数据聚类算法及其高效并行实现
A CLUSTERING ALGORITHM OF CATEGORICAL DATA AND ITS EFFICIENT PARALLEL IMPLEMENTATION
摘要
Abstract
For large-scale, high-dimensional, sparse categorical data clustering, compared with the traditional clustering algorithm, CLOPE has a great improvement in the quality of clustering and running speed.However, CLOPE has some defects such as clustering quality instability, not to distinguish the attribute clustering contribution between each dimension, need to specify rejection factor r in advance.Therefore, this paper proposes a clustering algorithm for categorical data based on random sequence iteration and attribute weight (RW-CLOPE).RW-CLOPE use the "shuffle" model to sort the raw data randomly to eliminate the effect of data input sequence on clustering quality.At the same time, based on the information entropy, the calculation method of attribute weights is proposed to distinguish the attribute clustering contribution between each dimension which greatly improve the quality of data clustering.Finally, the RW-CLOPE algorithm has been implemented on the efficient cluster platform-Spark.Experiments on three different and real data show that RW-CLOPE algorithm achieves better clustering quality than p-CLOPE algorithm when the number of disordered dataset is the same.For the mushrooms dataset, when CLOPE achieve optimal results, RW-CLOPE can achieve 68% higher profit value than CLOPE and 25% higher profit value than p-CLOPE.The execution time of RW-CLOPE algorithm is much shorter than p-CLOPE algorithm when dealing massive data.When the computational resources are sufficient, the more the number of data sets in, the more obvious the improvement of the execution time is.关键词
分类数据/CLOPE/p-CLOPE/RW-CLOPE/SparkKey words
Categorical data/ CLOPE/ p-CLOPE/ RW-CLOPE/ Spark分类
信息技术与安全科学引用本文复制引用
丁祥武,谭佳,王梅..一种分类数据聚类算法及其高效并行实现[J].计算机应用与软件,2017,34(7):249-256,8.基金项目
上海市信息化发展资金项目(XX-XXFZ-05-16-0139). (XX-XXFZ-05-16-0139)