摘要
Abstract
In order to solve the problems of the traditional k-means algorithm in which k values needs to be input and the the ultra-large-scale data set needs to be clustered,on the basis of previous studies,the information entropy is brought in when distance is calculated,and data sampling method is adopted,that is,the optimal samples are extracted from the ultra-large-scale data set to conduct sample clustering. Based on the sample data clustering,the validity indexes are verified and k value re-quired by the algorithm is obtained. The distance formula for information entropy is brought in to carry out clustering on the ultra-large data set. Experiments show that the algorithm can overcome the defects of traditional k-means algorithm for k value input, and can automatically obtain k values of ultra-large data clustering under the premise of not affecting the quality of the early da-ta clustering.关键词
k-means算法/信息熵/最优样本抽取/有效性指标Key words
k-means algorithm/information entropy/optimal sample extraction/validity index分类
信息技术与安全科学