计算机工程与应用2018,Vol.54Issue(10):11-18,8.DOI:10.3778/j.issn.1002-8331.1802-0108
基于余弦距离选取初始簇中心的文本聚类研究
Research on text clustering for selecting initial cluster center based on Cosine distance
摘要
Abstract
Text clustering is an important means for text information to be organized,abstracted and navigated effectively, in which K-means algorithm based on cosine similarity is one of the most widely used algorithms.Aiming at the problem that the K-means algorithm based on cosine similarity is difficult to be improved,and that many excellent K-means improve-ment methods based on Euclidean distance can not be applied, the relationship between cosine similarity and Euclidean distance is discussed,and the transformation formula of the two is obtained with standard vector.Thus,a definition of cosine distance is given,which is close to the Euclidean distance,so that the original improved K-means method based on Euclidean distance can be transformed into a cosine similarity K-means algorithm by cosine distance.On this basis,it is deduced the calculation method of cluster center points in cosine K-means algorithm, and the initial point selection scheme is further improved to form a new text clustering algorithm MCSKM++.The experimental results show that the algorithm can improve the clustering accuracy while the number of iterations is reduced and the running time is shortened.关键词
文本聚类/K-means算法/余弦相似度/余弦距离/初始点选取Key words
text clustering/K-means algorithm/cosine similarity/cosine distance/initial point selection分类
信息技术与安全科学引用本文复制引用
王彬宇,刘文芬,胡学先,魏江宏..基于余弦距离选取初始簇中心的文本聚类研究[J].计算机工程与应用,2018,54(10):11-18,8.基金项目
国家自然科学基金(No.61502527,No.61702549). (No.61502527,No.61702549)