计算机技术与发展2019,Vol.29Issue(2):39-42,4.DOI:10.3969/j.issn.1673-629X.2019.02.008
一种基于MinHash的改进新闻文本聚类算法
An Improved News Text Clustering Algorithm Based on MinHash
摘要
Abstract
The continuous development of information technology has brought about the rapid growth of news texts on the Internet.In the face of a large number of news texts, it is very important to cluster them effectively.Based on the above requirements, we propose an improved DBSCAN clustering algorithm based on MinHash.In order to solve the problem of high data dimension, high computational complexity and large resource consumption in traditional vector space model text clustering, this algorithm uses MinHash to reduce the dimension of all text feature word sets, thus effectively reducing the wastes of resources.Jaccard coefficient is calculated for any two-by-two data in the obtained characteristics matrix, and each result is compared with the neighborhood radius Eps in DBSCAN clustering and calculated whether all the neighboring nodes whose distances are greater than the neighborhood radius Eps is greater than or equal to MinPts.Therefore, we can determine whether the text is a core point and whether clusters can be formed.Experiment shows that the algorithm has a better effect on news text clustering and can effectively cluster the intricate news text on the Internet.关键词
Min Hash/Jaccard系数/DBSCAN/文本聚类Key words
MinHash/Jaccard coefficient/DBSCAN/text-clustering分类
信息技术与安全科学引用本文复制引用
王安瑾..一种基于MinHash的改进新闻文本聚类算法[J].计算机技术与发展,2019,29(2):39-42,4.基金项目
国家自然科学基金(61472075) (61472075)