| 注册
首页|期刊导航|计算机技术与发展|一种基于MinHash的改进新闻文本聚类算法

一种基于MinHash的改进新闻文本聚类算法

王安瑾

计算机技术与发展2019,Vol.29Issue(2):39-42,4.
计算机技术与发展2019,Vol.29Issue(2):39-42,4.DOI:10.3969/j.issn.1673-629X.2019.02.008

一种基于MinHash的改进新闻文本聚类算法

An Improved News Text Clustering Algorithm Based on MinHash

王安瑾1

作者信息

  • 1. 东华大学 计算机科学与技术学院, 上海 200000
  • 折叠

摘要

Abstract

The continuous development of information technology has brought about the rapid growth of news texts on the Internet.In the face of a large number of news texts, it is very important to cluster them effectively.Based on the above requirements, we propose an improved DBSCAN clustering algorithm based on MinHash.In order to solve the problem of high data dimension, high computational complexity and large resource consumption in traditional vector space model text clustering, this algorithm uses MinHash to reduce the dimension of all text feature word sets, thus effectively reducing the wastes of resources.Jaccard coefficient is calculated for any two-by-two data in the obtained characteristics matrix, and each result is compared with the neighborhood radius Eps in DBSCAN clustering and calculated whether all the neighboring nodes whose distances are greater than the neighborhood radius Eps is greater than or equal to MinPts.Therefore, we can determine whether the text is a core point and whether clusters can be formed.Experiment shows that the algorithm has a better effect on news text clustering and can effectively cluster the intricate news text on the Internet.

关键词

Min Hash/Jaccard系数/DBSCAN/文本聚类

Key words

MinHash/Jaccard coefficient/DBSCAN/text-clustering

分类

信息技术与安全科学

引用本文复制引用

王安瑾..一种基于MinHash的改进新闻文本聚类算法[J].计算机技术与发展,2019,29(2):39-42,4.

基金项目

国家自然科学基金(61472075) (61472075)

计算机技术与发展

OACSTPCD

1673-629X

访问量0
|
下载量0
段落导航相关论文