计算机技术与发展2016,Vol.26Issue(9):1-7,7.DOI:10.3969/j.issn.1673-629X.2016.09.001
基于隐含语义分析的在线新闻话题发现方法
Online News Topics Extraction Based on Latent Semantic Analysis
摘要
Abstract
With the rapid development of the Internet and the continuous increasing of massive data,how to identify the current news topic quickly and effectively is becoming an urgent demand,and online hot news topic detection has become an hot area of research. For online news stream,the degree of traditional Vector Space Model ( VSM) will grow with the increasing of data,resulting in obvious problem of data sparsity and synonymy,which makes it difficult to quickly and accurately calculate the similarity of texts. The latent semantic analysis based on weighted features is used to map the sparse matrix with high-dimension of words and documents to the hidden k-dimension se-mantic space,making full use of the semantic information between words and documents to improve the semantic similarity between the same subject documents,overcoming the problems of text sparsity and synonymy in Internet. In addition,traditional clustering algorithm exists the problem of high time complexity and input dependency for increasing massive news data,which is difficult to get the expected result quickly and efficiently. A Single-pass online clustering algorithm is used to detect the topic clusters based on succession and corre-lation in time for news,and the concept of topic heat is introduced to screen the public attention of news topics. Experiment shows that the method proposed can effectively improve the accuracy of the detection of topics.关键词
话题发现/向量空间模型/隐含语义分析/文本聚类/奇异值分解Key words
topic detection/vector space model/latent semantic analysis/text clustering/singular value decomposition分类
信息技术与安全科学引用本文复制引用
武高敏,张宇晨,韩京宇..基于隐含语义分析的在线新闻话题发现方法[J].计算机技术与发展,2016,26(9):1-7,7.基金项目
国家自然科学基金重点项目(61003040,61100135,61302157) (61003040,61100135,61302157)