计算机应用与软件2017,Vol.34Issue(12):47-52,6.DOI:10.3969/j.issn.1000-386x.2017.12.009
基于词向量的微博话题发现方法
MICROBLOGGING TOPIC DETECTION BASED ON THE WORD DISTRIBUTED REPRESENTATION
摘要
Abstract
Aiming at the characteristics of microblogging short text,colloquialization and big data,a new method based on the distributed representation is proposed.We crawled the experimental data combined with the Chinese corpus training to get the vector representation of the word.Then we got the word vector representation of the text by defining the text word vector model.Compared with the traditional vector space representation model,the word vector representation model can solve the sparse and high dimensional problem of microblog short text,and can solve the problem of text semantic information loss.We used the improved Canopy algorithm to fuzzy text clustering,and the data in the same Canopy were clustered by the K-means algorithm.Experiments showed that the comprehensive index of the proposed method's increased 4% compared with the Single-Pass algorithm.The experimental results proved the validity and accuracy of the proposed method.关键词
话题发现/词向量/短文本/Canopy聚类Key words
Topic detection/Word distributed representation/Short text/Canopy cluster分类
信息技术与安全科学引用本文复制引用
李帅彬,李亚星,冯旭鹏,刘利军,黄青松..基于词向量的微博话题发现方法[J].计算机应用与软件,2017,34(12):47-52,6.基金项目
国家自然科学基金项目(81360230,81560296). (81360230,81560296)