首页|期刊导航|计算机应用与软件|基于词向量的微博话题发现方法

基于词向量的微博话题发现方法

李帅彬李亚星冯旭鹏刘利军黄青松

计算机应用与软件2017，Vol.34Issue(12)：47-52,6.

计算机应用与软件2017，Vol.34Issue(12)：47-52,6.DOI:10.3969/j.issn.1000-386x.2017.12.009

基于词向量的微博话题发现方法

MICROBLOGGING TOPIC DETECTION BASED ON THE WORD DISTRIBUTED REPRESENTATION

李帅彬 ¹李亚星 ¹冯旭鹏 ¹刘利军 ¹黄青松¹

作者信息

1. 昆明理工大学信息工程与自动化学院,云南昆明650500
折叠

摘要

Abstract

Aiming at the characteristics of microblogging short text,colloquialization and big data,a new method based on the distributed representation is proposed.We crawled the experimental data combined with the Chinese corpus training to get the vector representation of the word.Then we got the word vector representation of the text by defining the text word vector model.Compared with the traditional vector space representation model,the word vector representation model can solve the sparse and high dimensional problem of microblog short text,and can solve the problem of text semantic information loss.We used the improved Canopy algorithm to fuzzy text clustering,and the data in the same Canopy were clustered by the K-means algorithm.Experiments showed that the comprehensive index of the proposed method's increased 4％ compared with the Single-Pass algorithm.The experimental results proved the validity and accuracy of the proposed method.

关键词

话题发现/词向量/短文本/Canopy聚类

Key words

Topic detection/Word distributed representation/Short text/Canopy cluster

分类

信息技术与安全科学

引用本文复制引用

李帅彬,李亚星,冯旭鹏,刘利军,黄青松..基于词向量的微博话题发现方法[J].计算机应用与软件,2017,34(12):47-52,6.

基金项目

国家自然科学基金项目(81360230,81560296). （81360230,81560296）

计算机应用与软件

OA北大核心CSTPCD

ISSN：1000-386X

访问量3

下载量0

段落导航