| 注册
首页|期刊导航|农业机械学报|基于word2vec和LSTM的饮食健康文本分类研究

基于word2vec和LSTM的饮食健康文本分类研究

赵明 杜会芳 董翠翠 陈长松

农业机械学报2017,Vol.48Issue(10):202-208,7.
农业机械学报2017,Vol.48Issue(10):202-208,7.DOI:10.6041/j.issn.1000-1298.2017.10.025

基于word2vec和LSTM的饮食健康文本分类研究

Diet Health Text Classification Based on word2vec and LSTM

赵明 1杜会芳 1董翠翠 1陈长松2

作者信息

  • 1. 中国农业大学信息与电气工程学院,北京100083
  • 2. 公安部第三研究所,上海200031
  • 折叠

摘要

Abstract

The development of Internet information age makes Internet information grow rapidly.As the main information form of the network,the texts are massive,so is texts information about diet.The diet information is closely related with people's health.It is important to make texts be auto-classified to help people make effective use of health eating information.In order to classify the food text information efficiently,a classification model was proposed based on word2vec and LSTM.According to the characteristics of food text information in encyclopedia and diet texts in health websites,word2vec realized word embedding,including semantic information which solved the problem of sparse representation and dimension disaster that the traditional method faced.Word2vec combined with K-means + + was used to cluster key words both of the proper and the avoiding to enlarge relevant words in classification dictionaries.The words were employed to work out rules to improve the quality of training data.Then document vectors were constructed based on word2vec as the initial input values of long-short term memory network (LSTM).LSTM moved input layer,hidden layers of the neural network into the memory cell to be protected.Through the "gate" structure,sigmoid function and tanh function to remove or increase the information to the cell state which enabled LSTM model the "memory" to make good use of the text context information,which was significant for text classification.Experiments were performed with 48 000 documents.The results showed that the classification accuracy was 98.08%.The result was higher than that of ways based on tf-idf and bag-of-words text vectors representation methods.Two other classification algorithms of support vector machine (SVM) and convolutional neural network (CNN) were also conducted.Both of them were based on word2vec.The results showed that the proposed model outperformed other competing methods by several percentage points.It proved that the method can automatically classify dietary texts with high quality and help people to make good use of health diet information.

关键词

文本分类/word2vec/词向量/长短期记忆网络/K-means++

Key words

text classification/word2vec/word embedding/long-short term memory network/K-means + +

分类

信息技术与安全科学

引用本文复制引用

赵明,杜会芳,董翠翠,陈长松..基于word2vec和LSTM的饮食健康文本分类研究[J].农业机械学报,2017,48(10):202-208,7.

基金项目

信息网络安全公安部重点实验室开放课题项目(61503386) (61503386)

农业机械学报

OA北大核心CSCDCSTPCD

1000-1298

访问量0
|
下载量0
段落导航相关论文