农业机械学报2017,Vol.48Issue(10):202-208,7.DOI:10.6041/j.issn.1000-1298.2017.10.025
基于word2vec和LSTM的饮食健康文本分类研究
Diet Health Text Classification Based on word2vec and LSTM
摘要
Abstract
The development of Internet information age makes Internet information grow rapidly.As the main information form of the network,the texts are massive,so is texts information about diet.The diet information is closely related with people's health.It is important to make texts be auto-classified to help people make effective use of health eating information.In order to classify the food text information efficiently,a classification model was proposed based on word2vec and LSTM.According to the characteristics of food text information in encyclopedia and diet texts in health websites,word2vec realized word embedding,including semantic information which solved the problem of sparse representation and dimension disaster that the traditional method faced.Word2vec combined with K-means + + was used to cluster key words both of the proper and the avoiding to enlarge relevant words in classification dictionaries.The words were employed to work out rules to improve the quality of training data.Then document vectors were constructed based on word2vec as the initial input values of long-short term memory network (LSTM).LSTM moved input layer,hidden layers of the neural network into the memory cell to be protected.Through the "gate" structure,sigmoid function and tanh function to remove or increase the information to the cell state which enabled LSTM model the "memory" to make good use of the text context information,which was significant for text classification.Experiments were performed with 48 000 documents.The results showed that the classification accuracy was 98.08%.The result was higher than that of ways based on tf-idf and bag-of-words text vectors representation methods.Two other classification algorithms of support vector machine (SVM) and convolutional neural network (CNN) were also conducted.Both of them were based on word2vec.The results showed that the proposed model outperformed other competing methods by several percentage points.It proved that the method can automatically classify dietary texts with high quality and help people to make good use of health diet information.关键词
文本分类/word2vec/词向量/长短期记忆网络/K-means++Key words
text classification/word2vec/word embedding/long-short term memory network/K-means + +分类
信息技术与安全科学引用本文复制引用
赵明,杜会芳,董翠翠,陈长松..基于word2vec和LSTM的饮食健康文本分类研究[J].农业机械学报,2017,48(10):202-208,7.基金项目
信息网络安全公安部重点实验室开放课题项目(61503386) (61503386)