计算机技术与发展2019,Vol.29Issue(3):23-29,7.DOI:10.3969/j.issn.1673-629X.2019.03.005
一种基于词加权LDA模型的专利文献分类方法
A Patent Document Classification Method Based on Word Weighted LDA Model
摘要
Abstract
When the traditional topic model carries on the text classification, its characteristic words choose the high frequency words under the law of statistics. However, in the patent literature classification, most professional words are often overwhelmed by high frequency words, resulting in the low accuracy of the topic model in the classification of patent documents. Therefore, we present a supervised LDA topic model based on word weighted for the classification of patent documents. Based on the co-occurrence relationship between professional words and high-frequency words, KeyGraph algorithm is used to select the keywords with better characterization, and the mutual information function is used to calculate the weight of each keyword to establish a professional word dictionary. On this basis, a supervised LDA model is built, the word weighted is extended to the LDA model and Gibbs Sampling is used to estimate the parameters. Compared with the LDA model and its two variant models, the classification accuracy of the model is improved by 4.62%, 3.74% and 3.26% respectively on the patent documents. It shows that the high degree of specialization words selected by the model has a higher degree of relevance to the topic, and the classification efficiency and accuracy are significantly improved.关键词
加权模型/LDA/KeyGraph算法/专利文献分类Key words
weighted model/latent Dirichlet allocation/KeyGraph algorithm/patent literature classification分类
信息技术与安全科学引用本文复制引用
孙伟,刘文静,葛丽阁,余璇..一种基于词加权LDA模型的专利文献分类方法[J].计算机技术与发展,2019,29(3):23-29,7.基金项目
国家自然科学基金青年项目(61203240) (61203240)