计算机工程与应用2019,Vol.55Issue(12):155-161,7.DOI:10.3778/j.issn.1002-8331.1803-0259
基于重叠度与完整度的LDA主题优选方法
Optimal Selection Method for LDA Topics Based on Degree of Overlap and Completeness
摘要
Abstract
Many topic modeling methods can infer topic number and topic description from large text data set based on LDA, however, there exists several problems, such as determination of topic number, and selection of topic words. The paper proposes a new method to select optimal topic description based on Overlap-Completeness score. It combines LDA and TF-IDF, and takes completeness of words and word independency into consideration. Based on the result of LDA, TF-IDF is utilized to select distinctive words for each topic, then the degree of overlap between the vocabularies of different top-ics, and the degree of completeness in topic description are defined, and finally the optimal selection method is presented. The method can not only get the best topic number, but also the best description words for each topic. Experiments based on news about information security topic show that, compared with the traditional LDA model, this method can get dis-tinctive topics and representative words.关键词
LDA模型/TF-IDF/主题识别/重叠度/完整度Key words
LDA model/ TF-IDF/ topic detection/ degree of overlap/ degree of completeness分类
信息技术与安全科学引用本文复制引用
柏志安,曾剑平..基于重叠度与完整度的LDA主题优选方法[J].计算机工程与应用,2019,55(12):155-161,7.基金项目
上海市自然科学基金(No.15ZR1403700). (No.15ZR1403700)