中南大学学报(自然科学版)Issue(6):2142-2148,7.DOI:10.11817/j.issn.1672-7207.2015.06.023
一种基于LDA模型的关键词抽取方法
A LDA-based approach to keyphrase extraction
摘要
Abstract
Due to the shortage of the comprehensive analysis of the coverage of document topics, the readability and difference of keyphrases, a new algorithm of keyphrase extraction TFITF based on the implicit topic model was put forward. The algorithm adopted the large-scale corpus and producted latent topic model to calculate the TFITF weight of vocabulary on the topic and further generate the weight of vocabulary on the document. And adjacent lexical was ranked and picked out as candidate keyphrases based on co-occurrence information. Then according to the similarity of vocabulary topics, redundant phrases were eliminated. In addition, the comparative experiments of candidate keyphrases were executed by document statistical information, vocabulary chain and topic information. The experimental results, which were carried out on an evaluation dataset including 1 040 Chinese documents and 5 408 standard keyphrases, demonstrate that the method can effectively improve the precision and recall of keyphrase extraction.关键词
信息抽取/关键词抽取/LDA 模型/主题相似性Key words
information extraction/keyphrase extraction/LDA model/topic similarity分类
信息技术与安全科学引用本文复制引用
朱泽德,李淼,张健,曾伟辉,曾新华..一种基于LDA模型的关键词抽取方法[J].中南大学学报(自然科学版),2015,(6):2142-2148,7.基金项目
模式识别国家重点实验室开放课题基金资助项目(201306320);中国科学院信息化专项(XXH12504-1-10);国家自然科学基金资助项目(61070099)(Project (201306320) supported bythe Open Projects Program of National Laboratory of Pattern Recognition (201306320)
Project (XXH12504-1-10) supported by the Informationization Special Projects of Chinese Academy of Science (XXH12504-1-10)
Project (61070099) supported by the National Natural Science Foundation of China) (61070099)