计算机工程与应用2017,Vol.53Issue(12):152-157,165,7.DOI:10.3778/j.issn.1002-8331.1606-0088
混合词汇特征和LDA的语义相关度计算方法
Combing lexical features and LDA for semantic relatedness measure
摘要
Abstract
Computing semantic relatedness in text documents is a key problem in many domains, for example, Natural Language Processing(NLP), Semantic Information Retrieval(SIR), etc. ESA(Explicit Semantic Analysis)for Wikipe-dia has received wide attention and applied mainly because of its simplicity and effectivity. However, use of ESA in semantic relatedness computation is inefficient due to its redundant concepts and high dimensionality. This paper presents a new technique based on LDA(Latent Dirichlet Allocation)and JSD(Jensen-Shannon Divergence)to computer semantic relatedness between text documents. The LDA is employed to reduce dimensionality and improve efficiency, and is used to build topic model probability vector from highly dimensional document matrix. Instead of cosine distance, JSD is used to compute semantic relatedness between documents. The results show that this technique based on LDA and JSD is more effective than ESA. Several benchmark test results have been presented to compare proposed technique with other meth-ods. The results of experiment show that the proposed technique provides an increase of above 3%and 9%in Pearson cor-relation coefficient than ESA and LDA, respectively.关键词
主题模型/词汇特征/显式语义分析(ESA)/隐含狄利克雷分布(LDA)/语义相关度计算Key words
topic model/lexical features/Explicit Semantic Analysis(ESA)/Latent Dirichlet Allocation(LDA)/semantic relatedness measure分类
信息技术与安全科学引用本文复制引用
肖宝,李璞,蒋运承..混合词汇特征和LDA的语义相关度计算方法[J].计算机工程与应用,2017,53(12):152-157,165,7.基金项目
国家自然科学基金(No.61272066) (No.61272066)
广州市科技计划项目(No.2014J4100031) (No.2014J4100031)
广西高校中青年教师基础能力提升项目(No.KY2016LX431). (No.KY2016LX431)