计算机工程与科学2017,Vol.39Issue(5):978-983,6.DOI:10.3969/j.issn.1007-130X.2017.05.024
基于双语LDA的跨语言文本相似度计算方法研究
A cross-lingual document similarity calculation method based on bilingual LDA
摘要
Abstract
Based on the idea of bilingual topic model,we analyze similarity of bilingual documents and propose a cross-lingual document similarity calculation method based on bilingual LDA.Firstly we use the bilingual parallel documents to train the bilingual LDA model and then use the trained model to predict the topic distribution of the new corpus.The new corpus's bilingual documents are mapped to the vector space of the same topic.We use the cosine similarity method and topic distribution combined to calculate the similarity of the bilingual documents of the new corpus.We improve the topic frequency inverse document frequency method from the aspect of the dispersion of in-category and the between-category topic distribution,and utilize the improved method to calculate feature topic weights.Experimental results show that the improved weight calculation method can enhance the recall rate,enable the LDA similarity calculation algorithm not limited to certain categories,and it is reliable.关键词
双语LDA/跨语言文本相似度/余弦相似度/主题频率-逆文档频率Key words
bilingual LDA/cross-lingual document similarity calculation/cosine similarity/topic frequency-inverse document frequency分类
信息技术与安全科学引用本文复制引用
程蔚,线岩团,周兰江,余正涛,王红斌..基于双语LDA的跨语言文本相似度计算方法研究[J].计算机工程与科学,2017,39(5):978-983,6.基金项目
国家自然科学基金(61363044,61462054) (61363044,61462054)
云南省科技厅面上项目(2015FB135) (2015FB135)
云南省教育厅科学研究基金(2014Z021) (2014Z021)
昆明理工大学省级人培项目(KKSY201403028). (KKSY201403028)