计算机应用与软件2018,Vol.35Issue(4):49-54,6.DOI:10.3969/j.issn.1000-386x.2018.04.009
基于LDA扩展主题词库的主题爬虫研究
FOCUSED CRAWLER BASED ON LDA EXTENDED TOPIC TERMS
摘要
Abstract
The purpose of a focused crawler is to get as much content as possible related to a particular topic.In view of the lack of focused crawler coverage and the low accuracy of topic similarity calculation,this paper proposed a focused crawler framework with dynamic theme,which expanded the theme keywords in two ways:expansion of the word with the subject and semantic extension of the word.Using the functions of the subject crawler's own related resources,we continuously expanded the corpus and obtain themed documents through LDA training to expand and update the thesaurus.On this basis,an improved similarity calculation model based on word2vec word vector was proposed for page similarity calculation and URL prioritization.Experiments on real news datasets showed that the focused crawler proposed in this paper all performed well on the accuracy of topic relevance and the yield of topic content.关键词
LDA主题模型/主题爬虫/word2vec/相似度计算Key words
LDA/Focused crawler/word2vec/Similarity calculation分类
信息技术与安全科学引用本文复制引用
费晨杰,刘柏嵩..基于LDA扩展主题词库的主题爬虫研究[J].计算机应用与软件,2018,35(4):49-54,6.基金项目
国家社会科学基金项目/后期资助项目(15FTQ002) (15FTQ002)
省部级实验室/开放基金项目(B2014). (B2014)