| 注册
首页|期刊导航|计算机应用与软件|基于LDA扩展主题词库的主题爬虫研究

基于LDA扩展主题词库的主题爬虫研究

费晨杰 刘柏嵩

计算机应用与软件2018,Vol.35Issue(4):49-54,6.
计算机应用与软件2018,Vol.35Issue(4):49-54,6.DOI:10.3969/j.issn.1000-386x.2018.04.009

基于LDA扩展主题词库的主题爬虫研究

FOCUSED CRAWLER BASED ON LDA EXTENDED TOPIC TERMS

费晨杰 1刘柏嵩1

作者信息

  • 1. 宁波大学信息科学与工程学院 浙江宁波315211
  • 折叠

摘要

Abstract

The purpose of a focused crawler is to get as much content as possible related to a particular topic.In view of the lack of focused crawler coverage and the low accuracy of topic similarity calculation,this paper proposed a focused crawler framework with dynamic theme,which expanded the theme keywords in two ways:expansion of the word with the subject and semantic extension of the word.Using the functions of the subject crawler's own related resources,we continuously expanded the corpus and obtain themed documents through LDA training to expand and update the thesaurus.On this basis,an improved similarity calculation model based on word2vec word vector was proposed for page similarity calculation and URL prioritization.Experiments on real news datasets showed that the focused crawler proposed in this paper all performed well on the accuracy of topic relevance and the yield of topic content.

关键词

LDA主题模型/主题爬虫/word2vec/相似度计算

Key words

LDA/Focused crawler/word2vec/Similarity calculation

分类

信息技术与安全科学

引用本文复制引用

费晨杰,刘柏嵩..基于LDA扩展主题词库的主题爬虫研究[J].计算机应用与软件,2018,35(4):49-54,6.

基金项目

国家社会科学基金项目/后期资助项目(15FTQ002) (15FTQ002)

省部级实验室/开放基金项目(B2014). (B2014)

计算机应用与软件

OA北大核心CSTPCD

1000-386X

访问量0
|
下载量0
段落导航相关论文