| 注册
首页|期刊导航|计算机应用研究|基于URL模式集的主题爬虫

基于URL模式集的主题爬虫

胡萍瑞 李石君

计算机应用研究2018,Vol.35Issue(3):694-699,726,7.
计算机应用研究2018,Vol.35Issue(3):694-699,726,7.DOI:10.3969/j.issn.1001-3695.2018.03.012

基于URL模式集的主题爬虫

Focused crawler based on URL patterns

胡萍瑞 1李石君1

作者信息

  • 1. 武汉大学计算机学院,武汉430072
  • 折叠

摘要

Abstract

To improve the performance of the focused crawler,according to the features of site information organization and URL,this paper proposed an UPFC(focused crawler based on URL patterns) which in a two-phase framework.In the experimental crawler phase,it collected the site samples and built the URL patterns by the pattern construction algorithm based on URL prefix tree.Additionally,it adopted the HITS algorithm to calculate the importance of patterns based on the pattern graph.In the focused crawler phase,the topic relevance and the guiding significance of pages were determined by those URL patterns without pre-downloading,and the priority of links to be crawled were predicted according to the importance of URL patterns.Experimental results prove that the crawler can be guided to crawl the relevant pages quickly,guarantee the precision and recall,and improve the crawling efficiency.

关键词

主题爬虫/URL模式/URL前缀树/模式关系图/URL模式重要性

Key words

focused crawler/URL pattern/URL prefix tree/pattern graph/importance of URL pattern

分类

信息技术与安全科学

引用本文复制引用

胡萍瑞,李石君..基于URL模式集的主题爬虫[J].计算机应用研究,2018,35(3):694-699,726,7.

基金项目

国家自然科学基金资助项目(61272109,61502350) (61272109,61502350)

计算机应用研究

OA北大核心CSCDCSTPCD

1001-3695

访问量0
|
下载量0
段落导航相关论文