计算机应用研究2018,Vol.35Issue(3):694-699,726,7.DOI:10.3969/j.issn.1001-3695.2018.03.012
基于URL模式集的主题爬虫
Focused crawler based on URL patterns
摘要
Abstract
To improve the performance of the focused crawler,according to the features of site information organization and URL,this paper proposed an UPFC(focused crawler based on URL patterns) which in a two-phase framework.In the experimental crawler phase,it collected the site samples and built the URL patterns by the pattern construction algorithm based on URL prefix tree.Additionally,it adopted the HITS algorithm to calculate the importance of patterns based on the pattern graph.In the focused crawler phase,the topic relevance and the guiding significance of pages were determined by those URL patterns without pre-downloading,and the priority of links to be crawled were predicted according to the importance of URL patterns.Experimental results prove that the crawler can be guided to crawl the relevant pages quickly,guarantee the precision and recall,and improve the crawling efficiency.关键词
主题爬虫/URL模式/URL前缀树/模式关系图/URL模式重要性Key words
focused crawler/URL pattern/URL prefix tree/pattern graph/importance of URL pattern分类
信息技术与安全科学引用本文复制引用
胡萍瑞,李石君..基于URL模式集的主题爬虫[J].计算机应用研究,2018,35(3):694-699,726,7.基金项目
国家自然科学基金资助项目(61272109,61502350) (61272109,61502350)