计算机与数字工程2019,Vol.47Issue(5):1151-1159,9.DOI:10.3969/j.issn.1672-9722.2019.05.027
可动态自适应主题爬虫的研究
Research and Implementation of Dynamic Adaptive Topical Crawler
摘要
Abstract
In the face of a dynamically changing Internet,the traditional topical crawlers have problems such as incomplete topical knowledge,domain knowledge updating,topical resource center transfer and so on. In this paper,a topic crawler that can dynamically adapt to Internet information is proposed. In which the TopicHub algorithm can dynamically select seed URLs. Com?pared with the traditional topic crawler of static seed URL,the crawling efficiency increases by more than 7%,and the recall rate in?creases by more than 5% . Additionally,aiming at the problems of the incomplete coverage of the topic information and domain knowledge updating in the static ontology library,an algorithm named SDTP can dynamically expand the domain semantic informa?tion is proposed. Compared with the traditional algorithm which is based on the static ontology library,the precision of the algorithm is improved by 13%,and compared with the algorithm which is based on the VSM,the improvement is 4%.关键词
主题爬虫/动态自适应/URL图结构Key words
topic crawler/dynamic self-adaption/URL structure分类
信息技术与安全科学引用本文复制引用
肖新凤,余伟,李石君,陈亚辉,刘倍雄,刘永明..可动态自适应主题爬虫的研究[J].计算机与数字工程,2019,47(5):1151-1159,9.基金项目
国家自然科学基金项目(编号:61502350) (编号:61502350)
2017 广东高校省级重点平台和重大科研项目(编号:2017GKTSCX042)资助. (编号:2017GKTSCX042)