首页|期刊导航|现代电子技术|基于维基百科和网页相似度分析的主题爬行策略

基于维基百科和网页相似度分析的主题爬行策略

栾霞赵晓楠

现代电子技术Issue(20)：35-37,3.

基于维基百科和网页相似度分析的主题爬行策略

Topic crawling strategies based on Wikipedia and analysis of web-page similarity

栾霞 ¹赵晓楠²

作者信息

1. 中国人民解放军第三二三医院网络中心，陕西西安 710054
2. 中国人民解放军68303部队，甘肃武威 733000
折叠

摘要

Abstract

To overcome the weakness existing in the present topic crawling strategies,a topic crawling strategy based on Wikipedia and web-page similarity analysis is put forward in this paper. The Wikipedia classification tree structure is utilized to describe the topics,and then the downloaded webs are properly handled. Finally,the priorities of the candidate links are calcu-lated in combination with text relativity and analysis of Web links. The experimental result indicates that this new method is bet-ter than the traditional crawler in terms of searching results and topic relativity,and its climb rate has been increased. The theme description method and the crawl strategy have a certain promotion value,especially in the field of genetically modified or-ganisms,the crawler has certain innovativeness.

关键词

维基百科/文本相关性/链接分析/相似度计算

Key words

topic crawling/Wikipedia/text relativity/link analysis/similarity calculation

分类

信息技术与安全科学

引用本文复制引用

栾霞,赵晓楠..基于维基百科和网页相似度分析的主题爬行策略[J].现代电子技术,2014,(20):35-37,3.

现代电子技术

OA北大核心CSTPCD

ISSN：1004-373X

访问量0

下载量0

段落导航