计算机工程与应用2011,Vol.47Issue(29):23-26,4.DOI:10.3778/j.issn.1002-8331.2011.29.007
面向P2P特定信息的爬虫改进技术
Improved crawler algorithm technique for P2P specific information
摘要
Abstract
Current topic crawler algorithm technique can crawl lots of uncorrelated websites during obtaining of the "meta-in-formation", so the current topic crawler algorithm technique has been improved by being added URL classification algorithm. This classification algorithm,based on the supplied URL sample information, generates multiple uncorrelated URL key words sets and "meta-information" URL key words sets.lt sets up power to the key words in the set,and sets the threshold value to all sets;describes URL by feature vector,and calculates the distance with the key words set to classify URL;analyzes the algorithm performance in detail.As the test indicates,compared with the traditional topic crawler technique,the improved technique can dramatically improve the efficiency during obtaining of the "meta-information".The obtained "meta-information" quantity can be improved by 96.21% in the same time,which can fully meet the performance requirement of initiative monitoring model to crawler.关键词
“元信息”获取/主题爬虫技术/URL分类算法/特征向量表示/主动监测模型Key words
"meta-information" obtaining/topic crawler technique/URL classification algorithm/ feature vector representation/ initiative monitoring model分类
信息技术与安全科学引用本文复制引用
丁军平,蔡皖东..面向P2P特定信息的爬虫改进技术[J].计算机工程与应用,2011,47(29):23-26,4.基金项目
国家高技术研究发展计划(863)(the National High-Tech Research and Development Plan of China under Grant No.2009AA01Z424). (863)