计算机与数字工程Issue(10):1751-1755,5.DOI:10.3969/j.issn.1672-9722.2015.10.007
一种面向大规模网页去重的三层分布式架构
A Three Layer Distributed Architecture for Large-Scale Duplicated Web Page Detection
贲兴龙 1贾大文 1袁林1
作者信息
- 1. 中国电子科技集团第二十八研究所 南京 210000
- 折叠
摘要
Abstract
Duplicated web page detection is a necessary step .Currently ,researchers focus on the accuracy and time complexity of duplicated web pages detection algorithms based on the similarity of web page content .A three layer distribu‐ted architecture for large‐scale duplicated web page detection is proposed ,which can detect duplicated web pages efficiently u‐sing the combination of local memory caches ,distributed caches and distributed index .This architecture is especially applica‐ble for those applications which need crawl the web pages repeatedly .The experimental results indicate our proposed archi‐tecture can satisfy the requirement of large scale duplicated web page detection in distribute web crawler application .Moreo‐ver ,this architecture is scalable .关键词
网页去重/网络爬虫/分布式架构Key words
duplicated web page detection/web crawler/distribute architecture引用本文复制引用
贲兴龙,贾大文,袁林..一种面向大规模网页去重的三层分布式架构[J].计算机与数字工程,2015,(10):1751-1755,5.