| 注册
首页|期刊导航|计算机与数字工程|一种面向大规模网页去重的三层分布式架构

一种面向大规模网页去重的三层分布式架构

贲兴龙 贾大文 袁林

计算机与数字工程Issue(10):1751-1755,5.
计算机与数字工程Issue(10):1751-1755,5.DOI:10.3969/j.issn.1672-9722.2015.10.007

一种面向大规模网页去重的三层分布式架构

A Three Layer Distributed Architecture for Large-Scale Duplicated Web Page Detection

贲兴龙 1贾大文 1袁林1

作者信息

  • 1. 中国电子科技集团第二十八研究所 南京 210000
  • 折叠

摘要

Abstract

Duplicated web page detection is a necessary step .Currently ,researchers focus on the accuracy and time complexity of duplicated web pages detection algorithms based on the similarity of web page content .A three layer distribu‐ted architecture for large‐scale duplicated web page detection is proposed ,which can detect duplicated web pages efficiently u‐sing the combination of local memory caches ,distributed caches and distributed index .This architecture is especially applica‐ble for those applications which need crawl the web pages repeatedly .The experimental results indicate our proposed archi‐tecture can satisfy the requirement of large scale duplicated web page detection in distribute web crawler application .Moreo‐ver ,this architecture is scalable .

关键词

网页去重/网络爬虫/分布式架构

Key words

duplicated web page detection/web crawler/distribute architecture

引用本文复制引用

贲兴龙,贾大文,袁林..一种面向大规模网页去重的三层分布式架构[J].计算机与数字工程,2015,(10):1751-1755,5.

计算机与数字工程

OACSTPCD

1672-9722

访问量0
|
下载量0
段落导航相关论文