| 注册
首页|期刊导航|计算机工程与科学|Hadoop下改进布隆过滤器算法的网页去重

Hadoop下改进布隆过滤器算法的网页去重

黄伟建 杨海龙

计算机工程与科学2017,Vol.39Issue(2):285-290,6.
计算机工程与科学2017,Vol.39Issue(2):285-290,6.DOI:10.3969/j.issn.1007-130X.2017.02.010

Hadoop下改进布隆过滤器算法的网页去重

An improved Bloom Filter algorithm under the Hadoop for duplicated web page removal

黄伟建 1杨海龙1

作者信息

  • 1. 河北工程大学信息与电气工程学院,河北邯郸056038
  • 折叠

摘要

Abstract

To solve the space waste problem existing in the server space where a lot of duplicated and similar data are stored,we propose an improved Bloom Filter algorithm,which adds an array of bit and dynamically optimizes the number of copies of duplicated data according to the weight calculated by the repeated hits of the bit array.Then,the improved algorithm is parallelized in the Hadoop distributed cluster to further improve the processing efficiency.Experimental results show that compared with traditional web duplicate removal algorithms,the improved Bloom filter algorithm can not only improve the processing efficiency of jobs,but also save the server storage space to a certain extent by dynamically optimizing the number of copies of duplicated data according to the repeated hits of the bit array.

关键词

Hadoop/布隆过滤器/副本数/MapReduce

Key words

Hadoop/Bloom Filter/number of copy/MapReduce

分类

信息技术与安全科学

引用本文复制引用

黄伟建,杨海龙..Hadoop下改进布隆过滤器算法的网页去重[J].计算机工程与科学,2017,39(2):285-290,6.

基金项目

河北省自然科学基金(F2015402077) (F2015402077)

河北省重点基础研究项目(14964206D) (14964206D)

计算机工程与科学

OA北大核心CSCDCSTPCD

1007-130X

访问量0
|
下载量0
段落导航相关论文