计算机工程与科学2017,Vol.39Issue(2):285-290,6.DOI:10.3969/j.issn.1007-130X.2017.02.010
Hadoop下改进布隆过滤器算法的网页去重
An improved Bloom Filter algorithm under the Hadoop for duplicated web page removal
摘要
Abstract
To solve the space waste problem existing in the server space where a lot of duplicated and similar data are stored,we propose an improved Bloom Filter algorithm,which adds an array of bit and dynamically optimizes the number of copies of duplicated data according to the weight calculated by the repeated hits of the bit array.Then,the improved algorithm is parallelized in the Hadoop distributed cluster to further improve the processing efficiency.Experimental results show that compared with traditional web duplicate removal algorithms,the improved Bloom filter algorithm can not only improve the processing efficiency of jobs,but also save the server storage space to a certain extent by dynamically optimizing the number of copies of duplicated data according to the repeated hits of the bit array.关键词
Hadoop/布隆过滤器/副本数/MapReduceKey words
Hadoop/Bloom Filter/number of copy/MapReduce分类
信息技术与安全科学引用本文复制引用
黄伟建,杨海龙..Hadoop下改进布隆过滤器算法的网页去重[J].计算机工程与科学,2017,39(2):285-290,6.基金项目
河北省自然科学基金(F2015402077) (F2015402077)
河北省重点基础研究项目(14964206D) (14964206D)