计算机技术与发展2018,Vol.28Issue(2):125-129,5.DOI:10.3969/j.issn.1673-629X.2018.02.027
基于聚类的重复数据去冗算法的研究
Research on Deduplication Algorithm Based on K-medoids Clustering
摘要
Abstract
Data damage and loss will lead the irreparable losses which can be minimized by data backup system.With the increasing amount of data collection,data backup system has to deal with more and more data of backup and recovery,but the similarity between the various backup files is more than 60% so that all the data stored in the hard disk will be a waste of storage space.For this,we propose a DELTA compression method based on K-medoids clustering to remove duplicate data from the backup data.It firstly segments and blocks the files, and then obtains the size of each compression file by means of DELTA compression between the two blocks as the similarity of them.K-me-doids clustering is performed by the similarity obtained as preprocessing steps before DELTA compression.According to the K-medoids clustering,we merge the small similar file blocks before DELTA compression.The tests show that the proposed method can improve the compression rate,reduce the number of fingerprints in DELTA compression and shorten the compression time.关键词
DELTA压缩/数据压缩/聚类/K-medoidsKey words
DELTA compression/data compression/clustering/K-medoids分类
信息技术与安全科学引用本文复制引用
刘赛,聂庆节,刘军,王超,李静..基于聚类的重复数据去冗算法的研究[J].计算机技术与发展,2018,28(2):125-129,5.基金项目
国家电网公司总部科技项目(0711-150TL173) (0711-150TL173)