首页|期刊导航|计算机技术与发展|基于聚类的重复数据去冗算法的研究

基于聚类的重复数据去冗算法的研究

刘赛聂庆节刘军王超李静

计算机技术与发展2018，Vol.28Issue(2)：125-129,5.

计算机技术与发展2018，Vol.28Issue(2)：125-129,5.DOI:10.3969/j.issn.1673-629X.2018.02.027

基于聚类的重复数据去冗算法的研究

Research on Deduplication Algorithm Based on K-medoids Clustering

刘赛 ¹聂庆节 ¹刘军 ¹王超 ²李静²

作者信息

1. 南瑞集团公司,江苏南京210003
2. 南京航空航天大学计算机学院,江苏南京211106
折叠

摘要

Abstract

Data damage and loss will lead the irreparable losses which can be minimized by data backup system.With the increasing amount of data collection,data backup system has to deal with more and more data of backup and recovery,but the similarity between the various backup files is more than 60% so that all the data stored in the hard disk will be a waste of storage space.For this,we propose a DELTA compression method based on K-medoids clustering to remove duplicate data from the backup data.It firstly segments and blocks the files, and then obtains the size of each compression file by means of DELTA compression between the two blocks as the similarity of them.K-me-doids clustering is performed by the similarity obtained as preprocessing steps before DELTA compression.According to the K-medoids clustering,we merge the small similar file blocks before DELTA compression.The tests show that the proposed method can improve the compression rate,reduce the number of fingerprints in DELTA compression and shorten the compression time.

关键词

DELTA压缩/数据压缩/聚类/K-medoids

Key words

DELTA compression/data compression/clustering/K-medoids

分类

信息技术与安全科学

引用本文复制引用

刘赛,聂庆节,刘军,王超,李静..基于聚类的重复数据去冗算法的研究[J].计算机技术与发展,2018,28(2):125-129,5.

基金项目

国家电网公司总部科技项目(0711-150TL173) （0711-150TL173）

计算机技术与发展

OACSTPCD

ISSN：1673-629X

访问量0

下载量0

段落导航