|国家科技期刊平台
首页|期刊导航|计算机技术与发展|上下文语义嵌入的变粒度云存储相似数据去重技术

上下文语义嵌入的变粒度云存储相似数据去重技术OACSTPCD

Variable Granularity-based Chunk-context Aware Similar Data Deduplication Technique for Cloud Storage

中文摘要英文摘要

针对云存储环境下现有相似数据去重技术效果不佳以及元数据开销大等问题,提出了上下文语义嵌入的变粒度云存储相似数据去重技术.该技术采用基于子块重组的特征提取算法,对数据块内容内部结构进行初步特征提取,并利用BP(Back Propagation)神经网络上下文感知模型将数据块上下文特征信息嵌入到初始特征中,实现了具有上下文语义嵌入的变粒度数据块.通过控制数据块大小,动态地合并相邻相似数据块或非冗余数据块,减少元数据开销,并对位于相似数据块和非冗余数据块之间过渡区域进行分割,从而获得更好的相似数据块表示形式.最后,为了评估其性能,实现了一个变粒度相似数据检测算法原型rCARD并在真实世界的数据集进行了实验,实验结果表明,与最新相似性检测去重技术Finesse相比,rCARD在实现更高重复数据删除率的同时,显著降低了元数据的大小,并且加速相似性检测速度高达11.07 倍.

Aiming at the problems of poor effect of existing similar data deduplication techniques and high metadata overhead in cloud storage environment,variable granularity-based chunk-context aware similar data deduplication technique for cloud storage is proposed.The technique adopts a feature extraction algorithm based on sub-block reorganization to perform initial feature extraction of the internal structure of the data block content,and utilizes a BP(Back Propagation)neural network context-aware model to embed the data block contextual feature information into the initial features,realizing a variable granularity data block with contextual semantic embedding.A better representation of similar data blocks is obtained by controlling the data block size,dynamically merging neighboring similar data blocks or non-redundant data blocks to reduce metadata overhead,and segmenting the transition region located between similar and non-redundant data blocks.Finally,to evaluate its performance,a prototype variable granularity similar data detection algorithm,rCARD,is implemented and extensively experimented on real world datasets.The experimental results show that compared to the latest similarity de-tection deduplication technique Finesse,rCARD achieves a higher deduplication rate while significantly reducing the metadata size and ac-celerates the similarity detection speedup by up to 11.07 times.

阳智欢;田纹龙;何婷婷;叶旭明;唐佳

南华大学计算机学院,湖南衡阳 421001南华大学计算机学院,湖南衡阳 421001||新加坡南洋理工大学 数理科学学院,新加坡 637371衡阳师范学院教育科学学院,湖南衡阳 421010

计算机与自动化

相似数据去重数据块语义变粒度云存储元数据

similar data deduplicationdata block semanticsvariable granularitycloud storagemetadata

《计算机技术与发展》 2024 (004)

16-23 / 8

湖南省自然科学基金项目(2021JJ40468);湖南省教育厅优青项目(22B0437);湖南省教师教育研究基地(XJK23AJD014)

10.20165/j.cnki.ISSN1673-629X.2024.0003

评论