计算机工程与应用2017,Vol.53Issue(23):1-5,5.DOI:10.3778/j.issn.1002-8331.1706-0449
面向云平台的二代测序数据近似去重方法研究
Near de-duplication method of NGS sequence data oriented cloud platform
摘要
Abstract
The next generation sequencing needs to be processed by cloud computing due to its large data volume, complex pipeline and high requirements of computing resources. Cloud computing approach necessitates that the sequencing data is uploaded to the cloud platform first. The randomness of the sequencing process results in great differences at the binary level even dealing with the same sample or two similar samples. Existing methods of deduplication do not effectively identify duplicate contents in such sequencing results. Uploading and storing these duplicate data not only consume network bandwidth, but also waste storage space. Aiming to the existing methods are based on the binary feature of file, not effectively use the similarity features of sequencing results, NPD(Near Probability Deduplication)method is proposed for massive high-throughput sequencing data oriented cloud platform. It uses SimHash to calculate the block fingerprints of sequence and quality information in FastQ file, and then the fingerprints are quickly detected by double cuckoo filter of client and cloud platform. At the final stage, the cloud platform uses approximate algorithm to near deduplicate fingerprints. Experimental results show that the NPD method can improve deduplication ratio, reduce network traffic, shorten data upload time, and support massive data processing, and has a good practical value.关键词
高通量测序/重复数据删除/近似去重/布谷过滤器Key words
high-throughput sequence data/data de-duplication/near de-duplication/cuckoo filter分类
信息技术与安全科学引用本文复制引用
赵晓永,陈晨..面向云平台的二代测序数据近似去重方法研究[J].计算机工程与应用,2017,53(23):1-5,5.基金项目
国家自然科学基金(No.61572079) (No.61572079)
北京市教育委员会科技计划一般项目(No.KM201711232018). (No.KM201711232018)