计算机工程2025,Vol.51Issue(5):177-187,11.DOI:10.19678/j.issn.1000-3428.0068749
基于MapReduce的拷贝数变异测序数据并行处理方案
Parallel Processing Scheme for Sequencing Data in Copy Number Variation Based on MapReduce
摘要
Abstract
Copy Number Variation(CNV)is a type of genetic variation that widely occurs in the gene distribution of the human genome.Improving the efficiency of CNV detection can provide patients with more rapid and accurate results,significantly reduce medical costs,and facilitate drug development and clinical applications.Currently,a method based on Read Depth(RD)is the most commonly used method for CNV detection,and the processing time for RD-related information is long,accounting for the relatively high CNV detection time.Existing methods have problems,such as ineffective application in whole-genome analysis,low computational efficiency,and decreased detection accuracy.This paper proposes an efficient parallel processing scheme for sequencing data for copy number variation detection EPPCNV.In EPPCNV,two MapReduce jobs are designed to achieve efficient parallel processing of whole-genome sequencing data and accurately extract RD-related information.Moreover,EPPCNV fully considers the impact of GC content deviation on CNV detection results,implementing RD corrections of sequencing data to ensure high sensitivity and accuracy of the final detection outputs.Further,EPPCNV adopts a highly adaptable data processing method that operates independently of specific CNV detection methods.The final RD-related information generated by EPPCNV can be directly combined with various mainstream CNV detection methods,thereby achieving a significant improvement in the overall performance of the method without changing the judgment of the CNV regions in the original method.Experimental results show that EPPCNV achieves high comprehensive accuracy and can be directly combined with CNV-LOF,HBOS-CNV,and CNVnator methods,significantly improving the computational efficiency of these methods while maintaining high sensitivity and accuracy.For sequencing data with a higher coverage depth and larger data volume,the combination of the CNV detection method and EPPCNV yields even greater improvements in computational efficiency.关键词
拷贝数变异检测/MapReduce作业/测序数据处理/读段深度/全基因组Key words
Copy Number Variation(CNV)detection/MapReduce job/sequencing data processing/Read Depth(RD)/whole-genome分类
信息技术与安全科学引用本文复制引用
何亨,程凯莉,张葵,成淑君..基于MapReduce的拷贝数变异测序数据并行处理方案[J].计算机工程,2025,51(5):177-187,11.基金项目
国家自然科学基金(62372343,61602351). (62372343,61602351)