计算机工程2017,Vol.43Issue(3):1-6,6.DOI:10.3969/j.issn.1000-3428.2017.03.001
基于内存与文件共享机制的Spark I/O性能优化
Spark I/O Performance Optimization Based on Memory and File Sharing Mechanism
摘要
Abstract
Based on the analysis of the key technologies of Spark,such as flexible distributed data set and Spark task scheduling,it is concluded that time of I/O in data processing has a great effect on the computing performance of Spark.Aiming at this problem,this paper studies the run mode of Spark consolidating files that can reduce the number of cache files and improve the I/O efficiency of Spark to some extent,but it still has the disadvantage of high memory cost.Further more,the paper proposes an improved process of Spark Shuffle which designs a mode that every Mapper only generates one cache file,and every Mapper's bucket shares the same memory buffer,thus these improve I/O efficiency and reduce the memory overhead.Simulation results show that,compared with the default mode of Spark,the I/O time of a wide dependent process is shortened by 42.9%,which improves the memory utilization and the efficiency of the Spark platform.关键词
分布式计算/Spark平台/Shuffle过程/磁盘I/O/任务调度Key words
distributed computing/Spark platform/Shuffle process/disk I/O/task scheduling分类
信息技术与安全科学引用本文复制引用
黄廷辉,王玉良,汪振,崔更申..基于内存与文件共享机制的Spark I/O性能优化[J].计算机工程,2017,43(3):1-6,6.基金项目
国家自然科学基金(61363029) (61363029)
赛尔网络下一代互联网技术创新计划项目(NGII20160306). (NGII20160306)