计算机工程与应用2016,Vol.52Issue(13):1-7,7.DOI:10.3778/j.issn.1002-8331.1601-0299
基于Charm++运行时环境的异构计算应用容错研究
Charm++ RTS based fault tolerance mechanism of heteroge-neous computing.
摘要
Abstract
Fault tolerance is an inevitable issue for long-running large-scale applications. Heterogeneous devices make the reliability problem more extrude. Focusing on the hardware failures, this paper presents a fault-tolerant mechanism in the heterogeneous clusters. It's also implemented in the large-scale parallelization software of the cosmological fluid sim-ulation in WIGEON based on the parallel pattern of Charm++RTS and CUDA. What's more, it's combined with dynamic load-balancing for adapting the unbalanced computing node configuration after failures. Through the experiments and analysis on Mole 8.5, it validates the efficiency and feasibility of this algorithm, and the recovery time only takes 1~4 sec-onds, and it also uses the distributed redundant data to improve Double-in-Memory checkpoint algorithm. The new fault-tolerant algorithm reduces memory footprint by 50% at the most without extra performance loss.关键词
容错/异构/无盘检查点/Charm++/负载均衡/分布式冗余Key words
fault tolerance/heterogeneous/in-memory checkpoint/Charm++/load-balancing/distributed redundant分类
信息技术与安全科学引用本文复制引用
孟晨,曹宗雁,王龙,迟学斌..基于Charm++运行时环境的异构计算应用容错研究[J].计算机工程与应用,2016,52(13):1-7,7.基金项目
国家高技术研究发展计划(863)(No.2014AA01A302) (863)
中科院重点部署项目(No.KJZD-EW-TZ-G90). (No.KJZD-EW-TZ-G90)