| 注册
首页|期刊导航|计算机工程与应用|基于Charm++运行时环境的异构计算应用容错研究

基于Charm++运行时环境的异构计算应用容错研究

孟晨 曹宗雁 王龙 迟学斌

计算机工程与应用2016,Vol.52Issue(13):1-7,7.
计算机工程与应用2016,Vol.52Issue(13):1-7,7.DOI:10.3778/j.issn.1002-8331.1601-0299

基于Charm++运行时环境的异构计算应用容错研究

Charm++ RTS based fault tolerance mechanism of heteroge-neous computing.

孟晨 1曹宗雁 2王龙 1迟学斌1

作者信息

  • 1. 中国科学院 计算机网络信息中心 超级计算中心,北京 100190
  • 2. 中国科学院大学,北京 100049
  • 折叠

摘要

Abstract

Fault tolerance is an inevitable issue for long-running large-scale applications. Heterogeneous devices make the reliability problem more extrude. Focusing on the hardware failures, this paper presents a fault-tolerant mechanism in the heterogeneous clusters. It's also implemented in the large-scale parallelization software of the cosmological fluid sim-ulation in WIGEON based on the parallel pattern of Charm++RTS and CUDA. What's more, it's combined with dynamic load-balancing for adapting the unbalanced computing node configuration after failures. Through the experiments and analysis on Mole 8.5, it validates the efficiency and feasibility of this algorithm, and the recovery time only takes 1~4 sec-onds, and it also uses the distributed redundant data to improve Double-in-Memory checkpoint algorithm. The new fault-tolerant algorithm reduces memory footprint by 50% at the most without extra performance loss.

关键词

容错/异构/无盘检查点/Charm++/负载均衡/分布式冗余

Key words

fault tolerance/heterogeneous/in-memory checkpoint/Charm++/load-balancing/distributed redundant

分类

信息技术与安全科学

引用本文复制引用

孟晨,曹宗雁,王龙,迟学斌..基于Charm++运行时环境的异构计算应用容错研究[J].计算机工程与应用,2016,52(13):1-7,7.

基金项目

国家高技术研究发展计划(863)(No.2014AA01A302) (863)

中科院重点部署项目(No.KJZD-EW-TZ-G90). (No.KJZD-EW-TZ-G90)

计算机工程与应用

OA北大核心CSCDCSTPCD

1002-8331

访问量0
|
下载量0
段落导航相关论文