| 注册
首页|期刊导航|计算机工程与科学|并行作业容错自动调度环境设计

并行作业容错自动调度环境设计

刘杰 张亦添 迟利华 徐涵 蒋杰 胡庆丰

计算机工程与科学2009,Vol.31Issue(11):87-90,4.
计算机工程与科学2009,Vol.31Issue(11):87-90,4.DOI:10.3969/j.issn.1007-130X.2009.11.023

并行作业容错自动调度环境设计

Design of a Fault-Tolerant Environment for the Automatic Scheduling of Parallel Tasks

刘杰 1张亦添 1迟利华 1徐涵 1蒋杰 1胡庆丰1

作者信息

  • 1. 国防科技大学计算机学院,湖南,长沙,410073
  • 折叠

摘要

Abstract

Large-scale scientific and engineering computing needs to realize unprecedented complex numerical simulation and process huge data,and it is necessary to design a fault-tolerant environment for auto-reloading the failed parallel tasks. Based on parallel jobs and the system-provided checkpoint/restart function, we design a user-level,fault-tolerant environment for job auto-scheduling, including the auto-perception of fault-tolerant parallel program scheduing, auto-reloading, and data integrity ensuring. The experimental results demonstrate that the design of the fault-tolerant environment achieves the de-sign requirements of parallel program scheduling which requires auto-reloading the failed applications and ensures the cor-rectness and completeness of the checkpoint data.

关键词

高性能计算/容错/checkpoint/restart/并行程序

Key words

high performance computing/faults tolerant/checkpoint/restart/ parallel programming

分类

信息技术与安全科学

引用本文复制引用

刘杰,张亦添,迟利华,徐涵,蒋杰,胡庆丰..并行作业容错自动调度环境设计[J].计算机工程与科学,2009,31(11):87-90,4.

基金项目

国家自然科学基金资助项目(60673150,60603061) (60673150,60603061)

国家863计划资助项目(2008AA01Z137) (2008AA01Z137)

计算机工程与科学

OA北大核心CSCDCSTPCD

1007-130X

访问量0
|
下载量0
段落导航相关论文