计算机工程与科学2009,Vol.31Issue(11):87-90,4.DOI:10.3969/j.issn.1007-130X.2009.11.023
并行作业容错自动调度环境设计
Design of a Fault-Tolerant Environment for the Automatic Scheduling of Parallel Tasks
摘要
Abstract
Large-scale scientific and engineering computing needs to realize unprecedented complex numerical simulation and process huge data,and it is necessary to design a fault-tolerant environment for auto-reloading the failed parallel tasks. Based on parallel jobs and the system-provided checkpoint/restart function, we design a user-level,fault-tolerant environment for job auto-scheduling, including the auto-perception of fault-tolerant parallel program scheduing, auto-reloading, and data integrity ensuring. The experimental results demonstrate that the design of the fault-tolerant environment achieves the de-sign requirements of parallel program scheduling which requires auto-reloading the failed applications and ensures the cor-rectness and completeness of the checkpoint data.关键词
高性能计算/容错/checkpoint/restart/并行程序Key words
high performance computing/faults tolerant/checkpoint/restart/ parallel programming分类
信息技术与安全科学引用本文复制引用
刘杰,张亦添,迟利华,徐涵,蒋杰,胡庆丰..并行作业容错自动调度环境设计[J].计算机工程与科学,2009,31(11):87-90,4.基金项目
国家自然科学基金资助项目(60673150,60603061) (60673150,60603061)
国家863计划资助项目(2008AA01Z137) (2008AA01Z137)