| 注册
首页|期刊导航|计算机工程与科学|资源管理系统中基于作业检查点的自动容错

资源管理系统中基于作业检查点的自动容错

曹宏嘉 卢宇彤 谢旻

计算机工程与科学2009,Vol.31Issue(11):66-68,109,4.
计算机工程与科学2009,Vol.31Issue(11):66-68,109,4.DOI:10.3969/j.issn.1007-130X.2009.11.017

资源管理系统中基于作业检查点的自动容错

Automatic Fault-Tolerance Support in Resource Management System Based on Job Checkpoint/Restart

曹宏嘉 1卢宇彤 1谢旻1

作者信息

  • 1. 国防科技大学计算机学院,湖南,长沙,410073
  • 折叠

摘要

Abstract

An automatic fault-tolerance method based on job checkpoint/restart in resource management systems is pro-posed The key technologies are presented, including the separation of job checkpoint and task checkpoint, management of checkpoint image files, and automatic job restart.Automatic job checkpoint/restart with BLCR is implemented in SLURM and the challenges are discussed. Analysis and experiments show that the checkpoint and restart works correctly, and the time to complete large-scale jobs is reduced effectively.

关键词

容错/作业检查点/恢复/资源管理

Key words

fault-tolerance/job checkpoint/restart/resource management

分类

信息技术与安全科学

引用本文复制引用

曹宏嘉,卢宇彤,谢旻..资源管理系统中基于作业检查点的自动容错[J].计算机工程与科学,2009,31(11):66-68,109,4.

计算机工程与科学

OA北大核心CSCDCSTPCD

1007-130X

访问量0
|
下载量0
段落导航相关论文