计算机工程与科学2009,Vol.31Issue(11):66-68,109,4.DOI:10.3969/j.issn.1007-130X.2009.11.017
资源管理系统中基于作业检查点的自动容错
Automatic Fault-Tolerance Support in Resource Management System Based on Job Checkpoint/Restart
曹宏嘉 1卢宇彤 1谢旻1
作者信息
- 1. 国防科技大学计算机学院,湖南,长沙,410073
- 折叠
摘要
Abstract
An automatic fault-tolerance method based on job checkpoint/restart in resource management systems is pro-posed The key technologies are presented, including the separation of job checkpoint and task checkpoint, management of checkpoint image files, and automatic job restart.Automatic job checkpoint/restart with BLCR is implemented in SLURM and the challenges are discussed. Analysis and experiments show that the checkpoint and restart works correctly, and the time to complete large-scale jobs is reduced effectively.关键词
容错/作业检查点/恢复/资源管理Key words
fault-tolerance/job checkpoint/restart/resource management分类
信息技术与安全科学引用本文复制引用
曹宏嘉,卢宇彤,谢旻..资源管理系统中基于作业检查点的自动容错[J].计算机工程与科学,2009,31(11):66-68,109,4.