计算机工程与科学2009,Vol.31Issue(z1):237-240,4.DOI:10.3969/j.issn.1007-130X.2009.A1.068
大规模计算系统故障特征及容错机制分析
Survey on the Dependability and the Fault Tolerance Mechanism for Large Scale Computing Systems
摘要
Abstract
The running stability of several large scale computing systems is discussed. First, we summaries the main fault models and features according to the public fault data. Second, based on the survey of system fault tolerance research, the challenge and likely mechanisms for fault tolerance of more large scale computing systems is introduced.关键词
大规模计算系统/故障/容错/断点续算Key words
Large scale computing system/Fault/Fault tolerance/Checkpoint restart分类
信息技术与安全科学引用本文复制引用
武林平,罗红兵,刘勇鹏..大规模计算系统故障特征及容错机制分析[J].计算机工程与科学,2009,31(z1):237-240,4.基金项目
国家自然科学基金资助项目(60803045) (60803045)