| 注册
首页|期刊导航|计算机应用研究|SIC:面向大语言模型训练的增量检查点技术

SIC:面向大语言模型训练的增量检查点技术

王志强 朱文喆 闫超美 李永坤

计算机应用研究2025,Vol.42Issue(11):3397-3404,8.
计算机应用研究2025,Vol.42Issue(11):3397-3404,8.DOI:10.19734/j.issn.1001-3695.2025.04.0081

SIC:面向大语言模型训练的增量检查点技术

SIC:incremental checkpointing for large language model training

王志强 1朱文喆 1闫超美 1李永坤2

作者信息

  • 1. 中国科学技术大学计算机科学与技术学院,合肥 230027
  • 2. 中国科学技术大学计算机科学与技术学院,合肥 230027||中国科学技术大学安徽省高性能计算重点实验室,合肥 230027
  • 折叠

摘要

Abstract

Large language model training frequently encounters software and hardware failures,prolonging runtimes and was-ting resources.Checkpointing is a key fault tolerance mechanism,yet conventional full checkpointing limit checkpoint frequen-cy and incurs high storage costs.This paper proposed a significance-aware incremental checkpointing(SIC),which employed a layer-wise online filtering algorithm to identify critical parameter updates per network layer and used a dynamic threshold ad-justment mechanism to adaptively preserve essential changes without losing vital information.A theoretical analysis bounded SIC's impact on convergence.Empirical results show that saving just 2%of parameters per iteration maintains both accuracy and convergence.Under identical overhead constraints,SIC raises checkpoint frequency by 9~17 times compared to state-of-the-art full checkpointing while reducing storage overhead to 3%.Thus,SIC achieves both high efficiency and minimal storage cost.

关键词

大语言模型/容错训练/检查点技术/增量检查点

Key words

large language model/fault-tolerant training/checkpointing/incremental checkpointing

分类

计算机与自动化

引用本文复制引用

王志强,朱文喆,闫超美,李永坤..SIC:面向大语言模型训练的增量检查点技术[J].计算机应用研究,2025,42(11):3397-3404,8.

基金项目

国家自然科学基金面上项目(62472392) (62472392)

计算机应用研究

OA北大核心

1001-3695

访问量0
|
下载量0
段落导航相关论文