计算机应用研究2025,Vol.42Issue(11):3397-3404,8.DOI:10.19734/j.issn.1001-3695.2025.04.0081
SIC:面向大语言模型训练的增量检查点技术
SIC:incremental checkpointing for large language model training
摘要
Abstract
Large language model training frequently encounters software and hardware failures,prolonging runtimes and was-ting resources.Checkpointing is a key fault tolerance mechanism,yet conventional full checkpointing limit checkpoint frequen-cy and incurs high storage costs.This paper proposed a significance-aware incremental checkpointing(SIC),which employed a layer-wise online filtering algorithm to identify critical parameter updates per network layer and used a dynamic threshold ad-justment mechanism to adaptively preserve essential changes without losing vital information.A theoretical analysis bounded SIC's impact on convergence.Empirical results show that saving just 2%of parameters per iteration maintains both accuracy and convergence.Under identical overhead constraints,SIC raises checkpoint frequency by 9~17 times compared to state-of-the-art full checkpointing while reducing storage overhead to 3%.Thus,SIC achieves both high efficiency and minimal storage cost.关键词
大语言模型/容错训练/检查点技术/增量检查点Key words
large language model/fault-tolerant training/checkpointing/incremental checkpointing分类
计算机与自动化引用本文复制引用
王志强,朱文喆,闫超美,李永坤..SIC:面向大语言模型训练的增量检查点技术[J].计算机应用研究,2025,42(11):3397-3404,8.基金项目
国家自然科学基金面上项目(62472392) (62472392)