数据与计算发展前沿2025,Vol.7Issue(1):135-151,17.DOI:10.11871/jfdc.issn.2096-742X.2025.01.010
基于检查点的大模型弹性训练方法研究
Research on Checkpoint-Based Elastic Training Methods for Large Language Models
摘要
Abstract
[Objective]In view of the huge demand for computing resources in training the large language models,the elastic training method of distributed training framework is crucial to ensure that the model can make elastic adjustments when resources change to ensure smooth training.[Methods]To meet this urgent need,this article proposes an elastic training mechanism for 3D parallel training and memory optimization,and introduces an elastic training method based on checkpoint into the deep learning framework Megatron-DeepSpeed.[Results]The checkpoint-based elastic training method proposed in this paper is suitable for model training with different resource allocations.By implementing 3D parallel training and memory optimized elastic training strategy on the LLaMA1 model,8 groups of experiments are set for comparison.The experimental indexes such as loss value change,job completion time,and memory allocation are compared after elastic resource change.[Conclusions]The experimental results show that the model loss value is similar under different parallel degrees after elastic re-source changes,which confirms the scalability of the model under different resource configurations.The model can maintain the continuity of model training under resource constraints,and the training speed and performance of the model can be significantly improved by increasing computing resources and parallel degrees under suffi-cient resources.关键词
大语言模型/分布式训练/弹性训练/检查点Key words
large language model/distributed training/elastic training/checkpoint引用本文复制引用
王子健,李凯,曹荣强,周纯葆..基于检查点的大模型弹性训练方法研究[J].数据与计算发展前沿,2025,7(1):135-151,17.基金项目
国家电网有限公司总部科技项目(5700-202358842A-4-3-WL) (5700-202358842A-4-3-WL)