首页|期刊导航|数据与计算发展前沿|基于检查点的大模型弹性训练方法研究

基于检查点的大模型弹性训练方法研究

王子健李凯曹荣强周纯葆

数据与计算发展前沿2025，Vol.7Issue(1)：135-151,17.

数据与计算发展前沿2025，Vol.7Issue(1)：135-151,17.DOI:10.11871/jfdc.issn.2096-742X.2025.01.010

基于检查点的大模型弹性训练方法研究

Research on Checkpoint-Based Elastic Training Methods for Large Language Models

王子健 ¹李凯 ²曹荣强 ²周纯葆²

作者信息

1. 中国科学院计算机网络信息中心,北京 100083||中国科学院大学,北京 101408
2. 中国科学院计算机网络信息中心,北京 100083
折叠

摘要

Abstract

[Objective]In view of the huge demand for computing resources in training the large language models,the elastic training method of distributed training framework is crucial to ensure that the model can make elastic adjustments when resources change to ensure smooth training.[Methods]To meet this urgent need,this article proposes an elastic training mechanism for 3D parallel training and memory optimization,and introduces an elastic training method based on checkpoint into the deep learning framework Megatron-DeepSpeed.[Results]The checkpoint-based elastic training method proposed in this paper is suitable for model training with different resource allocations.By implementing 3D parallel training and memory optimized elastic training strategy on the LLaMA1 model,8 groups of experiments are set for comparison.The experimental indexes such as loss value change,job completion time,and memory allocation are compared after elastic resource change.[Conclusions]The experimental results show that the model loss value is similar under different parallel degrees after elastic re-source changes,which confirms the scalability of the model under different resource configurations.The model can maintain the continuity of model training under resource constraints,and the training speed and performance of the model can be significantly improved by increasing computing resources and parallel degrees under suffi-cient resources.

关键词

大语言模型/分布式训练/弹性训练/检查点

Key words

large language model/distributed training/elastic training/checkpoint

引用本文复制引用

王子健,李凯,曹荣强,周纯葆..基于检查点的大模型弹性训练方法研究[J].数据与计算发展前沿,2025,7(1):135-151,17.

基金项目

国家电网有限公司总部科技项目(5700-202358842A-4-3-WL) （5700-202358842A-4-3-WL）

数据与计算发展前沿

ISSN：2096-742X

访问量0

下载量0

段落导航