大数据2026,Vol.12Issue(2):75-84,10.DOI:10.11959/j.issn.2096-0271.2026029
深度学习模型训练过程检查点访问性能优化方法
Checkpoint accessing performance optimization method for the deep learning model training process
摘要
Abstract
As LLMs become more widely used and their scale continues to expand,currently LLMs training faces issues such as high error rate and poor performance of checkpoint accessing.This paper reviews the strengths and weaknesses of existing methods for optimizing checkpoint accessing performance and introduces a novel method for optimizing checkpoint accessing performance.Based on the observation of data patterns in checkpoints,where the model weights change between adjacent checkpoints are minimal,making them suitable for delta compression.The proposed method implements delta compression across multiple interconnected training nodes and conducts experimental tests using real checkpoints generated during deep learning model training.The results demonstrate that,during the model training,delta compression has good compression effect for most checkpoints.Furthermore,the paper introduces dynamic intervals in delta compression to balance compression ratio and storage overhead,while also analyzing the characteristics of momentum datas.The analysis of existing methods and the optimization of checkpoint accessing performance offer insights for accelerating LLMs training.关键词
大模型/检查点/数据压缩/性能提升Key words
LLM/checkpoint/data compression/performance improvement分类
信息技术与安全科学引用本文复制引用
滕云,张广艳,孙大为,田海东,常锐..深度学习模型训练过程检查点访问性能优化方法[J].大数据,2026,12(2):75-84,10.基金项目
国家自然科学基金项目(No.62025203) The National Natural Science Foundation of China(No.62025203) (No.62025203)