高技术通讯2026,Vol.36Issue(3):230-243,14.DOI:10.3772/j.issn.1002-0470.2026.03.002
Moonlight:基于数据处理器的大模型训练在网检查点结构
Moonlight:DPU-based in-network checkpoint structure for large model training
摘要
Abstract
Large models have a vast number of parameters,and training them requires thousands of neural network accel-erators to work together for days or even months.Due to the frequent data synchronization required by parallel train-ing algorithms,an error in a single training node can lead to the loss of training progress and significant waste of computing resources.Existing training clusters typically use checkpoint to help recover from errors.However,cur-rent checkpoint strategies bring severe host central processing unit(CPU)usage and memory overhead,and the ef-ficiency of checkpoint operations is limited by the low bandwidth of remote storage.To address these issues,this paper proposes a new in-network checkpoint strategy called Moonlight,based on the data processing unit(DPU).Moonlight offloads the checkpoint operation workload from the host to the network devices,provides checkpoint management and control functions through hardware structure designs,and implements hierarchical checkpoint stor-age using network device storage,offering an efficient checkpoint structure for large model training.Experimental results show that:(1)Moonlight can effectively offload the checkpoint workload from hosts,with no host CPU us-age at the checkpoint data plane and negligible host memory overhead.(2)Moonlight can provide efficient check-point saving functions,with the backend storage bandwidth being 4.10 times that of existing commercial solutions and the sending efficiency of checkpoint data packets being 1.96 times that of the baseline solution.关键词
检查点/远程直接内存访问/数据处理器/大模型训练Key words
checkpoint/remote memory direct access/data processing unit/large model training引用本文复制引用
赵巍岳,吴婧雅,卢文岩,李华伟,李晓维,鄢贵海..Moonlight:基于数据处理器的大模型训练在网检查点结构[J].高技术通讯,2026,36(3):230-243,14.基金项目
国家自然科学基金(62090024,92373206,62002340,62090020)和中国科学院先导B类(XDB0660100)资助项目. (62090024,92373206,62002340,62090020)