| 注册
首页|期刊导航|高技术通讯|Moonlight:基于数据处理器的大模型训练在网检查点结构

Moonlight:基于数据处理器的大模型训练在网检查点结构

赵巍岳 吴婧雅 卢文岩 李华伟 李晓维 鄢贵海

高技术通讯2026,Vol.36Issue(3):230-243,14.
高技术通讯2026,Vol.36Issue(3):230-243,14.DOI:10.3772/j.issn.1002-0470.2026.03.002

Moonlight:基于数据处理器的大模型训练在网检查点结构

Moonlight:DPU-based in-network checkpoint structure for large model training

赵巍岳 1吴婧雅 2卢文岩 2李华伟 2李晓维 2鄢贵海2

作者信息

  • 1. 处理器芯片全国重点实验室(中科学院计算技术研究所) 北京 100190||中国科学院大学 北京 100049
  • 2. 处理器芯片全国重点实验室(中科学院计算技术研究所) 北京 100190
  • 折叠

摘要

Abstract

Large models have a vast number of parameters,and training them requires thousands of neural network accel-erators to work together for days or even months.Due to the frequent data synchronization required by parallel train-ing algorithms,an error in a single training node can lead to the loss of training progress and significant waste of computing resources.Existing training clusters typically use checkpoint to help recover from errors.However,cur-rent checkpoint strategies bring severe host central processing unit(CPU)usage and memory overhead,and the ef-ficiency of checkpoint operations is limited by the low bandwidth of remote storage.To address these issues,this paper proposes a new in-network checkpoint strategy called Moonlight,based on the data processing unit(DPU).Moonlight offloads the checkpoint operation workload from the host to the network devices,provides checkpoint management and control functions through hardware structure designs,and implements hierarchical checkpoint stor-age using network device storage,offering an efficient checkpoint structure for large model training.Experimental results show that:(1)Moonlight can effectively offload the checkpoint workload from hosts,with no host CPU us-age at the checkpoint data plane and negligible host memory overhead.(2)Moonlight can provide efficient check-point saving functions,with the backend storage bandwidth being 4.10 times that of existing commercial solutions and the sending efficiency of checkpoint data packets being 1.96 times that of the baseline solution.

关键词

检查点/远程直接内存访问/数据处理器/大模型训练

Key words

checkpoint/remote memory direct access/data processing unit/large model training

引用本文复制引用

赵巍岳,吴婧雅,卢文岩,李华伟,李晓维,鄢贵海..Moonlight:基于数据处理器的大模型训练在网检查点结构[J].高技术通讯,2026,36(3):230-243,14.

基金项目

国家自然科学基金(62090024,92373206,62002340,62090020)和中国科学院先导B类(XDB0660100)资助项目. (62090024,92373206,62002340,62090020)

高技术通讯

1002-0470

访问量0
|
下载量0
段落导航相关论文