大数据2025,Vol.11Issue(3):17-32,16.DOI:10.11959/j.issn.2096-0271.2025040
SpanTrain:基于云边端异构设备的跨域分布式模型训练系统
SpanTrain:a cross-domain distributed model training system for cloud-edge-end heterogeneous devices
摘要
Abstract
Currently,in addition to cloud computing centers,the edge and end environments represented by the internet of things,fixed or mobile computing edges are also filled with a large number of intelligent computing devices.Expanding the deep neural network(DNN)training from cloud computing centers to the edge and end has significant advantages in aspects such as support for new application patterns,protection of data privacy,and control of training costs.Most existing distributed training systems are designed for homogeneous devices,and they are difficult to adapt to the heterogeneous computing environments of cloud-edge-end.For this reason,a cross-domain distributed training system named SpanTrain,which is based on the heterogeneous devices of cloud,edge,and end,has been designed.Through a novel hybrid pipeline parallel mechanism,it realizes the efficient DNN model training with the collaboration of the heterogeneous devices of cloud,edge,and end.Moreover,experiments have been carried out in an environment containing typical heterogeneous devices.Experiments in typical cloud-edge-end heterogeneous environments demonstrate that SpanTrain achieves 1.17x~3.15x higher training throughput compared to state-of-the-art systems,while improving resource utilization of heterogeneous devices by 39%.These results validate the efficiency of SpanTrain for DNN training in cloud-edge-end heterogeneous environments.关键词
云边端异构设备/分布式计算/模型训练/并行训练策略Key words
cloud-edge-end heterogeneous device/distributed computing/DNN training/parallel training strategy分类
计算机与自动化引用本文复制引用
王锦权,唐熙程,刘旭昭,廖晓坚,肖利民,霍志胜,索珈顺,李云潼,沈润楠,谢喜龙..SpanTrain:基于云边端异构设备的跨域分布式模型训练系统[J].大数据,2025,11(3):17-32,16.基金项目
国家重点研发计划项目(No.2023YFB4503100) The National Key Research and Development Program of China(No.2023YFB4503100) (No.2023YFB4503100)