| 注册
首页|期刊导航|大数据|SpanTrain:基于云边端异构设备的跨域分布式模型训练系统

SpanTrain:基于云边端异构设备的跨域分布式模型训练系统

王锦权 唐熙程 刘旭昭 廖晓坚 肖利民 霍志胜 索珈顺 李云潼 沈润楠 谢喜龙

大数据2025,Vol.11Issue(3):17-32,16.
大数据2025,Vol.11Issue(3):17-32,16.DOI:10.11959/j.issn.2096-0271.2025040

SpanTrain:基于云边端异构设备的跨域分布式模型训练系统

SpanTrain:a cross-domain distributed model training system for cloud-edge-end heterogeneous devices

王锦权 1唐熙程 1刘旭昭 1廖晓坚 1肖利民 1霍志胜 1索珈顺 1李云潼 1沈润楠 2谢喜龙1

作者信息

  • 1. 北京航空航天大学计算机学院,北京 100191||北京航空航天大学复杂关键软件环境全国重点实验室,北京 100191
  • 2. 北京航空航天大学计算机学院,北京 100191||北京航空航天大学复杂关键软件环境全国重点实验室,北京 100191||北京航空航天大学沈元学院,北京 100191
  • 折叠

摘要

Abstract

Currently,in addition to cloud computing centers,the edge and end environments represented by the internet of things,fixed or mobile computing edges are also filled with a large number of intelligent computing devices.Expanding the deep neural network(DNN)training from cloud computing centers to the edge and end has significant advantages in aspects such as support for new application patterns,protection of data privacy,and control of training costs.Most existing distributed training systems are designed for homogeneous devices,and they are difficult to adapt to the heterogeneous computing environments of cloud-edge-end.For this reason,a cross-domain distributed training system named SpanTrain,which is based on the heterogeneous devices of cloud,edge,and end,has been designed.Through a novel hybrid pipeline parallel mechanism,it realizes the efficient DNN model training with the collaboration of the heterogeneous devices of cloud,edge,and end.Moreover,experiments have been carried out in an environment containing typical heterogeneous devices.Experiments in typical cloud-edge-end heterogeneous environments demonstrate that SpanTrain achieves 1.17x~3.15x higher training throughput compared to state-of-the-art systems,while improving resource utilization of heterogeneous devices by 39%.These results validate the efficiency of SpanTrain for DNN training in cloud-edge-end heterogeneous environments.

关键词

云边端异构设备/分布式计算/模型训练/并行训练策略

Key words

cloud-edge-end heterogeneous device/distributed computing/DNN training/parallel training strategy

分类

计算机与自动化

引用本文复制引用

王锦权,唐熙程,刘旭昭,廖晓坚,肖利民,霍志胜,索珈顺,李云潼,沈润楠,谢喜龙..SpanTrain:基于云边端异构设备的跨域分布式模型训练系统[J].大数据,2025,11(3):17-32,16.

基金项目

国家重点研发计划项目(No.2023YFB4503100) The National Key Research and Development Program of China(No.2023YFB4503100) (No.2023YFB4503100)

大数据

2096-0271

访问量0
|
下载量0
段落导航相关论文