| 注册
首页|期刊导航|高技术通讯|Cloudless-Training:基于serverless的高效跨地域分布式ML训练框架

Cloudless-Training:基于serverless的高效跨地域分布式ML训练框架

谭文婷 吕存驰 史骁 赵晓芳

高技术通讯2024,Vol.34Issue(3):219-232,14.
高技术通讯2024,Vol.34Issue(3):219-232,14.DOI:10.3772/j.issn.1002-0470.2024.03.001

Cloudless-Training:基于serverless的高效跨地域分布式ML训练框架

Cloudless-Training:a framework to improve efficiency of geo-distributed ML training based on serverless

谭文婷 1吕存驰 1史骁 2赵晓芳3

作者信息

  • 1. 中国科学院计算技术研究所 北京 100190||中国科学院大学 北京 100049
  • 2. 中国科学院计算技术研究所 北京 100190||中科南京信息高铁研究院 南京 211135
  • 3. 中国科学院计算技术研究所 北京 100190||中科苏州智能计算技术研究院 苏州 215028
  • 折叠

摘要

Abstract

Geo-distributed machine learning(ML)training can benefit many emerging ML scenarios(e.g.,large model training,federated learning)with multi-regional cloud resources and wide area network.However,its efficiency is limited due to two challenges.First,efficient elastic scheduling of multi-regional cloud resources is usually miss-ing,affecting resource utilization and performance of training.Second,training communication on wide area net-work(WAN)is still the main overhead,easily subjected to low bandwidth and high fluctuations of WAN.In this paper,a framework Cloudless-Training is proposed to realize efficient geo-distributed ML training in 3 aspects.First,it uses a two-layer architecture with control and physical training planes to support elastic scheduling and communication for multi-regional clouds in a serverless manner.Second,it provides an elastic scheduling strategy that can deploy training workflows adaptively according to the heterogeneity of available cloud resources and distri-bution of pre-existing training datasets.Third,it provides two new synchronization strategies for training partitions among clouds,including asynchronous stochastic gradient descent with gradient accumulation(ASGD-GA)and in-ter-parameter server(PS)model averaging(MA).It is implemented with OpenFaaS and evaluated on Tencent Cloud.Experimental results show that Cloudless-Training can support general ML training in a geo-distributed way,and greatly improve resource utilization(e.g.,9.2%-24.0%training cost reduction)and synchronization effi-ciency(e.g.,1.7 times speedup of training over baseline at most)with model correctness guarantees.

关键词

跨地域分布式机器学习(ML)训练/跨云ML训练/分布式训练框架/serverless/跨云模型同步

Key words

geo-distributed machine learning(ML)training/cross cloud ML training/distributed training framework/serverless/cross cloud model synchronization

引用本文复制引用

谭文婷,吕存驰,史骁,赵晓芳..Cloudless-Training:基于serverless的高效跨地域分布式ML训练框架[J].高技术通讯,2024,34(3):219-232,14.

基金项目

国家重点研发计划(2021YFF0703800)和光合基金B类(202302028357)资助项目. (2021YFF0703800)

高技术通讯

OA北大核心CSTPCD

1002-0470

访问量0
|
下载量0
段落导航相关论文