Kubeflow异构算力调度策略研究OA北大核心CSTPCD
Research on Heterogeneous Computing Scheduling Strategy for Kubeflow
Kubeflow将机器学习和云计算技术两个技术领域相结合,集成了大量的机器学习工具,为生产级的机器学习平台落地提供了可行方案.机器学习通常依托图形处理器(GPU)等专用处理器来提高训练和推理速度,随着云计算集群规模的动态调整,不同计算架构的云计算节点可以灵活地加入/退出集群,传统的轮询调度策略已无法满足动态调整下的异构算力资源调度.为解决Kubeflow平台异构算力的分配优化问题,提高平台资源利用率,实现负载均衡,提出一种基于云的图形处理器-中央处理器(CPU-GPU)异构算力调度策略,采用量化后的负载均衡度和优先级两个判断指标,细颗粒度化显存分配,将计算资源挂载给对应的Pod以实现算力资源的细颗粒度调度.根据集群各节点算力资源设计资源权重矩阵,利用改进的遗传算法获取Pod的最优部署方案,保证多个任务的执行.实验结果表明,该调度策略对并行任务支持效果较好,且在资源请求溢出的情况下,能够按照优先级调度执行并实现最优的负载,与平台原生策略相比,资源细化程度提升了一个数量级,集群负载均衡也有较为显著的提升.
Kubeflow is a project that integrates machine learning and cloud computing technology,integrating a large number of machine learning tools and providing a feasible solution for the deployment of production-grade machine learning platforms.Machine learning relies on specialized Graphics Processing Unit(GPU)s to improve training and inference speed.As the size of cloud computing clusters is dynamically adjusted,computing nodes of different computing architectures can be added or removed from the cluster,and traditional round-robin scheduling strategies cannot realize the dynamic adjustment of heterogeneous computing power resources.To solve the allocation and optimization problems of Kubeflow's heterogeneous computing power,improve the utilization rate of platform resources,and achieve load balancing,a cloud-based Central Processing Unit-GPU(CPU-GPU)heterogeneous computing power scheduling strategy is proposed.This scheduling strategy adopts two judgment indicators:weighted load balancing degree and priority,and fine-grained allocation of display memory to achieve granularity of computing power resources.The optimal deployment scheme of Pod is designed according to the resource weight matrix of each node in the cluster,and an improved genetic algorithm is used for optimal deployment.The experimental results show that this scheduling strategy performs better for parallel tasks.It can execute optimal loads under overflow of resource requests.Compared with the original platform-native strategy,the degree of resource fine-tuning is one order of magnitude higher,and the cluster load balancing performance is also significantly improved.
孙毅;王会梅;鲜明;向航
国防科技大学电子科学学院,湖南 长沙 410000
计算机与自动化
云计算机器学习异构算力资源调度遗传算法
cloud computingmachine learningheterogeneous computingresource schedulinggenetic algorithm
《计算机工程》 2024 (002)
25-32 / 8
国家部委基金.
评论