| 注册
首页|期刊导航|计算机工程与科学|校级异地超算集群管理的关键技术研究与实践

校级异地超算集群管理的关键技术研究与实践

张天阳 池成悦 郭武 高亦沁 文敏华 韦建文

计算机工程与科学2023,Vol.45Issue(12):2135-2145,11.
计算机工程与科学2023,Vol.45Issue(12):2135-2145,11.DOI:10.3969/j.issn.1007-130X.2023.12.005

校级异地超算集群管理的关键技术研究与实践

Key techniques and practice on managing multi-site HPC clusters for university campus

张天阳 1池成悦 1郭武 1高亦沁 1文敏华 1韦建文1

作者信息

  • 1. 上海交通大学网络信息中心,上海 200240
  • 折叠

摘要

Abstract

With the growth and expansion of high-performance computing businesses,external fac-tors such as data center space and power supply capacity often become constraints on cluster expansion and upgrading,resulting in the need for the construction of multi-site high-performance computing(HPC)clusters.Multi-site HPC cluster can break through the geographical limitations of a single clus-ter and provide more computing resources.Based on the practice of SJTU-computing platform,this pa-per summarizes the unified management methods of infrastructure and system software,as well as the high availability design of cluster remote disaster tolerance,including:adaptive Slurm job scheduling system and Open OnDemand visual portal site,extending high availability capabilities for LDAP and other basic services,and building a hierarchical aggregation monitoring system.Finally,this paper dem-onstrates the effectiveness of remote supercomputing cluster solutions from three dimensions:data transmission,user experience,and platform high availability.

关键词

高性能计算/多站点集群/异地容灾/多层联合监控

Key words

high performance computing/multi-site cluster/remote disaster recovery/multi-level fed-eration monitor

分类

信息技术与安全科学

引用本文复制引用

张天阳,池成悦,郭武,高亦沁,文敏华,韦建文..校级异地超算集群管理的关键技术研究与实践[J].计算机工程与科学,2023,45(12):2135-2145,11.

基金项目

国家重点基础研究发展计划(2018YFA0404600,2018YFA0404603) (2018YFA0404600,2018YFA0404603)

计算机工程与科学

OA北大核心CSCDCSTPCD

1007-130X

访问量0
|
下载量0
段落导航相关论文