计算机工程与科学2023,Vol.45Issue(12):2135-2145,11.DOI:10.3969/j.issn.1007-130X.2023.12.005
校级异地超算集群管理的关键技术研究与实践
Key techniques and practice on managing multi-site HPC clusters for university campus
摘要
Abstract
With the growth and expansion of high-performance computing businesses,external fac-tors such as data center space and power supply capacity often become constraints on cluster expansion and upgrading,resulting in the need for the construction of multi-site high-performance computing(HPC)clusters.Multi-site HPC cluster can break through the geographical limitations of a single clus-ter and provide more computing resources.Based on the practice of SJTU-computing platform,this pa-per summarizes the unified management methods of infrastructure and system software,as well as the high availability design of cluster remote disaster tolerance,including:adaptive Slurm job scheduling system and Open OnDemand visual portal site,extending high availability capabilities for LDAP and other basic services,and building a hierarchical aggregation monitoring system.Finally,this paper dem-onstrates the effectiveness of remote supercomputing cluster solutions from three dimensions:data transmission,user experience,and platform high availability.关键词
高性能计算/多站点集群/异地容灾/多层联合监控Key words
high performance computing/multi-site cluster/remote disaster recovery/multi-level fed-eration monitor分类
信息技术与安全科学引用本文复制引用
张天阳,池成悦,郭武,高亦沁,文敏华,韦建文..校级异地超算集群管理的关键技术研究与实践[J].计算机工程与科学,2023,45(12):2135-2145,11.基金项目
国家重点基础研究发展计划(2018YFA0404600,2018YFA0404603) (2018YFA0404600,2018YFA0404603)