数据与计算发展前沿2023,Vol.5Issue(6):94-103,10.DOI:10.11871/jfdc.issn.2096-742X.2023.06.009
基于特征分析的HPC失败作业的检测和根因分析
Detection and Root Cause Analysis of HPC Failure Jobs Based on Feature Analysis
危婷 1彭亮 1牛铁 1张宏海1
作者信息
- 1. 中国科学院计算机网络信息中心,北京 100083
- 折叠
摘要
Abstract
Background]In high-performance computing systems,earlier and faster detection of comput-ing job failures and their failure reasons can help users shorten the error correction time and use expensive computing resources more effectively.[Objective]In order to realize the early warn-ing of computing job anomalies,quickly locate the root causes of job failures,and improve user experience,[Methods]this paper analyzes the relationship between the running features and the success or failure of computing jobs for specific applications,based on the monitoring data of a very large supercomputing cluster.The Isolation Forest algorithm is used to detect anoma-lies in the running state of the computing node where the job is running and predict the job fail-ure.Through feature analysis,logs,and other fault data,the root cause map of HPC job failure can be constructed.[Results]It is found that the Isolation Forest algorithm can predict job fail-ure accurately through the numerical analysis of the algorithm.The root cause map constructed based on application feature association analysis can integrate all the influencing factors of job execution and re-source usage,and show the relationship of all factors.[Conclusions]The research in this paper can help manag-ers and users of high-performance computing systems,especially ultra-large supercomputing systems,to find job anomalies as soon as possible,and quickly provide problem location,which is of great significance to reduce the waste of computing resources and improve computing efficiency.关键词
高性能计算/特征分析/机器学习/Isolation Forest/检测/根因分析Key words
HPC/feature analysis/machine learning/Isolation Forest/detection/root cause analysis引用本文复制引用
危婷,彭亮,牛铁,张宏海..基于特征分析的HPC失败作业的检测和根因分析[J].数据与计算发展前沿,2023,5(6):94-103,10.