| 注册
首页|期刊导航|数据与计算发展前沿|基于特征分析的HPC失败作业的检测和根因分析

基于特征分析的HPC失败作业的检测和根因分析

危婷 彭亮 牛铁 张宏海

数据与计算发展前沿2023,Vol.5Issue(6):94-103,10.
数据与计算发展前沿2023,Vol.5Issue(6):94-103,10.DOI:10.11871/jfdc.issn.2096-742X.2023.06.009

基于特征分析的HPC失败作业的检测和根因分析

Detection and Root Cause Analysis of HPC Failure Jobs Based on Feature Analysis

危婷 1彭亮 1牛铁 1张宏海1

作者信息

  • 1. 中国科学院计算机网络信息中心,北京 100083
  • 折叠

摘要

Abstract

Background]In high-performance computing systems,earlier and faster detection of comput-ing job failures and their failure reasons can help users shorten the error correction time and use expensive computing resources more effectively.[Objective]In order to realize the early warn-ing of computing job anomalies,quickly locate the root causes of job failures,and improve user experience,[Methods]this paper analyzes the relationship between the running features and the success or failure of computing jobs for specific applications,based on the monitoring data of a very large supercomputing cluster.The Isolation Forest algorithm is used to detect anoma-lies in the running state of the computing node where the job is running and predict the job fail-ure.Through feature analysis,logs,and other fault data,the root cause map of HPC job failure can be constructed.[Results]It is found that the Isolation Forest algorithm can predict job fail-ure accurately through the numerical analysis of the algorithm.The root cause map constructed based on application feature association analysis can integrate all the influencing factors of job execution and re-source usage,and show the relationship of all factors.[Conclusions]The research in this paper can help manag-ers and users of high-performance computing systems,especially ultra-large supercomputing systems,to find job anomalies as soon as possible,and quickly provide problem location,which is of great significance to reduce the waste of computing resources and improve computing efficiency.

关键词

高性能计算/特征分析/机器学习/Isolation Forest/检测/根因分析

Key words

HPC/feature analysis/machine learning/Isolation Forest/detection/root cause analysis

引用本文复制引用

危婷,彭亮,牛铁,张宏海..基于特征分析的HPC失败作业的检测和根因分析[J].数据与计算发展前沿,2023,5(6):94-103,10.

数据与计算发展前沿

OACSCDCSTPCD

2096-742X

访问量0
|
下载量0
段落导航相关论文