计算机工程与科学2025,Vol.47Issue(9):1535-1543,9.DOI:10.3969/j.issn.1007-130X.2025.09.002
基于多源日志语义分析的异构超算平台作业故障识别方法
A job failure identification method of heterogeneous supercomputing platforms based on semantic analysis of multi-source logs
摘要
Abstract
This paper presents a method for detecting job anomalies in large-scale distributed HPC heterogeneous platforms.Analyzing job runtime logs is vital for detecting anomalies,but the sheer vol-ume of logs hinders human comprehension.To address this,we introduce a multi-source log semantic a-nalysis approach using latent Dirichlet allocation(LDA)to analyze logs from various sources.By model-ing topic evolution over time and matching with historical faulty job patterns,it predicts anomalies.Ex-periments on a domestic HPC platform show 95.2%precision,enhancing predictive capability and aiding users and administrators in quickly diagnosing issues,thereby improving HPC environment availability and efficiency.关键词
数据处理/故障识别/混合异构/语义分析/潜在狄利克雷分布Key words
data processing/fault identification/hybrid heterogeneity/semantic analysis/latent Drichlet allocation(LDA)分类
信息技术与安全科学引用本文复制引用
胡鹤,赵毅,顾蓓蓓,赵芸卿..基于多源日志语义分析的异构超算平台作业故障识别方法[J].计算机工程与科学,2025,47(9):1535-1543,9.基金项目
国家自然科学基金(62372428) (62372428)
2024年中国科学院算力基础设施运维与服务项目(CAS-WX2024YW-0102) (CAS-WX2024YW-0102)