计算机应用研究2024,Vol.41Issue(4):1015-1021,7.DOI:10.19734/j.issn.1001-3695.2023.08.0395
基于申威NMII的锁死故障监测与诊断
Lockup fault monitoring and diagnosis based on Sunway NMII
摘要
Abstract
The non maskable inter-processor interrupt(NMII)of the domestic Sunway processor must be initiated by one of the other cores.Therefore,it is difficult to apply the general lockup fault monitoring algorithm of Linux.In severe cases,it will jeopardize the data processing in critical areas.This paper designed a lockup fault monitoring and diagnosis system for Sunway architecture to solve the above problem.It used a chain structure to send NMII requests,and combined timer event and kernel thread to check the lockup time stamp,realized the soft lockup and hard lockup monitoring of single core in the system.Based on the fault tolerance mechanism,it adopted a master-slave structure to monitor the state of all cores.When the master core failed,the system implemented fault tolerance measures and migrated the master core,realized the multi-core lockup monitoring in the system.It designed a task model based on NMII,and realized the diagnostic information output of the fault cores,exten-ded the application scenarios of NMII.The test results show that the proposed algorithm can accurately detect the lockup fault and make real-time diagnosis under both low and high fault risk,and meet the reliability and real-time requirements of lockup fault monitoring and diagnosis of Sunway platform.关键词
申威处理器/不可屏蔽中断/操作系统/锁死/故障诊断/看门狗Key words
Sunway processor/non maskable interrupt(NMI)/operating system/lockup/fault diagnosis/Watchdog分类
信息技术与安全科学引用本文复制引用
郜晨,何升,杭骁骞..基于申威NMII的锁死故障监测与诊断[J].计算机应用研究,2024,41(4):1015-1021,7.基金项目
科技部重点支持项目(GG20210701) (GG20210701)