| 注册
首页|期刊导航|计算机工程与应用|面向SW26010P的异形矩阵乘法众核并行优化技术研究

面向SW26010P的异形矩阵乘法众核并行优化技术研究

胡怡 陈道琨 杨超

计算机工程与应用2025,Vol.61Issue(6):150-163,14.
计算机工程与应用2025,Vol.61Issue(6):150-163,14.DOI:10.3778/j.issn.1002-8331.2405-0142

面向SW26010P的异形矩阵乘法众核并行优化技术研究

Performance Optimization Techniques of Irregular-Shaped Matrix Multiplication on SW26010P

胡怡 1陈道琨 1杨超2

作者信息

  • 1. 北京大学 数学科学学院,北京 100871||北京大学 长沙计算与数字经济研究院 先进计算研究中心,长沙 410205||中国科学院 软件研究所 并行软件与计算科学实验室,北京 100190
  • 2. 北京大学 数学科学学院,北京 100871||北京大学 长沙计算与数字经济研究院 先进计算研究中心,长沙 410205
  • 折叠

摘要

Abstract

Matrix multiplication is widely used in the field of scientific and engineering computing,and is the most impor-tant optimization object in BLAS.With the development of artificial neural networks,computational fluid mechanics and other fields,irregular-shaped matrix multiplication is rapidly gaining attention.This paper proposes parallelization techniques for irregular-shaped matrix multiplication on SW26010P,a domestic many-core processor deployed in the new generation Sunway supercomputer.Specifically,a parallel algorithm with diversified task partition mapping is designed to improve memory access bandwidth utilization rate based on the hardware characteristics and the data layout of matrix elements.At the same time,based on the hardware assembly lines and vectorized computation and data access instruc-tions,the key computations are abstracted and the corresponding underlying compilation optimizations are performed to improve computational efficiency.And a data-sharing strategy under the RMA point to point communication mechanism is adopted to further reduce the overhead of data access and transmission,and the nested double buffering are used to further improve the performance.Besides,a series of experiments on SW26010P are conducted to determine the optimal number of blocks of different kinds of function parallelization calculation for the purpose of making full use of the hard-ware platform performance.The experimental results demonstrate that the performance of the irregular-shaped matrix multiplication optimized in this thesis can reach up to 93%of the upper bound of the theoretical performance.Compared with the massive GEMM algorithm implementation,the average performance acceleration of the irregular-shaped matrix multiplication is 5.43 times,and the optimal performance acceleration can reach up to 51.5 times.

关键词

异形矩阵乘法/SW26010P众核处理器/多样化任务划分映射/RMA点对点机制/嵌套双缓冲技术

Key words

irregular-shaped matrix multiplication/Sunway 26010P many-core processor/diversified task partition map-ping/RMA point to point mechanism/nested double buffering techniques

分类

信息技术与安全科学

引用本文复制引用

胡怡,陈道琨,杨超..面向SW26010P的异形矩阵乘法众核并行优化技术研究[J].计算机工程与应用,2025,61(6):150-163,14.

计算机工程与应用

OA北大核心

1002-8331

访问量4
|
下载量0
段落导航相关论文