| 注册
首页|期刊导航|华南理工大学学报(自然科学版)|基于Matrix Core的小尺寸批量矩阵乘法设计与优化

基于Matrix Core的小尺寸批量矩阵乘法设计与优化

陆璐 赵容 梁志宏 索思亮

华南理工大学学报(自然科学版)2025,Vol.53Issue(9):48-58,11.
华南理工大学学报(自然科学版)2025,Vol.53Issue(9):48-58,11.DOI:10.12141/j.issn.1000-565X.240498

基于Matrix Core的小尺寸批量矩阵乘法设计与优化

Design and Optimization of Small-Batch Matrix Multiplication Based on Matrix Core

陆璐 1赵容 1梁志宏 2索思亮2

作者信息

  • 1. 华南理工大学 计算机科学与工程学院,广东 广州 510006
  • 2. 南方电网科学研究院有限责任公司/广东省电力系统网络安全重点实验室,广东 广州 510623
  • 折叠

摘要

Abstract

General Matrix Multiplication(GEMM)is one of the most important operations in linear algebra,serving as the backbone for numerous applications in machine learning,scientific computing,and signal processing.In par-ticular,FP16 batch GEMM has become a core operation in deep learning frameworks due to its efficiency in training and inference.However,current implementations on AMD GPUs(e.g.,CDNA/MI200 architectures with Matrix Cores)suffer from suboptimal memory access and low compute utilization,limiting performance in high-throughput scenarios.Therefore,this paper proposed a GPU optimization scheme for half-precision batch GEMM(HGEMM).In terms of blocking strategy,it allocates equal memory access and computational loads to threads based on input matrix sizes,while enabling each thread to compute multiple matrix multiplications to improve arithmetic unit utili-zation.For memory access optimization,it trades redundant data reads for uniform memory access patterns per thread to facilitate compiler optimization,ensuring overlapping of memory and computation time.For extremely small-batch HGEMM with matrix dimensions smaller than 16,the proposed method employs a 4×4×4 Matrix Core and its corresponding tiling scheme to enhance memory performance while reducing computational resource wastage,and pro-vides the option of whether to use shared memory to achieve the highest performance.This paper compares the perfor-mance of this scheme with two operators of rocBLAS on the AMD GPU MI210 platform.The results show that the ave-rage performance of this scheme on AMD GPU MI210 is 4.14 times that of rocBLASHGEMMBatched and 4.96 times that of rocBLASGEMMExBatched.For extremely small-batch HGEMM,the average performance is 18.60 times that of rocBLASHGEMMBatched and 14.02 times that of rocBLASGEMMExBatched.

关键词

图形处理器/Matrix Core/矩阵乘法/访存优化

Key words

graphics processing unit/Matrix Core/matrix multiplication/memory access optimization

分类

信息技术与安全科学

引用本文复制引用

陆璐,赵容,梁志宏,索思亮..基于Matrix Core的小尺寸批量矩阵乘法设计与优化[J].华南理工大学学报(自然科学版),2025,53(9):48-58,11.

基金项目

广东省自然科学基金项目(2024A1515010204) (2024A1515010204)

南方电网科学研究院项目(1500002024030103XA00063)Supported by the Natural Science Foundation of Guangdong Province(2024A1515010204) (1500002024030103XA00063)

华南理工大学学报(自然科学版)

OA北大核心

1000-565X

访问量0
|
下载量0
段落导航相关论文