现代电子技术2013,Vol.36Issue(4):80-84,5.
GPU矩阵乘法和FFT算法的性能优化
Performance optimization of matrix multiplication and FFT in GPU
摘要
Abstract
The optimization technique of GPU program performance is investigated for obtaining the common method to de-sign many-core GPU high-performance program. The authors' experiences in improving the performance of two key algorithms: single - precision matrix - matrix multiplication subprogram (SGEMM of BLAS) and single - precision FFT using CUDA are dis-cussed in this paper. The former is computation-intensive, while the latter is memory bandwidth or communication-intensive. The peak speed of 393 Gflops was achieved on NVIDIA GeForce GTX280 GPU for the former. It is about 5% faster than the CUB-LAS 2.0 library. Better FFT performance was obtained for a range of dimensions. Some common principles are discussed for the design and implementation of many-core algorithms.关键词
GPU程序设计/矩阵乘法/快速傅里叶变换/性能优化技术Key words
GPU programming/matrix multiplication/FFT/performance optimization technique分类
信息技术与安全科学引用本文复制引用
李晓雯,崔翔..GPU矩阵乘法和FFT算法的性能优化[J].现代电子技术,2013,36(4):80-84,5.基金项目
国家"863"高技术研究发展计划项目基金(2012AA010902) (2012AA010902)
国家自然科学基金资助项目(61240045 ()
10571178) ()