|国家科技期刊平台
首页|期刊导航|计算机工程与科学|面向众核CPU的稠密线性求解器性能评测与优化

面向众核CPU的稠密线性求解器性能评测与优化OA北大核心CSTPCD

Dense linear solver on many-core CPUs:Characterization and optimization

中文摘要英文摘要

稠密线性求解器在高性能计算和机器学习等领域扮演着重要的角色.其典型的并行算法实现通常构建在著名的fork-join或task-based编程模型之上.尽管采用fork-join模型的主流稠密线性代数库能将大部分的计算转移到高度优化、高性能的BLAS 3例程上,由于fork-join不灵活的执行流,它们仍然未能高效地利用众核CPU的计算资源.采用task-based编程模型的开源库能实现更加灵活、负载更均衡的算法,因此能获得明显的性能提升.然而,在众核CPU平台上,尤其是对于中等矩阵规模的问题而言,它们仍然有较大的优化空间.对稠密线性求解器的性能进行了全面的测评,以定位性能瓶颈,并提出了2种优化策略,以提高程序性能.具体地,通过重叠LU分解和下三角求解的计算过程,减少同步开销线程的空等,从而提高算法的并行性;进一步通过减少冗余的矩阵打包操作,降低算法的访存开销.分别在2个主流的众核CPU平台(Intel® Xeon Gold® 6252N(48核)和HiSilicon Kunpeng 920(64核))上进行了性能评估.实验结果表明,该优化的稠密线性求解器在上述两个CPU平台上,相比最佳开源实现分别取得了10.05%(Xeon)和13.63%(Kunpeng 920)的性能提升.

The dense linear solver plays a vital role in high-performance computing and machine learning.Typical parallel implementations are built upon the well-known fork-join or task-based pro-gramming model.Though mainstream dense linear algebra libraries adopting the fork-join paradigm can shift most of the computation to well-tuned and high-performance BLAS 3 routines,they fail to exploit many-core CPUs efficiently due to the rigid execution stream of fork-join.While open-source implemen-tations employing the task-based paradigm can provide more promising performance thanks to the model's malleability and better load balance,they still leave much room for optimization on many-core platforms,especially for medium-sized matrices.In this paper,a quantitative characterization of the dense linear solver is carried out to locate performance bottlenecks and a series of optimizations is pro-posed to deliver higher performance.Specifically,idle threads are reduced by merging LU factorization with the following lower triangular solver to improve parallelism.Moreover,duplicated matrix packing operations are reduced to lower memory overhead.Performance evaluation is conducted on two modern many-core platform,Intel® Xeon Gold® 6252N(48 cores)and HiSilicon Kunpeng 920(64 cores).Eval-uation results show that our optimized solver outperforms the state-of-the-art open-source implementa-tion by a factor up to 10.05%(Xeon)and 13.63%(Kunpeng 920)on the two platforms,respectively.

付晓;苏醒;董德尊;钱程东

国防科技大学计算机学院,湖南 长沙 410073飞腾信息技术有限公司,天津 300459

计算机与自动化

稠密线性求解器LU分解fork-join模型task-based模型众核CPU

dense linear solverLU factorizationfork-join modeltask-based modelmany-core CPU

《计算机工程与科学》 2024 (006)

984-992 / 9

国家重点研发计划(2022YFB4501702);湖南省自然科学基金杰出青年基金(2021JJ10050)

10.3969/j.issn.1007-130X.2024.06.005

评论