环上多项式乘法在GPU上的优化实现OA北大核心CSTPCD
Optimal Implementation of Multiplication over Polynomial Rings on GPUs
作为格密码算法的核心组件,环上多项式乘法的效率和准确性对于格密码方案的实用性和安全性至关重要.NTT及KNTT等现有的环上多项式乘法算法具有较高的并行性,其在CPU上运行时很难完全发挥优势.这也意味着,很多基于CPU实现的环上多项式乘法算法的效率仍有很大的提升空间.针对这一问题,本文基于Zhu等人提出的KNTT算法,利用GPU的众核特性以及强大的并行计算能力,实现了高效的环上多项式乘法运算.同时,将GPU线程模型中的线程块与KNTT算法中拆分出的小次数多项式一一对应,使得每个线程块负责一个多项式的NTT并行运算.由于GPU中的多个线程块可以被同时调度开始计算任务,因此多项式之间也可以实现并行处理,这进一步提高了 KNTT算法在GPU上的实现效率.实验结果显示,GPU上实现的KNTT算法相较于NTT算法的GPU版本以及原始的CPU版本增速明显.在模多项式次数N为16384时,相对于原始C版本代码可以达到93.78%的增速.相较于GPU版本的NTT算法,在N=2048时,也可以达到40.62%的增速.
As a core component of lattice-based cryptography,the efficiency and accuracy of polyno-mial multiplication over polynomial rings are crucial for the practicality and security of lattice-based cryptographic schemes.Existing polynomial multiplication algorithms,such as NTT and KNTT,have high parallelism,however they are difficult to fully leverage on CPUs.This means that there is still much room for improvement in the efficiency of ring polynomial multiplication algorithms based on CPU implementations.To address this issue,this paper proposes an efficient implementation of ring polynomial multiplication based on the KNTT algorithm proposed by Zhu et al.This implementation leverages the massive parallel nature and powerful computing capabilities of GPUs.Specifically,each thread block in the GPU thread model is mapped to one of the small-degree polynomials generated by the KNTT algorithm,with each block responsible for the NTT parallel operation of a single polyno-mial.As multiple thread blocks in the GPU can be scheduled to start computing tasks simultaneously,parallel processing of multiple polynomials is enabled,which further improves the efficiency of the KNTT algorithm on the GPU.Experimental results show that the GPU implementation of the KNTT algorithm is significantly faster than both the GPU version of the NTT algorithm and the original CPU version.For example,when the modular polynomial length N is 16 384,the GPU implementation achieves a speedup of 93.78%compared to the original C version.Compared with the GPU version of the NTT algorithm,the GPU implementation of the KNTT algorithm can achieve a speedup of 40.62%when N=2048.
赵新颖;袁峰;赵臻;王保仓
西安电子科技大学空天地一体化综合业务网全国重点实验室,西安 710071中国航天科工集团第二研究院706所,北京 100854西安电子科技大学空天地一体化综合业务网全国重点实验室,西安 710071||河南省网络密码技术重点实验室,郑州 453499
计算机与自动化
格密码多项式乘法NTTKNTT
lattice-based cryptographypolynomial multiplicationNTTKNTT
《密码学报(中英文)》 2024 (004)
830-844 / 15
国家自然科学基金(62272362,U19B2021,62102299);河南省网络密码技术重点实验室研究课题(LNCT2022-A05);陕西高校青年创新团队National Natural Science Foundation of China(62272362,U19B2021,62102299);Open Fund of Henan Key Laboratory of Network Cryptography Technology(LNCT2022-A05);The Youth Innovation Team of Higher Edu-cations of Shaanxi Province
评论