| 注册
首页|期刊导航|智能系统学报|基于二维张量并行策略的大模型加速训练方法

基于二维张量并行策略的大模型加速训练方法

朱仕通 董琦

智能系统学报2025,Vol.20Issue(5):1256-1265,10.
智能系统学报2025,Vol.20Issue(5):1256-1265,10.DOI:10.11992/tis.202411023

基于二维张量并行策略的大模型加速训练方法

Accelerated method for training large models based on a 2D tensor parallel strategy

朱仕通 1董琦1

作者信息

  • 1. 中国电子科学研究院,北京 100043
  • 折叠

摘要

Abstract

Recent advancements in language modeling have shown that large pretrained models based on the Trans-former architecture exhibit exceptional performance in natural language processing applications.However,training large language models(LLMs)poses a considerable challenge due to the limited memory capacity of GPUs.Traditional tensor parallelism methods require a single GPU to store all activation values,making it difficult to address memory bot-tlenecks.Aiming to solve the GPU memory constraint on LLM training and improve training efficiency,this paper pro-poses a two-dimensional tensor parallelism method(TP2D).TP2D partitions the input data and parameter matrices across multiple GPUs,leveraging distributed communication to facilitate high-speed data exchange between GPUs.This approach enables true distributed parallel training and alleviates memory constraints.GPT-2 was used as a benchmark model to evaluate the soft scaling efficiency and training efficiency of the two training methods.Experimental results show that,when using a 4-block GPU,the training speed of 2D tensor parallelism is 1.84 times that of tensor parallel-ism,with a soft scaling efficiency of 86%and reduced memory consumption.

关键词

Transformer/张量并行/注意力机制/自然语言处理/人工智能/预训练/分布式训练/分布式通信

Key words

Transformer/tensor parallel/attention mechanism/natural language processing/artificial intelligence/pre-training/distributed training/distributed communication

分类

信息技术与安全科学

引用本文复制引用

朱仕通,董琦..基于二维张量并行策略的大模型加速训练方法[J].智能系统学报,2025,20(5):1256-1265,10.

智能系统学报

OA北大核心

1673-4785

访问量0
|
下载量0
段落导航相关论文