可重构OCS技术在大模型预训练中的应用(特邀)OA北大核心CSTPCD
Application of Reconfigurable OCS Technology for Pre-training Large Language Models
[目的]相比于电子分组交换机(EPS),全光电路交换(OCS)在时延、功耗、成本和稳定性等各个方面都体现出了优势,文章通过分析大模型预训练中的并行切分策略、集合通信需求、流量模式和现今的网络架构,讨论了基于 OCS在训练组网中的可行的应用方式,以在训练任务中充分利用 OCS的优势.[方法]文章提出在故障快速恢复中采用多个小端口 OCS进行网络设备冗余保护的机制,可在机顶(ToR)交换机故障时快速切换不中断训练任务.此外,文章还提出 OCS 只为数据并行(DP)服务,且仅在任务开始前进行配置.[结果]文章提出了多种可行的光电组网架构,以及在不同 AllReduce算法下的具体配置,采用包括集合通信算法和架构设计联合优化的方式达到更优的带宽利用率.[结论]只要充分结合训练任务的流量模型,OCS可以很好地融入现有 EPS网络架构,从成本、低功耗、低时延以及高稳定性等各方面对大模型预训练进行优化.
[Objective]Compared to Electronic Packet Switching(EPS),Optical Circuit Switching(OCS)demonstrates advanta-ges in latency,power consumption,cost,and stability.This study aims to explore feasible applications of OCS in the networ-king of training tasks by analyzing parallel partitioning strategies,collective communication requirements,traffic patterns,and current network architectures in large model pretraining,in order to fully leverage the benefits of OCS.[Methods]We propose a mechanism for network device redundancy protection using multiple small-port OCS devices,enabling rapid switching with-out interrupting training tasks in the event of Top-of-Rack(ToR)switch failures.Additionally,we advocate for the exclusive service of OCS to data parallelism,requiring configuration only at the start of the task.[Results]We present several feasible opto-electronic networking architectures and specific configurations under different AllReduce algorithms,including joint opti-mization of collective communication algorithms and architectural design to achieve optimal bandwidth.[Conclusion]By ade-quately integrating the traffic models of training tasks,OCS can seamlessly blend into existing EPS network architectures and optimize the large model pretraining from multiple perspectives,including cost,low power consumption,reduced latency,and enhanced stability.
朱宸;周谞;王佩龙
百度在线网络技术有限公司 系统部,北京 100085
电子信息工程
全光电路交换可重构光电混合网络架构大模型预训练集合通信并行训练
OCSreconfigurableopto-electro hybrid network architectlarge language models pre-trainingcollective commu-nicationparallel training
《光通信研究》 2024 (005)
25-34 / 10
评论