摘要
Abstract
[Objective]Compared to Electronic Packet Switching(EPS),Optical Circuit Switching(OCS)demonstrates advanta-ges in latency,power consumption,cost,and stability.This study aims to explore feasible applications of OCS in the networ-king of training tasks by analyzing parallel partitioning strategies,collective communication requirements,traffic patterns,and current network architectures in large model pretraining,in order to fully leverage the benefits of OCS.[Methods]We propose a mechanism for network device redundancy protection using multiple small-port OCS devices,enabling rapid switching with-out interrupting training tasks in the event of Top-of-Rack(ToR)switch failures.Additionally,we advocate for the exclusive service of OCS to data parallelism,requiring configuration only at the start of the task.[Results]We present several feasible opto-electronic networking architectures and specific configurations under different AllReduce algorithms,including joint opti-mization of collective communication algorithms and architectural design to achieve optimal bandwidth.[Conclusion]By ade-quately integrating the traffic models of training tasks,OCS can seamlessly blend into existing EPS network architectures and optimize the large model pretraining from multiple perspectives,including cost,low power consumption,reduced latency,and enhanced stability.关键词
全光电路交换/可重构/光电混合网络架构/大模型预训练/集合通信/并行训练Key words
OCS/reconfigurable/opto-electro hybrid network architect/large language models pre-training/collective commu-nication/parallel training分类
信息技术与安全科学