摘要
随着机器学习模型的参数量与训练数据集爆炸式增长,单一计算节点已无法满足人工智能(Artificial Intelligence,AI)大模型的算力需求,分布式机器学习系统成为支持模型训练的主要平台,该系统通过数万设备的并行训练缩短机器学习的训练时间.其中数据并行是一种常用的分布式训练并行框架,该框架将训练数据划分至不同的计算节点,通过节点间周期性参数同步实现训练任务的协同,由于计算节点在每轮迭代前需要传输大量数据以完成参数同步,通信成为影响计算效率的关键因素.经典参数同步策略存在通信次数较多或接收端链路拥塞的问题,基于网内聚合的参数同步策略则存在交换机计算、存储能力有限、服务器输出端口拥塞的问题,对此本文提出一种混合参数同步策略PASSING(hybrid Parameter Synchronization Strategy with In-host and In-network Aggregation),该策略首先在服务器内或机架内预先进行模型参数的本地同步,随后利用可编程交换机完成全局的参数同步,这种方式既保证了机内小规模计算节点间的高效通信,也减轻了交换机侧的计算和通信负载.本文使用多GPU(Graphics Processing Unit)服务器和可编程交换机搭建了实验平台,并部署了所提出的混合同步策略,实验结果表明PASSING相较于传统的参数服务器算法最多提升了65.25%的训练性能,有效加速了分布式训练的速度.
With the explosive growth in the number of parameters of machine learning models and the scale of training datasets,a single computing node can no longer meet the computational demands of large artificial intelligence(AI)models.Distributed machine learning systems have become the primary platform for supporting AI model training.The training time can be reduced by implementing parallel training across tens of thousands of computing nodes.In particular,data parallelism is a widely used parallel training framework in distributed training.It splits the training dataset across many computing nodes and then trains the model collaboratively through periodic parameter synchronization among those nodes.Since computing nodes need to transmit a large amount of data to complete the parameter synchronization before each round of iteration,communication becomes the key factor that affects computational efficiency.Traditional parameter synchronization strategies suffer from the problem of excessive communication rounds or congestion at the receiver’s link.In contrast,parameter synchronization strategies based on in-network aggregation face issues such as limited computing and storage capabilities of the switches,and congestion at server output ports.To this end,a hybrid parameter synchronization strategy termed PASSING(hybrid Parameter Synchronization Strategy with In-host and In-network Aggregation)is proposed.It implements a local pre-aggregation of the model parameters within the host prior to transferring the data to programmable switches.Subsequently,the local aggregation parameters are sent to the programmable switches to implement the global parameter synchronization.This approach not only ensures efficient communication between the small-scale computing nodes with the host but also reduces the computational and communication load on the switch side.We built a testbed using the multi-GPU(Graphics Processing Unit)servers and programmable switches and deployed PASSING in this testbed.The experimental results demonstrate that PASSING,when compared to traditional parameter synchronization strategies,enhances training performance by up to 65.25%,thus effectively accelerating the speed of distributed training.
作者
余晓杉
顾华玺
周肇星
王佳昆
YU Xiao-shan;GU Hua-xi;ZHOU Zhao-xing;WANG Jia-kun(School of Telecommunications Engineering,Xidian University,Xi’an,Shannxi 710000,China)
出处
《电子学报》
北大核心
2025年第8期2636-2648,共13页
Acta Electronica Sinica
基金
国家重点研发计划(No.2018YFE0202800)。
关键词
分布式训练
数据并行
参数同步
网内聚合
混合同步策略
distributed training
data parallelism
parameter synchronization
in-network aggregation
hybrid parameter synchronization strategy