摘要
面向生产环境中深度学习训练(DLT)场景,观察到多DLT任务之间产生严重的通信竞争,严重影响训练集群整体GPU利用率。针对这一基础问题,研制Crux,一个旨在通过缓解DLT多任务间通信竞争最大化GPU利用率的通信调度方法。Crux核心思想是将最优化GPU利用率目标转化为每个DLT对GPU强度需求问题,因此文章设计一种优先考虑高GPU强度的DLT流调度算法,从而最大程度减少潜在的通信竞争。基于大规模实验显示,与Sincronia、CASSINI和TACCL等调度方案相比,Crux可将GPU利用率提高至多23%,远高于同类方法。Crux已经在工业级大模型训练集群部署并进行任务调度。
In the context of Deep Learning Training(DLT)tasks within Alibaba Cloud's production environment,it is observed severe communication contention among multiple DLT jobs,which significantly degrades the overall GPU utilization of the training cluster.To address this fundamental issue,Crux is developed,a communication scheduling system designed to maximize GPU utilization by mitigating inter-job communication contention.The core principle of Crux is to reformulate the objective of maximizing GPU utilization into a problem of meeting each DLT job's demand for GPU compute density.Consequently,it employs a flow scheduling algorithm that prioritizes DLT jobs with higher compute density,thereby minimizing potential communication conflicts.Extensive experiments show that compared to schedulers like Sincronia,Varys,and TACCL,Crux improves GPU utilization by up to 23%,significantly outperforming these counterparts.Crux has been deployed in industrial-scale model training clusters for DLT job scheduling.
作者
操佳敏
关宇
翟恩南
Cao Jiamin;Guan Yu;Zhai Ennan(Alibaba Cloud Computing Co.,Ltd.,Hangzhou 310000,China)
出处
《信息通信技术》
2025年第5期17-23,共7页
Information and communications Technologies
关键词
深度学习
模型训练
通信竞争
通信调度
利用率优化
集合通信
路径选择
Deep Learning
Model Training,Communication Contention
Communication Scheduling
Utilization Optimization
Collextive Communication
Path Selection