期刊文献+

面向多任务训练的网络调度技术

Network Scheduling Techniques for Multi-Job Training
在线阅读 下载PDF
导出
摘要 面向生产环境中深度学习训练(DLT)场景,观察到多DLT任务之间产生严重的通信竞争,严重影响训练集群整体GPU利用率。针对这一基础问题,研制Crux,一个旨在通过缓解DLT多任务间通信竞争最大化GPU利用率的通信调度方法。Crux核心思想是将最优化GPU利用率目标转化为每个DLT对GPU强度需求问题,因此文章设计一种优先考虑高GPU强度的DLT流调度算法,从而最大程度减少潜在的通信竞争。基于大规模实验显示,与Sincronia、CASSINI和TACCL等调度方案相比,Crux可将GPU利用率提高至多23%,远高于同类方法。Crux已经在工业级大模型训练集群部署并进行任务调度。 In the context of Deep Learning Training(DLT)tasks within Alibaba Cloud's production environment,it is observed severe communication contention among multiple DLT jobs,which significantly degrades the overall GPU utilization of the training cluster.To address this fundamental issue,Crux is developed,a communication scheduling system designed to maximize GPU utilization by mitigating inter-job communication contention.The core principle of Crux is to reformulate the objective of maximizing GPU utilization into a problem of meeting each DLT job's demand for GPU compute density.Consequently,it employs a flow scheduling algorithm that prioritizes DLT jobs with higher compute density,thereby minimizing potential communication conflicts.Extensive experiments show that compared to schedulers like Sincronia,Varys,and TACCL,Crux improves GPU utilization by up to 23%,significantly outperforming these counterparts.Crux has been deployed in industrial-scale model training clusters for DLT job scheduling.
作者 操佳敏 关宇 翟恩南 Cao Jiamin;Guan Yu;Zhai Ennan(Alibaba Cloud Computing Co.,Ltd.,Hangzhou 310000,China)
出处 《信息通信技术》 2025年第5期17-23,共7页 Information and communications Technologies
关键词 深度学习 模型训练 通信竞争 通信调度 利用率优化 集合通信 路径选择 Deep Learning Model Training,Communication Contention Communication Scheduling Utilization Optimization Collextive Communication Path Selection
  • 相关文献

参考文献1

共引文献4

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部