随着大语言模型(large language models,LLMs)(以下简称“大模型”)参数规模的持续增长,微调百亿级参数大模型对计算和存储资源提出了极高要求。传统分布式训练方案通常依赖大量高端GPU和高速互联网络,训练成本极为昂贵。现有单GPU训练...随着大语言模型(large language models,LLMs)(以下简称“大模型”)参数规模的持续增长,微调百亿级参数大模型对计算和存储资源提出了极高要求。传统分布式训练方案通常依赖大量高端GPU和高速互联网络,训练成本极为昂贵。现有单GPU训练方案虽通过张量卸载缓解显存压力,但仍然面临I/O传输效率低和设备利用率不足等问题。传统内核态I/O操作在大规模张量迁移中引入频繁的系统调用和上下文切换,成为制约性能的关键瓶颈;同时,优化器计算无法充分发挥多核CPU的并行能力,难以实现与GPU计算的有效重叠,进一步限制了系统性能。针对上述问题,提出了一种面向大模型训练的异构内存卸载与I/O优化方案HiTrain。首先构建了基于存储性能开发工具包(storage performance development kit,SPDK)的高性能张量存储模块,通过在用户态管理张量数据,避免了内核I/O栈开销,从而提高张量卸载的并发性与吞吐率;其次,设计并实现了基于异步优化器的存储-计算流水线调度模块,通过对优化器的执行进行优化重排来减少GPU等待时间,提高整体训练效率。实验结果表明,在配备单张GPU和非易失性存储器快速固态硬盘(non-volatile memory express solid state drive,NVMe SSD)的服务器上,所提出的方案能够充分利用系统中的存算资源,使得模型训练过程中张量卸载与加载效率提升32.7%,整体训练吞吐提升至现有方案的1.49倍,为低成本大模型训练提供了切实可行的技术路径。展开更多
In erasure-coded storage systems,updating data requires parity maintenance,which often leads to significant I/O amplification due to“write-after-read”operations.Furthermore,scattered parity placement increases disk ...In erasure-coded storage systems,updating data requires parity maintenance,which often leads to significant I/O amplification due to“write-after-read”operations.Furthermore,scattered parity placement increases disk seek overhead during repair,resulting in degraded system performance.To address these challenges,this paper proposes a Cognitive Update and Repair Method(CURM)that leverages machine learning to classify files into writeonly,read-only,and read-write categories,enabling tailored update and repair strategies.For write-only and read-write files,CURM employs a data-differencemechanism combined with fine-grained I/O scheduling to minimize redundant read operations and mitigate I/O amplification.For read-write files,CURM further reserves adjacent disk space near parity blocks,supporting parallel reads and reducing disk seek overhead during repair.We implement CURM in a prototype system,Cognitive Update and Repair File System(CURFS),and conduct extensive experiments using realworld Network File System(NFS)and Microsoft Research(MSR)workloads on a 25-node cluster.Experimental results demonstrate that CURMimproves data update throughput by up to 82.52%,reduces recovery time by up to 47.47%,and decreases long-term storage overhead by more than 15% compared to state-of-the-art methods including Full Logging(FL),ParityLogging(PL),ParityLoggingwithReservedspace(PLR),andPARIX.These results validate the effectiveness of CURM in enhancing both update and repair performance,providing a scalable and efficient solution for large-scale erasure-coded storage systems.展开更多
文摘随着大语言模型(large language models,LLMs)(以下简称“大模型”)参数规模的持续增长,微调百亿级参数大模型对计算和存储资源提出了极高要求。传统分布式训练方案通常依赖大量高端GPU和高速互联网络,训练成本极为昂贵。现有单GPU训练方案虽通过张量卸载缓解显存压力,但仍然面临I/O传输效率低和设备利用率不足等问题。传统内核态I/O操作在大规模张量迁移中引入频繁的系统调用和上下文切换,成为制约性能的关键瓶颈;同时,优化器计算无法充分发挥多核CPU的并行能力,难以实现与GPU计算的有效重叠,进一步限制了系统性能。针对上述问题,提出了一种面向大模型训练的异构内存卸载与I/O优化方案HiTrain。首先构建了基于存储性能开发工具包(storage performance development kit,SPDK)的高性能张量存储模块,通过在用户态管理张量数据,避免了内核I/O栈开销,从而提高张量卸载的并发性与吞吐率;其次,设计并实现了基于异步优化器的存储-计算流水线调度模块,通过对优化器的执行进行优化重排来减少GPU等待时间,提高整体训练效率。实验结果表明,在配备单张GPU和非易失性存储器快速固态硬盘(non-volatile memory express solid state drive,NVMe SSD)的服务器上,所提出的方案能够充分利用系统中的存算资源,使得模型训练过程中张量卸载与加载效率提升32.7%,整体训练吞吐提升至现有方案的1.49倍,为低成本大模型训练提供了切实可行的技术路径。
基金supported by the National Natural Science Foundation of China(Grant No.62362019)the Natural Science Foundation of Hainan Province(Grant No.624RC482)the Hainan Provincial Higher Education Teaching Reform Research Project(Grant Hnjg2024-27).
文摘In erasure-coded storage systems,updating data requires parity maintenance,which often leads to significant I/O amplification due to“write-after-read”operations.Furthermore,scattered parity placement increases disk seek overhead during repair,resulting in degraded system performance.To address these challenges,this paper proposes a Cognitive Update and Repair Method(CURM)that leverages machine learning to classify files into writeonly,read-only,and read-write categories,enabling tailored update and repair strategies.For write-only and read-write files,CURM employs a data-differencemechanism combined with fine-grained I/O scheduling to minimize redundant read operations and mitigate I/O amplification.For read-write files,CURM further reserves adjacent disk space near parity blocks,supporting parallel reads and reducing disk seek overhead during repair.We implement CURM in a prototype system,Cognitive Update and Repair File System(CURFS),and conduct extensive experiments using realworld Network File System(NFS)and Microsoft Research(MSR)workloads on a 25-node cluster.Experimental results demonstrate that CURMimproves data update throughput by up to 82.52%,reduces recovery time by up to 47.47%,and decreases long-term storage overhead by more than 15% compared to state-of-the-art methods including Full Logging(FL),ParityLogging(PL),ParityLoggingwithReservedspace(PLR),andPARIX.These results validate the effectiveness of CURM in enhancing both update and repair performance,providing a scalable and efficient solution for large-scale erasure-coded storage systems.