随着大语言模型(large language models,LLMs)(以下简称“大模型”)参数规模的持续增长,微调百亿级参数大模型对计算和存储资源提出了极高要求。传统分布式训练方案通常依赖大量高端GPU和高速互联网络,训练成本极为昂贵。现有单GPU训练...随着大语言模型(large language models,LLMs)(以下简称“大模型”)参数规模的持续增长,微调百亿级参数大模型对计算和存储资源提出了极高要求。传统分布式训练方案通常依赖大量高端GPU和高速互联网络,训练成本极为昂贵。现有单GPU训练方案虽通过张量卸载缓解显存压力,但仍然面临I/O传输效率低和设备利用率不足等问题。传统内核态I/O操作在大规模张量迁移中引入频繁的系统调用和上下文切换,成为制约性能的关键瓶颈;同时,优化器计算无法充分发挥多核CPU的并行能力,难以实现与GPU计算的有效重叠,进一步限制了系统性能。针对上述问题,提出了一种面向大模型训练的异构内存卸载与I/O优化方案HiTrain。首先构建了基于存储性能开发工具包(storage performance development kit,SPDK)的高性能张量存储模块,通过在用户态管理张量数据,避免了内核I/O栈开销,从而提高张量卸载的并发性与吞吐率;其次,设计并实现了基于异步优化器的存储-计算流水线调度模块,通过对优化器的执行进行优化重排来减少GPU等待时间,提高整体训练效率。实验结果表明,在配备单张GPU和非易失性存储器快速固态硬盘(non-volatile memory express solid state drive,NVMe SSD)的服务器上,所提出的方案能够充分利用系统中的存算资源,使得模型训练过程中张量卸载与加载效率提升32.7%,整体训练吞吐提升至现有方案的1.49倍,为低成本大模型训练提供了切实可行的技术路径。展开更多
Hard disk drives(HDDs)serve as the primary storage devices in modern data centers.Once a failure occurs,it often leads to severe data loss,significantly degrading the reliability of storage systems.Numerous studies ha...Hard disk drives(HDDs)serve as the primary storage devices in modern data centers.Once a failure occurs,it often leads to severe data loss,significantly degrading the reliability of storage systems.Numerous studies have proposed machine learning-based HDD failure prediction models.However,the Self-Monitoring,Analysis,and Reporting Technology(SMART)attributes differ across HDD manufacturers.We define hard drives of the same brand and model as homogeneous HDD groups,and those from different brands or models as heterogeneous HDD groups.In practical engineering scenarios,a data center is often composed of a heterogeneous population of HDDs,spanning multiple vendors and models.Existing research predominantly focuses on homogeneous datasets,ignoring the model’s generalization capability across heterogeneous HDDs.As a result,HDD models with limited samples often suffer from poor training effectiveness and prediction performance.To address this issue,we investigate generalizable SMART predictors across heterogeneous HDD groups.By extracting time-series features within a fixed sliding time window,we propose a Heterogeneous Disk Failure Prediction Method based on Time Series Features(HDFPM)framework.This method is adaptable to HDD models with limited sample sizes,thereby enhancing its applicability and robustness across diverse drive populations.Experimental results show that the proposed model achieves an F1-score of 0.9518 when applied to two different Seagate HDD models,while maintaining the False Positive Rate(FPR)below 1%.After incorporating the Complexity-Ratio Dynamic Time Warping(CDTW)based feature enhancement method,the best prediction model achieves a True Positive Rate(TPR)of up to 0.93 between the two models.For next-day failure prediction across various Seagate models,the model achieves an F1-score of up to 0.8792.Moreover,the experimental results also show that within the same brand,the higher the proportion of shared SMART attributes across different models,the better the prediction performance.In addition,HDFPMdemonstrates the best stability andmost significant performance in heterogeneous environments.展开更多
In erasure-coded storage systems,updating data requires parity maintenance,which often leads to significant I/O amplification due to“write-after-read”operations.Furthermore,scattered parity placement increases disk ...In erasure-coded storage systems,updating data requires parity maintenance,which often leads to significant I/O amplification due to“write-after-read”operations.Furthermore,scattered parity placement increases disk seek overhead during repair,resulting in degraded system performance.To address these challenges,this paper proposes a Cognitive Update and Repair Method(CURM)that leverages machine learning to classify files into writeonly,read-only,and read-write categories,enabling tailored update and repair strategies.For write-only and read-write files,CURM employs a data-differencemechanism combined with fine-grained I/O scheduling to minimize redundant read operations and mitigate I/O amplification.For read-write files,CURM further reserves adjacent disk space near parity blocks,supporting parallel reads and reducing disk seek overhead during repair.We implement CURM in a prototype system,Cognitive Update and Repair File System(CURFS),and conduct extensive experiments using realworld Network File System(NFS)and Microsoft Research(MSR)workloads on a 25-node cluster.Experimental results demonstrate that CURMimproves data update throughput by up to 82.52%,reduces recovery time by up to 47.47%,and decreases long-term storage overhead by more than 15% compared to state-of-the-art methods including Full Logging(FL),ParityLogging(PL),ParityLoggingwithReservedspace(PLR),andPARIX.These results validate the effectiveness of CURM in enhancing both update and repair performance,providing a scalable and efficient solution for large-scale erasure-coded storage systems.展开更多
The performance of data restore is one of the key indicators of user experience for backup storage systems.Compared to the traditional offline restore process,online restore reduces downtime during backup restoration,...The performance of data restore is one of the key indicators of user experience for backup storage systems.Compared to the traditional offline restore process,online restore reduces downtime during backup restoration,allowing users to operate on already restored files while other files are still being restored.This approach improves availability during restoration tasks but suffers from a critical limitation:inconsistencies between the access sequence and the restore sequence.In many cases,the file a user needs to access at a given moment may not yet be restored,resulting in significant delays and poor user experience.To this end,we present Histore,which builds on the user’s historical access sequence to schedule the restore sequence,in order to reduce users’access delayed time.Histore includes three restore approaches:(i)the frequency-based approach,which restores files based on historical file access frequencies and prioritizes ensuring the availability of frequently accessed files;(ii)the graph-based approach,which preferentially restores the frequently accessed files as well as their correlated files based on historical access patterns,and(iii)the trie-based approach,which restores particular files based on both users’real-time and historical access patterns to deduce and restore the files to be accessed in the near future.We implement a prototype of Histore and evaluate its performance from multiple perspectives.Trace-driven experiments on two datasets show that Histore significantly reduces users’delay time by 4-700×with only 1.0%-14.5%additional performance overhead.展开更多
文摘随着大语言模型(large language models,LLMs)(以下简称“大模型”)参数规模的持续增长,微调百亿级参数大模型对计算和存储资源提出了极高要求。传统分布式训练方案通常依赖大量高端GPU和高速互联网络,训练成本极为昂贵。现有单GPU训练方案虽通过张量卸载缓解显存压力,但仍然面临I/O传输效率低和设备利用率不足等问题。传统内核态I/O操作在大规模张量迁移中引入频繁的系统调用和上下文切换,成为制约性能的关键瓶颈;同时,优化器计算无法充分发挥多核CPU的并行能力,难以实现与GPU计算的有效重叠,进一步限制了系统性能。针对上述问题,提出了一种面向大模型训练的异构内存卸载与I/O优化方案HiTrain。首先构建了基于存储性能开发工具包(storage performance development kit,SPDK)的高性能张量存储模块,通过在用户态管理张量数据,避免了内核I/O栈开销,从而提高张量卸载的并发性与吞吐率;其次,设计并实现了基于异步优化器的存储-计算流水线调度模块,通过对优化器的执行进行优化重排来减少GPU等待时间,提高整体训练效率。实验结果表明,在配备单张GPU和非易失性存储器快速固态硬盘(non-volatile memory express solid state drive,NVMe SSD)的服务器上,所提出的方案能够充分利用系统中的存算资源,使得模型训练过程中张量卸载与加载效率提升32.7%,整体训练吞吐提升至现有方案的1.49倍,为低成本大模型训练提供了切实可行的技术路径。
基金supported by the Tianjin Manufacturing High Quality Development Special Foundation(No.20232185)the Roycom Foundation(No.70306901).
文摘Hard disk drives(HDDs)serve as the primary storage devices in modern data centers.Once a failure occurs,it often leads to severe data loss,significantly degrading the reliability of storage systems.Numerous studies have proposed machine learning-based HDD failure prediction models.However,the Self-Monitoring,Analysis,and Reporting Technology(SMART)attributes differ across HDD manufacturers.We define hard drives of the same brand and model as homogeneous HDD groups,and those from different brands or models as heterogeneous HDD groups.In practical engineering scenarios,a data center is often composed of a heterogeneous population of HDDs,spanning multiple vendors and models.Existing research predominantly focuses on homogeneous datasets,ignoring the model’s generalization capability across heterogeneous HDDs.As a result,HDD models with limited samples often suffer from poor training effectiveness and prediction performance.To address this issue,we investigate generalizable SMART predictors across heterogeneous HDD groups.By extracting time-series features within a fixed sliding time window,we propose a Heterogeneous Disk Failure Prediction Method based on Time Series Features(HDFPM)framework.This method is adaptable to HDD models with limited sample sizes,thereby enhancing its applicability and robustness across diverse drive populations.Experimental results show that the proposed model achieves an F1-score of 0.9518 when applied to two different Seagate HDD models,while maintaining the False Positive Rate(FPR)below 1%.After incorporating the Complexity-Ratio Dynamic Time Warping(CDTW)based feature enhancement method,the best prediction model achieves a True Positive Rate(TPR)of up to 0.93 between the two models.For next-day failure prediction across various Seagate models,the model achieves an F1-score of up to 0.8792.Moreover,the experimental results also show that within the same brand,the higher the proportion of shared SMART attributes across different models,the better the prediction performance.In addition,HDFPMdemonstrates the best stability andmost significant performance in heterogeneous environments.
基金supported by the National Natural Science Foundation of China(Grant No.62362019)the Natural Science Foundation of Hainan Province(Grant No.624RC482)the Hainan Provincial Higher Education Teaching Reform Research Project(Grant Hnjg2024-27).
文摘In erasure-coded storage systems,updating data requires parity maintenance,which often leads to significant I/O amplification due to“write-after-read”operations.Furthermore,scattered parity placement increases disk seek overhead during repair,resulting in degraded system performance.To address these challenges,this paper proposes a Cognitive Update and Repair Method(CURM)that leverages machine learning to classify files into writeonly,read-only,and read-write categories,enabling tailored update and repair strategies.For write-only and read-write files,CURM employs a data-differencemechanism combined with fine-grained I/O scheduling to minimize redundant read operations and mitigate I/O amplification.For read-write files,CURM further reserves adjacent disk space near parity blocks,supporting parallel reads and reducing disk seek overhead during repair.We implement CURM in a prototype system,Cognitive Update and Repair File System(CURFS),and conduct extensive experiments using realworld Network File System(NFS)and Microsoft Research(MSR)workloads on a 25-node cluster.Experimental results demonstrate that CURMimproves data update throughput by up to 82.52%,reduces recovery time by up to 47.47%,and decreases long-term storage overhead by more than 15% compared to state-of-the-art methods including Full Logging(FL),ParityLogging(PL),ParityLoggingwithReservedspace(PLR),andPARIX.These results validate the effectiveness of CURM in enhancing both update and repair performance,providing a scalable and efficient solution for large-scale erasure-coded storage systems.
基金supported in part by National Key R&D Program of China(2022YFB4501200),National Natural Science Foundation of China(62332018)Science and Technology Program(2024NSFTD0031,2024YFHZ0339 and 2025ZNSFSC0497).
文摘The performance of data restore is one of the key indicators of user experience for backup storage systems.Compared to the traditional offline restore process,online restore reduces downtime during backup restoration,allowing users to operate on already restored files while other files are still being restored.This approach improves availability during restoration tasks but suffers from a critical limitation:inconsistencies between the access sequence and the restore sequence.In many cases,the file a user needs to access at a given moment may not yet be restored,resulting in significant delays and poor user experience.To this end,we present Histore,which builds on the user’s historical access sequence to schedule the restore sequence,in order to reduce users’access delayed time.Histore includes three restore approaches:(i)the frequency-based approach,which restores files based on historical file access frequencies and prioritizes ensuring the availability of frequently accessed files;(ii)the graph-based approach,which preferentially restores the frequently accessed files as well as their correlated files based on historical access patterns,and(iii)the trie-based approach,which restores particular files based on both users’real-time and historical access patterns to deduce and restore the files to be accessed in the near future.We implement a prototype of Histore and evaluate its performance from multiple perspectives.Trace-driven experiments on two datasets show that Histore significantly reduces users’delay time by 4-700×with only 1.0%-14.5%additional performance overhead.