Efficient program execution on massively parallel clusters is critical for fields like scientific computing and artificial intelligence.However,traditional task scheduling algorithms do not fully leverage platform cha...Efficient program execution on massively parallel clusters is critical for fields like scientific computing and artificial intelligence.However,traditional task scheduling algorithms do not fully leverage platform characteristics,resulting in inefficiency and long task execution times.We propose KANETAS,a reinforcement learning-based DAG(Directed Acyclic Graph)elastic task scheduling algorithm,designed to adapt to DAG tasks of various scales and structures.Kolmogorov-Arnold Network(KAN)was applied to the DAG scheduling problem.It enhances the efficiency of heterogeneous hardware by using Graph Convolutional Networks(GCN)and Actor-Critic Algorithm(A2C),recognizing hardware features and assigning tasks to appropriate computing units.We have conducted extensive experiments to evaluate the proposed solution with four strong baseline algorithms,including the state-of-the-art heuristics method and a variety of deep reinforcement learning based algorithms.The experimental results suggest that KANETAS can reduce the average makespan of the best baseline algorithm by 13.1%at most.Furthermore,compared to the MLP version,the KAN version showed superior performance.The proposed model demonstrates a clear advantage in load balancing.展开更多
The energy consumption of large-scale heterogeneous computing systems has become a critical concern on both financial and environmental fronts.Current systems employ hand-crafted heuristics and ignore changes in the s...The energy consumption of large-scale heterogeneous computing systems has become a critical concern on both financial and environmental fronts.Current systems employ hand-crafted heuristics and ignore changes in the system and workload characteristics.Moreover,high-dimensional state and action problems cannot be solved efficiently using traditional reinforcement learning-based methods in large-scale heterogeneous settings.Therefore,in this paper,energy-aware task scheduling with deep reinforcement learning(DRL)is proposed.First,based on the real data set SPECpower,a high-precision energy consumption model,convenient for environmental simulation,is designed.Based on the actual production conditions,a partition-based task-scheduling algorithm using proximal policy optimization on heterogeneous resources is proposed.Simultaneously,an auto-encoder is used to process high-dimensional space to speed up DRL convergence.Finally,to fully verify our algorithm,three scheduling scenarios containing large,medium,and small-scale heterogeneous environments are simulated.Experiments show that when compared with heuristics and DRL-based methods,our algorithm more effectively reduces system energy consumption and ensures the quality of service,without significantly increasing the waiting time.展开更多
Supercomputing technology has been supporting the solution of cutting-edge scientific and complex engineering problems since its inception—serving as a comprehensive representation of the most advanced computer hardw...Supercomputing technology has been supporting the solution of cutting-edge scientific and complex engineering problems since its inception—serving as a comprehensive representation of the most advanced computer hardware and software technologies over a period of time.Over the course of nearly 80 years of development,supercomputing has progressed from being oriented towards computationally intensive tasks,to being oriented towards a hybrid of computationally and data-intensive tasks.Driven by the continuous development of high performance data analytics(HPDA)applications—such as big data,deep learning,and other intelligent tasks—supercomputing storage systems are facing challenges such as a sudden increase in data volume for computational processing tasks,increased and diversified computing power of supercomputing systems,and higher reliability and availability requirements.Based on this,data-intensive supercomputing,which is deeply integrated with data centers and smart computing centers,aims to solve the problems of complex data type optimization,mixed-load optimization,multi-protocol support,and interoperability on the storage system—thereby becoming the main protagonist of research and development today and for some time in the future.This paper first introduces key concepts in HPDA and data-intensive computing,and then illustrates the extent to which existing platforms support data-intensive applications by analyzing the most representative supercomputing platforms today(Fugaku,Summit,Sunway TaihuLight,and Tianhe 2A).This is followed by an illustration of the actual demand for data-intensive applications in today’s mainstream scientific and industrial communities from the perspectives of both scientific and commercial applications.Next,we provide an outlook on future trends and potential challenges data-intensive supercomputing is facing.In a word,this paper provides researchers and practitioners with a quick overview of the key concepts and developments in supercomputing,and captures the current and future data-intensive supercomputing research hotspots and key issues that need to be addressed.展开更多
To ensure the reliability and availability of data,redundancy strategies are always required for distributed storage systems.Erasure coding,one of the representative redundancy strategies,has the advantage of low stor...To ensure the reliability and availability of data,redundancy strategies are always required for distributed storage systems.Erasure coding,one of the representative redundancy strategies,has the advantage of low storage overhead,which facilitates its employment in distributed storage systems.Among the various erasure coding schemes,XOR-based erasure codes are becoming popular due to their high computing speed.When a single-node failure occurs in such coding schemes,a process called data recovery takes place to retrieve the failed node’s lost data from surviving nodes.However,data transmission during the data recovery process usually requires a considerable amount of time.Current research has focused mainly on reducing the amount of data needed for data recovery to reduce the time required for data transmission,but it has encountered problems such as significant complexity and local optima.In this paper,we propose a random search recovery algorithm,named SA-RSR,to speed up single-node failure recovery of XOR-based erasure codes.SA-RSR uses a simulated annealing technique to search for an optimal recovery solution that reads and transmits a minimum amount of data.In addition,this search process can be done in polynomial time.We evaluate SA-RSR with a variety of XOR-based erasure codes in simulations and in a real storage system,Ceph.Experimental results in Ceph show that SA-RSR reduces the amount of data required for recovery by up to 30.0%and improves the performance of data recovery by up to 20.36%compared to the conventional recovery method.展开更多
基金Funding National Key Research and Development Program of China,2023YFB3001504.
文摘Efficient program execution on massively parallel clusters is critical for fields like scientific computing and artificial intelligence.However,traditional task scheduling algorithms do not fully leverage platform characteristics,resulting in inefficiency and long task execution times.We propose KANETAS,a reinforcement learning-based DAG(Directed Acyclic Graph)elastic task scheduling algorithm,designed to adapt to DAG tasks of various scales and structures.Kolmogorov-Arnold Network(KAN)was applied to the DAG scheduling problem.It enhances the efficiency of heterogeneous hardware by using Graph Convolutional Networks(GCN)and Actor-Critic Algorithm(A2C),recognizing hardware features and assigning tasks to appropriate computing units.We have conducted extensive experiments to evaluate the proposed solution with four strong baseline algorithms,including the state-of-the-art heuristics method and a variety of deep reinforcement learning based algorithms.The experimental results suggest that KANETAS can reduce the average makespan of the best baseline algorithm by 13.1%at most.Furthermore,compared to the MLP version,the KAN version showed superior performance.The proposed model demonstrates a clear advantage in load balancing.
基金supported by the National Key Research and Development Program of China under Grant 2016YFB0200902.
文摘The energy consumption of large-scale heterogeneous computing systems has become a critical concern on both financial and environmental fronts.Current systems employ hand-crafted heuristics and ignore changes in the system and workload characteristics.Moreover,high-dimensional state and action problems cannot be solved efficiently using traditional reinforcement learning-based methods in large-scale heterogeneous settings.Therefore,in this paper,energy-aware task scheduling with deep reinforcement learning(DRL)is proposed.First,based on the real data set SPECpower,a high-precision energy consumption model,convenient for environmental simulation,is designed.Based on the actual production conditions,a partition-based task-scheduling algorithm using proximal policy optimization on heterogeneous resources is proposed.Simultaneously,an auto-encoder is used to process high-dimensional space to speed up DRL convergence.Finally,to fully verify our algorithm,three scheduling scenarios containing large,medium,and small-scale heterogeneous environments are simulated.Experiments show that when compared with heuristics and DRL-based methods,our algorithm more effectively reduces system energy consumption and ensures the quality of service,without significantly increasing the waiting time.
文摘Supercomputing technology has been supporting the solution of cutting-edge scientific and complex engineering problems since its inception—serving as a comprehensive representation of the most advanced computer hardware and software technologies over a period of time.Over the course of nearly 80 years of development,supercomputing has progressed from being oriented towards computationally intensive tasks,to being oriented towards a hybrid of computationally and data-intensive tasks.Driven by the continuous development of high performance data analytics(HPDA)applications—such as big data,deep learning,and other intelligent tasks—supercomputing storage systems are facing challenges such as a sudden increase in data volume for computational processing tasks,increased and diversified computing power of supercomputing systems,and higher reliability and availability requirements.Based on this,data-intensive supercomputing,which is deeply integrated with data centers and smart computing centers,aims to solve the problems of complex data type optimization,mixed-load optimization,multi-protocol support,and interoperability on the storage system—thereby becoming the main protagonist of research and development today and for some time in the future.This paper first introduces key concepts in HPDA and data-intensive computing,and then illustrates the extent to which existing platforms support data-intensive applications by analyzing the most representative supercomputing platforms today(Fugaku,Summit,Sunway TaihuLight,and Tianhe 2A).This is followed by an illustration of the actual demand for data-intensive applications in today’s mainstream scientific and industrial communities from the perspectives of both scientific and commercial applications.Next,we provide an outlook on future trends and potential challenges data-intensive supercomputing is facing.In a word,this paper provides researchers and practitioners with a quick overview of the key concepts and developments in supercomputing,and captures the current and future data-intensive supercomputing research hotspots and key issues that need to be addressed.
基金the National Natural Science Foundation of China(No.62172327)。
文摘To ensure the reliability and availability of data,redundancy strategies are always required for distributed storage systems.Erasure coding,one of the representative redundancy strategies,has the advantage of low storage overhead,which facilitates its employment in distributed storage systems.Among the various erasure coding schemes,XOR-based erasure codes are becoming popular due to their high computing speed.When a single-node failure occurs in such coding schemes,a process called data recovery takes place to retrieve the failed node’s lost data from surviving nodes.However,data transmission during the data recovery process usually requires a considerable amount of time.Current research has focused mainly on reducing the amount of data needed for data recovery to reduce the time required for data transmission,but it has encountered problems such as significant complexity and local optima.In this paper,we propose a random search recovery algorithm,named SA-RSR,to speed up single-node failure recovery of XOR-based erasure codes.SA-RSR uses a simulated annealing technique to search for an optimal recovery solution that reads and transmits a minimum amount of data.In addition,this search process can be done in polynomial time.We evaluate SA-RSR with a variety of XOR-based erasure codes in simulations and in a real storage system,Ceph.Experimental results in Ceph show that SA-RSR reduces the amount of data required for recovery by up to 30.0%and improves the performance of data recovery by up to 20.36%compared to the conventional recovery method.