期刊文献+
共找到4篇文章
< 1 >
每页显示 20 50 100
KANETAS:an elastic scheduler for heterogeneous many‑core systems
1
作者 Zhao Mao xingjun zhang Longxiang Wang 《CCF Transactions on High Performance Computing》 2025年第3期179-193,共15页
Efficient program execution on massively parallel clusters is critical for fields like scientific computing and artificial intelligence.However,traditional task scheduling algorithms do not fully leverage platform cha... Efficient program execution on massively parallel clusters is critical for fields like scientific computing and artificial intelligence.However,traditional task scheduling algorithms do not fully leverage platform characteristics,resulting in inefficiency and long task execution times.We propose KANETAS,a reinforcement learning-based DAG(Directed Acyclic Graph)elastic task scheduling algorithm,designed to adapt to DAG tasks of various scales and structures.Kolmogorov-Arnold Network(KAN)was applied to the DAG scheduling problem.It enhances the efficiency of heterogeneous hardware by using Graph Convolutional Networks(GCN)and Actor-Critic Algorithm(A2C),recognizing hardware features and assigning tasks to appropriate computing units.We have conducted extensive experiments to evaluate the proposed solution with four strong baseline algorithms,including the state-of-the-art heuristics method and a variety of deep reinforcement learning based algorithms.The experimental results suggest that KANETAS can reduce the average makespan of the best baseline algorithm by 13.1%at most.Furthermore,compared to the MLP version,the KAN version showed superior performance.The proposed model demonstrates a clear advantage in load balancing. 展开更多
关键词 Reinforcement learning Task scheduling algorithm Graph neural network Heterogeneous computing Kolmogorov-Arnold network
在线阅读 下载PDF
Energy‑aware task scheduling optimization with deep reinforcement learning for large‑scale heterogeneous systems
2
作者 Jingbo Li xingjun zhang +2 位作者 Zheng Wei Jia Wei Zeyu Ji 《CCF Transactions on High Performance Computing》 2021年第4期383-392,共10页
The energy consumption of large-scale heterogeneous computing systems has become a critical concern on both financial and environmental fronts.Current systems employ hand-crafted heuristics and ignore changes in the s... The energy consumption of large-scale heterogeneous computing systems has become a critical concern on both financial and environmental fronts.Current systems employ hand-crafted heuristics and ignore changes in the system and workload characteristics.Moreover,high-dimensional state and action problems cannot be solved efficiently using traditional reinforcement learning-based methods in large-scale heterogeneous settings.Therefore,in this paper,energy-aware task scheduling with deep reinforcement learning(DRL)is proposed.First,based on the real data set SPECpower,a high-precision energy consumption model,convenient for environmental simulation,is designed.Based on the actual production conditions,a partition-based task-scheduling algorithm using proximal policy optimization on heterogeneous resources is proposed.Simultaneously,an auto-encoder is used to process high-dimensional space to speed up DRL convergence.Finally,to fully verify our algorithm,three scheduling scenarios containing large,medium,and small-scale heterogeneous environments are simulated.Experiments show that when compared with heuristics and DRL-based methods,our algorithm more effectively reduces system energy consumption and ensures the quality of service,without significantly increasing the waiting time. 展开更多
关键词 Task scheduling Large scale heterogeneous systems Deep reinforcement learning Resources management Cloud computing
在线阅读 下载PDF
Status,challenges and trends of data‑intensive supercomputing
3
作者 Jia Wei Mo Chen +9 位作者 Longxiang Wang Pei Ren Yujia Lei Yuqi Qu Qiyu Jiang Xiaoshe Dong Weiguo Wu Qiang Wang Kaili zhang xingjun zhang 《CCF Transactions on High Performance Computing》 2022年第2期211-230,共20页
Supercomputing technology has been supporting the solution of cutting-edge scientific and complex engineering problems since its inception—serving as a comprehensive representation of the most advanced computer hardw... Supercomputing technology has been supporting the solution of cutting-edge scientific and complex engineering problems since its inception—serving as a comprehensive representation of the most advanced computer hardware and software technologies over a period of time.Over the course of nearly 80 years of development,supercomputing has progressed from being oriented towards computationally intensive tasks,to being oriented towards a hybrid of computationally and data-intensive tasks.Driven by the continuous development of high performance data analytics(HPDA)applications—such as big data,deep learning,and other intelligent tasks—supercomputing storage systems are facing challenges such as a sudden increase in data volume for computational processing tasks,increased and diversified computing power of supercomputing systems,and higher reliability and availability requirements.Based on this,data-intensive supercomputing,which is deeply integrated with data centers and smart computing centers,aims to solve the problems of complex data type optimization,mixed-load optimization,multi-protocol support,and interoperability on the storage system—thereby becoming the main protagonist of research and development today and for some time in the future.This paper first introduces key concepts in HPDA and data-intensive computing,and then illustrates the extent to which existing platforms support data-intensive applications by analyzing the most representative supercomputing platforms today(Fugaku,Summit,Sunway TaihuLight,and Tianhe 2A).This is followed by an illustration of the actual demand for data-intensive applications in today’s mainstream scientific and industrial communities from the perspectives of both scientific and commercial applications.Next,we provide an outlook on future trends and potential challenges data-intensive supercomputing is facing.In a word,this paper provides researchers and practitioners with a quick overview of the key concepts and developments in supercomputing,and captures the current and future data-intensive supercomputing research hotspots and key issues that need to be addressed. 展开更多
关键词 Data-intensive supercomputing I/O intensive supercomputing High performance data analytics Parallel processing systems Supercomputing storage
在线阅读 下载PDF
SA-RSR:a read-optimal data recovery strategy for XOR-coded distributed storage systems 被引量:1
4
作者 xingjun zhang Ningjing LIANG +2 位作者 Yunfei LIU Changjiang zhang Yang LI 《Frontiers of Information Technology & Electronic Engineering》 SCIE EI CSCD 2022年第6期858-875,共18页
To ensure the reliability and availability of data,redundancy strategies are always required for distributed storage systems.Erasure coding,one of the representative redundancy strategies,has the advantage of low stor... To ensure the reliability and availability of data,redundancy strategies are always required for distributed storage systems.Erasure coding,one of the representative redundancy strategies,has the advantage of low storage overhead,which facilitates its employment in distributed storage systems.Among the various erasure coding schemes,XOR-based erasure codes are becoming popular due to their high computing speed.When a single-node failure occurs in such coding schemes,a process called data recovery takes place to retrieve the failed node’s lost data from surviving nodes.However,data transmission during the data recovery process usually requires a considerable amount of time.Current research has focused mainly on reducing the amount of data needed for data recovery to reduce the time required for data transmission,but it has encountered problems such as significant complexity and local optima.In this paper,we propose a random search recovery algorithm,named SA-RSR,to speed up single-node failure recovery of XOR-based erasure codes.SA-RSR uses a simulated annealing technique to search for an optimal recovery solution that reads and transmits a minimum amount of data.In addition,this search process can be done in polynomial time.We evaluate SA-RSR with a variety of XOR-based erasure codes in simulations and in a real storage system,Ceph.Experimental results in Ceph show that SA-RSR reduces the amount of data required for recovery by up to 30.0%and improves the performance of data recovery by up to 20.36%compared to the conventional recovery method. 展开更多
关键词 Distributed storage system Data reliability and availability XOR-based erasure codes Single-node failure Data recovery
原文传递
上一页 1 下一页 到第
使用帮助 返回顶部