摘要
GPU集群已经成为高性能计算的重要方式,特别对于计算密集型应用,具有成本低、性能高、功耗小的优势。为了解决GPU集群系统运行中的任务负载均衡问题,文中提出了一种面向计算密集型应用的异构GPU集群调度方法,该方法可以自动发现计算节点,并动态估计计算节点的计算能力,并根据计算能力、任务的计算强度和优先级在异构GPU集群上合理分配计算资源。同时,该系统还具有容错能力,能够处理计算节点的意外退出,可恢复意外退出计算节点的计算任务,并动态适应系统的计算规模。通过实验表明,文中采用的策略达到了预期目的。
GPU cluster has become an important method for high performance computing, especially for compute-intensive applications. It has many advantages, such as low cost, high performance and low power consumption. To solve the load balancing problem of GPU cluster system, propose an algorithm for heterogeneous GPU cluster, it can automatically identify computation nodes, dynamically estimate the computing capability of these nodes and allocate resources in heterogeneous GPU cluster based on computation nodes" capability, tasks , computing strength and priority. At the same time, the system is also fault tolerant, which is able to handle unexpected exit of computa- tion nodes, recover the computing task of calculation nodes out of an unexpected exit and dynamically adapt to the calculation size of the system. The experiment result shows this strategy achieves desired purpose.
出处
《计算机技术与发展》
2012年第5期32-36,共5页
Computer Technology and Development
关键词
负载均衡
异构GPU集群
任务调度
动态适应
load balance
heterogeneous GPU cluster
task schedule
dynamical adaptation