Peridynamics(PD)demonstrates unique advantages in addressing fracture problems,however,its nonlocality and meshfree discretization result in high computational and storage costs.Moreover,in its engineering application...Peridynamics(PD)demonstrates unique advantages in addressing fracture problems,however,its nonlocality and meshfree discretization result in high computational and storage costs.Moreover,in its engineering applications,the computational scale of classical GPU parallel schemes is often limited by the finite graphics memory of GPU devices.In the present study,we develop an efficient particle information management strategy based on the cell-linked list method and on this basis propose a subdomain-based GPU parallel scheme,which exhibits outstanding acceleration performance in specific compute kernels while significantly reducing graphics memory usage.Compared to the classical parallel scheme,the cell-linked list method facilitates efficient management of particle information within subdomains,enabling the proposed parallel scheme to effectively reduce graphics memory usage by optimizing the size and number of subdomains while significantly improving the speed of neighbor search.As demonstrated in PD examples,the proposed parallel scheme enhances the neighbor search efficiency dramatically and achieves a significant speedup relative to serial programs.For instance,without considering the time of data transmission,the proposed scheme achieves a remarkable speedup of nearly 1076.8×in one test case,due to its excellent computational efficiency in the neighbor search.Additionally,for 2D and 3D PD models with tens of millions of particles,the graphics memory usage can be reduced up to 83.6%and 85.9%,respectively.Therefore,this subdomain-based GPU parallel scheme effectively avoids graphics memory shortages while significantly improving the computational efficiency,providing new insights into studying more complex large-scale problems.展开更多
文摘仿真点(simulation point,SimPoint)作为一种代表性采样技术被广泛应用于处理器硅前性能评估中。SimPoint为每个待评估的程序根据贝叶斯信息准则确定仿真点数目。然而,标准测试集内不同程序有着不同的行为复杂程度,需要不同数目的仿真点来准确刻画其程序行为。SimPoint无法识别出不同程序间的复杂度差异,无法做到在总仿真点数目一定的情况下,将更多的仿真点分配给行为复杂的程序以降低这些程序的性能评估误差,将更少的仿真点分配给行为简单的程序而不损失这些程序的性能评估精度。由于没有在测试集内合理地进行仿真点分配,SimPoint虽然可以给出比较准确的平均性能评估误差,但是某些行为复杂的测试子项的性能评估误差依然较大。针对这一问题,本文优化了SimPoint的仿真点局部分配方式,提出了一种全局贪心分配方法———贪心点(greedy point,GreedyPoint)方法。该方法将仿真点的分配问题抽象为含约束的优化问题,使用微架构无关特征计算表征误差,通过全局贪心算法来求解该优化问题。实验数据表明,在相同仿真开销下,与SimPoint相比,GreedyPoint可以将SPEC CPU 2017测试套件的平均性能评估误差由3.23%降低到2.08%,最大性能评估误差由21.22%大幅降低至7.01%。
基金Jun Li was supported by National Natural Science Foundation of China(No.:U2441215)Lisheng Liu and Xin Lai were supported by National Natural Science Foundation of China(No.:52494933).
文摘Peridynamics(PD)demonstrates unique advantages in addressing fracture problems,however,its nonlocality and meshfree discretization result in high computational and storage costs.Moreover,in its engineering applications,the computational scale of classical GPU parallel schemes is often limited by the finite graphics memory of GPU devices.In the present study,we develop an efficient particle information management strategy based on the cell-linked list method and on this basis propose a subdomain-based GPU parallel scheme,which exhibits outstanding acceleration performance in specific compute kernels while significantly reducing graphics memory usage.Compared to the classical parallel scheme,the cell-linked list method facilitates efficient management of particle information within subdomains,enabling the proposed parallel scheme to effectively reduce graphics memory usage by optimizing the size and number of subdomains while significantly improving the speed of neighbor search.As demonstrated in PD examples,the proposed parallel scheme enhances the neighbor search efficiency dramatically and achieves a significant speedup relative to serial programs.For instance,without considering the time of data transmission,the proposed scheme achieves a remarkable speedup of nearly 1076.8×in one test case,due to its excellent computational efficiency in the neighbor search.Additionally,for 2D and 3D PD models with tens of millions of particles,the graphics memory usage can be reduced up to 83.6%and 85.9%,respectively.Therefore,this subdomain-based GPU parallel scheme effectively avoids graphics memory shortages while significantly improving the computational efficiency,providing new insights into studying more complex large-scale problems.