Due to advances in semiconductor techniques, many-core processors have been widely used in high performance computing. However, many applications still cannot be carried out efficiently due to the memory wall, which h...Due to advances in semiconductor techniques, many-core processors have been widely used in high performance computing. However, many applications still cannot be carried out efficiently due to the memory wall, which has become a bottleneck in many-core processors. In this paper, we present a novel heterogeneous many-core processor architecture named deeply fused many-core (DFMC) for high performance computing systems. DFMC integrates management processing ele- ments (MPEs) and computing processing elements (CPEs), which are heterogeneous processor cores for different application features with a unified ISA (instruction set architecture), a unified execution model, and share-memory that supports cache coherence. The DFMC processor can alleviate the memory wall problem by combining a series of cooperative computing techniques of CPEs, such as multi-pattern data stream transfer, efficient register-level communication mechanism, and fast hardware synchronization technique. These techniques are able to improve on-chip data reuse and optimize memory access performance. This paper illustrates an implementation of a full system prototype based on FPGA with four MPEs and 256 CPEs. Our experimental results show that the effect of the cooperative computing techniques of CPEs is significant, with DGEMM (double-precision matrix multiplication) achieving an efficiency of 94%, FFT (fast Fourier transform) obtaining a performance of 207 GFLOPS and FDTD (finite-difference time-domain) obtaining a performance of 27 GFLOPS.展开更多
In this paper,a typical experiment is carried out based on a high-resolution air-sea coupled model,namely,the coupled ocean-atmosphere-wave-sediment transport(COAWST)model,on both heterogeneous many-core(SW)and homoge...In this paper,a typical experiment is carried out based on a high-resolution air-sea coupled model,namely,the coupled ocean-atmosphere-wave-sediment transport(COAWST)model,on both heterogeneous many-core(SW)and homogenous multicore(Intel)supercomputing platforms.We construct a hindcast of Typhoon Lekima on both the SW and Intel platforms,compare the simulation results between these two platforms and compare the key elements of the atmospheric and ocean modules to reanalysis data.The comparative experiment in this typhoon case indicates that the domestic many-core computing platform and general cluster yield almost no differences in the simulated typhoon path and intensity,and the differences in surface pressure(PSFC)in the WRF model and sea surface temperature(SST)in the short-range forecast are very small,whereas a major difference can be identified at high latitudes after the first 10 days.Further heat budget analysis verifies that the differences in SST after 10 days are mainly caused by shortwave radiation variations,as influenced by subsequently generated typhoons in the system.These typhoons generated in the hindcast after the first 10 days attain obviously different trajectories between the two platforms.展开更多
Cloud computing has taken over the high-performance distributed computing area,and it currently provides on-demand services and resource polling over the web.As a result of constantly changing user service demand,the ...Cloud computing has taken over the high-performance distributed computing area,and it currently provides on-demand services and resource polling over the web.As a result of constantly changing user service demand,the task scheduling problem has emerged as a critical analytical topic in cloud computing.The primary goal of scheduling tasks is to distribute tasks to available processors to construct the shortest possible schedule without breaching precedence restrictions.Assignments and schedules of tasks substantially influence system operation in a heterogeneous multiprocessor system.The diverse processes inside the heuristic-based task scheduling method will result in varying makespan in the heterogeneous computing system.As a result,an intelligent scheduling algorithm should efficiently determine the priority of every subtask based on the resources necessary to lower the makespan.This research introduced a novel efficient scheduling task method in cloud computing systems based on the cooperation search algorithm to tackle an essential task and schedule a heterogeneous cloud computing problem.The basic idea of thismethod is to use the advantages of meta-heuristic algorithms to get the optimal solution.We assess our algorithm’s performance by running it through three scenarios with varying numbers of tasks.The findings demonstrate that the suggested technique beats existingmethods NewGenetic Algorithm(NGA),Genetic Algorithm(GA),Whale Optimization Algorithm(WOA),Gravitational Search Algorithm(GSA),and Hybrid Heuristic and Genetic(HHG)by 7.9%,2.1%,8.8%,7.7%,3.4%respectively according to makespan.展开更多
With the improvement of security awareness,in order to guarantee information security,more advanced and secure encryption algorithms are applied to Microsoft Office.People also set more complex encryption passwords.Ho...With the improvement of security awareness,in order to guarantee information security,more advanced and secure encryption algorithms are applied to Microsoft Office.People also set more complex encryption passwords.However,once the initial password is forgotten,the encrypted information needs to be retrieved.The conventional brute force cracking methods and password recovery programs can hardly meet the actual deciphering needs.To this end,we develop a distributed parallel password recovery program(MT-Office)for Microsoft Office on the domestic heterogeneous multi-core processor(MT-3000).MT-Office takes full advantage of the multi-core and heterogeneous features of MT-3000,and is optimized and improved in both vectorization and global computing.At the same time,MT-Office provides multiple recovery strategies in password generation to improve the recovery efficiency.Compared with other platforms(e.g.,Intel platforms and FT platforms),MT-3000 heterogeneous platform can achieve 60×–218×speedup ratio.For Office2010,we perform a strong scalability test on the new-generation supercomputer in National Supercomputer Center in Tianjin.MT-Office not only extends to 65,536 acceleration clusters on this system,shows good scalability,but also achieves almost linear speedup ratio.For Office2007,compared with other password recovery programs,MT-Office can achieve 2.5×–131.1×speedup ratio.It can be seen that MT-Office can better exploit the advantages of MT-3000,which not only has good scalability and parallelism,but also has faster deciphering speed and can be applied to practical engineering application.展开更多
目前,多核实时系统中同步任务的节能调度研究主要针对的是同构多核处理器平台,而异构多核处理器架构能够更有效地发挥系统性能。将现有的研究直接应用于异构多核系统,在保证可调度性的情况下会导致能耗变高。对此,通过使用动态电压与频...目前,多核实时系统中同步任务的节能调度研究主要针对的是同构多核处理器平台,而异构多核处理器架构能够更有效地发挥系统性能。将现有的研究直接应用于异构多核系统,在保证可调度性的情况下会导致能耗变高。对此,通过使用动态电压与频率调节(Dynamic Voltage Frequency Scaling,DVFS)技术,研究异构多核实时系统中基于任务同步的节能调度问题,提出同步感知的最大能耗节省优先算法(Synchronization Aware-Largest Energy Saved First,SA-LESF)。该算法针对所有任务的速度配置进行迭代优化,直至所有任务均达到其最大限度节能的速度配置。此外,进一步提出基于动态松弛时间回收的同步感知最大能耗节省优先算法(Synchronization Aware-Largest Energy Saved First with Dynamic Reclamation,SA-LESF-DR)。该算法在保证实时任务可调度的同时,实施相应的回收策略,进一步降低系统能耗。实验结果表明,SA-LESF与SA-LESF-DR算法在能耗表现上具有优势,在相同任务集下,相比其他算法可节省高达30%的能耗。展开更多
Heterogeneous processors integrate very distinct compute resources such as CPUs and GPUs into the same chip,thus can exploit the advantages and avoid disadvantages of those compute units.We in this work evaluate and a...Heterogeneous processors integrate very distinct compute resources such as CPUs and GPUs into the same chip,thus can exploit the advantages and avoid disadvantages of those compute units.We in this work evaluate and analyze eight sparse matrix and graph kernels on an AMD CPU-GPU heterogeneous processor by using 956 sparse matrices.Five characteristics,i.e.,load balancing,indirect addressing,memory reallocation,atomic operations,and dynamic characteristics are our major considerations.The experimental results show that although the CPU and GPU parts access the same DRAM,very different performance behaviors are observed.For example,though the GPU part in general outperforms the CPU part,it cannot achieve the best performance in all cases given by the CPU part.Moreover,the bandwidth utilization of atomic operations on heterogeneous processors can be much higher than a high-end discrete GPU.展开更多
Now the OpenACC has become a popular programming interface for many-core application programming.Internationally,a lot of research have been done on OpenACC for CPU+GPU heterogeneous many-core architecture.Among them,...Now the OpenACC has become a popular programming interface for many-core application programming.Internationally,a lot of research have been done on OpenACC for CPU+GPU heterogeneous many-core architecture.Among them,the PGI OpenACC compiler developed by NVIDIA is the most advanced one.But there are few research on OpenACC related to the Home Grown Heterogeneous Many-Core(HGHM)Architecture that is different from GPU.This paper proposes an automatic mapping technique for OpenACC kernel code based on the OpenACC compiler to a heterogeneous and deeply fused many-core architecture.Our approach uses the static analysis and feedback dynamic analysis of the compiler to perform the automatic mapping of the program parallel kernel code to many-core devices,and it greatly improves the transformation quality of the compiler.Experimental results show that this technique can greatly improve the efficiency of using OpenACC to port applications to heterogeneous and fused many-core system without impacting program acceleration performance.展开更多
Heterogeneous many-cores are now an integral part of modern computing systems ranging from embedding systems to supercomputers.While heterogeneous many-core design offers the potential for energy-efficient high-perfor...Heterogeneous many-cores are now an integral part of modern computing systems ranging from embedding systems to supercomputers.While heterogeneous many-core design offers the potential for energy-efficient high-performance,such potential can only be unlocked if the application programs are suitably parallel and can be made to match the underlying heterogeneous platform.In this article,we provide a comprehensive survey for parallel programming models for heterogeneous many-core architectures and review the compiling techniques of improving programmability and portability.We examine various software optimization techniques for minimizing the communicating overhead between heterogeneous computing devices.We provide a road map for a wide variety of different research areas.We conclude with a discussion on open issues in the area and potential research directions.This article provides both an accessible introduction to the fast-moving area of heterogeneous programming and a detailed bibliography of its main achievements.展开更多
As semiconductor technology advances, there will be billions of transistors on a single chip. Chip many-core processors are emerging to take advantage of these greater transistor densities to deliver greater performan...As semiconductor technology advances, there will be billions of transistors on a single chip. Chip many-core processors are emerging to take advantage of these greater transistor densities to deliver greater performance. Effective fault tolerance techniques are essential to improve the yield of such complex chips. In this paper, a core-level redundancy scheme called N+M is proposed to improve N-core processors’ yield by providing M spare cores. In such architecture, topology is an important factor because it greatly affects the processors’ performance. The concept of logical topology and a topology reconfiguration problem are introduced, which is able to transparently provide target topology with lowest performance degradation as the presence of faulty cores on-chip. A row rippling and column stealing (RRCS) algorithm is also proposed. Results show that PRCS can give solutions with average 13.8% degradation with negligible computing time.展开更多
为研究异构多核片上系统(multi-processor system on chip,MPSoC)在密集并行计算任务中的潜力,文章设计并实现了一种适用于粗粒度数据特征、面向任务级并行应用的异构多核系统动态调度协处理器,采用了片上缓存、任务输出的多级写回管理...为研究异构多核片上系统(multi-processor system on chip,MPSoC)在密集并行计算任务中的潜力,文章设计并实现了一种适用于粗粒度数据特征、面向任务级并行应用的异构多核系统动态调度协处理器,采用了片上缓存、任务输出的多级写回管理、任务自动映射、通讯任务乱序执行等机制。实验结果表明,该动态调度协处理器不仅能够实现任务级乱序执行等基本设计目标,还具有极低的调度开销,相较于基于动态记分牌算法的调度器,运行多个子孔径距离压缩算法的时间降低达17.13%。研究结果证明文章设计的动态调度协处理器能够有效优化目标场景下的任务调度效果。展开更多
Purpose–The purpose of this paper is to propose a fault-tolerant technology for increasing the durability of application programs when evolutionary computation is performed by fast parallel processing on many-core pr...Purpose–The purpose of this paper is to propose a fault-tolerant technology for increasing the durability of application programs when evolutionary computation is performed by fast parallel processing on many-core processors such as graphics processing units(GPUs)and multi-core processors(MCPs).Design/methodology/approach–For distributed genetic algorithm(GA)models,the paper proposes a method where an island’s ID number is added to the header of data transferred by this island for use in fault detection.Findings–The paper has shown that the processing time of the proposed idea is practically negligible in applications and also shown that an optimal solution can be obtained even with a single stuck-at fault or a transient fault,and that increasing the number of parallel threads makes the system less susceptible to faults.Originality/value–The study described in this paper is a new approach to increase the sustainability of application program using distributed GA on GPUs and MCPs.展开更多
文摘Due to advances in semiconductor techniques, many-core processors have been widely used in high performance computing. However, many applications still cannot be carried out efficiently due to the memory wall, which has become a bottleneck in many-core processors. In this paper, we present a novel heterogeneous many-core processor architecture named deeply fused many-core (DFMC) for high performance computing systems. DFMC integrates management processing ele- ments (MPEs) and computing processing elements (CPEs), which are heterogeneous processor cores for different application features with a unified ISA (instruction set architecture), a unified execution model, and share-memory that supports cache coherence. The DFMC processor can alleviate the memory wall problem by combining a series of cooperative computing techniques of CPEs, such as multi-pattern data stream transfer, efficient register-level communication mechanism, and fast hardware synchronization technique. These techniques are able to improve on-chip data reuse and optimize memory access performance. This paper illustrates an implementation of a full system prototype based on FPGA with four MPEs and 256 CPEs. Our experimental results show that the effect of the cooperative computing techniques of CPEs is significant, with DGEMM (double-precision matrix multiplication) achieving an efficiency of 94%, FFT (fast Fourier transform) obtaining a performance of 207 GFLOPS and FDTD (finite-difference time-domain) obtaining a performance of 27 GFLOPS.
基金This work is supported by the National Key Research and Development Plan program of the Ministry of Science and Technology of China(No.2016YFB0201100)Additionally,this work is supported by the National Laboratory for Marine Science and Technology(Qingdao)Major Project of the Aoshan Science and Technology Innovation Program(No.2018ASKJ01-04)the Open Fundation of Key Laboratory of Marine Science and Numerical Simulation,Ministry of Natural Resources(No.2021-YB-02).
文摘In this paper,a typical experiment is carried out based on a high-resolution air-sea coupled model,namely,the coupled ocean-atmosphere-wave-sediment transport(COAWST)model,on both heterogeneous many-core(SW)and homogenous multicore(Intel)supercomputing platforms.We construct a hindcast of Typhoon Lekima on both the SW and Intel platforms,compare the simulation results between these two platforms and compare the key elements of the atmospheric and ocean modules to reanalysis data.The comparative experiment in this typhoon case indicates that the domestic many-core computing platform and general cluster yield almost no differences in the simulated typhoon path and intensity,and the differences in surface pressure(PSFC)in the WRF model and sea surface temperature(SST)in the short-range forecast are very small,whereas a major difference can be identified at high latitudes after the first 10 days.Further heat budget analysis verifies that the differences in SST after 10 days are mainly caused by shortwave radiation variations,as influenced by subsequently generated typhoons in the system.These typhoons generated in the hindcast after the first 10 days attain obviously different trajectories between the two platforms.
文摘Cloud computing has taken over the high-performance distributed computing area,and it currently provides on-demand services and resource polling over the web.As a result of constantly changing user service demand,the task scheduling problem has emerged as a critical analytical topic in cloud computing.The primary goal of scheduling tasks is to distribute tasks to available processors to construct the shortest possible schedule without breaching precedence restrictions.Assignments and schedules of tasks substantially influence system operation in a heterogeneous multiprocessor system.The diverse processes inside the heuristic-based task scheduling method will result in varying makespan in the heterogeneous computing system.As a result,an intelligent scheduling algorithm should efficiently determine the priority of every subtask based on the resources necessary to lower the makespan.This research introduced a novel efficient scheduling task method in cloud computing systems based on the cooperation search algorithm to tackle an essential task and schedule a heterogeneous cloud computing problem.The basic idea of thismethod is to use the advantages of meta-heuristic algorithms to get the optimal solution.We assess our algorithm’s performance by running it through three scenarios with varying numbers of tasks.The findings demonstrate that the suggested technique beats existingmethods NewGenetic Algorithm(NGA),Genetic Algorithm(GA),Whale Optimization Algorithm(WOA),Gravitational Search Algorithm(GSA),and Hybrid Heuristic and Genetic(HHG)by 7.9%,2.1%,8.8%,7.7%,3.4%respectively according to makespan.
基金supported by the National Key Research and Development Program of China(Grant No.2021YFB0300101)the National Natural Science Foundation of China(Grant No.62032023,61902411,12002382)。
文摘With the improvement of security awareness,in order to guarantee information security,more advanced and secure encryption algorithms are applied to Microsoft Office.People also set more complex encryption passwords.However,once the initial password is forgotten,the encrypted information needs to be retrieved.The conventional brute force cracking methods and password recovery programs can hardly meet the actual deciphering needs.To this end,we develop a distributed parallel password recovery program(MT-Office)for Microsoft Office on the domestic heterogeneous multi-core processor(MT-3000).MT-Office takes full advantage of the multi-core and heterogeneous features of MT-3000,and is optimized and improved in both vectorization and global computing.At the same time,MT-Office provides multiple recovery strategies in password generation to improve the recovery efficiency.Compared with other platforms(e.g.,Intel platforms and FT platforms),MT-3000 heterogeneous platform can achieve 60×–218×speedup ratio.For Office2010,we perform a strong scalability test on the new-generation supercomputer in National Supercomputer Center in Tianjin.MT-Office not only extends to 65,536 acceleration clusters on this system,shows good scalability,but also achieves almost linear speedup ratio.For Office2007,compared with other password recovery programs,MT-Office can achieve 2.5×–131.1×speedup ratio.It can be seen that MT-Office can better exploit the advantages of MT-3000,which not only has good scalability and parallelism,but also has faster deciphering speed and can be applied to practical engineering application.
文摘目前,多核实时系统中同步任务的节能调度研究主要针对的是同构多核处理器平台,而异构多核处理器架构能够更有效地发挥系统性能。将现有的研究直接应用于异构多核系统,在保证可调度性的情况下会导致能耗变高。对此,通过使用动态电压与频率调节(Dynamic Voltage Frequency Scaling,DVFS)技术,研究异构多核实时系统中基于任务同步的节能调度问题,提出同步感知的最大能耗节省优先算法(Synchronization Aware-Largest Energy Saved First,SA-LESF)。该算法针对所有任务的速度配置进行迭代优化,直至所有任务均达到其最大限度节能的速度配置。此外,进一步提出基于动态松弛时间回收的同步感知最大能耗节省优先算法(Synchronization Aware-Largest Energy Saved First with Dynamic Reclamation,SA-LESF-DR)。该算法在保证实时任务可调度的同时,实施相应的回收策略,进一步降低系统能耗。实验结果表明,SA-LESF与SA-LESF-DR算法在能耗表现上具有优势,在相同任务集下,相比其他算法可节省高达30%的能耗。
基金supported by the National Natural Science Foundation of China(Grant nos.61732014,61802412,61671151)Beijing Natural Science Foundation(no.4172031)SenseTime Young Scholars Research Fund.
文摘Heterogeneous processors integrate very distinct compute resources such as CPUs and GPUs into the same chip,thus can exploit the advantages and avoid disadvantages of those compute units.We in this work evaluate and analyze eight sparse matrix and graph kernels on an AMD CPU-GPU heterogeneous processor by using 956 sparse matrices.Five characteristics,i.e.,load balancing,indirect addressing,memory reallocation,atomic operations,and dynamic characteristics are our major considerations.The experimental results show that although the CPU and GPU parts access the same DRAM,very different performance behaviors are observed.For example,though the GPU part in general outperforms the CPU part,it cannot achieve the best performance in all cases given by the CPU part.Moreover,the bandwidth utilization of atomic operations on heterogeneous processors can be much higher than a high-end discrete GPU.
基金supported by the National Key RD Program of China(Grant no.2017YFB02-02004)the Project of manned space engineering technology(2018-14)+1 种基金“Large-scale parallel computation of aerodynamic problems of irregular spacecraft reentry covering various flow regimes”the National Natural Science Foundation of China(91530319).
文摘Now the OpenACC has become a popular programming interface for many-core application programming.Internationally,a lot of research have been done on OpenACC for CPU+GPU heterogeneous many-core architecture.Among them,the PGI OpenACC compiler developed by NVIDIA is the most advanced one.But there are few research on OpenACC related to the Home Grown Heterogeneous Many-Core(HGHM)Architecture that is different from GPU.This paper proposes an automatic mapping technique for OpenACC kernel code based on the OpenACC compiler to a heterogeneous and deeply fused many-core architecture.Our approach uses the static analysis and feedback dynamic analysis of the compiler to perform the automatic mapping of the program parallel kernel code to many-core devices,and it greatly improves the transformation quality of the compiler.Experimental results show that this technique can greatly improve the efficiency of using OpenACC to port applications to heterogeneous and fused many-core system without impacting program acceleration performance.
基金partially funded by the National Key Research and Development Program of China under Grant No.2018YFB0204301the National Natural Science Foundation of China under Grant agreements 61972408,61602501 and 61872294a UK Royal Society International Collaboration Grant.
文摘Heterogeneous many-cores are now an integral part of modern computing systems ranging from embedding systems to supercomputers.While heterogeneous many-core design offers the potential for energy-efficient high-performance,such potential can only be unlocked if the application programs are suitably parallel and can be made to match the underlying heterogeneous platform.In this article,we provide a comprehensive survey for parallel programming models for heterogeneous many-core architectures and review the compiling techniques of improving programmability and portability.We examine various software optimization techniques for minimizing the communicating overhead between heterogeneous computing devices.We provide a road map for a wide variety of different research areas.We conclude with a discussion on open issues in the area and potential research directions.This article provides both an accessible introduction to the fast-moving area of heterogeneous programming and a detailed bibliography of its main achievements.
基金the National Natural Science Foundation of China (Nos. 60633060, 60606008, and 60576031)the National Key Basic Research and Development (973) Program of China (973)(Nos. 2005CB321604 and 2005CB321605)the fund of Chinese Academy of Sciences (No. 20074010) due to the President Scholarship
文摘As semiconductor technology advances, there will be billions of transistors on a single chip. Chip many-core processors are emerging to take advantage of these greater transistor densities to deliver greater performance. Effective fault tolerance techniques are essential to improve the yield of such complex chips. In this paper, a core-level redundancy scheme called N+M is proposed to improve N-core processors’ yield by providing M spare cores. In such architecture, topology is an important factor because it greatly affects the processors’ performance. The concept of logical topology and a topology reconfiguration problem are introduced, which is able to transparently provide target topology with lowest performance degradation as the presence of faulty cores on-chip. A row rippling and column stealing (RRCS) algorithm is also proposed. Results show that PRCS can give solutions with average 13.8% degradation with negligible computing time.
文摘为研究异构多核片上系统(multi-processor system on chip,MPSoC)在密集并行计算任务中的潜力,文章设计并实现了一种适用于粗粒度数据特征、面向任务级并行应用的异构多核系统动态调度协处理器,采用了片上缓存、任务输出的多级写回管理、任务自动映射、通讯任务乱序执行等机制。实验结果表明,该动态调度协处理器不仅能够实现任务级乱序执行等基本设计目标,还具有极低的调度开销,相较于基于动态记分牌算法的调度器,运行多个子孔径距离压缩算法的时间降低达17.13%。研究结果证明文章设计的动态调度协处理器能够有效优化目标场景下的任务调度效果。
文摘Purpose–The purpose of this paper is to propose a fault-tolerant technology for increasing the durability of application programs when evolutionary computation is performed by fast parallel processing on many-core processors such as graphics processing units(GPUs)and multi-core processors(MCPs).Design/methodology/approach–For distributed genetic algorithm(GA)models,the paper proposes a method where an island’s ID number is added to the header of data transferred by this island for use in fault detection.Findings–The paper has shown that the processing time of the proposed idea is practically negligible in applications and also shown that an optimal solution can be obtained even with a single stuck-at fault or a transient fault,and that increasing the number of parallel threads makes the system less susceptible to faults.Originality/value–The study described in this paper is a new approach to increase the sustainability of application program using distributed GA on GPUs and MCPs.