With the improvement of security awareness,in order to guarantee information security,more advanced and secure encryption algorithms are applied to Microsoft Office.People also set more complex encryption passwords.Ho...With the improvement of security awareness,in order to guarantee information security,more advanced and secure encryption algorithms are applied to Microsoft Office.People also set more complex encryption passwords.However,once the initial password is forgotten,the encrypted information needs to be retrieved.The conventional brute force cracking methods and password recovery programs can hardly meet the actual deciphering needs.To this end,we develop a distributed parallel password recovery program(MT-Office)for Microsoft Office on the domestic heterogeneous multi-core processor(MT-3000).MT-Office takes full advantage of the multi-core and heterogeneous features of MT-3000,and is optimized and improved in both vectorization and global computing.At the same time,MT-Office provides multiple recovery strategies in password generation to improve the recovery efficiency.Compared with other platforms(e.g.,Intel platforms and FT platforms),MT-3000 heterogeneous platform can achieve 60×–218×speedup ratio.For Office2010,we perform a strong scalability test on the new-generation supercomputer in National Supercomputer Center in Tianjin.MT-Office not only extends to 65,536 acceleration clusters on this system,shows good scalability,but also achieves almost linear speedup ratio.For Office2007,compared with other password recovery programs,MT-Office can achieve 2.5×–131.1×speedup ratio.It can be seen that MT-Office can better exploit the advantages of MT-3000,which not only has good scalability and parallelism,but also has faster deciphering speed and can be applied to practical engineering application.展开更多
Developing parallel applications on heterogeneous processors is facing the challenges of 'memory wall',due to limited capacity of local storage,limited bandwidth and long latency for memory access. Aiming at t...Developing parallel applications on heterogeneous processors is facing the challenges of 'memory wall',due to limited capacity of local storage,limited bandwidth and long latency for memory access. Aiming at this problem,a parallelization approach was proposed with six memory optimization schemes for CG,four schemes of them aiming at all kinds of sparse matrix-vector multiplication (SPMV) operation. Conducted on IBM QS20,the parallelization approach can reach up to 21 and 133 times speedups with size A and B,respectively,compared with single power processor element. Finally,the conclusion is drawn that the peak bandwidth of memory access on Cell BE can be obtained in SPMV,simple computation is more efficient on heterogeneous processors and loop-unrolling can hide local storage access latency while executing scalar operation on SIMD cores.展开更多
Cloud computing has taken over the high-performance distributed computing area,and it currently provides on-demand services and resource polling over the web.As a result of constantly changing user service demand,the ...Cloud computing has taken over the high-performance distributed computing area,and it currently provides on-demand services and resource polling over the web.As a result of constantly changing user service demand,the task scheduling problem has emerged as a critical analytical topic in cloud computing.The primary goal of scheduling tasks is to distribute tasks to available processors to construct the shortest possible schedule without breaching precedence restrictions.Assignments and schedules of tasks substantially influence system operation in a heterogeneous multiprocessor system.The diverse processes inside the heuristic-based task scheduling method will result in varying makespan in the heterogeneous computing system.As a result,an intelligent scheduling algorithm should efficiently determine the priority of every subtask based on the resources necessary to lower the makespan.This research introduced a novel efficient scheduling task method in cloud computing systems based on the cooperation search algorithm to tackle an essential task and schedule a heterogeneous cloud computing problem.The basic idea of thismethod is to use the advantages of meta-heuristic algorithms to get the optimal solution.We assess our algorithm’s performance by running it through three scenarios with varying numbers of tasks.The findings demonstrate that the suggested technique beats existingmethods NewGenetic Algorithm(NGA),Genetic Algorithm(GA),Whale Optimization Algorithm(WOA),Gravitational Search Algorithm(GSA),and Hybrid Heuristic and Genetic(HHG)by 7.9%,2.1%,8.8%,7.7%,3.4%respectively according to makespan.展开更多
Modern shared-memory multi-core processors typically have shared Level 2(L2)or Level 3(L3)caches.Cache bottlenecks and replacement strategies are the main problems of such architectures,where multiple cores try to acc...Modern shared-memory multi-core processors typically have shared Level 2(L2)or Level 3(L3)caches.Cache bottlenecks and replacement strategies are the main problems of such architectures,where multiple cores try to access the shared cache simultaneously.The main problem in improving memory performance is the shared cache architecture and cache replacement.This paper documents the implementation of a Dual-Port Content Addressable Memory(DPCAM)and a modified Near-Far Access Replacement Algorithm(NFRA),which was previously proposed as a shared L2 cache layer in a multi-core processor.Standard Performance Evaluation Corporation(SPEC)Central Processing Unit(CPU)2006 benchmark workloads are used to evaluate the benefit of the shared L2 cache layer.Results show improved performance of the multicore processor’s DPCAM and NFRA algorithms,corresponding to a higher number of concurrent accesses to shared memory.The new architecture significantly increases system throughput and records performance improvements of up to 8.7%on various types of SPEC 2006 benchmarks.The miss rate is also improved by about 13%,with some exceptions in the sphinx3 and bzip2 benchmarks.These results could open a new window for solving the long-standing problems with shared cache in multi-core processors.展开更多
目前,多核实时系统中同步任务的节能调度研究主要针对的是同构多核处理器平台,而异构多核处理器架构能够更有效地发挥系统性能。将现有的研究直接应用于异构多核系统,在保证可调度性的情况下会导致能耗变高。对此,通过使用动态电压与频...目前,多核实时系统中同步任务的节能调度研究主要针对的是同构多核处理器平台,而异构多核处理器架构能够更有效地发挥系统性能。将现有的研究直接应用于异构多核系统,在保证可调度性的情况下会导致能耗变高。对此,通过使用动态电压与频率调节(Dynamic Voltage Frequency Scaling,DVFS)技术,研究异构多核实时系统中基于任务同步的节能调度问题,提出同步感知的最大能耗节省优先算法(Synchronization Aware-Largest Energy Saved First,SA-LESF)。该算法针对所有任务的速度配置进行迭代优化,直至所有任务均达到其最大限度节能的速度配置。此外,进一步提出基于动态松弛时间回收的同步感知最大能耗节省优先算法(Synchronization Aware-Largest Energy Saved First with Dynamic Reclamation,SA-LESF-DR)。该算法在保证实时任务可调度的同时,实施相应的回收策略,进一步降低系统能耗。实验结果表明,SA-LESF与SA-LESF-DR算法在能耗表现上具有优势,在相同任务集下,相比其他算法可节省高达30%的能耗。展开更多
Heterogeneous processors integrate very distinct compute resources such as CPUs and GPUs into the same chip,thus can exploit the advantages and avoid disadvantages of those compute units.We in this work evaluate and a...Heterogeneous processors integrate very distinct compute resources such as CPUs and GPUs into the same chip,thus can exploit the advantages and avoid disadvantages of those compute units.We in this work evaluate and analyze eight sparse matrix and graph kernels on an AMD CPU-GPU heterogeneous processor by using 956 sparse matrices.Five characteristics,i.e.,load balancing,indirect addressing,memory reallocation,atomic operations,and dynamic characteristics are our major considerations.The experimental results show that although the CPU and GPU parts access the same DRAM,very different performance behaviors are observed.For example,though the GPU part in general outperforms the CPU part,it cannot achieve the best performance in all cases given by the CPU part.Moreover,the bandwidth utilization of atomic operations on heterogeneous processors can be much higher than a high-end discrete GPU.展开更多
Multi-core homogeneous processors have been widely used to deal with computation-intensive embedded applications. However, with the continuous down scaling of CMOS technology, within-die variations in the manufacturin...Multi-core homogeneous processors have been widely used to deal with computation-intensive embedded applications. However, with the continuous down scaling of CMOS technology, within-die variations in the manufacturing process lead to a significant spread in the operating speeds of cores within homogeneous multi-core processors. Task scheduling approaches, which do not consider such heterogeneity caused by within-die variations,can lead to an overly pessimistic result in terms of performance. To realize an optimal performance according to the actual maximum clock frequencies at which cores can run, we present a heterogeneity-aware schedule refining(HASR) scheme by fully exploiting the heterogeneities of homogeneous multi-core processors in embedded domains.We analyze and show how the actual maximum frequencies of cores are used to guide the scheduling. In the scheme,representative chip operating points are selected and the corresponding optimal schedules are generated as candidate schedules. During the booting of each chip, according to the actual maximum clock frequencies of cores, one of the candidate schedules is bound to the chip to maximize the performance. A set of applications are designed to evaluate the proposed scheme. Experimental results show that the proposed scheme can improve the performance by an average value of 22.2%, compared with the baseline schedule based on the worst case timing analysis. Compared with the conventional task scheduling approach based on the actual maximum clock frequencies, the proposed scheme also improves the performance by up to 12%.展开更多
Multi-core architectures are widely used to in time-to-market and power consumption of the chips enhance the microprocessor performance within a limited increase Toward the application of high-density data signal pro...Multi-core architectures are widely used to in time-to-market and power consumption of the chips enhance the microprocessor performance within a limited increase Toward the application of high-density data signal processing, this paper presents a novel heterogeneous multi-core architecture digital signal processor (DSP), YHFT-QDSP, with one RISC CPU core and 4 VLIW DSP cores. By three kinds of interconnection, YHFT-QDSP provides high efficiency message communication for inner-chip RISC core and DSP cores, inner-chip and inter-chip DSP cores. A parallel programming platform is specifically developed for the heterogeneous nmlti-core architecture of YHFT-QDSP. This parallel programming environment provides a parallel support library and a friendly interface between high level application softwares and multi- core DSP. The 130 nm CMOS custom chip design results benchmarks show that the interconnection structure of in a high speed and moderate power design. The results of typical YHFT-QDSP is much better than other related structures and achieves better speedup when using the interconnection facilities in combing methods. YHFT-QDSP has been signed off and manufactured presently. The future applications of the multi-core chip could be found in 3G wireless base station, high performance radar, industrial applications, and so on.展开更多
Due to advances in semiconductor techniques, many-core processors have been widely used in high performance computing. However, many applications still cannot be carried out efficiently due to the memory wall, which h...Due to advances in semiconductor techniques, many-core processors have been widely used in high performance computing. However, many applications still cannot be carried out efficiently due to the memory wall, which has become a bottleneck in many-core processors. In this paper, we present a novel heterogeneous many-core processor architecture named deeply fused many-core (DFMC) for high performance computing systems. DFMC integrates management processing ele- ments (MPEs) and computing processing elements (CPEs), which are heterogeneous processor cores for different application features with a unified ISA (instruction set architecture), a unified execution model, and share-memory that supports cache coherence. The DFMC processor can alleviate the memory wall problem by combining a series of cooperative computing techniques of CPEs, such as multi-pattern data stream transfer, efficient register-level communication mechanism, and fast hardware synchronization technique. These techniques are able to improve on-chip data reuse and optimize memory access performance. This paper illustrates an implementation of a full system prototype based on FPGA with four MPEs and 256 CPEs. Our experimental results show that the effect of the cooperative computing techniques of CPEs is significant, with DGEMM (double-precision matrix multiplication) achieving an efficiency of 94%, FFT (fast Fourier transform) obtaining a performance of 207 GFLOPS and FDTD (finite-difference time-domain) obtaining a performance of 27 GFLOPS.展开更多
Godson-3 is the latest generation of Godson microprocessor family. It takes a scalable multi-core architecture with hardware support for accelerating applications including X86 emulation and signal processing. This pa...Godson-3 is the latest generation of Godson microprocessor family. It takes a scalable multi-core architecture with hardware support for accelerating applications including X86 emulation and signal processing. This paper introduces the system architecture of Godson-3 from various aspects including system scalability, organization of memory hierarchy, network-on-chip, inter-chip connection and I/O subsystem.展开更多
This paper describes parallel simulation techniques for the discrete element method (DEM) on multi-core processors. Recently, multi-core CPU and GPU processors have attracted much attention in accelerating computer ...This paper describes parallel simulation techniques for the discrete element method (DEM) on multi-core processors. Recently, multi-core CPU and GPU processors have attracted much attention in accelerating computer simulations in various fields. We propose a new algorithm for multi-thread parallel computation of DEM, which makes effective use of the available memory and accelerates the computation. This study shows that memory usage is drastically reduced by using this algorithm. To show the practical use of DEM in industry, a large-scale powder system is simulated with a complicated drive unit. We compared the performance of the simulation between the latest GPU and CPU processors with optimized programs for each processor. The results show that the difference in performance is not substantial when using either GPUs or CPUs with a multi-thread parallel algorithm. In addition, DEM algorithm is shown to have high scalabilitv in a multi-thread parallel computation on a CPU.展开更多
为研究异构多核片上系统(multi-processor system on chip,MPSoC)在密集并行计算任务中的潜力,文章设计并实现了一种适用于粗粒度数据特征、面向任务级并行应用的异构多核系统动态调度协处理器,采用了片上缓存、任务输出的多级写回管理...为研究异构多核片上系统(multi-processor system on chip,MPSoC)在密集并行计算任务中的潜力,文章设计并实现了一种适用于粗粒度数据特征、面向任务级并行应用的异构多核系统动态调度协处理器,采用了片上缓存、任务输出的多级写回管理、任务自动映射、通讯任务乱序执行等机制。实验结果表明,该动态调度协处理器不仅能够实现任务级乱序执行等基本设计目标,还具有极低的调度开销,相较于基于动态记分牌算法的调度器,运行多个子孔径距离压缩算法的时间降低达17.13%。研究结果证明文章设计的动态调度协处理器能够有效优化目标场景下的任务调度效果。展开更多
基金supported by the National Key Research and Development Program of China(Grant No.2021YFB0300101)the National Natural Science Foundation of China(Grant No.62032023,61902411,12002382)。
文摘With the improvement of security awareness,in order to guarantee information security,more advanced and secure encryption algorithms are applied to Microsoft Office.People also set more complex encryption passwords.However,once the initial password is forgotten,the encrypted information needs to be retrieved.The conventional brute force cracking methods and password recovery programs can hardly meet the actual deciphering needs.To this end,we develop a distributed parallel password recovery program(MT-Office)for Microsoft Office on the domestic heterogeneous multi-core processor(MT-3000).MT-Office takes full advantage of the multi-core and heterogeneous features of MT-3000,and is optimized and improved in both vectorization and global computing.At the same time,MT-Office provides multiple recovery strategies in password generation to improve the recovery efficiency.Compared with other platforms(e.g.,Intel platforms and FT platforms),MT-3000 heterogeneous platform can achieve 60×–218×speedup ratio.For Office2010,we perform a strong scalability test on the new-generation supercomputer in National Supercomputer Center in Tianjin.MT-Office not only extends to 65,536 acceleration clusters on this system,shows good scalability,but also achieves almost linear speedup ratio.For Office2007,compared with other password recovery programs,MT-Office can achieve 2.5×–131.1×speedup ratio.It can be seen that MT-Office can better exploit the advantages of MT-3000,which not only has good scalability and parallelism,but also has faster deciphering speed and can be applied to practical engineering application.
基金Project(2008AA01A201) supported the National High-tech Research and Development Program of ChinaProjects(60833004, 60633050) supported by the National Natural Science Foundation of China
文摘Developing parallel applications on heterogeneous processors is facing the challenges of 'memory wall',due to limited capacity of local storage,limited bandwidth and long latency for memory access. Aiming at this problem,a parallelization approach was proposed with six memory optimization schemes for CG,four schemes of them aiming at all kinds of sparse matrix-vector multiplication (SPMV) operation. Conducted on IBM QS20,the parallelization approach can reach up to 21 and 133 times speedups with size A and B,respectively,compared with single power processor element. Finally,the conclusion is drawn that the peak bandwidth of memory access on Cell BE can be obtained in SPMV,simple computation is more efficient on heterogeneous processors and loop-unrolling can hide local storage access latency while executing scalar operation on SIMD cores.
文摘Cloud computing has taken over the high-performance distributed computing area,and it currently provides on-demand services and resource polling over the web.As a result of constantly changing user service demand,the task scheduling problem has emerged as a critical analytical topic in cloud computing.The primary goal of scheduling tasks is to distribute tasks to available processors to construct the shortest possible schedule without breaching precedence restrictions.Assignments and schedules of tasks substantially influence system operation in a heterogeneous multiprocessor system.The diverse processes inside the heuristic-based task scheduling method will result in varying makespan in the heterogeneous computing system.As a result,an intelligent scheduling algorithm should efficiently determine the priority of every subtask based on the resources necessary to lower the makespan.This research introduced a novel efficient scheduling task method in cloud computing systems based on the cooperation search algorithm to tackle an essential task and schedule a heterogeneous cloud computing problem.The basic idea of thismethod is to use the advantages of meta-heuristic algorithms to get the optimal solution.We assess our algorithm’s performance by running it through three scenarios with varying numbers of tasks.The findings demonstrate that the suggested technique beats existingmethods NewGenetic Algorithm(NGA),Genetic Algorithm(GA),Whale Optimization Algorithm(WOA),Gravitational Search Algorithm(GSA),and Hybrid Heuristic and Genetic(HHG)by 7.9%,2.1%,8.8%,7.7%,3.4%respectively according to makespan.
文摘Modern shared-memory multi-core processors typically have shared Level 2(L2)or Level 3(L3)caches.Cache bottlenecks and replacement strategies are the main problems of such architectures,where multiple cores try to access the shared cache simultaneously.The main problem in improving memory performance is the shared cache architecture and cache replacement.This paper documents the implementation of a Dual-Port Content Addressable Memory(DPCAM)and a modified Near-Far Access Replacement Algorithm(NFRA),which was previously proposed as a shared L2 cache layer in a multi-core processor.Standard Performance Evaluation Corporation(SPEC)Central Processing Unit(CPU)2006 benchmark workloads are used to evaluate the benefit of the shared L2 cache layer.Results show improved performance of the multicore processor’s DPCAM and NFRA algorithms,corresponding to a higher number of concurrent accesses to shared memory.The new architecture significantly increases system throughput and records performance improvements of up to 8.7%on various types of SPEC 2006 benchmarks.The miss rate is also improved by about 13%,with some exceptions in the sphinx3 and bzip2 benchmarks.These results could open a new window for solving the long-standing problems with shared cache in multi-core processors.
文摘目前,多核实时系统中同步任务的节能调度研究主要针对的是同构多核处理器平台,而异构多核处理器架构能够更有效地发挥系统性能。将现有的研究直接应用于异构多核系统,在保证可调度性的情况下会导致能耗变高。对此,通过使用动态电压与频率调节(Dynamic Voltage Frequency Scaling,DVFS)技术,研究异构多核实时系统中基于任务同步的节能调度问题,提出同步感知的最大能耗节省优先算法(Synchronization Aware-Largest Energy Saved First,SA-LESF)。该算法针对所有任务的速度配置进行迭代优化,直至所有任务均达到其最大限度节能的速度配置。此外,进一步提出基于动态松弛时间回收的同步感知最大能耗节省优先算法(Synchronization Aware-Largest Energy Saved First with Dynamic Reclamation,SA-LESF-DR)。该算法在保证实时任务可调度的同时,实施相应的回收策略,进一步降低系统能耗。实验结果表明,SA-LESF与SA-LESF-DR算法在能耗表现上具有优势,在相同任务集下,相比其他算法可节省高达30%的能耗。
基金supported by the National Natural Science Foundation of China(Grant nos.61732014,61802412,61671151)Beijing Natural Science Foundation(no.4172031)SenseTime Young Scholars Research Fund.
文摘Heterogeneous processors integrate very distinct compute resources such as CPUs and GPUs into the same chip,thus can exploit the advantages and avoid disadvantages of those compute units.We in this work evaluate and analyze eight sparse matrix and graph kernels on an AMD CPU-GPU heterogeneous processor by using 956 sparse matrices.Five characteristics,i.e.,load balancing,indirect addressing,memory reallocation,atomic operations,and dynamic characteristics are our major considerations.The experimental results show that although the CPU and GPU parts access the same DRAM,very different performance behaviors are observed.For example,though the GPU part in general outperforms the CPU part,it cannot achieve the best performance in all cases given by the CPU part.Moreover,the bandwidth utilization of atomic operations on heterogeneous processors can be much higher than a high-end discrete GPU.
基金Project supported by the National Natural Science Foundation of China(Nos.6122500861373074+3 种基金and 61373090)the National Basic Research Program(973)of China(No.2014CB349304)the Specialized Research Fund for the Doctoral Program of Higher Education,the Ministry of Education of China(No.20120002110033)the Tsinghua University Initiative Scientific Research Program
文摘Multi-core homogeneous processors have been widely used to deal with computation-intensive embedded applications. However, with the continuous down scaling of CMOS technology, within-die variations in the manufacturing process lead to a significant spread in the operating speeds of cores within homogeneous multi-core processors. Task scheduling approaches, which do not consider such heterogeneity caused by within-die variations,can lead to an overly pessimistic result in terms of performance. To realize an optimal performance according to the actual maximum clock frequencies at which cores can run, we present a heterogeneity-aware schedule refining(HASR) scheme by fully exploiting the heterogeneities of homogeneous multi-core processors in embedded domains.We analyze and show how the actual maximum frequencies of cores are used to guide the scheduling. In the scheme,representative chip operating points are selected and the corresponding optimal schedules are generated as candidate schedules. During the booting of each chip, according to the actual maximum clock frequencies of cores, one of the candidate schedules is bound to the chip to maximize the performance. A set of applications are designed to evaluate the proposed scheme. Experimental results show that the proposed scheme can improve the performance by an average value of 22.2%, compared with the baseline schedule based on the worst case timing analysis. Compared with the conventional task scheduling approach based on the actual maximum clock frequencies, the proposed scheme also improves the performance by up to 12%.
基金supported by the National Science and Technology Major Project of the Ministry of Science and Technology of China under Grant No.2009ZX01034-001-001-006the National High Technology Research and Development 863 Program of China under Grant No.2007AA01Z108the Program for Changjiang Scholars and Innovative Research Team in Universities of China under Grant No.IRT0614.
文摘Multi-core architectures are widely used to in time-to-market and power consumption of the chips enhance the microprocessor performance within a limited increase Toward the application of high-density data signal processing, this paper presents a novel heterogeneous multi-core architecture digital signal processor (DSP), YHFT-QDSP, with one RISC CPU core and 4 VLIW DSP cores. By three kinds of interconnection, YHFT-QDSP provides high efficiency message communication for inner-chip RISC core and DSP cores, inner-chip and inter-chip DSP cores. A parallel programming platform is specifically developed for the heterogeneous nmlti-core architecture of YHFT-QDSP. This parallel programming environment provides a parallel support library and a friendly interface between high level application softwares and multi- core DSP. The 130 nm CMOS custom chip design results benchmarks show that the interconnection structure of in a high speed and moderate power design. The results of typical YHFT-QDSP is much better than other related structures and achieves better speedup when using the interconnection facilities in combing methods. YHFT-QDSP has been signed off and manufactured presently. The future applications of the multi-core chip could be found in 3G wireless base station, high performance radar, industrial applications, and so on.
文摘Due to advances in semiconductor techniques, many-core processors have been widely used in high performance computing. However, many applications still cannot be carried out efficiently due to the memory wall, which has become a bottleneck in many-core processors. In this paper, we present a novel heterogeneous many-core processor architecture named deeply fused many-core (DFMC) for high performance computing systems. DFMC integrates management processing ele- ments (MPEs) and computing processing elements (CPEs), which are heterogeneous processor cores for different application features with a unified ISA (instruction set architecture), a unified execution model, and share-memory that supports cache coherence. The DFMC processor can alleviate the memory wall problem by combining a series of cooperative computing techniques of CPEs, such as multi-pattern data stream transfer, efficient register-level communication mechanism, and fast hardware synchronization technique. These techniques are able to improve on-chip data reuse and optimize memory access performance. This paper illustrates an implementation of a full system prototype based on FPGA with four MPEs and 256 CPEs. Our experimental results show that the effect of the cooperative computing techniques of CPEs is significant, with DGEMM (double-precision matrix multiplication) achieving an efficiency of 94%, FFT (fast Fourier transform) obtaining a performance of 207 GFLOPS and FDTD (finite-difference time-domain) obtaining a performance of 27 GFLOPS.
基金Supported by the National High Technology Development 863 Program of China under Grant No.2008AA010901the National Natural Science Foundation of China under Grant Nos.60736012 and 60673146the National Basic Research 973 Program of China under Grant No.2005CB321601.
文摘Godson-3 is the latest generation of Godson microprocessor family. It takes a scalable multi-core architecture with hardware support for accelerating applications including X86 emulation and signal processing. This paper introduces the system architecture of Godson-3 from various aspects including system scalability, organization of memory hierarchy, network-on-chip, inter-chip connection and I/O subsystem.
文摘This paper describes parallel simulation techniques for the discrete element method (DEM) on multi-core processors. Recently, multi-core CPU and GPU processors have attracted much attention in accelerating computer simulations in various fields. We propose a new algorithm for multi-thread parallel computation of DEM, which makes effective use of the available memory and accelerates the computation. This study shows that memory usage is drastically reduced by using this algorithm. To show the practical use of DEM in industry, a large-scale powder system is simulated with a complicated drive unit. We compared the performance of the simulation between the latest GPU and CPU processors with optimized programs for each processor. The results show that the difference in performance is not substantial when using either GPUs or CPUs with a multi-thread parallel algorithm. In addition, DEM algorithm is shown to have high scalabilitv in a multi-thread parallel computation on a CPU.
文摘为研究异构多核片上系统(multi-processor system on chip,MPSoC)在密集并行计算任务中的潜力,文章设计并实现了一种适用于粗粒度数据特征、面向任务级并行应用的异构多核系统动态调度协处理器,采用了片上缓存、任务输出的多级写回管理、任务自动映射、通讯任务乱序执行等机制。实验结果表明,该动态调度协处理器不仅能够实现任务级乱序执行等基本设计目标,还具有极低的调度开销,相较于基于动态记分牌算法的调度器,运行多个子孔径距离压缩算法的时间降低达17.13%。研究结果证明文章设计的动态调度协处理器能够有效优化目标场景下的任务调度效果。