In many clusters connected by high-speed communication networks, the exact structure of the underlying communication network and the latency difference between different sending and receiving pairs may be ignored when...In many clusters connected by high-speed communication networks, the exact structure of the underlying communication network and the latency difference between different sending and receiving pairs may be ignored when they broadcast, such as in the approach adopted by the broadcasting method in MPICH, a widely used MPI implementation. However, the underlying network cluster topologies are becoming more and more complicated and the performance of traditional broadcasting algorithms, such as MPICHs MPI_Bcast, is far from good. This paper analyzed the impact of communication latencies and the underlying topologies on the performance of broadcasting algorithms for multilevel clusters. A multilevel model was developed for broadcasting in clusters with complicated topologies, which divides the cluster topology into many levels based on the underlying topology. The multilevel model was used to develop a new broadcast algorithm, MLM broadcast-2 (MLMB-2), that adapts to a wide range of clusters. Comparison of the performance of the counterpart MPI operation MPI_Bcast and MLMB-2 shows that MLMB-2 outperforms MPI_Bcast by decreasing the broadcast running time by 60%-90%.展开更多
Performance and energy consumption of high performance computing (HPC) interconnection networks have a great significance in the whole supercomputer, and building up HPC interconnection network simulation plat- form...Performance and energy consumption of high performance computing (HPC) interconnection networks have a great significance in the whole supercomputer, and building up HPC interconnection network simulation plat- form is very important for the research on HPC software and hardware technologies. To effectively evaluate the per- formance and energy consumption of HPC interconnection networks, this article designs and implements a detailed and clock-driven HPC interconnection network simulation plat- form, called HPC-NetSim. HPC-NetSim uses application- driven workloads and inherits the characteristics of the de- tailed and flexible cycle-accurate network simulator. Besides, it offers a large set of configurable network parameters in terms of topology and routing, and supports router's on/off states. We compare the simulated execution time with the real execution time of Tianhe-2 subsystem and the mean error is only 2.7%. In addition, we simulate the network behaviors with different network structures and low-power modes. The results are also consistent with the theoretical analyses.展开更多
This paper analyzes the physical potential, computing performance benefi t and power consumption of optical interconnects. Compared with electrical interconnections, optical ones show undoubted advantages based on phy...This paper analyzes the physical potential, computing performance benefi t and power consumption of optical interconnects. Compared with electrical interconnections, optical ones show undoubted advantages based on physical factor analysis. At the same time, since the recent developments drive us to think about whether these optical interconnect technologies with higher bandwidth but higher cost are worthy to be deployed, the computing performance comparison is performed. To meet the increasing demand of large-scale parallel or multi-processor computing tasks, an analytic method to evaluate parallel computing performance ofinterconnect systems is proposed in this paper. Both bandwidth-limit model and full-bandwidth model are under our investigation. Speedup and effi ciency are selected to represent the parallel performance of an interconnect system. Deploying the proposed models, we depict the performance gap between the optical and electrically interconnected systems. Another investigation on power consumption of commercial products showed that if the parallel interconnections are deployed, the unit power consumption will be reduced. Therefore, from the analysis of computing influence and power dissipation, we found that parallel optical interconnect is valuable combination of high performance and low energy consumption. Considering the possible data center under construction, huge power could be saved if parallel optical interconnects technologies are used.展开更多
The teracluster LSSC-II installed at the State Key Laboratory of Scientific and Engineering Computing, Chinese Academy of Sciences is one of the most powerful PC clusters in China. It has a peek performance of 2Tflops...The teracluster LSSC-II installed at the State Key Laboratory of Scientific and Engineering Computing, Chinese Academy of Sciences is one of the most powerful PC clusters in China. It has a peek performance of 2Tflops. With a Unpack performance of 1.04Tflops, it is ranked at the 43rd place in the 20th TOP500 List (November 2002), 51st place in the 21st TOP500 List (June 2003), and the 82nd place in the 22nd TOP500 List (November 2003) with a new Linpack performance of 1.3Tflops. In this paper, we present some design principles of this cluster, as well as its applications in some large-scale numerical simulations.展开更多
广域长距离高性能传输技术在中国“东数西算”工程构建全国一体化算力网背景下具备重要的战略价值。3个趋势对广域分布式算力协同范式提出新需求:对算力资源要求极高的人工智能(AI)大模型智能应用的兴起;高端高性能图形处理单元(GPU)芯...广域长距离高性能传输技术在中国“东数西算”工程构建全国一体化算力网背景下具备重要的战略价值。3个趋势对广域分布式算力协同范式提出新需求:对算力资源要求极高的人工智能(AI)大模型智能应用的兴起;高端高性能图形处理单元(GPU)芯片被禁运限制单中心算力资源;中国各地建设的算力集群形成算力分散分布态势。广域长距离高性能传输技术是上述新范式的关键技术。从支撑广域分布式算力协同新范式、技术路线、承载网络、研究难点、成本5个方面进行讨论,结合深圳到宁夏中卫2100 km实网实验结果,将现有远程直接内存访问(remote direct memory access,RDMA)技术基于广域全光网进行长距离优化的方案是短期内可行性高、成本低且利于开展研究的最佳方案之一,通过优化基于融合以太网的远程直接内存访问(RDMA over Converged Ethernet,RoCE)可以在广域全光网上实现“广域光数直达”逼近物理层通信性能指标。展开更多
Low temperature complementary metal oxide semiconductor(CMOS)or cryogenic CMOS is a promising avenue for the continuation of Moore’s law while serving the needs of high performance computing.With temperature as a con...Low temperature complementary metal oxide semiconductor(CMOS)or cryogenic CMOS is a promising avenue for the continuation of Moore’s law while serving the needs of high performance computing.With temperature as a control“knob”to steepen the subthreshold slope behavior of CMOS devices,the supply voltage of operation can be reduced with no impact on operating speed.With the optimal threshold voltage engineering,the device ON current can be further enhanced,translating to higher performance.In this article,the experimentally calibrated data was adopted to tune the threshold voltage and investigated the power performance area of cryogenic CMOS at device,circuit and system level.We also presented results from measurement and analysis of functional memory chips fabricated in 28 nm bulk CMOS and 22 nm fully depleted silicon on insulator(FDSOI)operating at cryogenic temperature.Finally,the challenges and opportunities in the further development and deployment of such systems were discussed.展开更多
基金the National Natural Science Foundation of China (No. 60103019) and the National High-Tech Research and Development Program of China (No. 2001AA111110)
文摘In many clusters connected by high-speed communication networks, the exact structure of the underlying communication network and the latency difference between different sending and receiving pairs may be ignored when they broadcast, such as in the approach adopted by the broadcasting method in MPICH, a widely used MPI implementation. However, the underlying network cluster topologies are becoming more and more complicated and the performance of traditional broadcasting algorithms, such as MPICHs MPI_Bcast, is far from good. This paper analyzed the impact of communication latencies and the underlying topologies on the performance of broadcasting algorithms for multilevel clusters. A multilevel model was developed for broadcasting in clusters with complicated topologies, which divides the cluster topology into many levels based on the underlying topology. The multilevel model was used to develop a new broadcast algorithm, MLM broadcast-2 (MLMB-2), that adapts to a wide range of clusters. Comparison of the performance of the counterpart MPI operation MPI_Bcast and MLMB-2 shows that MLMB-2 outperforms MPI_Bcast by decreasing the broadcast running time by 60%-90%.
文摘Performance and energy consumption of high performance computing (HPC) interconnection networks have a great significance in the whole supercomputer, and building up HPC interconnection network simulation plat- form is very important for the research on HPC software and hardware technologies. To effectively evaluate the per- formance and energy consumption of HPC interconnection networks, this article designs and implements a detailed and clock-driven HPC interconnection network simulation plat- form, called HPC-NetSim. HPC-NetSim uses application- driven workloads and inherits the characteristics of the de- tailed and flexible cycle-accurate network simulator. Besides, it offers a large set of configurable network parameters in terms of topology and routing, and supports router's on/off states. We compare the simulated execution time with the real execution time of Tianhe-2 subsystem and the mean error is only 2.7%. In addition, we simulate the network behaviors with different network structures and low-power modes. The results are also consistent with the theoretical analyses.
基金supported in part by National 863 Program (2009AA01Z256,No.2009AA01A345)National 973 Program (2007CB310705)the NSFC (60932004),P.R.China
文摘This paper analyzes the physical potential, computing performance benefi t and power consumption of optical interconnects. Compared with electrical interconnections, optical ones show undoubted advantages based on physical factor analysis. At the same time, since the recent developments drive us to think about whether these optical interconnect technologies with higher bandwidth but higher cost are worthy to be deployed, the computing performance comparison is performed. To meet the increasing demand of large-scale parallel or multi-processor computing tasks, an analytic method to evaluate parallel computing performance ofinterconnect systems is proposed in this paper. Both bandwidth-limit model and full-bandwidth model are under our investigation. Speedup and effi ciency are selected to represent the parallel performance of an interconnect system. Deploying the proposed models, we depict the performance gap between the optical and electrically interconnected systems. Another investigation on power consumption of commercial products showed that if the parallel interconnections are deployed, the unit power consumption will be reduced. Therefore, from the analysis of computing influence and power dissipation, we found that parallel optical interconnect is valuable combination of high performance and low energy consumption. Considering the possible data center under construction, huge power could be saved if parallel optical interconnects technologies are used.
基金This work was supported by the Special Funds for the Major State Basic Research Projects(Grants No.G19990328)partly supported by the National Natural Science Foundation of China(Grant No.40004003).
文摘The teracluster LSSC-II installed at the State Key Laboratory of Scientific and Engineering Computing, Chinese Academy of Sciences is one of the most powerful PC clusters in China. It has a peek performance of 2Tflops. With a Unpack performance of 1.04Tflops, it is ranked at the 43rd place in the 20th TOP500 List (November 2002), 51st place in the 21st TOP500 List (June 2003), and the 82nd place in the 22nd TOP500 List (November 2003) with a new Linpack performance of 1.3Tflops. In this paper, we present some design principles of this cluster, as well as its applications in some large-scale numerical simulations.
文摘广域长距离高性能传输技术在中国“东数西算”工程构建全国一体化算力网背景下具备重要的战略价值。3个趋势对广域分布式算力协同范式提出新需求:对算力资源要求极高的人工智能(AI)大模型智能应用的兴起;高端高性能图形处理单元(GPU)芯片被禁运限制单中心算力资源;中国各地建设的算力集群形成算力分散分布态势。广域长距离高性能传输技术是上述新范式的关键技术。从支撑广域分布式算力协同新范式、技术路线、承载网络、研究难点、成本5个方面进行讨论,结合深圳到宁夏中卫2100 km实网实验结果,将现有远程直接内存访问(remote direct memory access,RDMA)技术基于广域全光网进行长距离优化的方案是短期内可行性高、成本低且利于开展研究的最佳方案之一,通过优化基于融合以太网的远程直接内存访问(RDMA over Converged Ethernet,RoCE)可以在广域全光网上实现“广域光数直达”逼近物理层通信性能指标。
基金funded by the Defense Advanced Research Project Agency(DARPA)Low Temperature Logic Technology(LTLT)program.
文摘Low temperature complementary metal oxide semiconductor(CMOS)or cryogenic CMOS is a promising avenue for the continuation of Moore’s law while serving the needs of high performance computing.With temperature as a control“knob”to steepen the subthreshold slope behavior of CMOS devices,the supply voltage of operation can be reduced with no impact on operating speed.With the optimal threshold voltage engineering,the device ON current can be further enhanced,translating to higher performance.In this article,the experimentally calibrated data was adopted to tune the threshold voltage and investigated the power performance area of cryogenic CMOS at device,circuit and system level.We also presented results from measurement and analysis of functional memory chips fabricated in 28 nm bulk CMOS and 22 nm fully depleted silicon on insulator(FDSOI)operating at cryogenic temperature.Finally,the challenges and opportunities in the further development and deployment of such systems were discussed.