一、前言80386 32位微处理器可与静态 PAM(SRAM)、动态 RAM(DRAM)和高速缓冲存贮系统(即由少量的快速存贮器 SRAM 和大量的低速存贮器 DRAM 组成的存贮系统)进行接口。由于 DRAM 存贮器在访问和周期刷新之间需要预充时间,所以 DRAM 存...一、前言80386 32位微处理器可与静态 PAM(SRAM)、动态 RAM(DRAM)和高速缓冲存贮系统(即由少量的快速存贮器 SRAM 和大量的低速存贮器 DRAM 组成的存贮系统)进行接口。由于 DRAM 存贮器在访问和周期刷新之间需要预充时间,所以 DRAM 存贮器的传送数据速度往往要低于 SRAM 存贮器。然而 DRAM 存贮器具有以低价格来构成大容量的存贮系统的特点,因而得到了广泛的应用。二、多存贮体交叉存贮由于 DRAM 存贮器在两次连续访问操作之间需要一段短暂的空闲时间,若不提供这一空闲时间,DRAM 中的数据就会丢失。如果对一组 DRAM 芯片进行连续访问。展开更多
Subsequently to the problem of performance and energy overhead, the reliability problem of the system caused by soft error has become a growing concern. Since register file(RF) is the hottest component in processor,...Subsequently to the problem of performance and energy overhead, the reliability problem of the system caused by soft error has become a growing concern. Since register file(RF) is the hottest component in processor, if not well protected, soft errors occurring in it will do harm to the system reliability greatly. In order to reduce soft error occurrence rate of register file, this paper presents a method to reallocate the register based on the fact that different live variables have different contribution to the register file vulnerability(RFV). Our experimental results on benchmarks from MiBench suite indicate that our method can significantly enhance the reliability.展开更多
The key to high performance for GPU architecture lies in its massive threading capability to drive a large number of cores and enable execution overlapping among threads. However, in reality, the number of threads tha...The key to high performance for GPU architecture lies in its massive threading capability to drive a large number of cores and enable execution overlapping among threads. However, in reality, the number of threads that can simultaneously execute is often limited by the size of the register file on GPUs. The traditional SRAM-based register file takes up so large amount of chip area that it cannot scale to meet the increasing demand of GPU applications. Racetrack memory (RM) is a promising technology for designing large capacity register file on GPUs due to its high data storage density. However, without careful deployment of RM-based register file, the lengthy shift operations of RM may hurt the performance. In this paper, we explore RM for designing high-performance register file for GPU architecture. High storage density RM helps to improve the thread level parallelism (TLP), but if the bits of the registers are not aligned to the ports, shift operations are required to move the bits to the access ports before they are accessed, and thus the read/write operations are delayed. We develop an optimization framework for RM-based register file on GPUs, which employs three different optimization techniques at the application, compilation, and architecture level, respectively. More clearly, we optimize the TLP at the application level, design a register mapping algorithm at the compilation level, and design a preshifting mechanism at the architecture level. Collectively, these optimizations help to determine the TLP without causing cache and register file resource contention and reduce the shift operation overhead. Experimental results using a variety of representative workloads demonstrate that our optimization framework achieves up to 29% (21% on average) performance improvement.展开更多
文摘一、前言80386 32位微处理器可与静态 PAM(SRAM)、动态 RAM(DRAM)和高速缓冲存贮系统(即由少量的快速存贮器 SRAM 和大量的低速存贮器 DRAM 组成的存贮系统)进行接口。由于 DRAM 存贮器在访问和周期刷新之间需要预充时间,所以 DRAM 存贮器的传送数据速度往往要低于 SRAM 存贮器。然而 DRAM 存贮器具有以低价格来构成大容量的存贮系统的特点,因而得到了广泛的应用。二、多存贮体交叉存贮由于 DRAM 存贮器在两次连续访问操作之间需要一段短暂的空闲时间,若不提供这一空闲时间,DRAM 中的数据就会丢失。如果对一组 DRAM 芯片进行连续访问。
基金Supported by the National Natural Science Foundation of China(61272110)
文摘Subsequently to the problem of performance and energy overhead, the reliability problem of the system caused by soft error has become a growing concern. Since register file(RF) is the hottest component in processor, if not well protected, soft errors occurring in it will do harm to the system reliability greatly. In order to reduce soft error occurrence rate of register file, this paper presents a method to reallocate the register based on the fact that different live variables have different contribution to the register file vulnerability(RFV). Our experimental results on benchmarks from MiBench suite indicate that our method can significantly enhance the reliability.
基金This work was supported by the National Natural Science Foundation of China under Grant No. 61300005.
文摘The key to high performance for GPU architecture lies in its massive threading capability to drive a large number of cores and enable execution overlapping among threads. However, in reality, the number of threads that can simultaneously execute is often limited by the size of the register file on GPUs. The traditional SRAM-based register file takes up so large amount of chip area that it cannot scale to meet the increasing demand of GPU applications. Racetrack memory (RM) is a promising technology for designing large capacity register file on GPUs due to its high data storage density. However, without careful deployment of RM-based register file, the lengthy shift operations of RM may hurt the performance. In this paper, we explore RM for designing high-performance register file for GPU architecture. High storage density RM helps to improve the thread level parallelism (TLP), but if the bits of the registers are not aligned to the ports, shift operations are required to move the bits to the access ports before they are accessed, and thus the read/write operations are delayed. We develop an optimization framework for RM-based register file on GPUs, which employs three different optimization techniques at the application, compilation, and architecture level, respectively. More clearly, we optimize the TLP at the application level, design a register mapping algorithm at the compilation level, and design a preshifting mechanism at the architecture level. Collectively, these optimizations help to determine the TLP without causing cache and register file resource contention and reduce the shift operation overhead. Experimental results using a variety of representative workloads demonstrate that our optimization framework achieves up to 29% (21% on average) performance improvement.