期刊文献+
共找到4篇文章
< 1 >
每页显示 20 50 100
PARBLO:Page-Allocation-Based DRAM Row Buffer Locality Optimization 被引量:2
1
作者 米伟 冯晓兵 +2 位作者 贾耀仓 陈莉 薛京灵 《Journal of Computer Science & Technology》 SCIE EI CSCD 2009年第6期1086-1097,共12页
DRAM row buffer conflicts can increase memory access latency significantly. This paper presents a new pageallocation-based optimization that works seamlessly together with some existing hardware and software optimizat... DRAM row buffer conflicts can increase memory access latency significantly. This paper presents a new pageallocation-based optimization that works seamlessly together with some existing hardware and software optimizations to eliminate significantly more row buffer conflicts. Validation in simulation using a set of selected scientific and engineering benchmarks against a few representative memory controller optimizations shows that our method can reduce row buffer miss rates by up to 76% (with an average of 37.4%). This reduction in row buffer miss rates will be translated into performance speedups by up to 15% (with an average of 5%). 展开更多
关键词 DRAM row buffer page allocation locality optimization
原文传递
PartialRC: A Partial Recomputing Method for Efficient Fault Recovery on GPGPUs 被引量:1
2
作者 徐新海 杨学军 +2 位作者 薛京灵 林宇斐 林一松 《Journal of Computer Science & Technology》 SCIE EI CSCD 2012年第2期240-255,共16页
GPGPUs are increasingly being used to as performance accelerators for HPC (High Performance Computing) applications in CPU/GPU heterogeneous computing systems, including TianHe-1A, the world's fastest supercomputer... GPGPUs are increasingly being used to as performance accelerators for HPC (High Performance Computing) applications in CPU/GPU heterogeneous computing systems, including TianHe-1A, the world's fastest supercomputer in the TOP500 list, built at NUDT (National University of Defense Technology) last year. However, despite their performance advantages, GPGPUs do not provide built-in fault-tolerant mechanisms to offer reliability guarantees required by many HPC applications. By analyzing the SIMT (single-instruction, multiple-thread) characteristics of programs running on GPGPUs, we have developed PartialRC, a new checkpoint-based compiler-directed partial recomputing method, for achieving efficient fault recovery by leveraging the phenomenal computing power of GPGPUs. In this paper, we introduce our PartialRC method that recovers from errors detected in a code region by partially re-computing the region, describe a checkpoint-based faulttolerance framework developed on PartialRC, and discuss an implementation on the CUDA platform. Validation using a range of representative CUDA programs on NVIDIA GPGPUs against FullRC (a traditional full-recomputing Checkpoint-Rollback-Restart fault recovery method for CPUs) shows that PartialRC reduces significantly the fault recovery overheads incurred by FullRC, by 73.5% when errors occur earlier during execution and 74.6% when errors occur later on average. In addition, PartialRC also reduces error detection overheads incurred by FullRC during fault recovery while incurring negligible performance overheads when no fault happens. 展开更多
关键词 GPGPU partial recomputing fault tolerance CUDA CHECKPOINTING
原文传递
A Hybrid Circular Queue Method for Iterative Stencil Computations on GPUs 被引量:1
3
作者 Yang Yang Hui-Min Cui +1 位作者 Xiao-Bing Feng Jing-Ling Xue 《Journal of Computer Science & Technology》 SCIE EI CSCD 2012年第1期57-74,共18页
In this paper, we present a hybrid circular queue method that can significantly boost the performance of stencil computations on GPU by carefully balancing usage of registers and shared-memory. Unlike earlier methods ... In this paper, we present a hybrid circular queue method that can significantly boost the performance of stencil computations on GPU by carefully balancing usage of registers and shared-memory. Unlike earlier methods that rely on circular queues predominantly implemented using indirectly addressable shared memory, our hybrid method exploits a new reuse pattern spanning across the multiple time steps in stencil computations so that circular queues can be implemented by both shared memory and registers effectively in a balanced manner. We describe a framework that automatically finds the best placement of data in registers and shared memory in order to maximize the performance of stencil computations. Validation using four different types of stencils on three different GPU platforms shows that our hybrid method achieves speedups up to 2.93X over methods that use circular queues implemented with shared-memory only. 展开更多
关键词 stencil computation circular queue GPU OCCUPANCY REGISTER
原文传递
Leakage-Aware Modulo Scheduling for Embedded VLIW Processors
4
作者 关永 薛京灵 《Journal of Computer Science & Technology》 SCIE EI CSCD 2011年第3期405-417,共13页
As semi-conductor technologies move down to the nanometer scale, leakage power has become a significant component of the total power consumption. In this paper, we present a leakage-aware modulo scheduling algorithm t... As semi-conductor technologies move down to the nanometer scale, leakage power has become a significant component of the total power consumption. In this paper, we present a leakage-aware modulo scheduling algorithm to achieve leakage energy saving for applications with loops on Very Long Instruction Word (VLIW) architectures. The proposed algorithm is designed to maximize the idleness of function units integrated with the dual-threshold domino logic, and reduce the number of transitions between the active and sleep modes. We have implemented our technique in the Trimaran compiler and conducted experiments using a set of embedded benchmarks from DSPstone and Mibench on the cycle-accurate VLIW simulator of Trimaran. The results show that our technique achieves significant leakage energy saving compared with a previously published DAG-based (Directed Acyclic Graph) leakage-aware scheduling algorithm. 展开更多
关键词 leakage power very long instruction word (VLIW) software pipelining modulo scheduling
原文传递
上一页 1 下一页 到第
使用帮助 返回顶部