期刊文献+
共找到3篇文章
< 1 >
每页显示 20 50 100
TMAC:Training-Targeted Mapping and Architecture Co-Exploration for Wafer-Scale Chips
1
作者 HUIZHENG WANG QIZE YANG +10 位作者 TAIQUAN WEI XINGMAO YU CHENGRAN LI JIAHAO FANG GUANGYANG LU XU DAI LIANG LIU SHENFEI JIANG YANG HU SHOUYI YIN SHAOJUN WEI 《Integrated Circuits and Systems》 2024年第4期178-195,共18页
Transformer-based large language models(LLMs)have made significant strides in the field of artificial intelligence(AI).However,training these LLMs imposes immense demands on computational power and bandwidth for hardw... Transformer-based large language models(LLMs)have made significant strides in the field of artificial intelligence(AI).However,training these LLMs imposes immense demands on computational power and bandwidth for hardware systems.Wafer-scale chips(WSCs)offer a promising solution,yet they struggle with limited on-chip memory and complex tensor partitioning.To fully harness the high-bandwidth,low-latency on-chip interconnect benefits of WSCs and to alleviate the on-chip memory limitations,a specialized mapping and architecture co-exploration method is essential.Despite existing efforts in memory optimization and mapping,current approaches fall short for WSC scenarios.To bridge this gap,we introduce TMAC,an architecture-mapping co-exploration framework that integrates recomputation into the design space,fully exploiting optimization opportunities overlooked by existing works.Further,TMAC takes advantage of the superior on-chip interconnect performance of WSCs by incorporating a more flexible tensor partition scheme.TMAC then introduces a novel operator-centric encoding scheme(OCES)designed to comprehensively describe the mapping space for training LLMs.Unlike previous studies that focus solely on communication volume analysis based on mapping,TMAC explores the design space by evaluating the combined impact of mapping and architecture on training performance.However,fully accounting for these untapped optimization opportunities increases the complexity of the design space.To address this,we streamline the simulation process,reducing the time needed for exploration.Compared to AccPar,Deepspeed and Megatron,TMAC delivers a 3.1×,2.9×,1.6×performance gain.In terms of memory usage,TMAC requires 3.6×,3.1×less memory than AccPar and Deepspeed,respectively and is comparable to Megatron’s full recomputation method. 展开更多
关键词 Large language models recomputation tensor partition TRAINING wafer-scale chips
在线阅读 下载PDF
一种改进的频繁子图挖掘算法
2
作者 李亮 陈莉 +2 位作者 李华 王珊珊 张敏超 《计算机与应用化学》 CAS CSCD 北大核心 2014年第2期161-165,共5页
在大量的图数据集合中实现目标图的精确匹配是一项相当耗时的任务,为了提高检索效率,频繁子图挖掘逐渐受到广泛的研究。使用频繁子图挖掘可以去除那些与目标图极不相似的图,这样就减小了图的数据集合,从而使目标图检索变得更为快速。FFS... 在大量的图数据集合中实现目标图的精确匹配是一项相当耗时的任务,为了提高检索效率,频繁子图挖掘逐渐受到广泛的研究。使用频繁子图挖掘可以去除那些与目标图极不相似的图,这样就减小了图的数据集合,从而使目标图检索变得更为快速。FFSM算法虽是一种较为有效的频繁子图挖掘算法,但在应用中存在占用大量存储空间的缺点。本文基于FFSM算法在数据预处理的基础上,将Recomputed Embedding技术整合于FFSM算法,利用改进后的算法建立索引分类。最后将新算法应用于化学虚拟合成系统的数据处理上,实验结果证明相对于FFSM算法其获得目标化合物的速度得到了显著提高。 展开更多
关键词 频繁子图挖掘 Recomputed Embedding技术 FFSM算法 预处理
原文传递
PartialRC: A Partial Recomputing Method for Efficient Fault Recovery on GPGPUs 被引量:1
3
作者 徐新海 杨学军 +2 位作者 薛京灵 林宇斐 林一松 《Journal of Computer Science & Technology》 SCIE EI CSCD 2012年第2期240-255,共16页
GPGPUs are increasingly being used to as performance accelerators for HPC (High Performance Computing) applications in CPU/GPU heterogeneous computing systems, including TianHe-1A, the world's fastest supercomputer... GPGPUs are increasingly being used to as performance accelerators for HPC (High Performance Computing) applications in CPU/GPU heterogeneous computing systems, including TianHe-1A, the world's fastest supercomputer in the TOP500 list, built at NUDT (National University of Defense Technology) last year. However, despite their performance advantages, GPGPUs do not provide built-in fault-tolerant mechanisms to offer reliability guarantees required by many HPC applications. By analyzing the SIMT (single-instruction, multiple-thread) characteristics of programs running on GPGPUs, we have developed PartialRC, a new checkpoint-based compiler-directed partial recomputing method, for achieving efficient fault recovery by leveraging the phenomenal computing power of GPGPUs. In this paper, we introduce our PartialRC method that recovers from errors detected in a code region by partially re-computing the region, describe a checkpoint-based faulttolerance framework developed on PartialRC, and discuss an implementation on the CUDA platform. Validation using a range of representative CUDA programs on NVIDIA GPGPUs against FullRC (a traditional full-recomputing Checkpoint-Rollback-Restart fault recovery method for CPUs) shows that PartialRC reduces significantly the fault recovery overheads incurred by FullRC, by 73.5% when errors occur earlier during execution and 74.6% when errors occur later on average. In addition, PartialRC also reduces error detection overheads incurred by FullRC during fault recovery while incurring negligible performance overheads when no fault happens. 展开更多
关键词 GPGPU partial recomputing fault tolerance CUDA CHECKPOINTING
原文传递
上一页 1 下一页 到第
使用帮助 返回顶部