摘要
E量级超算面临超十亿浮点融合乘加(Fused Multiply-Add,FMA)部件同时运行的严峻挑战,单个FMA检错率的少量变化可引起系统可用性的较大变动.E级超算核心的高运行频率、实时校验需求对校验逻辑时序提出了更高的要求.同时,E级超算需要控制系统规模,同芯片面积下集成的核心数目更多,片上资源较为紧张.因此,FMA校验设计需要在保证错误检测能力的前提下,对校验逻辑的时序、面积开销进行控制.本文提出了并行循环4:2压缩结构.余数系统模数增大后,并行循环4:2压缩结构能在降低余数生成逻辑的时序、面积开销的同时,提升余数系统的检错能力.本文还对余数域中的FMA尾数运算进行研究,提出了取反符号扩展操作、乘法尾数、加法尾数的余数域加速变换.实验结果表明,本文提出的并行循环4:2混合压缩余数生成逻辑较模加器树余数生成逻辑、CSA(Carry Saved Adder) 3:2压缩余数生成逻辑分别最多可取得19.64%、6.75%的时序优化和71%、18.18%的面积优化.基于并行循环4:2压缩树的模63余数校验在面积开销、检错率、系统可用性上均优于IBM采用的模15浮点FMA校验设计,面积开销、检错率优化效果分别能达到67.61%、5%,系统可用性优化最多可达49.6%.
Simultaneously operating of billions of floating-point FMA(Fused Multiply-Add)units has raised severe availability challenges for the exascale supercomputer.To ensure sustainable and efficient operation of the exascale supercomputer,processors must adopt more efficient fault-tolerance mechanisms on FMA.In the exascale supercomputer,the real-time check on high frequency processor and limited resources on chip challenge the design of FMA checker.The design of FMA checker must take timing overhead and hardware overhead into consideration under the premise of getting better error detection coverage.Floating-point FMA adopts a fusion design and has to deal with multiple special operations in IEEE 754 standard,such as mantissa align shift,normalization,round;as a result,the widely-used residue domain transformation is not able to effectively accelerate the residue encoding in FMA units.In this paper,we propose a parallel cyclic 4∶2 compressor-based residue generation technique,which reduces the number of logic gates on the critical path when the modulus is increasing.By adopting cyclic carry processing for the highest bit of each partition,cyclic 4∶2 compressors abate the logical dependency in carry chains and reduce the overhead caused by carry correction.When improving error detection coverage,the cyclic 4∶2 compressor can reduce the timing cost and hardware overhead of residue generation.We also study the mantissa calculation in residue domain and propose the residue domain compression technology for negative sign extension of mantissa,mantissa multiplication and mantissa addition based on mathematical transformations.These techniques reduce the input data width of the residue generator and limit the alignment range by dividing and transforming the mantissa fusion operation.For the reverse sign extension of mantissa in residue domain,this paper decreases the overhead by transforming the negative sign extension operation to the combined operations of residue generation and modular subtraction.For the mantissa multiplication in residue domain,this paper utilizes mathematical transformations to separate mantissa multiplication result from the mantissa fusion operation on FMA main path.Then multiplication distribution rate in residue domain is allowed to reduce the overhead of residue calculation of mantissa multiplication.For the mantissa addition in residue domain,this paper avoids large shift range caused by alignment by utilizing modular shift and modular subtraction.By using these techniques,the area overhead of shift logic and multiplication logic can be reduced by 10 times on average.Timing is also improved by utilizing these transformations in mathematics.Experimental results show that both timing cost and area cost of residue generation are optimized by utilizing the parallel cyclic compression structure.Compared with modular adder-based residue generator and carry-saved adder-based residue generator,the parallel cyclic 4∶2 compressor-based residue generator shows up to 19.64%,6.75%timing optimization and 71%,18.18%area reduction respectively.The residue system proposed in this paper outperforms conventional design in terms of area overhead and error detection coverage.Compared with the moduli 15 residue check system for FMA proposed by IBM,the moduli 63 FMA checker based on parallel cyclic 4∶2 compressor reduces the area by 67.61%,yields 5%error coverage improvement and yields up to 49.6%exascale supercomputer’s availability improvement.
作者
高剑刚
刘骁
郑方
唐勇
GAO Jian-Gang;LIU Xiao;ZHENG Fang;TANG Yong(National Research Center of Parallel Computer Engineering and Technology,Beijing 100190)
出处
《计算机学报》
EI
CAS
CSCD
北大核心
2023年第6期1103-1120,共18页
Chinese Journal of Computers
关键词
浮点融合乘加
可用性
浮点校验
模加器
并行循环压缩
floating-point fused multiply-add
availability
residue check
modular adder
parallel cyclic compression