Deep learning(DL)accelerators are critical for handling the growing computational demands of modern neural networks.Systolic array(SA)-based accelerators consist of a 2D mesh of processing elements(PEs)working coopera...Deep learning(DL)accelerators are critical for handling the growing computational demands of modern neural networks.Systolic array(SA)-based accelerators consist of a 2D mesh of processing elements(PEs)working cooperatively to accelerate matrix multiplication.The power efficiency of such accelerators is of primary importance,especially considering the edge AI regime.This work presents the SAPER-AI accelerator,an SA accelerator with power intent specified via a unified power format representation in a simplified manner with negligible microarchi-tectural optimization effort.Our proposed accelerator switches off rows and columns of PEs in a coarse-grained manner,thus leading to SA microarchitecture complying with the varying computational requirements of modern DL workloads.Our analysis demonstrates enhanced power efficiency ranging between 10% and 25% for the best case 32×32 and 64×64 SA designs,respectively.Additionally,the power delay product(PDP)exhibits a progressive improvement of around 6%for larger SA sizes.Moreover,a performance comparison between the MobileNet and ResNet50 models indicates generally better SA performance for the ResNet50 workload.This is due to the more regular convolutions portrayed by ResNet50 that are more favored by SAs,with the performance gap widening as the SA size increases.展开更多
文摘Deep learning(DL)accelerators are critical for handling the growing computational demands of modern neural networks.Systolic array(SA)-based accelerators consist of a 2D mesh of processing elements(PEs)working cooperatively to accelerate matrix multiplication.The power efficiency of such accelerators is of primary importance,especially considering the edge AI regime.This work presents the SAPER-AI accelerator,an SA accelerator with power intent specified via a unified power format representation in a simplified manner with negligible microarchi-tectural optimization effort.Our proposed accelerator switches off rows and columns of PEs in a coarse-grained manner,thus leading to SA microarchitecture complying with the varying computational requirements of modern DL workloads.Our analysis demonstrates enhanced power efficiency ranging between 10% and 25% for the best case 32×32 and 64×64 SA designs,respectively.Additionally,the power delay product(PDP)exhibits a progressive improvement of around 6%for larger SA sizes.Moreover,a performance comparison between the MobileNet and ResNet50 models indicates generally better SA performance for the ResNet50 workload.This is due to the more regular convolutions portrayed by ResNet50 that are more favored by SAs,with the performance gap widening as the SA size increases.