On the one hand,accelerating convolution neural networks(CNNs)on FPGAs requires ever increasing high energy efficiency in the edge computing paradigm.On the other hand,unlike normal digital algorithms,CNNs maintain th...On the one hand,accelerating convolution neural networks(CNNs)on FPGAs requires ever increasing high energy efficiency in the edge computing paradigm.On the other hand,unlike normal digital algorithms,CNNs maintain their high robustness even with limited timing errors.By taking advantage of this unique feature,we propose to use dynamic voltage and frequency scaling(DVFS)to further optimize the energy efficiency for CNNs.First,we have developed a DVFS framework on FPGAs.Second,we apply the DVFS to SkyNet,a state-of-the-art neural network targeting on object detection.Third,we analyze the impact of DVFS on CNNs in terms of performance,power,energy efficiency and accuracy.Compared to the state-of-the-art,experimental results show that we have achieved 38%improvement in energy efficiency without any loss in accuracy.Results also show that we can achieve 47%improvement in energy efficiency if we allow 0.11%relaxation in accuracy.展开更多
Previous studies show that interconnects occupy a large portion of the timing budget and area in FPGAs.In this work,we propose a time-multiplexing technique on FPGA interconnects.In order to fully exploit this interco...Previous studies show that interconnects occupy a large portion of the timing budget and area in FPGAs.In this work,we propose a time-multiplexing technique on FPGA interconnects.In order to fully exploit this interconnect architecture,we propose a time-multiplexed routing algorithm that can actively identify qualified nets and schedule them to multiplexable wires.We validate the algorithm by using the router to implement 20 benchmark circuits to time-multiplexed FPGAs.We achieve a 38%smaller minimum channel width and 3.8%smaller circuit critical path delay compared with the state-of-the-art architecture router when a wire can be time-multiplexed six times in a cycle.展开更多
Many artificial intelligence(AI)processing tasks,especially those related to deep neural networks(DNNs),are both computation and memory intensive.Yet the traditional computing platforms such as CPU are increasingly fa...Many artificial intelligence(AI)processing tasks,especially those related to deep neural networks(DNNs),are both computation and memory intensive.Yet the traditional computing platforms such as CPU are increasingly facing difficulties in dealing with those massive processing workloads.Reconfigurable computing(RC)features the ability to perform computations in hardware to increase execution capabilities,and at the same time retain much of the flexibility of a software solution.The microchip design based on the reconfigurable computing models and principles has emerged as an effective means to ensure that the AI applications can be accelerated to not only meet the performance and throughput targets but also the power and energy efficiency requirements.展开更多
Ultra-low-voltage SRAM is an indispensable component that is increasingly adopted in energyefficient computing systems.However,it comes at the cost of increased sensitivity to soft errors.To address this issue,bit-int...Ultra-low-voltage SRAM is an indispensable component that is increasingly adopted in energyefficient computing systems.However,it comes at the cost of increased sensitivity to soft errors.To address this issue,bit-interleaving SRAM is widely used to mitigate soft errors.But it suffers from half-select disturbance.Previous works address such disturbance by using a dedicated write port or enhanced write assist scheme.However,these works may decrease write margin,induce high cell-level write latency,or incur architecture-level time/timing overhead.In this paper,we develop a high-speed bit-interleaving half-select disturb-free memory with data-aware 10T SRAM.First,we present an isolated and decoupled topology with dedicated write control to improve stability.Second,we present a data-aware write path with enhanced write-ability that effectively reduces the write access time.A 40-nm 4-Kb test chip has been fabricated to validate the optimizations above.Measurement results show that our half-select disturb-free test chip achieves a peak operating frequency of 25 MHz and an energy consumption of 0.168 fJ/bit with a supply voltage of 0.35 V.Compared with the state-of-the-art designs,it has achieved a speed up of 2.72×and an energy saving of 93.8%.展开更多
As integrated circuits advance into the post-Moore era,the improvement of computing performance encounters several challenges,making it difficult to meet the ever-growing computing demands.Cryogenic complementary meta...As integrated circuits advance into the post-Moore era,the improvement of computing performance encounters several challenges,making it difficult to meet the ever-growing computing demands.Cryogenic complementary metal oxide semiconductor(CMOS)based computing systems have emerged as a promising solution for overcoming the existing computing performance bottleneck.By cooling the circuitry to cryogenic temperatures,device leakage and wire resistance can be significantly reduced,leading to further improvements in energy efficiency and performance.Here,we conduct a comprehensive review of the cryogenic CMOS based computing systems across multiple optimization layers,including the CMOS process,modeling,electronic design automation(EDA),circuits,and architecture.Moreover,this review identifies potential future works and applications.展开更多
文摘On the one hand,accelerating convolution neural networks(CNNs)on FPGAs requires ever increasing high energy efficiency in the edge computing paradigm.On the other hand,unlike normal digital algorithms,CNNs maintain their high robustness even with limited timing errors.By taking advantage of this unique feature,we propose to use dynamic voltage and frequency scaling(DVFS)to further optimize the energy efficiency for CNNs.First,we have developed a DVFS framework on FPGAs.Second,we apply the DVFS to SkyNet,a state-of-the-art neural network targeting on object detection.Third,we analyze the impact of DVFS on CNNs in terms of performance,power,energy efficiency and accuracy.Compared to the state-of-the-art,experimental results show that we have achieved 38%improvement in energy efficiency without any loss in accuracy.Results also show that we can achieve 47%improvement in energy efficiency if we allow 0.11%relaxation in accuracy.
文摘Previous studies show that interconnects occupy a large portion of the timing budget and area in FPGAs.In this work,we propose a time-multiplexing technique on FPGA interconnects.In order to fully exploit this interconnect architecture,we propose a time-multiplexed routing algorithm that can actively identify qualified nets and schedule them to multiplexable wires.We validate the algorithm by using the router to implement 20 benchmark circuits to time-multiplexed FPGAs.We achieve a 38%smaller minimum channel width and 3.8%smaller circuit critical path delay compared with the state-of-the-art architecture router when a wire can be time-multiplexed six times in a cycle.
文摘Many artificial intelligence(AI)processing tasks,especially those related to deep neural networks(DNNs),are both computation and memory intensive.Yet the traditional computing platforms such as CPU are increasingly facing difficulties in dealing with those massive processing workloads.Reconfigurable computing(RC)features the ability to perform computations in hardware to increase execution capabilities,and at the same time retain much of the flexibility of a software solution.The microchip design based on the reconfigurable computing models and principles has emerged as an effective means to ensure that the AI applications can be accelerated to not only meet the performance and throughput targets but also the power and energy efficiency requirements.
基金supported by the National Natural Science Foundation of China under Grant 62150710549 and Grant U2441247.
文摘Ultra-low-voltage SRAM is an indispensable component that is increasingly adopted in energyefficient computing systems.However,it comes at the cost of increased sensitivity to soft errors.To address this issue,bit-interleaving SRAM is widely used to mitigate soft errors.But it suffers from half-select disturbance.Previous works address such disturbance by using a dedicated write port or enhanced write assist scheme.However,these works may decrease write margin,induce high cell-level write latency,or incur architecture-level time/timing overhead.In this paper,we develop a high-speed bit-interleaving half-select disturb-free memory with data-aware 10T SRAM.First,we present an isolated and decoupled topology with dedicated write control to improve stability.Second,we present a data-aware write path with enhanced write-ability that effectively reduces the write access time.A 40-nm 4-Kb test chip has been fabricated to validate the optimizations above.Measurement results show that our half-select disturb-free test chip achieves a peak operating frequency of 25 MHz and an energy consumption of 0.168 fJ/bit with a supply voltage of 0.35 V.Compared with the state-of-the-art designs,it has achieved a speed up of 2.72×and an energy saving of 93.8%.
基金supported by the National Natural Science Foundation of China under Grant 62220106011。
文摘As integrated circuits advance into the post-Moore era,the improvement of computing performance encounters several challenges,making it difficult to meet the ever-growing computing demands.Cryogenic complementary metal oxide semiconductor(CMOS)based computing systems have emerged as a promising solution for overcoming the existing computing performance bottleneck.By cooling the circuitry to cryogenic temperatures,device leakage and wire resistance can be significantly reduced,leading to further improvements in energy efficiency and performance.Here,we conduct a comprehensive review of the cryogenic CMOS based computing systems across multiple optimization layers,including the CMOS process,modeling,electronic design automation(EDA),circuits,and architecture.Moreover,this review identifies potential future works and applications.