The wide acceptance and data deluge in medical imaging processing require faster and more efficient systems to be built.Due to the advances in heterogeneous architectures recently,there has been a resurgence in the fi...The wide acceptance and data deluge in medical imaging processing require faster and more efficient systems to be built.Due to the advances in heterogeneous architectures recently,there has been a resurgence in the first research aimed at FPGA-based as well as GPGPU-based accelerator design.This paper quantitatively analyzes the workload,computational intensity and memory performance of a single-particle 3D reconstruction application,called EMAN,and parallelizes it on CUDA GPGPU architectures and decouples the memory operations from the computing flow and orchestrates the thread-data mapping to reduce the overhead of off-chip memory operations.Then it exploits the trend towards FPGA-based accelerator design,which is achieved by offloading computingintensive kernels to dedicated hardware modules.Furthermore,a customized memory subsystem is also designed to facilitate the decoupling and optimization of computing dominated data access patterns.This paper evaluates the proposed accelerator design strategies by comparing it with a parallelized program on a 4-cores CPU.The CUDA version on a GTX480 shows a speedup of about 6 times.The performance of the stream architecture implemented on a Xilinx Virtex LX330 FPGA is justified by the reported speedup of 2.54 times.Meanwhile,measured in terms of power efficiency,the FPGA-based accelerator outperforms a 4-cores CPU and a GTX480 by 7.3 times and 3.4 times,respectively.展开更多
In modern VLSI technology, hundreds of thousands of arithmetic units fit on a 1cm^2 chip. The challenge is supplying them with instructions and data. Stream architecture is able to solve the problem well. However, the...In modern VLSI technology, hundreds of thousands of arithmetic units fit on a 1cm^2 chip. The challenge is supplying them with instructions and data. Stream architecture is able to solve the problem well. However, the applications suited for typical stream architecture are limited. This paper presents the definition of regular stream and irregular stream, and then describes MASA (Multiple-morphs Adaptive Stream Architecture) prototype system which supports different execution models according to applications' stream characteristics. This paper first discusses MASA architecture and stream model, and then explores the features and advantages of MASA through mapping stream applications to hardware. Finally MASA is evaluated by ten benchmarks. The result is encouraging.展开更多
The object detection algorithm based on convolutional neural networks(CNNs)significantly enhances accuracy by expanding network scale.As network parameters increase,large-scale networks demand substantial memory resou...The object detection algorithm based on convolutional neural networks(CNNs)significantly enhances accuracy by expanding network scale.As network parameters increase,large-scale networks demand substantial memory resources,making deployment on hardware challenging.Although most neural network accelerators utilize off-chip storage,frequent access to external memory restricts processing speed,hindering the ability to meet the frame rate requirements for embedded systems.This creates a trade-off in which the speed and accuracy of embedded target detection accelerators cannot be simultaneously optimized.In this paper,we propose PODALA,an energy-efficient accelerator developed through the algorithm-hardware co-design methodology.For object detection algorithm,we develop an optimized algorithm combined with the inverse-residual structure and depthwise separable convolution,effectively reducing network parameters while preserving high detection accuracy.For hardware accelerator,we develop a custom layer fusion technique for PODALA to minimize memory access requirements.The overall design employs a streaming hardware architecture that combines a computing array with a refined ping-pong output buffer to execute different layer fusion computing modes efficiently.Our approach substantially reduces memory usage through optimizations in both algorithmic and hardware design.Evaluated on the Xilinx ZCU102 FPGA platform,PODALA achieves 78 frames per second(FPS)and 79.73 GOPS/W energy efficiency,underscoring its superiority over state-of-the-art solutions.展开更多
基金Supported by the National Basic Research Program of China(No.2012CB316502)the National High Technology Research and DevelopmentProgram of China(No.2009AA01A129)the National Natural Science Foundation of China(No.60921002)
文摘The wide acceptance and data deluge in medical imaging processing require faster and more efficient systems to be built.Due to the advances in heterogeneous architectures recently,there has been a resurgence in the first research aimed at FPGA-based as well as GPGPU-based accelerator design.This paper quantitatively analyzes the workload,computational intensity and memory performance of a single-particle 3D reconstruction application,called EMAN,and parallelizes it on CUDA GPGPU architectures and decouples the memory operations from the computing flow and orchestrates the thread-data mapping to reduce the overhead of off-chip memory operations.Then it exploits the trend towards FPGA-based accelerator design,which is achieved by offloading computingintensive kernels to dedicated hardware modules.Furthermore,a customized memory subsystem is also designed to facilitate the decoupling and optimization of computing dominated data access patterns.This paper evaluates the proposed accelerator design strategies by comparing it with a parallelized program on a 4-cores CPU.The CUDA version on a GTX480 shows a speedup of about 6 times.The performance of the stream architecture implemented on a Xilinx Virtex LX330 FPGA is justified by the reported speedup of 2.54 times.Meanwhile,measured in terms of power efficiency,the FPGA-based accelerator outperforms a 4-cores CPU and a GTX480 by 7.3 times and 3.4 times,respectively.
文摘In modern VLSI technology, hundreds of thousands of arithmetic units fit on a 1cm^2 chip. The challenge is supplying them with instructions and data. Stream architecture is able to solve the problem well. However, the applications suited for typical stream architecture are limited. This paper presents the definition of regular stream and irregular stream, and then describes MASA (Multiple-morphs Adaptive Stream Architecture) prototype system which supports different execution models according to applications' stream characteristics. This paper first discusses MASA architecture and stream model, and then explores the features and advantages of MASA through mapping stream applications to hardware. Finally MASA is evaluated by ten benchmarks. The result is encouraging.
基金supported by the National Natural Science Foundation of China under Grant 62104025,Grant 62104229,and Grant 62104259.
文摘The object detection algorithm based on convolutional neural networks(CNNs)significantly enhances accuracy by expanding network scale.As network parameters increase,large-scale networks demand substantial memory resources,making deployment on hardware challenging.Although most neural network accelerators utilize off-chip storage,frequent access to external memory restricts processing speed,hindering the ability to meet the frame rate requirements for embedded systems.This creates a trade-off in which the speed and accuracy of embedded target detection accelerators cannot be simultaneously optimized.In this paper,we propose PODALA,an energy-efficient accelerator developed through the algorithm-hardware co-design methodology.For object detection algorithm,we develop an optimized algorithm combined with the inverse-residual structure and depthwise separable convolution,effectively reducing network parameters while preserving high detection accuracy.For hardware accelerator,we develop a custom layer fusion technique for PODALA to minimize memory access requirements.The overall design employs a streaming hardware architecture that combines a computing array with a refined ping-pong output buffer to execute different layer fusion computing modes efficiently.Our approach substantially reduces memory usage through optimizations in both algorithmic and hardware design.Evaluated on the Xilinx ZCU102 FPGA platform,PODALA achieves 78 frames per second(FPS)and 79.73 GOPS/W energy efficiency,underscoring its superiority over state-of-the-art solutions.