期刊文献+
共找到572篇文章
< 1 2 29 >
每页显示 20 50 100
A Novel Quantization and Model Compression Approach for Hardware Accelerators in Edge Computing
1
作者 Fangzhou He Ke Ding +3 位作者 DingjiangYan Jie Li Jiajun Wang Mingzhe Chen 《Computers, Materials & Continua》 SCIE EI 2024年第8期3021-3045,共25页
Massive computational complexity and memory requirement of artificial intelligence models impede their deploy-ability on edge computing devices of the Internet of Things(IoT).While Power-of-Two(PoT)quantization is pro... Massive computational complexity and memory requirement of artificial intelligence models impede their deploy-ability on edge computing devices of the Internet of Things(IoT).While Power-of-Two(PoT)quantization is pro-posed to improve the efficiency for edge inference of Deep Neural Networks(DNNs),existing PoT schemes require a huge amount of bit-wise manipulation and have large memory overhead,and their efficiency is bounded by the bottleneck of computation latency and memory footprint.To tackle this challenge,we present an efficient inference approach on the basis of PoT quantization and model compression.An integer-only scalar PoT quantization(IOS-PoT)is designed jointly with a distribution loss regularizer,wherein the regularizer minimizes quantization errors and training disturbances.Additionally,two-stage model compression is developed to effectively reduce memory requirement,and alleviate bandwidth usage in communications of networked heterogenous learning systems.The product look-up table(P-LUT)inference scheme is leveraged to replace bit-shifting with only indexing and addition operations for achieving low-latency computation and implementing efficient edge accelerators.Finally,comprehensive experiments on Residual Networks(ResNets)and efficient architectures with Canadian Institute for Advanced Research(CIFAR),ImageNet,and Real-world Affective Faces Database(RAF-DB)datasets,indicate that our approach achieves 2×∼10×improvement in the reduction of both weight size and computation cost in comparison to state-of-the-art methods.A P-LUT accelerator prototype is implemented on the Xilinx KV260 Field Programmable Gate Array(FPGA)platform for accelerating convolution operations,with performance results showing that P-LUT reduces memory footprint by 1.45×,achieves more than 3×power efficiency and 2×resource efficiency,compared to the conventional bit-shifting scheme. 展开更多
关键词 Edge computing model compression hardware accelerator power-of-two quantization
在线阅读 下载PDF
An FPGA-Based Resource-Saving Hardware Accelerator for Deep Neural Network
2
作者 Han Jia Xuecheng Zou 《International Journal of Intelligence Science》 2021年第2期57-69,共13页
With the development of computer vision researches, due to the state-of-the-art performance on image and video processing tasks, deep neural network (DNN) has been widely applied in various applications (autonomous ve... With the development of computer vision researches, due to the state-of-the-art performance on image and video processing tasks, deep neural network (DNN) has been widely applied in various applications (autonomous vehicles, weather forecasting, counter-terrorism, surveillance, traffic management, etc.). However, to achieve such performance, DNN models have become increasingly complicated and deeper, and result in heavy computational stress. Thus, it is not sufficient for the general central processing unit (CPU) processors to meet the real-time application requirements. To deal with this bottleneck, research based on hardware acceleration solution for DNN attracts great attention. Specifically, to meet various real-life applications, DNN acceleration solutions mainly focus on issue of hardware acceleration with intense memory and calculation resource. In this paper, a novel resource-saving architecture based on Field Programmable Gate Array (FPGA) is proposed. Due to the novel designed processing element (PE), the proposed architecture </span><span style="font-family:Verdana;">achieves good performance with the extremely limited calculating resource. The on-chip buffer allocation helps enhance resource-saving performance on memory. Moreover, the accelerator improves its performance by exploiting</span> <span style="font-family:Verdana;">the sparsity property of the input feature map. Compared to other state-of-the-art</span><span style="font-family:Verdana;"> solutions based on FPGA, our architecture achieves good performance, with quite limited resource consumption, thus fully meet the requirement of real-time applications. 展开更多
关键词 Deep Neural Network RESOURCE-SAVING hardware accelerator Data Flow
在线阅读 下载PDF
THUBrachy:fast Monte Carlo dose calculation tool accelerated by heterogeneous hardware for high-dose-rate brachytherapy 被引量:1
3
作者 An-Kang Hu Rui Qiu +5 位作者 Huan Liu Zhen Wu Chun-Yan Li Hui Zhang Jun-Li Li Rui-Jie Yang 《Nuclear Science and Techniques》 SCIE EI CAS CSCD 2021年第3期107-119,共13页
The Monte Carlo(MC)simulation is regarded as the gold standard for dose calculation in brachytherapy,but it consumes a large amount of computing resources.The development of heterogeneous computing makes it possible t... The Monte Carlo(MC)simulation is regarded as the gold standard for dose calculation in brachytherapy,but it consumes a large amount of computing resources.The development of heterogeneous computing makes it possible to substantially accelerate calculations with hardware accelerators.Accordingly,this study develops a fast MC tool,called THUBrachy,which can be accelerated by several types of hardware accelerators.THUBrachy can simulate photons with energy less than 3 MeV and considers all photon interactions in the energy range.It was benchmarked against the American Association of Physicists in Medicine Task Group No.43 Report using a water phantom and validated with Geant4 using a clinical case.A performance test was conducted using the clinical case,showing that a multicore central processing unit,Intel Xeon Phi,and graphics processing unit(GPU)can efficiently accelerate the simulation.GPU-accelerated THUBrachy is the fastest version,which is 200 times faster than the serial version and approximately 500 times faster than Geant4.The proposed tool shows great potential for fast and accurate dose calculations in clinical applications. 展开更多
关键词 High-dose-rate brachytherapy Monte Carlo Heterogeneous computing hardware accelerators
在线阅读 下载PDF
System-on-a-Chip (SoC) Based Hardware Acceleration for Video Codec
4
作者 Xinwei Niu Jeffrey Fan 《Optics and Photonics Journal》 2013年第2期112-117,共6页
Nowadays, from home monitoring to large airport security, a lot of digital video surveillance systems have been used. Digital surveillance system usually requires streaming video processing abilities. As an advanced v... Nowadays, from home monitoring to large airport security, a lot of digital video surveillance systems have been used. Digital surveillance system usually requires streaming video processing abilities. As an advanced video coding method, H.264 is introduced to reduce the large video data dramatically (usually by 70X or more). However, computational overhead occurs when coding and decoding H.264 video. In this paper, a System-on-a-Chip (SoC) based hardware acceleration solution for video codec is proposed, which can also be used for other software applications. The characteristics of the video codec are analyzed by using the profiling tool. The Hadamard function, which is the bottleneck of H.264, is identified not only by execution time but also another two attributes, such as cycle per loop and loop round. The Co-processor approach is applied to accelerate the Hadamard function by transforming it to hardware. Performance improvement, resource costs and energy consumption are compared and analyzed. Experimental results indicate that 76.5% energy deduction and 8.09X speedup can be reached after balancing these three key factors. 展开更多
关键词 SOC Software PROFILING hardware accelerATION Video CODEC
在线阅读 下载PDF
Automatic Control System of Ion Electrostatic Accelerator and Anti-Interference Measures 被引量:1
5
作者 孙振武 霍裕平 +2 位作者 刘根成 李玉晓 李涛 《Plasma Science and Technology》 SCIE EI CAS CSCD 2007年第1期101-105,共5页
An automatic control system for the electrostatic accelerator has been developed by adopting the PLC (Programmable Logic Controller) control technique, infrared and optical-fibre transmission technique and network c... An automatic control system for the electrostatic accelerator has been developed by adopting the PLC (Programmable Logic Controller) control technique, infrared and optical-fibre transmission technique and network communication with the purpose to improve the intelligence level of the accelerator and to enhance the ability of monitoring, collecting and recording parameters. In view of the control system' structure, some anti-interference measures have been adopted after analyzing the interference sources. The measures in hardware include controlling the position of the corona needle, using surge arresters, shielding, ground connection and stabilizing the voltage. The measures in terms of software involve inter-blocking protection, soft-spacing, time delay, and diagnostic and protective programs. The electromagnetic compatible ability of the control system has thus been effectively improved. 展开更多
关键词 electrostatic accelerator computer control SOFTWARE hardware electromag- netic interference
在线阅读 下载PDF
A Dynamically Reconfigurable Accelerator Design Using a Sparse-Winograd Decomposition Algorithm for CNNs
6
作者 Yunping Zhao Jianzhuang Lu Xiaowen Chen 《Computers, Materials & Continua》 SCIE EI 2021年第1期517-535,共19页
Convolutional Neural Networks(CNNs)are widely used in many fields.Due to their high throughput and high level of computing characteristics,however,an increasing number of researchers are focusing on how to improve the... Convolutional Neural Networks(CNNs)are widely used in many fields.Due to their high throughput and high level of computing characteristics,however,an increasing number of researchers are focusing on how to improve the computational efficiency,hardware utilization,or flexibility of CNN hardware accelerators.Accordingly,this paper proposes a dynamically reconfigurable accelerator architecture that implements a Sparse-Winograd F(2×2.3×3)-based high-parallelism hardware architecture.This approach not only eliminates the pre-calculation complexity associated with the Winograd algorithm,thereby reducing the difficulty of hardware implementation,but also greatly improves the flexibility of the hardware;as a result,the accelerator can realize the calculation of Conventional Convolution,Grouped Convolution(GCONV)or Depthwise Separable Convolution(DSC)using the same hardware architecture.Our experimental results show that the accelerator achieves a 3x–4.14x speedup compared with the designs that do not use the acceleration algorithm on VGG-16 and MobileNet V1.Moreover,compared with previous designs using the traditional Winograd algorithm,the accelerator design achieves 1.4x–1.8x speedup.At the same time,the efficiency of the multiplier improves by up to 142%. 展开更多
关键词 High performance computing accelerator architecture hardware
在线阅读 下载PDF
Neural Networks on an FPGA and Hardware-Friendly Activation Functions
7
作者 Jiong Si Sarah L. Harris Evangelos Yfantis 《Journal of Computer and Communications》 2020年第12期251-277,共27页
This paper describes our implementation of several neural networks built on a field programmable gate array (FPGA) and used to recognize a handwritten digit dataset—the Modified National Institute of Standards and Te... This paper describes our implementation of several neural networks built on a field programmable gate array (FPGA) and used to recognize a handwritten digit dataset—the Modified National Institute of Standards and Technology (MNIST) database. We also propose a novel hardware-friendly activation function called the dynamic Rectifid Linear Unit (ReLU)—D-ReLU function that achieves higher performance than traditional activation functions at no cost to accuracy. We built a 2-layer online training multilayer perceptron (MLP) neural network on an FPGA with varying data width. Reducing the data width from 8 to 4 bits only reduces prediction accuracy by 11%, but the FPGA area decreases by 41%. Compared to networks that use the sigmoid functions, our proposed D-ReLU function uses 24% - 41% less area with no loss to prediction accuracy. Further reducing the data width of the 3-layer networks from 8 to 4 bits, the prediction accuracies only decrease by 3% - 5%, with area being reduced by 9% - 28%. Moreover, FPGA solutions have 29 times faster execution time, even despite running at a 60× lower clock rate. Thus, FPGA implementations of neural networks offer a high-performance, low power alternative to traditional software methods, and our novel D-ReLU activation function offers additional improvements to performance and power saving. 展开更多
关键词 Deep Learning D-ReLU Dynamic ReLU FPGA hardware acceleration Activation Function
在线阅读 下载PDF
FPGA Accelerators for Computing Interatomic Potential-Based Molecular Dynamics Simulation for Gold Nanoparticles:Exploring Different Communication Protocols
8
作者 Ankitkumar Patel Srivathsan Vasudevan Satya Bulusu 《Computers, Materials & Continua》 SCIE EI 2024年第9期3803-3818,共16页
Molecular Dynamics(MD)simulation for computing Interatomic Potential(IAP)is a very important High-Performance Computing(HPC)application.MD simulation on particles of experimental relevance takes huge computation time,... Molecular Dynamics(MD)simulation for computing Interatomic Potential(IAP)is a very important High-Performance Computing(HPC)application.MD simulation on particles of experimental relevance takes huge computation time,despite using an expensive high-end server.Heterogeneous computing,a combination of the Field Programmable Gate Array(FPGA)and a computer,is proposed as a solution to compute MD simulation efficiently.In such heterogeneous computation,communication between FPGA and Computer is necessary.One such MD simulation,explained in the paper,is the(Artificial Neural Network)ANN-based IAP computation of gold(Au_(147)&Au_(309))nanoparticles.MD simulation calculates the forces between atoms and the total energy of the chemical system.This work proposes the novel design and implementation of an ANN IAP-based MD simulation for Au_(147)&Au_(309) using communication protocols,such as Universal Asynchronous Receiver-Transmitter(UART)and Ethernet,for communication between the FPGA and the host computer.To improve the latency of MD simulation through heterogeneous computing,Universal Asynchronous Receiver-Transmitter(UART)and Ethernet communication protocols were explored to conduct MD simulation of 50,000 cycles.In this study,computation times of 17.54 and 18.70 h were achieved with UART and Ethernet,respectively,compared to the conventional server time of 29 h for Au_(147) nanoparticles.The results pave the way for the development of a Lab-on-a-chip application. 展开更多
关键词 Ethernet hardware accelerator heterogeneous computing interatomic potential(IAP) MDsimulation peripheral component interconnect express(PCIe) UART
在线阅读 下载PDF
Hardware Design of Moving Object Detection on Reconfigurable System
9
作者 Hung-Yu Chen Yuan-Kai Wang 《Journal of Computer and Communications》 2016年第10期30-43,共14页
Moving object detection including background subtraction and morphological processing is a critical research topic for video surveillance because of its high computational loading and power consumption. This paper pro... Moving object detection including background subtraction and morphological processing is a critical research topic for video surveillance because of its high computational loading and power consumption. This paper proposes a hardware design to accelerate the computation of background subtraction with low power consumption. A real-time background subtraction method is designed with a frame-buffer scheme and function partition to improve throughput, and implemented using Verilog HDL on FPGA. The design parallelizes the computations of background update and subtraction with a seven-stage pipeline. A stripe-based morphological processing and accounting for the completion of detected objects is devised. Simulation results for videos of VGA resolutions on a low-end FPGA device show 368 fps throughput for only the real-time background subtraction module, and 51 fps for the whole system, including off-chip memory access. Real-time efficiency with low power consumption and low resource utilization is thus demonstrated. 展开更多
关键词 Background Substraction Moving Object Detection Field Programmable Gate Array (FPGA) hardware acceleration
在线阅读 下载PDF
Research on High-Precision Stochastic Computing VLSI Structures for Deep Neural Network Accelerators
10
作者 WU Jingguo ZHU Jingwei +3 位作者 XIONG Xiankui YAO Haidong WANG Chengchen CHEN Yun 《ZTE Communications》 2024年第4期9-17,共9页
Deep neural networks(DNN)are widely used in image recognition,image classification,and other fields.However,as the model size increases,the DNN hardware accelerators face the challenge of higher area overhead and ener... Deep neural networks(DNN)are widely used in image recognition,image classification,and other fields.However,as the model size increases,the DNN hardware accelerators face the challenge of higher area overhead and energy consumption.In recent years,stochastic computing(SC)has been considered a way to realize deep neural networks and reduce hardware consumption.A probabilistic compensation algorithm is proposed to solve the accuracy problem of stochastic calculation,and a fully parallel neural network accelerator based on a deterministic method is designed.The software simulation results show that the accuracy of the probability compensation algorithm on the CIFAR-10 data set is 95.32%,which is 14.98%higher than that of the traditional SC algorithm.The accuracy of the deterministic algorithm on the CIFAR-10 dataset is 95.06%,which is 14.72%higher than that of the traditional SC algorithm.The results of Very Large Scale Integration Circuit(VLSI)hardware tests show that the normalized energy efficiency of the fully parallel neural network accelerator based on the deterministic method is improved by 31%compared with the circuit based on binary computing. 展开更多
关键词 stochastic computing hardware accelerator deep neural network
在线阅读 下载PDF
面向RNS-CKKS方案的同态计算硬件加速器
11
作者 陈星辰 郭家怡 +1 位作者 陈弟虎 粟涛 《计算机应用研究》 北大核心 2026年第1期208-215,共8页
针对全同态加密应用中数据存储和传输开销大、计算效率低的问题,以主流的RNS-CKKS方案为研究对象,提出了一种同态计算硬件加速器。该加速器为自同构算子设计了分治式置换网络,实现了高效无冲突的数据调度与路由。同时通过实施片内外协... 针对全同态加密应用中数据存储和传输开销大、计算效率低的问题,以主流的RNS-CKKS方案为研究对象,提出了一种同态计算硬件加速器。该加速器为自同构算子设计了分治式置换网络,实现了高效无冲突的数据调度与路由。同时通过实施片内外协同存储策略和对计算流进行重构,有效降低了硬件部署密钥切换操作的片上缓存需求并隐藏片外延迟。为进一步提升计算与资源效率,构建了统一的计算阵列并优化了片上缓存结构。在FPGA上的实验结果表明,该设计相比于OpenFHE软件函数库,实现了8.68~56.2倍的加速;相较于同类硬件加速方案,在密文-密文同态乘法上实现了1.18~1.53倍的加速以及1.10~4.98倍的面积效率提升,同时在可配置性方面具备一定优势。该工作有助于同态加密方案的硬件高效部署。 展开更多
关键词 全同态加密 RNS-CKKS算法 硬件加速器 可配置架构 现场可编程门阵列
在线阅读 下载PDF
FPGA-based Acceleration of Davidon-Fletcher-Powell Quasi-Newton Optimization Method 被引量:2
12
作者 Liu Qiang Sang Ruoyu Zhang Qijun 《Transactions of Tianjin University》 EI CAS 2016年第5期381-387,共7页
Quasi-Newton methods are the most widely used methods to find local maxima and minima of functions in various engineering practices. However, they involve a large amount of matrix and vector operations, which are comp... Quasi-Newton methods are the most widely used methods to find local maxima and minima of functions in various engineering practices. However, they involve a large amount of matrix and vector operations, which are computationally intensive and require a long processing time. Recently, with the increasing density and arithmetic cores, field programmable gate array(FPGA) has become an attractive alternative to the acceleration of scientific computation. This paper aims to accelerate Davidon-Fletcher-Powell quasi-Newton(DFP-QN) method by proposing a customized and pipelined hardware implementation on FPGAs. Experimental results demonstrate that compared with a software implementation, a speed-up of up to 17 times can be achieved by the proposed hardware implementation. 展开更多
关键词 QUASI-NEWTON method hardware accelerATION field PROGRAMMABLE gate array
在线阅读 下载PDF
Real-time pre-processing system with hardware accelerator for mobile core networks 被引量:1
13
作者 Mian CHENG Jin-shu SU Jing XU 《Frontiers of Information Technology & Electronic Engineering》 SCIE EI CSCD 2017年第11期1720-1731,共12页
With the rapidly increasing number of mobile devices being used as essential terminals or platforms for communication, security threats now target the whole telecommunication infrastructure and become increasingly ser... With the rapidly increasing number of mobile devices being used as essential terminals or platforms for communication, security threats now target the whole telecommunication infrastructure and become increasingly serious. Network probing tools, which are deployed as a bypass device at a mobile core network gateway, can collect and analyze all the traffic for security detection. However, due to the ever-increasing link speed, it is of vital importance to offioad the processing pressure of the detection system. In this paper, we design and evaluate a real-time pre-processing system, which includes a hardware accelerator and a multi-core processor. The implemented prototype can quickly restore each encapsulated packet and effectively distribute traffic to multiple back-end detection systems. We demonstrate the prototype in a well-deployed network environment with large volumes of real data. Experimental results show that our system can achieve at least 18 Gb/s with no packet loss with all kinds of communication protocols. 展开更多
关键词 Mobile network Real-time processing hardware acceleration
原文传递
基于FPGA的MobileNetV1目标检测加速器设计 被引量:3
14
作者 严飞 郑绪文 +2 位作者 孟川 李楚 刘银萍 《现代电子技术》 北大核心 2025年第1期151-156,共6页
卷积神经网络是目标检测中的常用算法,但由于卷积神经网络参数量和计算量巨大导致检测速度慢、功耗高,且难以部署到硬件平台,故文中提出一种采用CPU与FPGA融合结构实现MobileNetV1目标检测加速的应用方法。首先,通过设置宽度超参数和分... 卷积神经网络是目标检测中的常用算法,但由于卷积神经网络参数量和计算量巨大导致检测速度慢、功耗高,且难以部署到硬件平台,故文中提出一种采用CPU与FPGA融合结构实现MobileNetV1目标检测加速的应用方法。首先,通过设置宽度超参数和分辨率超参数以及网络参数定点化来减少网络模型的参数量和计算量;其次,对卷积层和批量归一化层进行融合,减少网络复杂性,提升网络计算速度;然后,设计一种八通道核间并行卷积计算引擎,每个通道利用行缓存乘法和加法树结构实现卷积运算;最后,利用FPGA并行计算和流水线结构,通过对此八通道卷积计算引擎合理的复用完成三种不同类型的卷积计算,减少硬件资源使用量、降低功耗。实验结果表明,该设计可以对MobileNetV1目标检测进行硬件加速,帧率可达56.7 f/s,功耗仅为0.603 W。 展开更多
关键词 卷积神经网络 目标检测 FPGA MobileNetV1 并行计算 硬件加速
在线阅读 下载PDF
复杂三维体高效布尔运算技术的研究与实现 被引量:1
15
作者 张永亮 王家润 吴乾坤 《计算机应用与软件》 北大核心 2025年第1期249-257,327,共10页
高效、稳健的复杂三维体布尔运算是地理信息中的重点与难点。针对难点提出软硬件协同加速计算框架。在软件层面采用多种算法加速优化技术,包括:降维碰撞检测、三维多边形保留与丢弃的原则、新生成三维多边形的构建方法、三维线段与三维... 高效、稳健的复杂三维体布尔运算是地理信息中的重点与难点。针对难点提出软硬件协同加速计算框架。在软件层面采用多种算法加速优化技术,包括:降维碰撞检测、三维多边形保留与丢弃的原则、新生成三维多边形的构建方法、三维线段与三维多边形的高效求交、三维点或三维多边形与三维体包含关系的高效判断;基于以上加速优化技术提出一套高效的计算框架;在硬件层面基于GPU的众核算力加速计算,提出软硬件协同加速计算框架。实验证明,该计算框架高效性且稳健性,与现有的方法相比,软件层面计算框架效率提高3倍左右,软硬件协同加速框架将效率进一步提高3倍左右。 展开更多
关键词 加速优化技术 降维碰撞检测 高效 众核算力 软硬件协同加速
在线阅读 下载PDF
改进Camshift算法实时目标跟踪实现 被引量:1
16
作者 严飞 徐龙 +2 位作者 陈佳宇 姜栋 刘佳 《计算机工程与设计》 北大核心 2025年第1期314-320,F0003,共8页
为解决Camshift目标跟踪算法在跟踪目标遮挡时陷入局部最大值、跟踪目标快速移动导致跟踪丢失以及光照变化影响跟踪精度一系列问题,提出一种改进Camshift目标跟踪算法。利用自适应权重与H通道特征提取模板,融合Kalman滤波算法并引入巴... 为解决Camshift目标跟踪算法在跟踪目标遮挡时陷入局部最大值、跟踪目标快速移动导致跟踪丢失以及光照变化影响跟踪精度一系列问题,提出一种改进Camshift目标跟踪算法。利用自适应权重与H通道特征提取模板,融合Kalman滤波算法并引入巴氏距离遮挡判别法。非遮挡时,使用Kalman预测调整跟踪搜索区域;遮挡时,使用Kalman预测跟踪。实验结果表明,将改进后算法部署于FPGA硬件平台能够准确地跟踪快速运动、遮挡干扰目标,在1920×1080分辨率下理论跟踪帧率为98.17帧/s,对1080p@60 Hz以及多种分辨率视频输入下平均跟踪重叠率达到84.68%。 展开更多
关键词 目标跟踪 实时 图像处理 硬件加速 卡尔曼滤波 直方图 现场可编程逻辑门阵列
在线阅读 下载PDF
基于FPGA的SM4异构加速系统
17
作者 张全新 李可 +1 位作者 邵雨洁 谭毓安 《信息网络安全》 北大核心 2025年第7期1021-1031,共11页
国密SM4算法是WAPI无线网络标准中广泛使用的加密算法。目前,针对SM4加解密的研究主要集中于硬件实现结构优化,以提高吞吐量和安全性。同时,大数据和5G通信技术的发展对数据加解密的带宽和实时性提出了更高的要求。基于此背景,文章提出... 国密SM4算法是WAPI无线网络标准中广泛使用的加密算法。目前,针对SM4加解密的研究主要集中于硬件实现结构优化,以提高吞吐量和安全性。同时,大数据和5G通信技术的发展对数据加解密的带宽和实时性提出了更高的要求。基于此背景,文章提出一种基于FPGA的SM4异构加速系统,使用硬件实现SM4算法,并优化加解密性能;采用流式高速数据传输架构,支持多个SM4核并行工作,充分利用系统带宽;设计可配置接口,连接SM4与传输架构,提供足够的灵活性。系统于Xilinx XCVU9P FPGA上实现,支持随时更改SM4的负载和模式。测试得到SM4的最大工作频率为462 MHz,系统吞吐量高达92 Gbit/s,延迟仅为266μs。实验结果表明,与其他现有工作相比,该系统能获得更高的SM4工作频率和系统吞吐量,满足高带宽和低延迟的SM4加速需求。 展开更多
关键词 国密SM4算法 FPGA 硬件加速 传输架构
在线阅读 下载PDF
高性能YOLOv3-tiny嵌入式硬件加速器的混合优化设计
18
作者 谭会生 肖鑫凯 卿翔 《半导体技术》 CAS 北大核心 2025年第1期55-63,共9页
为解决在嵌入式设备中部署神经网络受算法复杂度、执行速度和硬件资源约束的问题,基于Zynq异构平台,设计了一个高性能的YOLOv3-tiny网络硬件加速器。在算法优化方面,将卷积层和批归一化层融合,使用8 bit量化算法,简化了算法流程;在加速... 为解决在嵌入式设备中部署神经网络受算法复杂度、执行速度和硬件资源约束的问题,基于Zynq异构平台,设计了一个高性能的YOLOv3-tiny网络硬件加速器。在算法优化方面,将卷积层和批归一化层融合,使用8 bit量化算法,简化了算法流程;在加速器架构设计方面,设计了可动态配置的层间流水线和高效的数据传输方案,缩短了推理时间,减小了存储资源消耗;在网络前向推理方面,针对卷积计算,基于循环展开策略,设计了8通道并行流水的卷积模块;针对池化计算,采用分步计算策略实现对连续数据流的高效处理;针对上采样计算,提出了基于数据复制的2倍上采样方法。实验结果表明,前向推理时间为232 ms,功耗仅为2.29 W,系统工作频率为200 MHz,达到了23.97 GOPS的实际算力。 展开更多
关键词 YOLOv3-tiny网络 异构平台 硬件加速器 动态配置架构 硬件混合优化 数据复制上采样
原文传递
基于FPGA的功率器件封装缺陷实时检测
19
作者 谭会生 吴文志 张杰 《半导体技术》 北大核心 2025年第10期1048-1056,共9页
针对基于机器视觉的功率器件封装缺陷检测技术实时性差、计算资源消耗较高的问题,基于现场可编程门阵列(FPGA)设计了一种功率器件封装缺陷实时检测器。首先,提出一种基于深度可分离卷积(DSConv)的轻量化Mini-DSCNet卷积网络,使用深度卷... 针对基于机器视觉的功率器件封装缺陷检测技术实时性差、计算资源消耗较高的问题,基于现场可编程门阵列(FPGA)设计了一种功率器件封装缺陷实时检测器。首先,提出一种基于深度可分离卷积(DSConv)的轻量化Mini-DSCNet卷积网络,使用深度卷积和逐点卷积代替标准卷积。仿真结果表明,该模型的浮点运算量(FLOPs)和参数量(Params)分别约为MobileNetV1的4.375%和0.021%,准确率约为91.80%。其次,采用定点量化算法将浮点数权重量化为有符号定点数,测试结果表明,其平均误差约为0.483%。最后,采用多通道并行流水线架构优化设计,降低了系统的资源消耗,提高了系统的处理速度。实验结果显示,在100 MHz时钟频率下,该检测器的推理速度分别约为CPU的17.10倍、GPU的2.47倍,显著提升了功率器件封装缺陷检测的实时性。 展开更多
关键词 功率器件 封装缺陷检测 Mini-DSCNet卷积网络 现场可编程门阵列(FPGA) 硬件加速
原文传递
核脉冲峰值序列轻量化神经网络核素识别模型及其FPGA加速方法
20
作者 李超 石睿 +3 位作者 曾树鑫 徐鑫华 魏雨鸿 庹先国 《强激光与粒子束》 北大核心 2025年第5期139-149,共11页
放射性核素已在核医疗、核安保及无损检测等领域中广泛应用,而对其准确识别是放射性核素定性检测的基础。在便携式核素识别仪中,基于传统能谱分析方法存在延迟高、识别率低等不足。提出一种基于核脉冲峰值序列的核素识别轻量化神经网络... 放射性核素已在核医疗、核安保及无损检测等领域中广泛应用,而对其准确识别是放射性核素定性检测的基础。在便携式核素识别仪中,基于传统能谱分析方法存在延迟高、识别率低等不足。提出一种基于核脉冲峰值序列的核素识别轻量化神经网络模型及其FPGA硬件加速方法,通过引入深度可分离卷积和倒残差模块,并使用全局平均池化替代传统全连接层,构建了一种轻量化、高效的神经网络模型。针对网络训练数据集,通过蒙特卡罗工具包Geant4构建NaI(Tl)探测器模型,获取模拟能谱,再由核脉冲信号模拟仿真器根据能谱产生核脉冲信号序列,构建了16种核脉冲信号数据。最后,将训练好的模型通过量化、融合与并行计算等优化方法部署到PYNQ-Z2异构芯片,实现加速。实验结果表明,模型识别精度可达98.3%,相较传统卷积神经网络模型提高了13.2%,参数量仅为2 128。FPGA优化加速后单次识别耗时0.273 ms,功耗为1.94 W。 展开更多
关键词 核素识别 核信号 神经网络 FPGA 硬件加速
在线阅读 下载PDF
上一页 1 2 29 下一页 到第
使用帮助 返回顶部