期刊文献+
共找到581篇文章
< 1 2 30 >
每页显示 20 50 100
A Novel Quantization and Model Compression Approach for Hardware Accelerators in Edge Computing
1
作者 Fangzhou He Ke Ding +3 位作者 DingjiangYan Jie Li Jiajun Wang Mingzhe Chen 《Computers, Materials & Continua》 SCIE EI 2024年第8期3021-3045,共25页
Massive computational complexity and memory requirement of artificial intelligence models impede their deploy-ability on edge computing devices of the Internet of Things(IoT).While Power-of-Two(PoT)quantization is pro... Massive computational complexity and memory requirement of artificial intelligence models impede their deploy-ability on edge computing devices of the Internet of Things(IoT).While Power-of-Two(PoT)quantization is pro-posed to improve the efficiency for edge inference of Deep Neural Networks(DNNs),existing PoT schemes require a huge amount of bit-wise manipulation and have large memory overhead,and their efficiency is bounded by the bottleneck of computation latency and memory footprint.To tackle this challenge,we present an efficient inference approach on the basis of PoT quantization and model compression.An integer-only scalar PoT quantization(IOS-PoT)is designed jointly with a distribution loss regularizer,wherein the regularizer minimizes quantization errors and training disturbances.Additionally,two-stage model compression is developed to effectively reduce memory requirement,and alleviate bandwidth usage in communications of networked heterogenous learning systems.The product look-up table(P-LUT)inference scheme is leveraged to replace bit-shifting with only indexing and addition operations for achieving low-latency computation and implementing efficient edge accelerators.Finally,comprehensive experiments on Residual Networks(ResNets)and efficient architectures with Canadian Institute for Advanced Research(CIFAR),ImageNet,and Real-world Affective Faces Database(RAF-DB)datasets,indicate that our approach achieves 2×∼10×improvement in the reduction of both weight size and computation cost in comparison to state-of-the-art methods.A P-LUT accelerator prototype is implemented on the Xilinx KV260 Field Programmable Gate Array(FPGA)platform for accelerating convolution operations,with performance results showing that P-LUT reduces memory footprint by 1.45×,achieves more than 3×power efficiency and 2×resource efficiency,compared to the conventional bit-shifting scheme. 展开更多
关键词 Edge computing model compression hardware accelerator power-of-two quantization
在线阅读 下载PDF
An FPGA-Based Resource-Saving Hardware Accelerator for Deep Neural Network
2
作者 Han Jia Xuecheng Zou 《International Journal of Intelligence Science》 2021年第2期57-69,共13页
With the development of computer vision researches, due to the state-of-the-art performance on image and video processing tasks, deep neural network (DNN) has been widely applied in various applications (autonomous ve... With the development of computer vision researches, due to the state-of-the-art performance on image and video processing tasks, deep neural network (DNN) has been widely applied in various applications (autonomous vehicles, weather forecasting, counter-terrorism, surveillance, traffic management, etc.). However, to achieve such performance, DNN models have become increasingly complicated and deeper, and result in heavy computational stress. Thus, it is not sufficient for the general central processing unit (CPU) processors to meet the real-time application requirements. To deal with this bottleneck, research based on hardware acceleration solution for DNN attracts great attention. Specifically, to meet various real-life applications, DNN acceleration solutions mainly focus on issue of hardware acceleration with intense memory and calculation resource. In this paper, a novel resource-saving architecture based on Field Programmable Gate Array (FPGA) is proposed. Due to the novel designed processing element (PE), the proposed architecture </span><span style="font-family:Verdana;">achieves good performance with the extremely limited calculating resource. The on-chip buffer allocation helps enhance resource-saving performance on memory. Moreover, the accelerator improves its performance by exploiting</span> <span style="font-family:Verdana;">the sparsity property of the input feature map. Compared to other state-of-the-art</span><span style="font-family:Verdana;"> solutions based on FPGA, our architecture achieves good performance, with quite limited resource consumption, thus fully meet the requirement of real-time applications. 展开更多
关键词 Deep Neural Network RESOURCE-SAVING hardware accelerator Data Flow
在线阅读 下载PDF
THUBrachy:fast Monte Carlo dose calculation tool accelerated by heterogeneous hardware for high-dose-rate brachytherapy 被引量:1
3
作者 An-Kang Hu Rui Qiu +5 位作者 Huan Liu Zhen Wu Chun-Yan Li Hui Zhang Jun-Li Li Rui-Jie Yang 《Nuclear Science and Techniques》 SCIE EI CAS CSCD 2021年第3期107-119,共13页
The Monte Carlo(MC)simulation is regarded as the gold standard for dose calculation in brachytherapy,but it consumes a large amount of computing resources.The development of heterogeneous computing makes it possible t... The Monte Carlo(MC)simulation is regarded as the gold standard for dose calculation in brachytherapy,but it consumes a large amount of computing resources.The development of heterogeneous computing makes it possible to substantially accelerate calculations with hardware accelerators.Accordingly,this study develops a fast MC tool,called THUBrachy,which can be accelerated by several types of hardware accelerators.THUBrachy can simulate photons with energy less than 3 MeV and considers all photon interactions in the energy range.It was benchmarked against the American Association of Physicists in Medicine Task Group No.43 Report using a water phantom and validated with Geant4 using a clinical case.A performance test was conducted using the clinical case,showing that a multicore central processing unit,Intel Xeon Phi,and graphics processing unit(GPU)can efficiently accelerate the simulation.GPU-accelerated THUBrachy is the fastest version,which is 200 times faster than the serial version and approximately 500 times faster than Geant4.The proposed tool shows great potential for fast and accurate dose calculations in clinical applications. 展开更多
关键词 High-dose-rate brachytherapy Monte Carlo Heterogeneous computing hardware accelerators
在线阅读 下载PDF
System-on-a-Chip (SoC) Based Hardware Acceleration for Video Codec
4
作者 Xinwei Niu Jeffrey Fan 《Optics and Photonics Journal》 2013年第2期112-117,共6页
Nowadays, from home monitoring to large airport security, a lot of digital video surveillance systems have been used. Digital surveillance system usually requires streaming video processing abilities. As an advanced v... Nowadays, from home monitoring to large airport security, a lot of digital video surveillance systems have been used. Digital surveillance system usually requires streaming video processing abilities. As an advanced video coding method, H.264 is introduced to reduce the large video data dramatically (usually by 70X or more). However, computational overhead occurs when coding and decoding H.264 video. In this paper, a System-on-a-Chip (SoC) based hardware acceleration solution for video codec is proposed, which can also be used for other software applications. The characteristics of the video codec are analyzed by using the profiling tool. The Hadamard function, which is the bottleneck of H.264, is identified not only by execution time but also another two attributes, such as cycle per loop and loop round. The Co-processor approach is applied to accelerate the Hadamard function by transforming it to hardware. Performance improvement, resource costs and energy consumption are compared and analyzed. Experimental results indicate that 76.5% energy deduction and 8.09X speedup can be reached after balancing these three key factors. 展开更多
关键词 SOC Software PROFILING hardware accelerATION Video CODEC
在线阅读 下载PDF
Automatic Control System of Ion Electrostatic Accelerator and Anti-Interference Measures 被引量:1
5
作者 孙振武 霍裕平 +2 位作者 刘根成 李玉晓 李涛 《Plasma Science and Technology》 SCIE EI CAS CSCD 2007年第1期101-105,共5页
An automatic control system for the electrostatic accelerator has been developed by adopting the PLC (Programmable Logic Controller) control technique, infrared and optical-fibre transmission technique and network c... An automatic control system for the electrostatic accelerator has been developed by adopting the PLC (Programmable Logic Controller) control technique, infrared and optical-fibre transmission technique and network communication with the purpose to improve the intelligence level of the accelerator and to enhance the ability of monitoring, collecting and recording parameters. In view of the control system' structure, some anti-interference measures have been adopted after analyzing the interference sources. The measures in hardware include controlling the position of the corona needle, using surge arresters, shielding, ground connection and stabilizing the voltage. The measures in terms of software involve inter-blocking protection, soft-spacing, time delay, and diagnostic and protective programs. The electromagnetic compatible ability of the control system has thus been effectively improved. 展开更多
关键词 electrostatic accelerator computer control SOFTWARE hardware electromag- netic interference
在线阅读 下载PDF
A Dynamically Reconfigurable Accelerator Design Using a Sparse-Winograd Decomposition Algorithm for CNNs
6
作者 Yunping Zhao Jianzhuang Lu Xiaowen Chen 《Computers, Materials & Continua》 SCIE EI 2021年第1期517-535,共19页
Convolutional Neural Networks(CNNs)are widely used in many fields.Due to their high throughput and high level of computing characteristics,however,an increasing number of researchers are focusing on how to improve the... Convolutional Neural Networks(CNNs)are widely used in many fields.Due to their high throughput and high level of computing characteristics,however,an increasing number of researchers are focusing on how to improve the computational efficiency,hardware utilization,or flexibility of CNN hardware accelerators.Accordingly,this paper proposes a dynamically reconfigurable accelerator architecture that implements a Sparse-Winograd F(2×2.3×3)-based high-parallelism hardware architecture.This approach not only eliminates the pre-calculation complexity associated with the Winograd algorithm,thereby reducing the difficulty of hardware implementation,but also greatly improves the flexibility of the hardware;as a result,the accelerator can realize the calculation of Conventional Convolution,Grouped Convolution(GCONV)or Depthwise Separable Convolution(DSC)using the same hardware architecture.Our experimental results show that the accelerator achieves a 3x–4.14x speedup compared with the designs that do not use the acceleration algorithm on VGG-16 and MobileNet V1.Moreover,compared with previous designs using the traditional Winograd algorithm,the accelerator design achieves 1.4x–1.8x speedup.At the same time,the efficiency of the multiplier improves by up to 142%. 展开更多
关键词 High performance computing accelerator architecture hardware
在线阅读 下载PDF
Neural Networks on an FPGA and Hardware-Friendly Activation Functions
7
作者 Jiong Si Sarah L. Harris Evangelos Yfantis 《Journal of Computer and Communications》 2020年第12期251-277,共27页
This paper describes our implementation of several neural networks built on a field programmable gate array (FPGA) and used to recognize a handwritten digit dataset—the Modified National Institute of Standards and Te... This paper describes our implementation of several neural networks built on a field programmable gate array (FPGA) and used to recognize a handwritten digit dataset—the Modified National Institute of Standards and Technology (MNIST) database. We also propose a novel hardware-friendly activation function called the dynamic Rectifid Linear Unit (ReLU)—D-ReLU function that achieves higher performance than traditional activation functions at no cost to accuracy. We built a 2-layer online training multilayer perceptron (MLP) neural network on an FPGA with varying data width. Reducing the data width from 8 to 4 bits only reduces prediction accuracy by 11%, but the FPGA area decreases by 41%. Compared to networks that use the sigmoid functions, our proposed D-ReLU function uses 24% - 41% less area with no loss to prediction accuracy. Further reducing the data width of the 3-layer networks from 8 to 4 bits, the prediction accuracies only decrease by 3% - 5%, with area being reduced by 9% - 28%. Moreover, FPGA solutions have 29 times faster execution time, even despite running at a 60× lower clock rate. Thus, FPGA implementations of neural networks offer a high-performance, low power alternative to traditional software methods, and our novel D-ReLU activation function offers additional improvements to performance and power saving. 展开更多
关键词 Deep Learning D-ReLU Dynamic ReLU FPGA hardware acceleration Activation Function
在线阅读 下载PDF
FPGA Accelerators for Computing Interatomic Potential-Based Molecular Dynamics Simulation for Gold Nanoparticles:Exploring Different Communication Protocols
8
作者 Ankitkumar Patel Srivathsan Vasudevan Satya Bulusu 《Computers, Materials & Continua》 SCIE EI 2024年第9期3803-3818,共16页
Molecular Dynamics(MD)simulation for computing Interatomic Potential(IAP)is a very important High-Performance Computing(HPC)application.MD simulation on particles of experimental relevance takes huge computation time,... Molecular Dynamics(MD)simulation for computing Interatomic Potential(IAP)is a very important High-Performance Computing(HPC)application.MD simulation on particles of experimental relevance takes huge computation time,despite using an expensive high-end server.Heterogeneous computing,a combination of the Field Programmable Gate Array(FPGA)and a computer,is proposed as a solution to compute MD simulation efficiently.In such heterogeneous computation,communication between FPGA and Computer is necessary.One such MD simulation,explained in the paper,is the(Artificial Neural Network)ANN-based IAP computation of gold(Au_(147)&Au_(309))nanoparticles.MD simulation calculates the forces between atoms and the total energy of the chemical system.This work proposes the novel design and implementation of an ANN IAP-based MD simulation for Au_(147)&Au_(309) using communication protocols,such as Universal Asynchronous Receiver-Transmitter(UART)and Ethernet,for communication between the FPGA and the host computer.To improve the latency of MD simulation through heterogeneous computing,Universal Asynchronous Receiver-Transmitter(UART)and Ethernet communication protocols were explored to conduct MD simulation of 50,000 cycles.In this study,computation times of 17.54 and 18.70 h were achieved with UART and Ethernet,respectively,compared to the conventional server time of 29 h for Au_(147) nanoparticles.The results pave the way for the development of a Lab-on-a-chip application. 展开更多
关键词 Ethernet hardware accelerator heterogeneous computing interatomic potential(IAP) MDsimulation peripheral component interconnect express(PCIe) UART
在线阅读 下载PDF
Hardware Design of Moving Object Detection on Reconfigurable System
9
作者 Hung-Yu Chen Yuan-Kai Wang 《Journal of Computer and Communications》 2016年第10期30-43,共14页
Moving object detection including background subtraction and morphological processing is a critical research topic for video surveillance because of its high computational loading and power consumption. This paper pro... Moving object detection including background subtraction and morphological processing is a critical research topic for video surveillance because of its high computational loading and power consumption. This paper proposes a hardware design to accelerate the computation of background subtraction with low power consumption. A real-time background subtraction method is designed with a frame-buffer scheme and function partition to improve throughput, and implemented using Verilog HDL on FPGA. The design parallelizes the computations of background update and subtraction with a seven-stage pipeline. A stripe-based morphological processing and accounting for the completion of detected objects is devised. Simulation results for videos of VGA resolutions on a low-end FPGA device show 368 fps throughput for only the real-time background subtraction module, and 51 fps for the whole system, including off-chip memory access. Real-time efficiency with low power consumption and low resource utilization is thus demonstrated. 展开更多
关键词 Background Substraction Moving Object Detection Field Programmable Gate Array (FPGA) hardware acceleration
在线阅读 下载PDF
Research on High-Precision Stochastic Computing VLSI Structures for Deep Neural Network Accelerators
10
作者 WU Jingguo ZHU Jingwei +3 位作者 XIONG Xiankui YAO Haidong WANG Chengchen CHEN Yun 《ZTE Communications》 2024年第4期9-17,共9页
Deep neural networks(DNN)are widely used in image recognition,image classification,and other fields.However,as the model size increases,the DNN hardware accelerators face the challenge of higher area overhead and ener... Deep neural networks(DNN)are widely used in image recognition,image classification,and other fields.However,as the model size increases,the DNN hardware accelerators face the challenge of higher area overhead and energy consumption.In recent years,stochastic computing(SC)has been considered a way to realize deep neural networks and reduce hardware consumption.A probabilistic compensation algorithm is proposed to solve the accuracy problem of stochastic calculation,and a fully parallel neural network accelerator based on a deterministic method is designed.The software simulation results show that the accuracy of the probability compensation algorithm on the CIFAR-10 data set is 95.32%,which is 14.98%higher than that of the traditional SC algorithm.The accuracy of the deterministic algorithm on the CIFAR-10 dataset is 95.06%,which is 14.72%higher than that of the traditional SC algorithm.The results of Very Large Scale Integration Circuit(VLSI)hardware tests show that the normalized energy efficiency of the fully parallel neural network accelerator based on the deterministic method is improved by 31%compared with the circuit based on binary computing. 展开更多
关键词 stochastic computing hardware accelerator deep neural network
在线阅读 下载PDF
多源传感器融合与ORB特征提取加速的一体化智能导航平台设计
11
作者 郭迟 蔡子腾 《武汉大学学报(理学版)》 北大核心 2026年第1期113-124,共12页
导航系统依赖传感器感知周围环境。当前,基于单一传感器的导航系统已难以满足各类复杂场景下的导航需求,导航系统正朝传感器多源化方向发展。在多源传感器数据融合过程中,图像数据的处理最消耗时间和资源,对系统性能影响最大。为解决这... 导航系统依赖传感器感知周围环境。当前,基于单一传感器的导航系统已难以满足各类复杂场景下的导航需求,导航系统正朝传感器多源化方向发展。在多源传感器数据融合过程中,图像数据的处理最消耗时间和资源,对系统性能影响最大。为解决这些问题,设计智能导航平台的硬件控制终端,利用基于全球卫星导航系统(Global Navigation Satellite System,GNSS)秒脉冲(Pulse Per Second,PPS)的时间同步,实现多源传感器数据融合;设计用于同步定位与地图构建(Simultaneous Localization And Mapping,SLAM)前端ORB(Oriented FAST and Rotated BRIEF)特征提取加速器,加速图像处理过程,提高SLAM系统的实时性。实验结果表明,硬件平台不仅支持GNSS、惯性测量单元(Inertial Measurement Unit,IMU)、视觉和激光雷达的数据采集和融合,还能加速图像ORB特征点提取。在执行图像ORB特征提取任务时,与CPU和GPU平台上的实现相比,该加速器的帧率分别达到了它们的2.7倍和1.8倍,而功耗仅为它们的5.1%和2.9%。 展开更多
关键词 智能导航 多源传感器 时间同步 ORB特征提取 硬件加速器
原文传递
面向边缘异构算力的高通量视频分析优化研究
12
作者 马丽娜 严龙 +3 位作者 曹华伟 梁彦 叶笑春 范东睿 《高技术通讯》 北大核心 2026年第1期53-66,共14页
大数据时代,视频数据占据了数据流量的82%以上,是名副其实的大数据。如何快速有效地从视频数据中获取价值信息以支持视频驱动的信息服务系统具有十分重要的价值。为了提高视频数据的并发处理能力、降低带宽成本,当前视频分析系统通常部... 大数据时代,视频数据占据了数据流量的82%以上,是名副其实的大数据。如何快速有效地从视频数据中获取价值信息以支持视频驱动的信息服务系统具有十分重要的价值。为了提高视频数据的并发处理能力、降低带宽成本,当前视频分析系统通常部署在靠近数据源头的边缘计算中心,依靠集成异构硬件的边缘计算处理方式来提高处理效果,但相关工作未能充分发挥异构加速芯片的能力。本文针对上述问题,提出了面向异构硬件加速设备的高通量视频分析方法。通过采用解码优化策略和多发射异步执行策略,该方法能够充分利用异构芯片资源,实现了单芯片解码速度提升1.49倍,推理速度提升1.44倍。此外,本文提出的优化策略确保了良好的线性扩展性。在一个由12颗解码芯片和18颗推理芯片组成的有限算力的边缘异构平台上,分别实现了17.71倍解码加速、25.52倍推理加速以及33.22倍的视频内容分析全流程加速效果。 展开更多
关键词 高通量计算 视频处理 边缘计算 异构硬件 解码加速 推理加速
在线阅读 下载PDF
基于FPGA的CNN加速通用性设计与实现
13
作者 李卓 卢辉斌 +1 位作者 高乐 郭肖楠 《计算机工程与设计》 北大核心 2026年第1期180-186,共7页
为解决传统FPGA实现卷积神经网络加速器的方案往往受到片上计算单元数量限制的问题,提出了一种高效利用计算单元的通用性设计。对卷积神经网络中卷积层和全连接层这两个最重要的部分进行优化,通过对输入数据进行特定处理,实现多个计算... 为解决传统FPGA实现卷积神经网络加速器的方案往往受到片上计算单元数量限制的问题,提出了一种高效利用计算单元的通用性设计。对卷积神经网络中卷积层和全连接层这两个最重要的部分进行优化,通过对输入数据进行特定处理,实现多个计算单元同时计算,以及对卷积层模块和全连接层模块进行可复用设计,从而高效利用计算单元,并使用Roofline模型评估卷积层模块。结合实例,将该设计与其它设计进行对比,实验结果验证了该设计具备了很高的处理速度和很强的通用性。 展开更多
关键词 神经网络 硬件加速 现场可编程门阵列 硬件描述语言 屋脊线模型 通用性设计 图像识别
在线阅读 下载PDF
面向RNS-CKKS方案的同态计算硬件加速器
14
作者 陈星辰 郭家怡 +1 位作者 陈弟虎 粟涛 《计算机应用研究》 北大核心 2026年第1期208-215,共8页
针对全同态加密应用中数据存储和传输开销大、计算效率低的问题,以主流的RNS-CKKS方案为研究对象,提出了一种同态计算硬件加速器。该加速器为自同构算子设计了分治式置换网络,实现了高效无冲突的数据调度与路由。同时通过实施片内外协... 针对全同态加密应用中数据存储和传输开销大、计算效率低的问题,以主流的RNS-CKKS方案为研究对象,提出了一种同态计算硬件加速器。该加速器为自同构算子设计了分治式置换网络,实现了高效无冲突的数据调度与路由。同时通过实施片内外协同存储策略和对计算流进行重构,有效降低了硬件部署密钥切换操作的片上缓存需求并隐藏片外延迟。为进一步提升计算与资源效率,构建了统一的计算阵列并优化了片上缓存结构。在FPGA上的实验结果表明,该设计相比于OpenFHE软件函数库,实现了8.68~56.2倍的加速;相较于同类硬件加速方案,在密文-密文同态乘法上实现了1.18~1.53倍的加速以及1.10~4.98倍的面积效率提升,同时在可配置性方面具备一定优势。该工作有助于同态加密方案的硬件高效部署。 展开更多
关键词 全同态加密 RNS-CKKS算法 硬件加速器 可配置架构 现场可编程门阵列
在线阅读 下载PDF
面向人形机器人的FPGA综合图像处理系统
15
作者 谢天舒 刘远光 +4 位作者 徐尚睿 李泽林 黄永嘉 张弘(指导) 娄永乐(指导) 《集成电路与嵌入式系统》 2026年第2期71-80,共10页
为解决ARM架构延迟高和FPGA方案功能单一的问题,设计了一套基于FPGA与PC协同架构的图像处理系统。系统集成对亮度、对比度和色温的调节,绿幕抠图,肤色ROI,信号灯ROI提取和无效区域剔除等功能,上位机通过Python Flask框架构建Web界面,实... 为解决ARM架构延迟高和FPGA方案功能单一的问题,设计了一套基于FPGA与PC协同架构的图像处理系统。系统集成对亮度、对比度和色温的调节,绿幕抠图,肤色ROI,信号灯ROI提取和无效区域剔除等功能,上位机通过Python Flask框架构建Web界面,实现参数配置与结果展示,并扩展了手势识别功能。通过USB-UART链路实现数据交互,核心模块处理速度稳定在560 Mb/s,大幅提升了图像处理效率,满足实时性需求。该系统为人形机器人视觉前端提供高质量图像输入,适应低光和遮挡场景,具有重要的应用价值。 展开更多
关键词 FPGA 软硬件协同 图像处理 手势识别 硬件加速
在线阅读 下载PDF
基于国产FPGA与类ASIC架构的图像识别系统
16
作者 陈冠夫 兰小磊 +3 位作者 陈镇城 张艺豪 陈林亮 李赛 《集成电路与嵌入式系统》 2026年第2期43-52,共10页
为实现实时图像识别的端侧部署,设计并实现了一种基于国产FPGA与自主设计类ASIC架构的嵌入式系统。软件层面,提出了一种轻量级神经网络NexusEdgeNet,以仅0.184 MB参数量,对39类农田病害图像的识别准确率达到94.22%。硬件层面,创新性地... 为实现实时图像识别的端侧部署,设计并实现了一种基于国产FPGA与自主设计类ASIC架构的嵌入式系统。软件层面,提出了一种轻量级神经网络NexusEdgeNet,以仅0.184 MB参数量,对39类农田病害图像的识别准确率达到94.22%。硬件层面,创新性地设计了一款完全采用Verilog HDL描述的类ASIC加速器,采用分布式存储,不依赖外存储器,支持任意形状卷积、池化及全连接等算子。通过近存并行计算、流水线、滑动卷积窗口及双缓冲存储等优化策略,该神经网络加速器在中科亿海微EP6HL130 FPGA上实现了399 f/s的高推理帧率,大幅降低了逻辑资源使用量,计算资源利用率高达85%。系统集成图像采集、处理与显示链路,支持视频流的实时处理与识别,在保持高精度的同时,具备优异的实时性与资源效率,为国产FPGA在边缘计算中的低成本应用提供了有价值的实践方案。 展开更多
关键词 FPGA 神经网络处理器 硬件加速 边缘计算 图像识别
在线阅读 下载PDF
暗通道先验优化的FPGA实时去雾系统
17
作者 刘梦雪 刘成 《计算机测量与控制》 2026年第1期157-165,共9页
暗通道先验算法中,天空区域因高亮无法满足先验条件致使透射率求取出现偏差,透射率细化与大气光映射求取复杂,算法整体计算耗时长,无法满足现代实时去雾的发展需求;为了解决这些问题,一个有效的解决方法为对算法进行轻量化以适配硬件实... 暗通道先验算法中,天空区域因高亮无法满足先验条件致使透射率求取出现偏差,透射率细化与大气光映射求取复杂,算法整体计算耗时长,无法满足现代实时去雾的发展需求;为了解决这些问题,一个有效的解决方法为对算法进行轻量化以适配硬件实时性;通过优化滤波窗口大小获取暗通道以满足硬件资源限制;采用直接对输入图像遍历像素点最大值快速收敛大气光强;引入均值滤波轻量化透射率细化过程;基于亮度阈值分割天空,根据天空占比自适应调整透射率下边界值,实现天空区域的有效去雾;利用FPGA并行优势对所优化算法硬件加速与实现,在Xilinx平台部署从MIPI雾图传感至HDMI去雾结果显示的完整实时图像去雾系统;实验证明优化算法的去雾效果在主客观评价指标上均优于传统暗通道先验,处理一帧1 080 P高帧率图像仅耗时33.016 5 ms,系统通过了去雾效果和实时性验证。 展开更多
关键词 暗通道先验 实时去雾 图像处理 硬件加速 FPGA
在线阅读 下载PDF
基于SSC-SRP-PHAT算法的实时低功耗空中声目标定位系统设计
18
作者 杨智勇 刘佳欣 +1 位作者 肖仲喆 黄敏 《声学技术》 北大核心 2026年第1期155-163,共9页
声源定位系统长期面临准确率低、实时性差的问题。针对这两个问题,文章提出了一种基于软硬件协同设计的方法,使声源定位算法能够应用在低功耗的嵌入式场景中。针对定位精度较高但复杂度也较高的可控响应功率和相位变换(steered response... 声源定位系统长期面临准确率低、实时性差的问题。针对这两个问题,文章提出了一种基于软硬件协同设计的方法,使声源定位算法能够应用在低功耗的嵌入式场景中。针对定位精度较高但复杂度也较高的可控响应功率和相位变换(steered response power phase transform,SRP-PHAT)的声源定位算法,引入搜索空间收缩的方式对其进行优化,进而提出了一种搜索空间收缩的可控响应功率和相位变换(search space contraction steered response power phase transform,SSC-SRP-PHAT)的声源定位方法。首先,设计了软硬件协同框架,使用Vivado HLS实现了硬件加速器IP核的设计与封装,充分利用现场可编程门阵列(field programmable gate array,FPGA)的高度并行性优势;同时,编写了相应的驱动软件,分别实现了数据预处理、数据控制、IP核驱动与运算计时等功能;最后,基于Zynq UltraScale+MPSoC XCZU7EV硬件平台,搭建了软硬件协同声源定位系统。该系统的实时定位分辨率为5°,整体功耗为4.55 W,达到了实时、低功耗定位的目的。 展开更多
关键词 声源定位 可控响应功率和相位变换 软硬件协同 硬件加速
在线阅读 下载PDF
基于ZYNQ的高效卷积神经网络加速器设计
19
作者 龚贵川 谢良波 +1 位作者 黄倩 周牧 《电讯技术》 北大核心 2026年第2期259-266,共8页
针对卷积神经网络(Convolutional Neural Network,CNN)在边缘设备部署时的存储需求大、计算复杂度高以及功耗受限等问题,提出了一种基于ZYNQ的高效卷积神经网络加速器。首先,通过深度可动态配置的行缓冲设计,实现了片上存储资源的高效... 针对卷积神经网络(Convolutional Neural Network,CNN)在边缘设备部署时的存储需求大、计算复杂度高以及功耗受限等问题,提出了一种基于ZYNQ的高效卷积神经网络加速器。首先,通过深度可动态配置的行缓冲设计,实现了片上存储资源的高效利用。其次,为了提高计算效率,基于整型量化技术,设计了一种共享数字信号处理器(Digital Signal Processor,DSP)计算方案,实现了单个DSP对两个有符号INT8乘法的支持。然后,提出了一种数据重排方案,有效提升数据传输效率并减少带宽访问,降低了存储空间寻址的复杂度。在ZU5EV部署了VGG16模型以对加速器性能进行测试,结果表明,所提出的加速器能够实现133.35 GOPS的吞吐量、0.45 GOPS/DSP的计算密度和39.57 GOPS/W的能效比。 展开更多
关键词 边缘设备 卷积神经网络(CNN) 硬件加速器 数据重排
在线阅读 下载PDF
扩散模型神经网络加速策略综述
20
作者 邹子涵 闫鑫明 +3 位作者 郑鹏 张顺 蔡浩 刘波 《电子与封装》 2026年第1期68-77,共10页
随着神经网络的发展,扩散模型通过其独特的扩散机制在图像生成任务中取得了非常大的成就。然而,为了实现优异的任务性能,其引入了大量的计算和复杂的网络结构,限制了其广泛应用,尤其是在资源受限的边缘端设备上。高效的模型加速算法和... 随着神经网络的发展,扩散模型通过其独特的扩散机制在图像生成任务中取得了非常大的成就。然而,为了实现优异的任务性能,其引入了大量的计算和复杂的网络结构,限制了其广泛应用,尤其是在资源受限的边缘端设备上。高效的模型加速算法和加速器软硬件协同框架已成为有效的解决方案。基于多种扩散模型加速和高效部署策略,从适用于通用计算平台的高效算法设计到软硬件框架协同设计,介绍了当前最先进的扩散模型加速策略。 展开更多
关键词 扩散模型 模型加速 边缘部署 软硬件协同设计 高效推理
在线阅读 下载PDF
上一页 1 2 30 下一页 到第
使用帮助 返回顶部