期刊文献+
共找到13篇文章
< 1 >
每页显示 20 50 100
A survey of neural network accelerator with software development environments
1
作者 Jin Song Xuemeng Wang +2 位作者 Zhipeng Zhao Wei Li Tian Zhi 《Journal of Semiconductors》 EI CAS CSCD 2020年第2期20-28,共9页
Recent years,the deep learning algorithm has been widely deployed from cloud servers to terminal units.And researchers proposed various neural network accelerators and software development environments.In this article... Recent years,the deep learning algorithm has been widely deployed from cloud servers to terminal units.And researchers proposed various neural network accelerators and software development environments.In this article,we have reviewed the representative neural network accelerators.As an entirety,the corresponding software stack must consider the hardware architecture of the specific accelerator to enhance the end-to-end performance.And we summarize the programming environments of neural network accelerators and optimizations in software stack.Finally,we comment the future trend of neural network accelerator and programming environments. 展开更多
关键词 neural network accelerator compiling optimization programming environments
在线阅读 下载PDF
Optimizing deep learning inference on mobile devices with neural network accelerators
2
作者 Zeng Xi Xu Yunlong Zhi Tian 《High Technology Letters》 EI CAS 2019年第4期417-425,共9页
Deep learning has now been widely used in intelligent apps of mobile devices.In pursuit of ultra-low power and latency,integrating neural network accelerators(NNA)to mobile phones has become a trend.However,convention... Deep learning has now been widely used in intelligent apps of mobile devices.In pursuit of ultra-low power and latency,integrating neural network accelerators(NNA)to mobile phones has become a trend.However,conventional deep learning programming frameworks are not well-developed to support such devices,leading to low computing efficiency and high memory-occupation.To address this problem,a 2-stage pipeline is proposed for optimizing deep learning model inference on mobile devices with NNAs in terms of both speed and memory-footprint.The 1 st stage reduces computation workload via graph optimization,including splitting and merging nodes.The 2 nd stage goes further by optimizing at compilation level,including kernel fusion and in-advance compilation.The proposed optimizations on a commercial mobile phone with an NNA is evaluated.The experimental results show that the proposed approaches achieve 2.8×to 26×speed up,and reduce the memory-footprint by up to 75%. 展开更多
关键词 machine learning inference neural network accelerator(NNA) low latency kernel fusion in-advance compilation
在线阅读 下载PDF
Design space exploration of neural network accelerator based on transfer learning
3
作者 吴豫章 ZHI Tian +1 位作者 SONG Xinkai LI Xi 《High Technology Letters》 EI CAS 2023年第4期416-426,共11页
With the increasing demand of computational power in artificial intelligence(AI)algorithms,dedicated accelerators have become a necessity.However,the complexity of hardware architectures,vast design search space,and c... With the increasing demand of computational power in artificial intelligence(AI)algorithms,dedicated accelerators have become a necessity.However,the complexity of hardware architectures,vast design search space,and complex tasks of accelerators have posed significant challenges.Tra-ditional search methods can become prohibitively slow if the search space continues to be expanded.A design space exploration(DSE)method is proposed based on transfer learning,which reduces the time for repeated training and uses multi-task models for different tasks on the same processor.The proposed method accurately predicts the latency and energy consumption associated with neural net-work accelerator design parameters,enabling faster identification of optimal outcomes compared with traditional methods.And compared with other DSE methods by using multilayer perceptron(MLP),the required training time is shorter.Comparative experiments with other methods demonstrate that the proposed method improves the efficiency of DSE without compromising the accuracy of the re-sults. 展开更多
关键词 design space exploration(DSE) transfer learning neural network accelerator multi-task learning
在线阅读 下载PDF
GShuttle:Optimizing Memory Access Efficiency for Graph Convolu-tional Neural Network Accelerators 被引量:1
4
作者 李家军 王可 +1 位作者 郑皓 Ahmed Louri 《Journal of Computer Science & Technology》 SCIE EI CSCD 2023年第1期115-127,共13页
Graph convolutional neural networks(GCNs)have emerged as an effective approach to extending deep learning for graph data analytics,but they are computationally challenging given the irregular graphs and the large num-... Graph convolutional neural networks(GCNs)have emerged as an effective approach to extending deep learning for graph data analytics,but they are computationally challenging given the irregular graphs and the large num-ber of nodes in a graph.GCNs involve chain sparse-dense matrix multiplications with six loops,which results in a large de-sign space for GCN accelerators.Prior work on GCN acceleration either employs limited loop optimization techniques,or determines the design variables based on random sampling,which can hardly exploit data reuse efficiently,thus degrading system efficiency.To overcome this limitation,this paper proposes GShuttle,a GCN acceleration scheme that maximizes memory access efficiency to achieve high performance and energy efficiency.GShuttle systematically explores loop opti-mization techniques for GCN acceleration,and quantitatively analyzes the design objectives(e.g.,required DRAM access-es and SRAM accesses)by analytical calculation based on multiple design variables.GShuttle further employs two ap-proaches,pruned search space sweeping and greedy search,to find the optimal design variables under certain design con-straints.We demonstrated the efficacy of GShuttle by evaluation on five widely used graph datasets.The experimental simulations show that GShuttle reduces the number of DRAM accesses by a factor of 1.5 and saves energy by a factor of 1.7 compared with the state-of-the-art approaches. 展开更多
关键词 graph convolutional neural network memory access neural network accelerator
原文传递
Tetris:A Heuristic Static Memory Management Framework for Uniform Memory Multicore Neural Network Accelerators
5
作者 Xiao-Bing Chen Hao Qi +4 位作者 Shao-Hui Peng Yi-Min Zhuang Tian Zhi Yun-Ji Chen Distinguished Member,CCF 《Journal of Computer Science & Technology》 SCIE EI CSCD 2022年第6期1255-1270,共16页
Uniform memory multicore neural network accelerators(UNNAs)furnish huge computing power to emerging neural network applications.Meanwhile,with neural network architectures going deeper and wider,the limited memory cap... Uniform memory multicore neural network accelerators(UNNAs)furnish huge computing power to emerging neural network applications.Meanwhile,with neural network architectures going deeper and wider,the limited memory capacity has become a constraint to deploy models on UNNA platforms.Therefore how to efficiently manage memory space and how to reduce workload footprints are urgently significant.In this paper,we propose Tetris:a heuristic static memory management framework for UNNA platforms.Tetris reconstructs execution flows and synchronization relationships among cores to analyze each tensor’s liveness interval.Then the memory management problem is converted to a sequence permutation problem.Tetris uses a genetic algorithm to explore the permutation space to optimize the memory management strategy and reduce memory footprints.We evaluate several typical neural networks and the experimental results demonstrate that Tetris outperforms the state-of-the-art memory allocation methods,and achieves an average memory reduction ratio of 91.9%and 87.9%for a quad-core and a 16-core Cambricon-X platform,respectively. 展开更多
关键词 multicore neural network accelerators liveness analysis static memory management memory reuse genetic algorithm
原文传递
Design and Tool Flow of a Reconfigurable Asynchronous Neural Network Accelerator 被引量:3
6
作者 Jilin Zhang Hui Wu +2 位作者 Weijia Chen Shaojun Wei Hong Chen 《Tsinghua Science and Technology》 SCIE EI CAS CSCD 2021年第5期565-573,共9页
Convolutional Neural Networks(CNNs)are widely used in computer vision,natural language processing,and so on,which generally require low power and high efficiency in real applications.Thus,energy efficiency has become ... Convolutional Neural Networks(CNNs)are widely used in computer vision,natural language processing,and so on,which generally require low power and high efficiency in real applications.Thus,energy efficiency has become a critical indicator of CNN accelerators.Considering that asynchronous circuits have the advantages of low power consumption,high speed,and no clock distribution problems,we design and implement an energy-efficient asynchronous CNN accelerator with a 65 nm Complementary Metal Oxide Semiconductor(CMOS)process.Given the absence of a commercial design tool flow for asynchronous circuits,we develop a novel design flow to implement Click-based asynchronous bundled data circuits efficiently to mask layout with conventional Electronic Design Automation(EDA)tools.We also introduce an adaptive delay matching method and perform accurate static timing analysis for the circuits to ensure correct timing.The accelerator for handwriting recognition network(LeNet-5 model)is implemented.Silicon test results show that the asynchronous accelerator has 30%less power in computing array than the synchronous one and that the energy efficiency of the asynchronous accelerator achieves 1.538 TOPS/W,which is 12%higher than that of the synchronous chip. 展开更多
关键词 Convolutional neural network(CNN)accelerator asynchronous circuit energy efficiency adaptive delay matching asynchronous design flow
原文传递
Design and implementation of dual-mode configurable memory architecture for CNN accelerator
7
作者 山蕊 LI Xiaoshuo +1 位作者 GAO Xu HUO Ziqing 《High Technology Letters》 EI CAS 2024年第2期211-220,共10页
With the rapid development of deep learning algorithms,the computational complexity and functional diversity are increasing rapidly.However,the gap between high computational density and insufficient memory bandwidth ... With the rapid development of deep learning algorithms,the computational complexity and functional diversity are increasing rapidly.However,the gap between high computational density and insufficient memory bandwidth under the traditional von Neumann architecture is getting worse.Analyzing the algorithmic characteristics of convolutional neural network(CNN),it is found that the access characteristics of convolution(CONV)and fully connected(FC)operations are very different.Based on this feature,a dual-mode reronfigurable distributed memory architecture for CNN accelerator is designed.It can be configured in Bank mode or first input first output(FIFO)mode to accommodate the access needs of different operations.At the same time,a programmable memory control unit is designed,which can effectively control the dual-mode configurable distributed memory architecture by using customized special accessing instructions and reduce the data accessing delay.The proposed architecture is verified and tested by parallel implementation of some CNN algorithms.The experimental results show that the peak bandwidth can reach 13.44 GB·s^(-1)at an operating frequency of 120 MHz.This work can achieve 1.40,1.12,2.80 and 4.70 times the peak bandwidth compared with the existing work. 展开更多
关键词 distributed memory structure neural network accelerator reconfigurable arrayprocessor configurable memory structure
在线阅读 下载PDF
A Method for Reducing Noise Radiated from Structures with Vibration Absorbers by Using an Accelerated Neural Network 被引量:2
8
作者 李连进 葛为民 《Transactions of Tianjin University》 EI CAS 2004年第1期9-15,共7页
A method for reducing noise radiated from structures by vibration absorbers is presented. Since usual design method for the absorbers is invalid for noise reduction, the peaks of noise power in the frequency domain as... A method for reducing noise radiated from structures by vibration absorbers is presented. Since usual design method for the absorbers is invalid for noise reduction, the peaks of noise power in the frequency domain as cost functions are applied. Hence, the equations for obtaining optimal parameters of the absorbers become nonlinear expressions. To have the parameters, an accelerated neural network procedure has been presented. Numerical calculations have been carried out for a plate type cantilever beam with a large width, and experimental tests have been also performed for the same beam. It is clarified that the present method is valid for reducing noise radiated from structures. As for the usual design method for the absorbers, model analysis has been given, so the number of absorbers should be the same as that of the considered modes. While the nonlinear problem can be dealt with by the present method, there is no restriction on the number of absorbers or the model number. 展开更多
关键词 STRUCTURE vibration and noise control vibration absorber neural network accelerated neural network
在线阅读 下载PDF
Design and implementation of reconfigurable CNN accelerator architecture based on elastic storage
9
作者 Shan Rui Xu Jianing Huo Ziqing 《The Journal of China Universities of Posts and Telecommunications》 2025年第1期74-87,共14页
With the rapid iteration of neural network algorithms, higher requirements were placed on the computational performance and memory access bandwidth of neural network accelerators. Simply increasing bandwidth cannot im... With the rapid iteration of neural network algorithms, higher requirements were placed on the computational performance and memory access bandwidth of neural network accelerators. Simply increasing bandwidth cannot improve energy efficiency, so improving the data reuse rate is a hot research topic. From the perspective of supporting data reuse, a reconfigurable convolutional neural network(CNN) accelerator based on elastic storage(RCAES) was designed in this paper. Supporting elastic memory access and flexible data flow reduces data movement between the processor and memory, eases the bandwidth pressure and enhances CNN acceleration performance. The experimental results indicate that by conducting 1×1 convolution and 3×3 convolution when performing convolution calculations, the execution speed increased by 25.00% and 61.61%, respectively. The 3×3 maximum pooling speed was increased by 76.04%. 展开更多
关键词 RECONFIGURABLE array processor distributed storage neural network accelerator data reuse
原文传递
Recent advances in efficient computation of deep convolutional neural networks 被引量:37
10
作者 Jian CHENG Pei-song WANG +2 位作者 Gang LI Qing-hao HU Han-qing LU 《Frontiers of Information Technology & Electronic Engineering》 SCIE EI CSCD 2018年第1期64-77,共14页
Deep neural networks have evolved remarkably over the past few years and they are currently the fundamental tools of many intelligent systems.At the same time,the computational complexity and resource consumption of t... Deep neural networks have evolved remarkably over the past few years and they are currently the fundamental tools of many intelligent systems.At the same time,the computational complexity and resource consumption of these networks continue to increase.This poses a significant challenge to the deployment of such networks,especially in real-time applications or on resource-limited devices.Thus,network acceleration has become a hot topic within the deep learning community.As for hardware implementation of deep neural networks,a batch of accelerators based on a field-programmable gate array(FPGA) or an application-specific integrated circuit(ASIC)have been proposed in recent years.In this paper,we provide a comprehensive survey of recent advances in network acceleration,compression,and accelerator design from both algorithm and hardware points of view.Specifically,we provide a thorough analysis of each of the following topics:network pruning,low-rank approximation,network quantization,teacher–student networks,compact network design,and hardware accelerators.Finally,we introduce and discuss a few possible future directions. 展开更多
关键词 Deep neural networks Acceleration Compression Hardware accelerator
原文传递
DyPipe: A Holistic Approach to Accelerating Dynamic Neural Networks with Dynamic Pipelining
11
作者 庄毅敏 胡杏 +1 位作者 陈小兵 支天 《Journal of Computer Science & Technology》 SCIE EI CSCD 2023年第4期899-910,共12页
Dynamic neural network(NN)techniques are increasingly important because they facilitate deep learning techniques with more complex network architectures.However,existing studies,which predominantly optimize the static... Dynamic neural network(NN)techniques are increasingly important because they facilitate deep learning techniques with more complex network architectures.However,existing studies,which predominantly optimize the static computational graphs by static scheduling methods,usually focus on optimizing static neural networks in deep neural network(DNN)accelerators.We analyze the execution process of dynamic neural networks and observe that dynamic features introduce challenges for efficient scheduling and pipelining in existing DNN accelerators.We propose DyPipe,a holistic approach to optimizing dynamic neural network inferences in enhanced DNN accelerators.DyPipe achieves significant performance improvements for dynamic neural networks while it introduces negligible overhead for static neural networks.Our evaluation demonstrates that DyPipe achieves 1.7x speedup on dynamic neural networks and maintains more than 96%performance for static neural networks. 展开更多
关键词 dynamic neural network(NN) deep neural network(DNN)accelerator dynamic pipelining
原文传递
Design of high parallel CNN accelerator based on FPGA for AIoT
12
作者 Lin Zhijian Gao Xuewei +3 位作者 Chen Xiaopei Zhu Zhipeng Du Xiaoyong Chen Pingping 《The Journal of China Universities of Posts and Telecommunications》 EI CSCD 2022年第5期1-9,61,共10页
To tackle the challenge of applying convolutional neural network(CNN)in field-programmable gate array(FPGA)due to its computational complexity,a high-performance CNN hardware accelerator based on Verilog hardware desc... To tackle the challenge of applying convolutional neural network(CNN)in field-programmable gate array(FPGA)due to its computational complexity,a high-performance CNN hardware accelerator based on Verilog hardware description language was designed,which utilizes a pipeline architecture with three parallel dimensions including input channels,output channels,and convolution kernels.Firstly,two multiply-and-accumulate(MAC)operations were packed into one digital signal processing(DSP)block of FPGA to double the computation rate of the CNN accelerator.Secondly,strategies of feature map block partitioning and special memory arrangement were proposed to optimize the total amount of off-chip access memory and reduce the pressure on FPGA bandwidth.Finally,an efficient computational array combining multiplicative-additive tree and Winograd fast convolution algorithm was designed to balance hardware resource consumption and computational performance.The high parallel CNN accelerator was deployed in ZU3 EG of Alinx,using the YOLOv3-tiny algorithm as the test object.The average computing performance of the CNN accelerator is 127.5 giga operations per second(GOPS).The experimental results show that the hardware architecture effectively improves the computational power of CNN and provides better performance compared with other existing schemes in terms of power consumption and the efficiency of DSPs and block random access memory(BRAMs). 展开更多
关键词 artificial intelligence of things(AIoT) convolutional neural network(CNN)accelerator Winograd convolution field-programmable gate array(FPGA)
原文传递
DRNet:Towards fast,accurate and practical dish recognition 被引量:1
13
作者 CHENG SiYuan CHU BinFei +4 位作者 ZHONG BiNeng ZHANG ZiKai LIU Xin TANG ZhenJun LI XianXian 《Science China(Technological Sciences)》 SCIE EI CAS CSCD 2021年第12期2651-2661,共11页
Existing algorithms of dish recognition mainly focus on accuracy with predefined classes,thus limiting their application scope.In this paper,we propose a practical two-stage dish recognition framework(DRNet)that yield... Existing algorithms of dish recognition mainly focus on accuracy with predefined classes,thus limiting their application scope.In this paper,we propose a practical two-stage dish recognition framework(DRNet)that yields a tradeoff between speed and accuracy while adapting to the variation in class numbers.In the first stage,we build an arbitrary-oriented dish detector(AODD)to localize dish position,which can effectively alleviate the impact of background noise and pose variations.In the second stage,we propose a dish reidentifier(DReID)to recognize the registered dishes to handle uncertain categories.To further improve the accuracy of DRNet,we design an attribute recognition(AR)module to predict the attributes of dishes.The attributes are used as auxiliary information to enhance the discriminative ability of DRNet.Moreover,pruning and quantization are processed on our model to be deployed in embedded environments.Finally,to facilitate the study of dish recognition,a well-annotated dataset is established.Our AODD,DReID,AR,and DRNet run at about 14,25,16,and 5 fps on the hardware RKNN 3399 pro,respectively. 展开更多
关键词 neural network acceleration neural network quantization object detection reidentification dish recognition
原文传递
上一页 1 下一页 到第
使用帮助 返回顶部