Dear Editor,The letter proposes a tensor low-rank orthogonal compression(TLOC)model for a convolutional neural network(CNN),which facilitates its efficient and highly-accurate low-rank representation.Model compression...Dear Editor,The letter proposes a tensor low-rank orthogonal compression(TLOC)model for a convolutional neural network(CNN),which facilitates its efficient and highly-accurate low-rank representation.Model compression is crucial for deploying deep neural network(DNN)models on resource-constrained embedded devices.展开更多
Accelerated aging tests are widely used to rapidly evaluate the durability of materials,of which thermal-oxidative aging is the most common approach.To quantitatively predict the effects of multiple coupled factors,th...Accelerated aging tests are widely used to rapidly evaluate the durability of materials,of which thermal-oxidative aging is the most common approach.To quantitatively predict the effects of multiple coupled factors,this study takes polyamide66 reinforced with glass fiber(PA66-GF)as a model system and proposed a high-precision paradigm for coupled thermal-oxidative aging.By integrating Arrhenius-type reaction kinetics with oxygen diffusion,a predictive formula that holistically captures the nonlinear synergistic effects of multiple factors was developed,thereby overcoming the limitations of traditional single-variable models.A systematic evaluation of the stepwise improved formulas through nonlinear fitting showed that the coefficient of determination(R^(2))increased from 0.223 to 0.803,elucidating the fundamental reason why conventional approaches fail in quantitative prediction.These formulae were further embedded as physical constraints into a physics-informed neural network(PINN),which further enhanced the predictive performance,with the proposed formula achieving a peak R^(2)of 0.946.The results highlight that robust data fitting alone is insufficient;the decisive factor for the success of PINN lies in whether the embedded formula faithfully reflects the underlying physical mechanisms.When applied to polyamide 6 reinforced with glass fiber(PA6-GF),the Formula-constrained PINN maintained a high level of accuracy(R^(2)=0.916),demonstrating its strong cross-system generalizability.In summary,this work establishes a robust hybrid physics-machine learning framework that combines high accuracy with transferability for predicting the thermal-oxidative aging behavior of composite material systems.展开更多
Recently,due to the availability of big data and the rapid growth of computing power,artificial intelligence(AI)has regained tremendous attention and investment.Machine learning(ML)approaches have been successfully ap...Recently,due to the availability of big data and the rapid growth of computing power,artificial intelligence(AI)has regained tremendous attention and investment.Machine learning(ML)approaches have been successfully applied to solve many problems in academia and in industry.Although the explosion of big data applications is driving the development of ML,it also imposes severe challenges of data processing speed and scalability on conventional computer systems.Computing platforms that are dedicatedly designed for AI applications have been considered,ranging from a complement to von Neumann platforms to a“must-have”and stand-alone technical solution.These platforms,which belong to a larger category named“domain-specific computing,”focus on specific customization for AI.In this article,we focus on summarizing the recent advances in accelerator designs for deep neural networks(DNNs)-that is,DNN accelerators.We discuss various architectures that support DNN executions in terms of computing units,dataflow optimization,targeted network topologies,architectures on emerging technologies,and accelerators for emerging applications.We also provide our visions on the future trend of AI chip designs.展开更多
Extracting the amplitude and time information from the shaped pulse is an important step in nuclear physics experiments.For this purpose,a neural network can be an alternative in off-line data processing.For processin...Extracting the amplitude and time information from the shaped pulse is an important step in nuclear physics experiments.For this purpose,a neural network can be an alternative in off-line data processing.For processing the data in real time and reducing the off-line data storage required in a trigger event,we designed a customized neural network accelerator on a field programmable gate array platform to implement specific layers in a convolutional neural network.The latter is then used in the front-end electronics of the detector.With fully reconfigurable hardware,a tested neural network structure was used for accurate timing of shaped pulses common in front-end electronics.This design can handle up to four channels of pulse signals at once.The peak performance of each channel is 1.665 Giga operations per second at a working frequency of 25 MHz.展开更多
Recent years,the deep learning algorithm has been widely deployed from cloud servers to terminal units.And researchers proposed various neural network accelerators and software development environments.In this article...Recent years,the deep learning algorithm has been widely deployed from cloud servers to terminal units.And researchers proposed various neural network accelerators and software development environments.In this article,we have reviewed the representative neural network accelerators.As an entirety,the corresponding software stack must consider the hardware architecture of the specific accelerator to enhance the end-to-end performance.And we summarize the programming environments of neural network accelerators and optimizations in software stack.Finally,we comment the future trend of neural network accelerator and programming environments.展开更多
With the increasing of data size and model size,deep neural networks(DNNs)show outstanding performance in many artificial intelligence(AI)applications.But the big model size makes it a challenge for high-performance a...With the increasing of data size and model size,deep neural networks(DNNs)show outstanding performance in many artificial intelligence(AI)applications.But the big model size makes it a challenge for high-performance and low-power running DNN on processors,such as central processing unit(CPU),graphics processing unit(GPU),and tensor processing unit(TPU).This paper proposes a LOGNN data representation of 8 bits and a hardware and software co-design deep neural network accelerator LACC to meet the challenge.LOGNN data representation replaces multiply operations to add and shift operations in running DNN.LACC accelerator achieves higher efficiency than the state-of-the-art DNN accelerators by domain specific arithmetic computing units.Finally,LACC speeds up the performance per watt by 1.5 times,compared to the state-of-the-art DNN accelerators on average.展开更多
Unmanned aerial vehicle(UAV)-enabled edge computing is emerging as a potential enabler for Artificial Intelligence of Things(AIoT)in the forthcoming sixth-generation(6G)communication networks.With the use of flexible ...Unmanned aerial vehicle(UAV)-enabled edge computing is emerging as a potential enabler for Artificial Intelligence of Things(AIoT)in the forthcoming sixth-generation(6G)communication networks.With the use of flexible UAVs,massive sensing data is gathered and processed promptly without considering geographical locations.Deep neural networks(DNNs)are becoming a driving force to extract valuable information from sensing data.However,the lightweight servers installed on UAVs are not able to meet the extremely high requirements of inference tasks due to the limited battery capacities of UAVs.In this work,we investigate a DNN model placement problem for AIoT applications,where the trained DNN models are selected and placed on UAVs to execute inference tasks locally.It is impractical to obtain future DNN model request profiles and system operation states in UAV-enabled edge computing.The Lyapunov optimization technique is leveraged for the proposed DNN model placement problem.Based on the observed system overview,an advanced online placement(AOP)algorithm is developed to solve the transformed problem in each time slot,which can reduce DNN model transmission delay and disk I/O energy cost simultaneously while keeping the input data queues stable.Finally,extensive simulations are provided to depict the effectiveness of the AOP algorithm.The numerical results demonstrate that the AOP algorithm can reduce 18.14%of the model placement cost and 29.89%of the input data queue backlog on average by comparing it with benchmark algorithms.展开更多
With the development of computer vision researches, due to the state-of-the-art performance on image and video processing tasks, deep neural network (DNN) has been widely applied in various applications (autonomous ve...With the development of computer vision researches, due to the state-of-the-art performance on image and video processing tasks, deep neural network (DNN) has been widely applied in various applications (autonomous vehicles, weather forecasting, counter-terrorism, surveillance, traffic management, etc.). However, to achieve such performance, DNN models have become increasingly complicated and deeper, and result in heavy computational stress. Thus, it is not sufficient for the general central processing unit (CPU) processors to meet the real-time application requirements. To deal with this bottleneck, research based on hardware acceleration solution for DNN attracts great attention. Specifically, to meet various real-life applications, DNN acceleration solutions mainly focus on issue of hardware acceleration with intense memory and calculation resource. In this paper, a novel resource-saving architecture based on Field Programmable Gate Array (FPGA) is proposed. Due to the novel designed processing element (PE), the proposed architecture </span><span style="font-family:Verdana;">achieves good performance with the extremely limited calculating resource. The on-chip buffer allocation helps enhance resource-saving performance on memory. Moreover, the accelerator improves its performance by exploiting</span> <span style="font-family:Verdana;">the sparsity property of the input feature map. Compared to other state-of-the-art</span><span style="font-family:Verdana;"> solutions based on FPGA, our architecture achieves good performance, with quite limited resource consumption, thus fully meet the requirement of real-time applications.展开更多
With the increasing demand of computational power in artificial intelligence(AI)algorithms,dedicated accelerators have become a necessity.However,the complexity of hardware architectures,vast design search space,and c...With the increasing demand of computational power in artificial intelligence(AI)algorithms,dedicated accelerators have become a necessity.However,the complexity of hardware architectures,vast design search space,and complex tasks of accelerators have posed significant challenges.Tra-ditional search methods can become prohibitively slow if the search space continues to be expanded.A design space exploration(DSE)method is proposed based on transfer learning,which reduces the time for repeated training and uses multi-task models for different tasks on the same processor.The proposed method accurately predicts the latency and energy consumption associated with neural net-work accelerator design parameters,enabling faster identification of optimal outcomes compared with traditional methods.And compared with other DSE methods by using multilayer perceptron(MLP),the required training time is shorter.Comparative experiments with other methods demonstrate that the proposed method improves the efficiency of DSE without compromising the accuracy of the re-sults.展开更多
Deep learning has now been widely used in intelligent apps of mobile devices.In pursuit of ultra-low power and latency,integrating neural network accelerators(NNA)to mobile phones has become a trend.However,convention...Deep learning has now been widely used in intelligent apps of mobile devices.In pursuit of ultra-low power and latency,integrating neural network accelerators(NNA)to mobile phones has become a trend.However,conventional deep learning programming frameworks are not well-developed to support such devices,leading to low computing efficiency and high memory-occupation.To address this problem,a 2-stage pipeline is proposed for optimizing deep learning model inference on mobile devices with NNAs in terms of both speed and memory-footprint.The 1 st stage reduces computation workload via graph optimization,including splitting and merging nodes.The 2 nd stage goes further by optimizing at compilation level,including kernel fusion and in-advance compilation.The proposed optimizations on a commercial mobile phone with an NNA is evaluated.The experimental results show that the proposed approaches achieve 2.8×to 26×speed up,and reduce the memory-footprint by up to 75%.展开更多
针对深度神经网络(deep neural network,DNN)模型在传统切片与映射方法中存在的资源调度和数据传输瓶颈问题,提出了一种基于片上网络(network on chip,NoC)加速器的高效DNN动态切片与智能映射优化算法。该算法通过动态切片技术灵活划分...针对深度神经网络(deep neural network,DNN)模型在传统切片与映射方法中存在的资源调度和数据传输瓶颈问题,提出了一种基于片上网络(network on chip,NoC)加速器的高效DNN动态切片与智能映射优化算法。该算法通过动态切片技术灵活划分DNN模型的计算任务,并结合智能映射策略优化NoC架构中的任务分配与数据流管理。实验结果表明,与传统方法相比,该算法在计算吞吐量、NoC传输时延、外部内存访问次数和计算能效等方面均显著提升,尤其在复杂模型上表现突出。展开更多
Deep neural networks(DNN)are widely used in image recognition,image classification,and other fields.However,as the model size increases,the DNN hardware accelerators face the challenge of higher area overhead and ener...Deep neural networks(DNN)are widely used in image recognition,image classification,and other fields.However,as the model size increases,the DNN hardware accelerators face the challenge of higher area overhead and energy consumption.In recent years,stochastic computing(SC)has been considered a way to realize deep neural networks and reduce hardware consumption.A probabilistic compensation algorithm is proposed to solve the accuracy problem of stochastic calculation,and a fully parallel neural network accelerator based on a deterministic method is designed.The software simulation results show that the accuracy of the probability compensation algorithm on the CIFAR-10 data set is 95.32%,which is 14.98%higher than that of the traditional SC algorithm.The accuracy of the deterministic algorithm on the CIFAR-10 dataset is 95.06%,which is 14.72%higher than that of the traditional SC algorithm.The results of Very Large Scale Integration Circuit(VLSI)hardware tests show that the normalized energy efficiency of the fully parallel neural network accelerator based on the deterministic method is improved by 31%compared with the circuit based on binary computing.展开更多
Grains are the most important food consumed globally,yet their yield can be severely impacted by pest infestations.Addressing this issue,scientists and researchers strive to enhance the yield-to-seed ratio through eff...Grains are the most important food consumed globally,yet their yield can be severely impacted by pest infestations.Addressing this issue,scientists and researchers strive to enhance the yield-to-seed ratio through effective pest detection methods.Traditional approaches often rely on preprocessed datasets,but there is a growing need for solutions that utilize real-time images of pests in their natural habitat.Our study introduces a novel twostep approach to tackle this challenge.Initially,raw images with complex backgrounds are captured.In the subsequent step,feature extraction is performed using both hand-crafted algorithms(Haralick,LBP,and Color Histogram)and modified deep-learning architectures.We propose two models for this purpose:PestNet-EF and PestNet-LF.PestNet-EF uses an early fusion technique to integrate handcrafted and deep learning features,followed by adaptive feature selection methods such as CFS and Recursive Feature Elimination(RFE).PestNet-LF utilizes a late fusion technique,incorporating three additional layers(fully connected,softmax,and classification)to enhance performance.These models were evaluated across 15 classes of pests,including five classes each for rice,corn,and wheat.The performance of our suggested algorithms was tested against the IP102 dataset.Simulation demonstrates that the Pestnet-EF model achieved an accuracy of 96%,and the PestNet-LF model with majority voting achieved the highest accuracy of 94%,while PestNet-LF with the average model attained an accuracy of 92%.Also,the proposed approach was compared with existing methods that rely on hand-crafted and transfer learning techniques,showcasing the effectiveness of our approach in real-time pest detection for improved agricultural yield.展开更多
Lost acceleration response reconstruction is crucial for assessing structural conditions in structural health monitoring(SHM).However,traditional methods struggle to address the reconstruction of acceleration response...Lost acceleration response reconstruction is crucial for assessing structural conditions in structural health monitoring(SHM).However,traditional methods struggle to address the reconstruction of acceleration responses with complex features,resulting in a lower reconstruction accuracy.This paper addresses this challenge by leveraging the advanced feature extraction and learning capabilities of fully convolutional networks(FCN)to achieve precise reconstruction of acceleration responses.In the designed network architecture,the incorporation of skip connections preserves low-level details of the network,greatly facilitating the flow of information and improving training efficiency and accuracy.Dropout techniques are employed to reduce computational load and enhance feature extraction.The proposed FCN model automatically extracts high-level features from the input data and establishes a nonlinearmapping relationship between the input and output responses.Finally,the accuracy of the FCN for structural response reconstructionwas evaluated using acceleration data from an experimental arch rib and comparedwith several traditional methods.Additionally,this approach was applied to reconstruct actual acceleration responses measured by an SHM system on a long-span bridge.Through parameter analysis,the feasibility and accuracy of aspects such as available response positions,the number of available channels,and multi-channel response reconstruction were explored.The results indicate that this method exhibits high-precision response reconstruction capability in both time and frequency domains.,with performance surpassing that of other networks,confirming its effectiveness in reconstructing responses under various sensor data loss scenarios.展开更多
基金supported by the Science and Technology Innovation Key R&D Program of Chongqing(CSTB2025TIAD-STX0032)National Key Research and Development Program of China(2024YFF0908200)+1 种基金the Chongqing Technology Innovation and Application Development Special Key Project(CSTB2024TIAD-KPX0018)the Southwest University Graduate Student Research Innovation(SWUB24051)。
文摘Dear Editor,The letter proposes a tensor low-rank orthogonal compression(TLOC)model for a convolutional neural network(CNN),which facilitates its efficient and highly-accurate low-rank representation.Model compression is crucial for deploying deep neural network(DNN)models on resource-constrained embedded devices.
基金financially supported by the National Natural Science Foundation of China(No.22473032)。
文摘Accelerated aging tests are widely used to rapidly evaluate the durability of materials,of which thermal-oxidative aging is the most common approach.To quantitatively predict the effects of multiple coupled factors,this study takes polyamide66 reinforced with glass fiber(PA66-GF)as a model system and proposed a high-precision paradigm for coupled thermal-oxidative aging.By integrating Arrhenius-type reaction kinetics with oxygen diffusion,a predictive formula that holistically captures the nonlinear synergistic effects of multiple factors was developed,thereby overcoming the limitations of traditional single-variable models.A systematic evaluation of the stepwise improved formulas through nonlinear fitting showed that the coefficient of determination(R^(2))increased from 0.223 to 0.803,elucidating the fundamental reason why conventional approaches fail in quantitative prediction.These formulae were further embedded as physical constraints into a physics-informed neural network(PINN),which further enhanced the predictive performance,with the proposed formula achieving a peak R^(2)of 0.946.The results highlight that robust data fitting alone is insufficient;the decisive factor for the success of PINN lies in whether the embedded formula faithfully reflects the underlying physical mechanisms.When applied to polyamide 6 reinforced with glass fiber(PA6-GF),the Formula-constrained PINN maintained a high level of accuracy(R^(2)=0.916),demonstrating its strong cross-system generalizability.In summary,this work establishes a robust hybrid physics-machine learning framework that combines high accuracy with transferability for predicting the thermal-oxidative aging behavior of composite material systems.
基金the National Science Foundations(NSFs)(1822085,1725456,1816833,1500848,1719160,and 1725447)the NSF Computing and Communication Foundations(1740352)+1 种基金the Nanoelectronics COmputing REsearch Program in the Semiconductor Research Corporation(NC-2766-A)the Center for Research in Intelligent Storage and Processing-in-Memory,one of six centers in the Joint University Microelectronics Program,a SRC program sponsored by Defense Advanced Research Projects Agency.
文摘Recently,due to the availability of big data and the rapid growth of computing power,artificial intelligence(AI)has regained tremendous attention and investment.Machine learning(ML)approaches have been successfully applied to solve many problems in academia and in industry.Although the explosion of big data applications is driving the development of ML,it also imposes severe challenges of data processing speed and scalability on conventional computer systems.Computing platforms that are dedicatedly designed for AI applications have been considered,ranging from a complement to von Neumann platforms to a“must-have”and stand-alone technical solution.These platforms,which belong to a larger category named“domain-specific computing,”focus on specific customization for AI.In this article,we focus on summarizing the recent advances in accelerator designs for deep neural networks(DNNs)-that is,DNN accelerators.We discuss various architectures that support DNN executions in terms of computing units,dataflow optimization,targeted network topologies,architectures on emerging technologies,and accelerators for emerging applications.We also provide our visions on the future trend of AI chip designs.
基金supported by the National Natural Science Foundation of China(Nos.11875146 and 11505074)National Key Research and Development Program of China(No.2016YFE0100900).
文摘Extracting the amplitude and time information from the shaped pulse is an important step in nuclear physics experiments.For this purpose,a neural network can be an alternative in off-line data processing.For processing the data in real time and reducing the off-line data storage required in a trigger event,we designed a customized neural network accelerator on a field programmable gate array platform to implement specific layers in a convolutional neural network.The latter is then used in the front-end electronics of the detector.With fully reconfigurable hardware,a tested neural network structure was used for accurate timing of shaped pulses common in front-end electronics.This design can handle up to four channels of pulse signals at once.The peak performance of each channel is 1.665 Giga operations per second at a working frequency of 25 MHz.
基金partially supported by the National Key Research and Development Program of China (under Grant 2017YFB1003101, 2018AAA0103300, 2017YFA0700900, 2017YFA0700902, 2017YFA0700901)the National Natural Science Foundation of China (under Grant 61732007, 61432016, 61532016, 61672491, 61602441, 61602446, 61732002, 61702478, and 61732020)+6 种基金Beijing Natural Science Foundation (JQ18013)National Science and Technology Major Project (2018ZX01031102)the Transformation and Transferof Scientific and Technological Achievements of Chinese Academy of Sciences (KFJ-HGZX-013)Key Research Projects in Frontier Science of Chinese Academy of Sciences (QYZDBSSW-JSC001)Strategic Priority Research Program of Chinese Academy of Science (XDB32050200, XDC01020000)Standardization Research Project of Chinese Academy of Sciences (BZ201800001)Beijing Academy of Artificial Intelligence (BAAI) and Beijing Nova Program of Science and Technology (Z191100001119093)
文摘Recent years,the deep learning algorithm has been widely deployed from cloud servers to terminal units.And researchers proposed various neural network accelerators and software development environments.In this article,we have reviewed the representative neural network accelerators.As an entirety,the corresponding software stack must consider the hardware architecture of the specific accelerator to enhance the end-to-end performance.And we summarize the programming environments of neural network accelerators and optimizations in software stack.Finally,we comment the future trend of neural network accelerator and programming environments.
基金Supported by the National Key Research and Development Program of China(No.2018AAA0103300,2017YFA0700900,2017YFA0700902,2017YFA0700901,2019AAA0103802,2020AAA0103802)。
文摘With the increasing of data size and model size,deep neural networks(DNNs)show outstanding performance in many artificial intelligence(AI)applications.But the big model size makes it a challenge for high-performance and low-power running DNN on processors,such as central processing unit(CPU),graphics processing unit(GPU),and tensor processing unit(TPU).This paper proposes a LOGNN data representation of 8 bits and a hardware and software co-design deep neural network accelerator LACC to meet the challenge.LOGNN data representation replaces multiply operations to add and shift operations in running DNN.LACC accelerator achieves higher efficiency than the state-of-the-art DNN accelerators by domain specific arithmetic computing units.Finally,LACC speeds up the performance per watt by 1.5 times,compared to the state-of-the-art DNN accelerators on average.
基金supported by the National Science Foundation of China(Grant No.62202118)the Top-Technology Talent Project from Guizhou Education Department(Qianjiao Ji[2022]073)+1 种基金the Natural Science Foundation of Hebei Province(Grant No.F2022203045 and F2022203026)the Central Government Guided Local Science and Technology Development Fund Project(Grant No.226Z0701G).
文摘Unmanned aerial vehicle(UAV)-enabled edge computing is emerging as a potential enabler for Artificial Intelligence of Things(AIoT)in the forthcoming sixth-generation(6G)communication networks.With the use of flexible UAVs,massive sensing data is gathered and processed promptly without considering geographical locations.Deep neural networks(DNNs)are becoming a driving force to extract valuable information from sensing data.However,the lightweight servers installed on UAVs are not able to meet the extremely high requirements of inference tasks due to the limited battery capacities of UAVs.In this work,we investigate a DNN model placement problem for AIoT applications,where the trained DNN models are selected and placed on UAVs to execute inference tasks locally.It is impractical to obtain future DNN model request profiles and system operation states in UAV-enabled edge computing.The Lyapunov optimization technique is leveraged for the proposed DNN model placement problem.Based on the observed system overview,an advanced online placement(AOP)algorithm is developed to solve the transformed problem in each time slot,which can reduce DNN model transmission delay and disk I/O energy cost simultaneously while keeping the input data queues stable.Finally,extensive simulations are provided to depict the effectiveness of the AOP algorithm.The numerical results demonstrate that the AOP algorithm can reduce 18.14%of the model placement cost and 29.89%of the input data queue backlog on average by comparing it with benchmark algorithms.
文摘With the development of computer vision researches, due to the state-of-the-art performance on image and video processing tasks, deep neural network (DNN) has been widely applied in various applications (autonomous vehicles, weather forecasting, counter-terrorism, surveillance, traffic management, etc.). However, to achieve such performance, DNN models have become increasingly complicated and deeper, and result in heavy computational stress. Thus, it is not sufficient for the general central processing unit (CPU) processors to meet the real-time application requirements. To deal with this bottleneck, research based on hardware acceleration solution for DNN attracts great attention. Specifically, to meet various real-life applications, DNN acceleration solutions mainly focus on issue of hardware acceleration with intense memory and calculation resource. In this paper, a novel resource-saving architecture based on Field Programmable Gate Array (FPGA) is proposed. Due to the novel designed processing element (PE), the proposed architecture </span><span style="font-family:Verdana;">achieves good performance with the extremely limited calculating resource. The on-chip buffer allocation helps enhance resource-saving performance on memory. Moreover, the accelerator improves its performance by exploiting</span> <span style="font-family:Verdana;">the sparsity property of the input feature map. Compared to other state-of-the-art</span><span style="font-family:Verdana;"> solutions based on FPGA, our architecture achieves good performance, with quite limited resource consumption, thus fully meet the requirement of real-time applications.
基金the National Key R&D Program of China(No.2018AAA0103300)the National Natural Science Foundation of China(No.61925208,U20A20227,U22A2028)+1 种基金the Chinese Academy of Sciences Project for Young Scientists in Basic Research(No.YSBR-029)the Youth Innovation Promotion Association Chinese Academy of Sciences.
文摘With the increasing demand of computational power in artificial intelligence(AI)algorithms,dedicated accelerators have become a necessity.However,the complexity of hardware architectures,vast design search space,and complex tasks of accelerators have posed significant challenges.Tra-ditional search methods can become prohibitively slow if the search space continues to be expanded.A design space exploration(DSE)method is proposed based on transfer learning,which reduces the time for repeated training and uses multi-task models for different tasks on the same processor.The proposed method accurately predicts the latency and energy consumption associated with neural net-work accelerator design parameters,enabling faster identification of optimal outcomes compared with traditional methods.And compared with other DSE methods by using multilayer perceptron(MLP),the required training time is shorter.Comparative experiments with other methods demonstrate that the proposed method improves the efficiency of DSE without compromising the accuracy of the re-sults.
基金Supported by the National Key Research and Development Program of China(No.2017YFB1003101,2018AAA0103300,2017YFA0700900)the National Natural Science Foundation of China(No.61702478,61732007,61906179)+2 种基金the Beijing Natural Science Foundation(No.JQ18013)the National Science and Technology Major Project(No.2018ZX01031102)the Beijing Academy of Artificial Intelligence
文摘Deep learning has now been widely used in intelligent apps of mobile devices.In pursuit of ultra-low power and latency,integrating neural network accelerators(NNA)to mobile phones has become a trend.However,conventional deep learning programming frameworks are not well-developed to support such devices,leading to low computing efficiency and high memory-occupation.To address this problem,a 2-stage pipeline is proposed for optimizing deep learning model inference on mobile devices with NNAs in terms of both speed and memory-footprint.The 1 st stage reduces computation workload via graph optimization,including splitting and merging nodes.The 2 nd stage goes further by optimizing at compilation level,including kernel fusion and in-advance compilation.The proposed optimizations on a commercial mobile phone with an NNA is evaluated.The experimental results show that the proposed approaches achieve 2.8×to 26×speed up,and reduce the memory-footprint by up to 75%.
文摘针对深度神经网络(deep neural network,DNN)模型在传统切片与映射方法中存在的资源调度和数据传输瓶颈问题,提出了一种基于片上网络(network on chip,NoC)加速器的高效DNN动态切片与智能映射优化算法。该算法通过动态切片技术灵活划分DNN模型的计算任务,并结合智能映射策略优化NoC架构中的任务分配与数据流管理。实验结果表明,与传统方法相比,该算法在计算吞吐量、NoC传输时延、外部内存访问次数和计算能效等方面均显著提升,尤其在复杂模型上表现突出。
文摘Deep neural networks(DNN)are widely used in image recognition,image classification,and other fields.However,as the model size increases,the DNN hardware accelerators face the challenge of higher area overhead and energy consumption.In recent years,stochastic computing(SC)has been considered a way to realize deep neural networks and reduce hardware consumption.A probabilistic compensation algorithm is proposed to solve the accuracy problem of stochastic calculation,and a fully parallel neural network accelerator based on a deterministic method is designed.The software simulation results show that the accuracy of the probability compensation algorithm on the CIFAR-10 data set is 95.32%,which is 14.98%higher than that of the traditional SC algorithm.The accuracy of the deterministic algorithm on the CIFAR-10 dataset is 95.06%,which is 14.72%higher than that of the traditional SC algorithm.The results of Very Large Scale Integration Circuit(VLSI)hardware tests show that the normalized energy efficiency of the fully parallel neural network accelerator based on the deterministic method is improved by 31%compared with the circuit based on binary computing.
基金supported in part by the Basic Science Research Program through the National Research Foundation of Korea(NRF)funded by the Ministry of Education(NRF-2021R1A6A1A03039493)in part by the NRF grant funded by the Korean government(MSIT)(NRF-2022R1A2C1004401).
文摘Grains are the most important food consumed globally,yet their yield can be severely impacted by pest infestations.Addressing this issue,scientists and researchers strive to enhance the yield-to-seed ratio through effective pest detection methods.Traditional approaches often rely on preprocessed datasets,but there is a growing need for solutions that utilize real-time images of pests in their natural habitat.Our study introduces a novel twostep approach to tackle this challenge.Initially,raw images with complex backgrounds are captured.In the subsequent step,feature extraction is performed using both hand-crafted algorithms(Haralick,LBP,and Color Histogram)and modified deep-learning architectures.We propose two models for this purpose:PestNet-EF and PestNet-LF.PestNet-EF uses an early fusion technique to integrate handcrafted and deep learning features,followed by adaptive feature selection methods such as CFS and Recursive Feature Elimination(RFE).PestNet-LF utilizes a late fusion technique,incorporating three additional layers(fully connected,softmax,and classification)to enhance performance.These models were evaluated across 15 classes of pests,including five classes each for rice,corn,and wheat.The performance of our suggested algorithms was tested against the IP102 dataset.Simulation demonstrates that the Pestnet-EF model achieved an accuracy of 96%,and the PestNet-LF model with majority voting achieved the highest accuracy of 94%,while PestNet-LF with the average model attained an accuracy of 92%.Also,the proposed approach was compared with existing methods that rely on hand-crafted and transfer learning techniques,showcasing the effectiveness of our approach in real-time pest detection for improved agricultural yield.
基金National Natural Science Foundation of China(Grant Nos.52408314,52278292)Chongqing Outstanding Youth Science Foundation(Grant No.CSTB2023NSCQ-JQX0029)+1 种基金Science and Technology Project of Sichuan Provincial Transportation Department(Grant No.2023-ZL-03)Science and Technology Project of Guizhou Provincial Transportation Department(Grant No.2024-122-018).
文摘Lost acceleration response reconstruction is crucial for assessing structural conditions in structural health monitoring(SHM).However,traditional methods struggle to address the reconstruction of acceleration responses with complex features,resulting in a lower reconstruction accuracy.This paper addresses this challenge by leveraging the advanced feature extraction and learning capabilities of fully convolutional networks(FCN)to achieve precise reconstruction of acceleration responses.In the designed network architecture,the incorporation of skip connections preserves low-level details of the network,greatly facilitating the flow of information and improving training efficiency and accuracy.Dropout techniques are employed to reduce computational load and enhance feature extraction.The proposed FCN model automatically extracts high-level features from the input data and establishes a nonlinearmapping relationship between the input and output responses.Finally,the accuracy of the FCN for structural response reconstructionwas evaluated using acceleration data from an experimental arch rib and comparedwith several traditional methods.Additionally,this approach was applied to reconstruct actual acceleration responses measured by an SHM system on a long-span bridge.Through parameter analysis,the feasibility and accuracy of aspects such as available response positions,the number of available channels,and multi-channel response reconstruction were explored.The results indicate that this method exhibits high-precision response reconstruction capability in both time and frequency domains.,with performance surpassing that of other networks,confirming its effectiveness in reconstructing responses under various sensor data loss scenarios.