期刊文献+
共找到15篇文章
< 1 >
每页显示 20 50 100
Design and implementation of control system for superconducting RSFQ circuit
1
作者 张阔中 HUANG Junying +3 位作者 ZHANG Hui TANG Guangming ZHANG Zhimin YE Xiaochun 《High Technology Letters》 EI CAS 2023年第4期335-347,共13页
The superconducting rapid single flux quantum(RSFQ)integrated circuit is a promising solu-tion for overcoming speed and power bottlenecks in high-performance computing systems in the post-Moore era.This paper presents... The superconducting rapid single flux quantum(RSFQ)integrated circuit is a promising solu-tion for overcoming speed and power bottlenecks in high-performance computing systems in the post-Moore era.This paper presents an architecture designed to improve the speed and power limitations of high-performance computing systems using superconducting technology.Since superconducting microprocessors,which operate at cryogenic temperatures,require support from semiconductor cir-cuits,the proposed design utilizes the von Neumann architecture with a superconducting RSFQ mi-croprocessor,cryogenic semiconductor memory,a room temperature field programmable gate array(FPGA)controller,and a host computer for input/output.Additionally,the paper introduces two key circuit designs:a start/stop controllable superconducting clock generator and an asynchronous communication interface between the RSFQ and semiconductor chips used to implement the control system.Experimental results demonstrate that the proposed design is feasible and effective,provi-ding valuable insights for future superconducting computer systems. 展开更多
关键词 single flux quantum superconducting rapid single flux quantum(RSFQ)circuit superconducting control system clock generator asynchronous communication interface circuit
在线阅读 下载PDF
Optimized algorithm for image semantic segmentation compression algorithm in video surveillance scenarios
2
作者 ZHANG Yangmei ZHANG Xishan +1 位作者 ZHANG Shuo LI Jintao 《High Technology Letters》 2025年第2期194-203,共10页
In recent years,video coding has been widely applied in the field of video image processing to remove redundant information and improve data transmission efficiency.However,during the video coding process,irrelevant o... In recent years,video coding has been widely applied in the field of video image processing to remove redundant information and improve data transmission efficiency.However,during the video coding process,irrelevant objects such as background elements are often encoded due to environmental disturbances,resulting in the wastage of computational resources.Existing research on video coding efficiency optimization primarily focuses on optimizing encoding units during intra-frame or inter frame prediction after the generation of coding units,neglecting the optimization of video images before coding unit generation.To address this challenge,This work proposes an image semantic segmentation compression algorithm based on macroblock encoding,called image semantic segmentation compression algorithm based on macroblock encoding(ISSC-ME),which consists of three modules.(1)The semantic label generation module generates interesting object labels using a grid-based approach to reduce redundant coding of consecutive frames.(2)The image segmentation network module generates a semantic segmentation image using U-Net.(3)The macroblock coding module,is a block segmentation-based video encoding and decoding algorithm used to compress images and improve video transmission efficiency.Experimental results show that the proposed image semantic segmentation optimization algorithm can reduce the computational costs,and improve the overall accuracy by 1.00%and the mean intersection over union(IoU)by 1.20%.In addition,the proposed compression algorithm utilizes macroblock fusion,resulting in the image compression rate achieving 80.64%.It has been proven that the proposed algorithm greatly reduces data storage and transmission,and enables fast image compression processing at the millisecond level. 展开更多
关键词 macroblock encoding semantic segmentation segmentation compression
在线阅读 下载PDF
StM:a benchmark for evaluating generalization in reinforcement learning
3
作者 YUAN Kaizhao ZHANG Rui +5 位作者 PAN Yansong YI Qi PENG Shaohui GUO Jiaming HE Wenkai HU Xing 《High Technology Letters》 2025年第2期118-130,共13页
The challenge of enhancing the generalization capacity of reinforcement learning(RL)agents remains a formidable obstacle.Existing RL methods,despite achieving superhuman performance on certain benchmarks,often struggl... The challenge of enhancing the generalization capacity of reinforcement learning(RL)agents remains a formidable obstacle.Existing RL methods,despite achieving superhuman performance on certain benchmarks,often struggle with this aspect.A potential reason is that the benchmarks used for training and evaluation may not adequately offer a diverse set of transferable tasks.Although recent studies have developed bench-marking environments to address this shortcoming,they typically fall short in providing tasks that both ensure a solid foundation for generalization and exhibit significant variability.To overcome these limitations,this work introduces the concept that‘objects are composed of more fundamental components’in environment design,as implemented in the proposed environment called summon the magic(StM).This environment generates tasks where objects are derived from extensible and shareable basic components,facilitating strategy reuse and enhancing generalization.Furthermore,two new metrics,adaptation sensitivity range(ASR)and parameter correlation coefficient(PCC),are proposed to better capture and evaluate the generalization process of RL agents.Experimental results show that increasing the number of basic components of the object reduces the proximal policy optimization(PPO)agent’s training-testing gap by 60.9%(in episode reward),significantly alleviating overfitting.Additionally,linear variations in other environmental factors,such as the training monster set proportion and the total number of basic components,uniformly decrease the gap by at least 32.1%.These results highlight StM’s effectiveness in benchmarking and probing the generalization capabilities of RL algorithms. 展开更多
关键词 reinforcement learning(RL) GENERALIZATION BENCHMARK environment
在线阅读 下载PDF
Cambricon-QR:a sparse and bitwise reproducible quantized training accelerator
4
作者 李楠 ZHAO Yongwei +7 位作者 ZHI Tian LIU Chang DU Zidong HU Xing LI Wei ZHANG Xishan LI Ling SUN Guangzhong 《High Technology Letters》 EI CAS 2024年第1期52-60,共9页
Quantized training has been proven to be a prominent method to achieve deep neural network training under limited computational resources.It uses low bit-width arithmetics with a proper scaling factor to achieve negli... Quantized training has been proven to be a prominent method to achieve deep neural network training under limited computational resources.It uses low bit-width arithmetics with a proper scaling factor to achieve negligible accuracy loss.Cambricon-Q is the ASIC design proposed to efficiently support quantized training,and achieves significant performance improvement.However,there are still two caveats in the design.First,Cambricon-Q with different hardware specifications may lead to different numerical errors,resulting in non-reproducible behaviors which may become a major concern in critical applications.Second,Cambricon-Q cannot leverage data sparsity,where considerable cycles could still be squeezed out.To address the caveats,the acceleration core of Cambricon-Q is redesigned to support fine-grained irregular data processing.The new design not only enables acceleration on sparse data,but also enables performing local dynamic quantization by contiguous value ranges(which is hardware independent),instead of contiguous addresses(which is dependent on hardware factors).Experimental results show that the accuracy loss of the method still keeps negligible,and the accelerator achieves 1.61×performance improvement over Cambricon-Q,with about 10%energy increase. 展开更多
关键词 quantized training sparse accelerator Cambricon-QR
在线阅读 下载PDF
A Survey of Hardware-Assisted Intra-Address Space Protections
5
作者 Yue Jin Tian-Yi Huang +4 位作者 Si-Yuan Zeng Yi-Bin Xu Han Wang Tian-Yue Lu Ming-Yu Chen 《Journal of Computer Science & Technology》 2025年第5期1347-1367,共21页
With a similar threat model,conventional software mechanisms aimed at various levels of security can be categorized as intra-address space protection(IASP)including memory safety,control-flow integrity,syscall filteri... With a similar threat model,conventional software mechanisms aimed at various levels of security can be categorized as intra-address space protection(IASP)including memory safety,control-flow integrity,syscall filtering,and isolation.When enhancing security,software-only IASP methods result in an expanded trusted computing base(TCB)and can lead to performance slowdowns,making it challenging to strike a balance between security and performance.Recent studies indicate that hardware-assisted methods enhance efficiency by encapsulating hardware primitives and utilizing specialized microarchitecture designs.They also enhance security by reducing the trusted computing base’s attack surface.However,there has been limited discussion regarding the key challenges in current hardware-assisted IASP studies.This paper conducts a comprehensive survey of hardware-assisted IASP and discusses critical design issues,such as metadata management strategies,protection comprehensiveness,protection granularity,and processor complexity.Through a qualitative analysis of existing methods,this paper summarizes the research trends in hardware-assisted IASP technologies and emphasizes the importance of isolation models,access control strategies,and cross-compartment switching in future hardware-assisted IASP designs. 展开更多
关键词 software security hardware-assisted security intra-address space protection(IASP)
原文传递
10-Million Atoms Simulation of First-Principle Package LS3DF 被引量:1
6
作者 严昱瑾 李海波 +6 位作者 赵曈 汪林望 石林 刘涛 谭光明 贾伟乐 孙凝晖 《Journal of Computer Science & Technology》 SCIE EI CSCD 2024年第1期45-62,共18页
The growing demand for semiconductor devices simulation poses a big challenge for large-scale electronic structure calculations.Among various methods,the linearly scaling three-dimensional fragment(LS3DF)method exhibi... The growing demand for semiconductor devices simulation poses a big challenge for large-scale electronic structure calculations.Among various methods,the linearly scaling three-dimensional fragment(LS3DF)method exhibits excellent scalability in large-scale simulations.Based on algorithmic and system-level optimizations,we propose a highly scalable and highly efficient implementation of LS3DF on a domestic heterogeneous supercomputer equipped with acceler-ators.In terms of algorithmic optimizations,the original all-band conjugate gradient algorithm is refined to achieve faster convergence,and mixed precision computing is adopted to increase overall efficiency.In terms of system-level optimiza-tions,the original two-layer parallel structure is replaced by a coarse-grained parallel method.Optimization strategies such as multi-stream,kernel fusion,and redundant computation removal are proposed to increase further utilization of the com-putational power provided by the heterogeneous machines.As a result,our optimized LS3DF can scale to a 10-million sili-con atoms system,attaining a peak performance of 34.8 PFLOPS(21.2% of the peak).All the improvements can be adapt-ed to the next-generation supercomputers for larger simulations. 展开更多
关键词 single instruction multiple thread accelerator electronic structure high-performance computing linearly scaling three-dimensional fragment(LS3DF)
原文传递
A Survey of Non-Volatile Main Memory File Systems
7
作者 王盈 贾文庆 +1 位作者 蒋德钧 熊劲 《Journal of Computer Science & Technology》 SCIE EI CSCD 2023年第2期348-372,共25页
Non-volatile memories(NVMs)provide lower latency and higher bandwidth than block devices.Besides,NVMs are byte-addressable and provide persistence that can be used as memory-level storage devices(non-volatile main mem... Non-volatile memories(NVMs)provide lower latency and higher bandwidth than block devices.Besides,NVMs are byte-addressable and provide persistence that can be used as memory-level storage devices(non-volatile main memory,NVMM).These features change storage hierarchy and allow CPU to access persistent data using load/store instructions.Thus,we can directly build a file system on NVMM.However,traditional file systems are designed based on slow block devices.They use a deep and complex software stack to optimize file system performance.This design results in software overhead being the dominant factor affecting NVMM file systems.Besides,scalability,crash consistency,data protection,and cross-media storage should be reconsidered in NVMM file systems.We survey existing work on optimizing NVMM file systems.First,we analyze the problems when directly using traditional file systems on NVMM,including heavy software overhead,limited scalability,inappropriate consistency guarantee techniques,etc.Second,we summarize the technique of 30 typical NVMM file systems and analyze their advantages and disadvantages.Finally,we provide a few suggestions for designing a high-performance NVMM file system based on real hardware Optane DC persistent memory module.Specifically,we suggest applying various techniques to reduce software overheads,improving the scalability of virtual file system(VFS),adopting highly-concurrent data structures(e.g.,lock and index),using memory protection keys(MPK)for data protection,and carefully designing data placement/migration for cross-media file system. 展开更多
关键词 non-volatile main memory(NVMM) file system PERFORMANCE SCALABILITY crash consistency data protection crossmeida
原文传递
AI Computing Systems for Large Langguage Models Training 被引量:1
8
作者 Zhen-Xing Zhang Yuan-Bo Wen +9 位作者 Han-Qi Lyu Chang Liu Rui Zhang Xia-Qing Li Chao Wang Zi-Dong Du Qi Guo Ling Li Xue-Hai Zhou Yun-Ji Chen 《Journal of Computer Science & Technology》 2025年第1期6-41,共36页
In this paper,we present a comprehensive overview of artificial intelligence(AI)computing systems for large language models(LLMs)training.The rapid advancement of LLMs in recent years,coupled with the widespread adopt... In this paper,we present a comprehensive overview of artificial intelligence(AI)computing systems for large language models(LLMs)training.The rapid advancement of LLMs in recent years,coupled with the widespread adoption of algorithms and applications such as BERT,ChatGPT,and DeepSeek,has sparked significant interest in this field.We classify LLMs into encoder-only,encoder-decoder,and decoder-only models,and briefly analyze their training and inference processes to emphasize their substantial need for computational resources.These operations depend heavily on Alspecific accelerators like GPUs(graphics processing units),TPUs(tensor processing units),and MLUs(machine learning units).However,as the gap widens between the increasing complexity of LLMs and the current capabilities of accelerators,it becomes essential to adopt heterogeneous computing systems optimized for distributed environments to manage the growing computational and memory requirements of LLMs.We delve into the execution and scheduling of LLM algorithms,underlining the critical role of distributed computing strategies,memory management enhancements,and boosting computational efficiency.This paper clarifies the complex relationship between algorithm design,hardware infrastructure,and software optimization,and provides an in-depth understanding of both the software and hardware infrastructure supporting LLMs training,offering insights into the challenges and potential avenues for future development and deployment. 展开更多
关键词 artificial intelligence(AI)chip large language model(LLM) AI computing system ACCELERATOR
原文传递
FuHsi:Shifting Base-Calling Closer to Sequencer via In-Cache Acceleration
9
作者 Ye-Wen Li Guang-Ming Tan Xue-Qi Li 《Journal of Computer Science & Technology》 2025年第2期482-499,共18页
Base-calling is an essential step in the analysis of third-generation genome data.Many previous hardware efforts aimed at enhancing processing in the workflow.However,an order of magnitude throughput gap still exists.... Base-calling is an essential step in the analysis of third-generation genome data.Many previous hardware efforts aimed at enhancing processing in the workflow.However,an order of magnitude throughput gap still exists.In this paper,we propose FuHsi to improve the end-to-end throughput of the base-calling process.FuHsi is an in-cache accelerator that only introduces three components to the traditional CPUs in the sequencer.We propose FuHsi Cache,which offloads the bottleneck operations to cache arithmetic.Specifically,we accelerate beam search,string conversion,and MAC(multiply-accumulate)using algorithm/hardware co-design.We also introduce FuHsi APIs and FuHsi Controller to provide coarse-grained control for FuHsi Cache.Experimental results show that FuHsi can achieve 45.7x,113.1x,and 100x throughput per watt speedup compared with an NVIDIA Jetson baseline,an NVIDIA A100 GPU baseline,and the Helix accelerator,respectively.FuHsi can provide base-calling requests for up to 15 ONT sequencers simultaneously. 展开更多
关键词 genome base-calling in-cache accelerator domain-specific architecture genome analysis Nanopore sequencing
原文传递
NapFS:A High-Performance Persistent Memory File System for Non-Uniform Memory Access Architectures
10
作者 Wen-Qing Jia De-Jun Jiang Jin Xiong 《Journal of Computer Science & Technology》 2025年第4期1155-1171,共17页
Persistent memory(PM)allows file systems to directly persist data on the memory bus.To increase the capacity of PM file systems,building a file system across sockets with each attached PM is attractive.However,accessi... Persistent memory(PM)allows file systems to directly persist data on the memory bus.To increase the capacity of PM file systems,building a file system across sockets with each attached PM is attractive.However,accessing data across sockets incurs impacts of the non-uniform memory access(NUMA)architecture,which will lead to significant performance degradation.In this paper,we first use experiments to understand the NUMA impacts on building PM file systems.And then,we propose four design principles for building a high-performance PM file system NapFS for the NUMA architecture.We architect NapFS with per-socket local PM file systems and per-socket dedicated IO thread pools.This not only allows applications to delegate data accesses to IO threads for avoiding remote PM accesses,but also fully reuses existing single-socket PM file systems to reduce implementation complexity.Additionally,NapFS utilizes fast DRAM to accelerate performance by adding a global cache and adopts a selective cache mechanism to eliminate the redundant double-copy overhead for synchronization operations.Lastly,we show that NapFS can adopt extended optimizations to improve scalability and the performance of critical requests.We evaluate NapFS against other multi-socket PM file systems.The evaluation results show that NapFS achieves 2.2x and 1.0x throughput improvement for Filebench and RocksDB,respectively. 展开更多
关键词 non-uniform memory access(NUMA) persistent memory file system IO delegation
原文传递
VastPipe:A High-Throughput Inference System via Adaptive Space-Division Multiplexing for Diverse Accelerators
11
作者 Li-Xian Ma Le-Ping Wang +2 位作者 En Shao Rong-Yu Cao Guang-Ming Tan 《Journal of Computer Science & Technology》 2025年第2期444-463,共20页
The escalating demand on batched deep learning inference requires concurrent deployment of multiple deep neural network(DNN)models on a shared accelerator,thereby enabling spatial multiplexing to enhance resource util... The escalating demand on batched deep learning inference requires concurrent deployment of multiple deep neural network(DNN)models on a shared accelerator,thereby enabling spatial multiplexing to enhance resource utilization.Spatial multiplexing for co-locating multiple model services on the same accelerator increases the complexity of scheduling within a cluster.The meticulous collaborative optimization of model co-location combinations and resource allocation in a cluster creates an extensive configuration space for scheduling.In this paper,we present,a highthroughput inference system that schedules batch-oriented and heterogeneous requests on spatial multiplexing-enabled computing clusters.determines optimal scheduling configurations by jointly optimizing model co-location and resource allocation using reinforcement learning to solve this combinatorial optimization problem.The experimental results demonstrate that on a large-scale cluster comprising 250 machine nodes with 1000 neural processing units(NPUs),achieves average performance improvements of 2.2x,1.3x,and 1.2x compared with the baseline systems,respectively.Furthermore,is optimized and evaluated on mainstream GPUs.The results demonstrate that achieves average throughput improvements of 2.7x on the NVIDIA A100 GPU and 1.9x on the AMD MI100 GPU. 展开更多
关键词 cluster scheduling resource management reinforcement learning DNN accelerator
原文传递
AutoQNN: An End-to-End Framework for Automatically Quantizing Neural Networks
12
作者 龚成 卢冶 +3 位作者 代素蓉 邓倩 杜承昆 李涛 《Journal of Computer Science & Technology》 SCIE EI CSCD 2024年第2期401-420,共20页
Exploring the expected quantizing scheme with suitable mixed-precision policy is the key to compress deep neural networks(DNNs)in high efficiency and accuracy.This exploration implies heavy workloads for domain expert... Exploring the expected quantizing scheme with suitable mixed-precision policy is the key to compress deep neural networks(DNNs)in high efficiency and accuracy.This exploration implies heavy workloads for domain experts,and an automatic compression method is needed.However,the huge search space of the automatic method introduces plenty of computing budgets that make the automatic process challenging to be applied in real scenarios.In this paper,we propose an end-to-end framework named AutoQNN,for automatically quantizing different layers utilizing different schemes and bitwidths without any human labor.AutoQNN can seek desirable quantizing schemes and mixed-precision policies for mainstream DNN models efficiently by involving three techniques:quantizing scheme search(QSS),quantizing precision learning(QPL),and quantized architecture generation(QAG).QSS introduces five quantizing schemes and defines three new schemes as a candidate set for scheme search,and then uses the Differentiable Neural Architecture Search(DNAS)algorithm to seek the layer-or model-desired scheme from the set.QPL is the first method to learn mixed-precision policies by reparameterizing the bitwidths of quantizing schemes,to the best of our knowledge.QPL optimizes both classification loss and precision loss of DNNs efficiently and obtains the relatively optimal mixed-precision model within limited model size and memory footprint.QAG is designed to convert arbitrary architectures into corresponding quantized ones without manual intervention,to facilitate end-to-end neural network quantization.We have implemented AutoQNN and integrated it into Keras.Extensive experiments demonstrate that AutoQNN can consistently outperform state-of-the-art quantization.For 2-bit weight and activation of AlexNet and ResNet18,AutoQNN can achieve the accuracy results of 59.75%and 68.86%,respectively,and obtain accuracy improvements by up to 1.65%and 1.74%,respectively,compared with state-of-the-art methods.Especially,compared with the full-precision AlexNet and ResNet18,the 2-bit models only slightly incur accuracy degradation by 0.26%and 0.76%,respectively,which can fulfill practical application demands. 展开更多
关键词 automatic quantization mixed precision quantizing scheme search quantizing precision learning quan-tized architecture generation
原文传递
Automatic Target Description File Generation
13
作者 耿洪娜 吕方 +3 位作者 钟茗 崔慧敏 薛景玲 冯晓兵 《Journal of Computer Science & Technology》 SCIE EI CSCD 2023年第6期1339-1355,共17页
Agile hardware design is gaining increasing momentum and bringing new chips in larger quantities to the market faster.However,it also takes new challenges for compiler developers to retarget existing compilers to thes... Agile hardware design is gaining increasing momentum and bringing new chips in larger quantities to the market faster.However,it also takes new challenges for compiler developers to retarget existing compilers to these new chips in shorter time than ever before.Currently,retargeting a compiler backend,e.g.,an LLVM backend to a new target,requires compiler developers to write manually a set of target description files(totalling 10300+lines of code(LOC)for RISC-V in LLVM),which is error-prone and time-consuming.In this paper,we introduce a new approach,Au-tomatic Target Description File Generation(ATG),which accelerates the generation of a compiler backend for a new tar-get by generating its target description files automatically.Given a new target,ATG proceeds in two stages.First,ATG synthesizes a small list of target-specific properties and a list of code-layout templates from the target description files of a set of existing targets with similar instruction set architectures(ISAs).Second,ATG requests compiler developers to fill in the information for each instruction in the new target in tabular form according to the list of target-specific properties syn-thesized and then generates its target description files automatically according to the list of code-layout templates synthe-sized.The first stage can often be reused by different new targets sharing similar ISAs.We evaluate ATG using nine RISC-V instruction sets drawn from a total of 1029 instructions in LLVM 12.0.ATG enables compiler developers to gen-erate compiler backends for these ISAs that emit the same assembly code as the existing compiler backends for RISC-V but with significantly less development effort(by specifying each instruction in terms of up to 61 target-specific properties only). 展开更多
关键词 retargetability COMPILER target description target backend automatic generator
原文传递
Hardware Acceleration for SLAM in Mobile Systems
14
作者 樊哲 郝一帆 +2 位作者 支天 郭崎 杜子东 《Journal of Computer Science & Technology》 SCIE EI CSCD 2023年第6期1300-1322,共23页
The emerging mobile robot industry has spurred a flurry of interest in solving the simultaneous localization and mapping(SLAM)problem.However,existing SLAM platforms have difficulty in meeting the real-time and low-po... The emerging mobile robot industry has spurred a flurry of interest in solving the simultaneous localization and mapping(SLAM)problem.However,existing SLAM platforms have difficulty in meeting the real-time and low-pow-er requirements imposed by mobile systems.Though specialized hardware is promising with regard to achieving high per-formance and lowering the power,designing an efficient accelerator for SLAM is severely hindered by a wide variety of SLAM algorithms.Based on our detailed analysis of representative SLAM algorithms,we observe that SLAM algorithms advance two challenges for designing efficient hardware accelerators:the large number of computational primitives and ir-regular control flows.To address these two challenges,we propose a hardware accelerator that features composable com-putation units classified as the matrix,vector,scalar,and control units.In addition,we design a hierarchical instruction set for coping with a broad range of SLAM algorithms with irregular control flows.Experimental results show that,com-pared against an Intel x86 processor,on average,our accelerator with the area of 7.41 mm^(2) achieves 10.52x and 112.62x better performance and energy savings,respectively,across different datasets.Compared against a more energy-efficient ARM Cortex processor,our accelerator still achieves 33.03x and 62.64x better performance and energy savings,respec-tively. 展开更多
关键词 hardware accelerator instruction set mobile system simultaneous localization and mapping(SLAM)algorithm
原文传递
DyPipe: A Holistic Approach to Accelerating Dynamic Neural Networks with Dynamic Pipelining
15
作者 庄毅敏 胡杏 +1 位作者 陈小兵 支天 《Journal of Computer Science & Technology》 SCIE EI CSCD 2023年第4期899-910,共12页
Dynamic neural network(NN)techniques are increasingly important because they facilitate deep learning techniques with more complex network architectures.However,existing studies,which predominantly optimize the static... Dynamic neural network(NN)techniques are increasingly important because they facilitate deep learning techniques with more complex network architectures.However,existing studies,which predominantly optimize the static computational graphs by static scheduling methods,usually focus on optimizing static neural networks in deep neural network(DNN)accelerators.We analyze the execution process of dynamic neural networks and observe that dynamic features introduce challenges for efficient scheduling and pipelining in existing DNN accelerators.We propose DyPipe,a holistic approach to optimizing dynamic neural network inferences in enhanced DNN accelerators.DyPipe achieves significant performance improvements for dynamic neural networks while it introduces negligible overhead for static neural networks.Our evaluation demonstrates that DyPipe achieves 1.7x speedup on dynamic neural networks and maintains more than 96%performance for static neural networks. 展开更多
关键词 dynamic neural network(NN) deep neural network(DNN)accelerator dynamic pipelining
原文传递
上一页 1 下一页 到第
使用帮助 返回顶部