The superconducting rapid single flux quantum(RSFQ)integrated circuit is a promising solu-tion for overcoming speed and power bottlenecks in high-performance computing systems in the post-Moore era.This paper presents...The superconducting rapid single flux quantum(RSFQ)integrated circuit is a promising solu-tion for overcoming speed and power bottlenecks in high-performance computing systems in the post-Moore era.This paper presents an architecture designed to improve the speed and power limitations of high-performance computing systems using superconducting technology.Since superconducting microprocessors,which operate at cryogenic temperatures,require support from semiconductor cir-cuits,the proposed design utilizes the von Neumann architecture with a superconducting RSFQ mi-croprocessor,cryogenic semiconductor memory,a room temperature field programmable gate array(FPGA)controller,and a host computer for input/output.Additionally,the paper introduces two key circuit designs:a start/stop controllable superconducting clock generator and an asynchronous communication interface between the RSFQ and semiconductor chips used to implement the control system.Experimental results demonstrate that the proposed design is feasible and effective,provi-ding valuable insights for future superconducting computer systems.展开更多
In recent years,video coding has been widely applied in the field of video image processing to remove redundant information and improve data transmission efficiency.However,during the video coding process,irrelevant o...In recent years,video coding has been widely applied in the field of video image processing to remove redundant information and improve data transmission efficiency.However,during the video coding process,irrelevant objects such as background elements are often encoded due to environmental disturbances,resulting in the wastage of computational resources.Existing research on video coding efficiency optimization primarily focuses on optimizing encoding units during intra-frame or inter frame prediction after the generation of coding units,neglecting the optimization of video images before coding unit generation.To address this challenge,This work proposes an image semantic segmentation compression algorithm based on macroblock encoding,called image semantic segmentation compression algorithm based on macroblock encoding(ISSC-ME),which consists of three modules.(1)The semantic label generation module generates interesting object labels using a grid-based approach to reduce redundant coding of consecutive frames.(2)The image segmentation network module generates a semantic segmentation image using U-Net.(3)The macroblock coding module,is a block segmentation-based video encoding and decoding algorithm used to compress images and improve video transmission efficiency.Experimental results show that the proposed image semantic segmentation optimization algorithm can reduce the computational costs,and improve the overall accuracy by 1.00%and the mean intersection over union(IoU)by 1.20%.In addition,the proposed compression algorithm utilizes macroblock fusion,resulting in the image compression rate achieving 80.64%.It has been proven that the proposed algorithm greatly reduces data storage and transmission,and enables fast image compression processing at the millisecond level.展开更多
The challenge of enhancing the generalization capacity of reinforcement learning(RL)agents remains a formidable obstacle.Existing RL methods,despite achieving superhuman performance on certain benchmarks,often struggl...The challenge of enhancing the generalization capacity of reinforcement learning(RL)agents remains a formidable obstacle.Existing RL methods,despite achieving superhuman performance on certain benchmarks,often struggle with this aspect.A potential reason is that the benchmarks used for training and evaluation may not adequately offer a diverse set of transferable tasks.Although recent studies have developed bench-marking environments to address this shortcoming,they typically fall short in providing tasks that both ensure a solid foundation for generalization and exhibit significant variability.To overcome these limitations,this work introduces the concept that‘objects are composed of more fundamental components’in environment design,as implemented in the proposed environment called summon the magic(StM).This environment generates tasks where objects are derived from extensible and shareable basic components,facilitating strategy reuse and enhancing generalization.Furthermore,two new metrics,adaptation sensitivity range(ASR)and parameter correlation coefficient(PCC),are proposed to better capture and evaluate the generalization process of RL agents.Experimental results show that increasing the number of basic components of the object reduces the proximal policy optimization(PPO)agent’s training-testing gap by 60.9%(in episode reward),significantly alleviating overfitting.Additionally,linear variations in other environmental factors,such as the training monster set proportion and the total number of basic components,uniformly decrease the gap by at least 32.1%.These results highlight StM’s effectiveness in benchmarking and probing the generalization capabilities of RL algorithms.展开更多
Quantized training has been proven to be a prominent method to achieve deep neural network training under limited computational resources.It uses low bit-width arithmetics with a proper scaling factor to achieve negli...Quantized training has been proven to be a prominent method to achieve deep neural network training under limited computational resources.It uses low bit-width arithmetics with a proper scaling factor to achieve negligible accuracy loss.Cambricon-Q is the ASIC design proposed to efficiently support quantized training,and achieves significant performance improvement.However,there are still two caveats in the design.First,Cambricon-Q with different hardware specifications may lead to different numerical errors,resulting in non-reproducible behaviors which may become a major concern in critical applications.Second,Cambricon-Q cannot leverage data sparsity,where considerable cycles could still be squeezed out.To address the caveats,the acceleration core of Cambricon-Q is redesigned to support fine-grained irregular data processing.The new design not only enables acceleration on sparse data,but also enables performing local dynamic quantization by contiguous value ranges(which is hardware independent),instead of contiguous addresses(which is dependent on hardware factors).Experimental results show that the accuracy loss of the method still keeps negligible,and the accelerator achieves 1.61×performance improvement over Cambricon-Q,with about 10%energy increase.展开更多
With a similar threat model,conventional software mechanisms aimed at various levels of security can be categorized as intra-address space protection(IASP)including memory safety,control-flow integrity,syscall filteri...With a similar threat model,conventional software mechanisms aimed at various levels of security can be categorized as intra-address space protection(IASP)including memory safety,control-flow integrity,syscall filtering,and isolation.When enhancing security,software-only IASP methods result in an expanded trusted computing base(TCB)and can lead to performance slowdowns,making it challenging to strike a balance between security and performance.Recent studies indicate that hardware-assisted methods enhance efficiency by encapsulating hardware primitives and utilizing specialized microarchitecture designs.They also enhance security by reducing the trusted computing base’s attack surface.However,there has been limited discussion regarding the key challenges in current hardware-assisted IASP studies.This paper conducts a comprehensive survey of hardware-assisted IASP and discusses critical design issues,such as metadata management strategies,protection comprehensiveness,protection granularity,and processor complexity.Through a qualitative analysis of existing methods,this paper summarizes the research trends in hardware-assisted IASP technologies and emphasizes the importance of isolation models,access control strategies,and cross-compartment switching in future hardware-assisted IASP designs.展开更多
The growing demand for semiconductor devices simulation poses a big challenge for large-scale electronic structure calculations.Among various methods,the linearly scaling three-dimensional fragment(LS3DF)method exhibi...The growing demand for semiconductor devices simulation poses a big challenge for large-scale electronic structure calculations.Among various methods,the linearly scaling three-dimensional fragment(LS3DF)method exhibits excellent scalability in large-scale simulations.Based on algorithmic and system-level optimizations,we propose a highly scalable and highly efficient implementation of LS3DF on a domestic heterogeneous supercomputer equipped with acceler-ators.In terms of algorithmic optimizations,the original all-band conjugate gradient algorithm is refined to achieve faster convergence,and mixed precision computing is adopted to increase overall efficiency.In terms of system-level optimiza-tions,the original two-layer parallel structure is replaced by a coarse-grained parallel method.Optimization strategies such as multi-stream,kernel fusion,and redundant computation removal are proposed to increase further utilization of the com-putational power provided by the heterogeneous machines.As a result,our optimized LS3DF can scale to a 10-million sili-con atoms system,attaining a peak performance of 34.8 PFLOPS(21.2% of the peak).All the improvements can be adapt-ed to the next-generation supercomputers for larger simulations.展开更多
Non-volatile memories(NVMs)provide lower latency and higher bandwidth than block devices.Besides,NVMs are byte-addressable and provide persistence that can be used as memory-level storage devices(non-volatile main mem...Non-volatile memories(NVMs)provide lower latency and higher bandwidth than block devices.Besides,NVMs are byte-addressable and provide persistence that can be used as memory-level storage devices(non-volatile main memory,NVMM).These features change storage hierarchy and allow CPU to access persistent data using load/store instructions.Thus,we can directly build a file system on NVMM.However,traditional file systems are designed based on slow block devices.They use a deep and complex software stack to optimize file system performance.This design results in software overhead being the dominant factor affecting NVMM file systems.Besides,scalability,crash consistency,data protection,and cross-media storage should be reconsidered in NVMM file systems.We survey existing work on optimizing NVMM file systems.First,we analyze the problems when directly using traditional file systems on NVMM,including heavy software overhead,limited scalability,inappropriate consistency guarantee techniques,etc.Second,we summarize the technique of 30 typical NVMM file systems and analyze their advantages and disadvantages.Finally,we provide a few suggestions for designing a high-performance NVMM file system based on real hardware Optane DC persistent memory module.Specifically,we suggest applying various techniques to reduce software overheads,improving the scalability of virtual file system(VFS),adopting highly-concurrent data structures(e.g.,lock and index),using memory protection keys(MPK)for data protection,and carefully designing data placement/migration for cross-media file system.展开更多
In this paper,we present a comprehensive overview of artificial intelligence(AI)computing systems for large language models(LLMs)training.The rapid advancement of LLMs in recent years,coupled with the widespread adopt...In this paper,we present a comprehensive overview of artificial intelligence(AI)computing systems for large language models(LLMs)training.The rapid advancement of LLMs in recent years,coupled with the widespread adoption of algorithms and applications such as BERT,ChatGPT,and DeepSeek,has sparked significant interest in this field.We classify LLMs into encoder-only,encoder-decoder,and decoder-only models,and briefly analyze their training and inference processes to emphasize their substantial need for computational resources.These operations depend heavily on Alspecific accelerators like GPUs(graphics processing units),TPUs(tensor processing units),and MLUs(machine learning units).However,as the gap widens between the increasing complexity of LLMs and the current capabilities of accelerators,it becomes essential to adopt heterogeneous computing systems optimized for distributed environments to manage the growing computational and memory requirements of LLMs.We delve into the execution and scheduling of LLM algorithms,underlining the critical role of distributed computing strategies,memory management enhancements,and boosting computational efficiency.This paper clarifies the complex relationship between algorithm design,hardware infrastructure,and software optimization,and provides an in-depth understanding of both the software and hardware infrastructure supporting LLMs training,offering insights into the challenges and potential avenues for future development and deployment.展开更多
Base-calling is an essential step in the analysis of third-generation genome data.Many previous hardware efforts aimed at enhancing processing in the workflow.However,an order of magnitude throughput gap still exists....Base-calling is an essential step in the analysis of third-generation genome data.Many previous hardware efforts aimed at enhancing processing in the workflow.However,an order of magnitude throughput gap still exists.In this paper,we propose FuHsi to improve the end-to-end throughput of the base-calling process.FuHsi is an in-cache accelerator that only introduces three components to the traditional CPUs in the sequencer.We propose FuHsi Cache,which offloads the bottleneck operations to cache arithmetic.Specifically,we accelerate beam search,string conversion,and MAC(multiply-accumulate)using algorithm/hardware co-design.We also introduce FuHsi APIs and FuHsi Controller to provide coarse-grained control for FuHsi Cache.Experimental results show that FuHsi can achieve 45.7x,113.1x,and 100x throughput per watt speedup compared with an NVIDIA Jetson baseline,an NVIDIA A100 GPU baseline,and the Helix accelerator,respectively.FuHsi can provide base-calling requests for up to 15 ONT sequencers simultaneously.展开更多
Persistent memory(PM)allows file systems to directly persist data on the memory bus.To increase the capacity of PM file systems,building a file system across sockets with each attached PM is attractive.However,accessi...Persistent memory(PM)allows file systems to directly persist data on the memory bus.To increase the capacity of PM file systems,building a file system across sockets with each attached PM is attractive.However,accessing data across sockets incurs impacts of the non-uniform memory access(NUMA)architecture,which will lead to significant performance degradation.In this paper,we first use experiments to understand the NUMA impacts on building PM file systems.And then,we propose four design principles for building a high-performance PM file system NapFS for the NUMA architecture.We architect NapFS with per-socket local PM file systems and per-socket dedicated IO thread pools.This not only allows applications to delegate data accesses to IO threads for avoiding remote PM accesses,but also fully reuses existing single-socket PM file systems to reduce implementation complexity.Additionally,NapFS utilizes fast DRAM to accelerate performance by adding a global cache and adopts a selective cache mechanism to eliminate the redundant double-copy overhead for synchronization operations.Lastly,we show that NapFS can adopt extended optimizations to improve scalability and the performance of critical requests.We evaluate NapFS against other multi-socket PM file systems.The evaluation results show that NapFS achieves 2.2x and 1.0x throughput improvement for Filebench and RocksDB,respectively.展开更多
The escalating demand on batched deep learning inference requires concurrent deployment of multiple deep neural network(DNN)models on a shared accelerator,thereby enabling spatial multiplexing to enhance resource util...The escalating demand on batched deep learning inference requires concurrent deployment of multiple deep neural network(DNN)models on a shared accelerator,thereby enabling spatial multiplexing to enhance resource utilization.Spatial multiplexing for co-locating multiple model services on the same accelerator increases the complexity of scheduling within a cluster.The meticulous collaborative optimization of model co-location combinations and resource allocation in a cluster creates an extensive configuration space for scheduling.In this paper,we present,a highthroughput inference system that schedules batch-oriented and heterogeneous requests on spatial multiplexing-enabled computing clusters.determines optimal scheduling configurations by jointly optimizing model co-location and resource allocation using reinforcement learning to solve this combinatorial optimization problem.The experimental results demonstrate that on a large-scale cluster comprising 250 machine nodes with 1000 neural processing units(NPUs),achieves average performance improvements of 2.2x,1.3x,and 1.2x compared with the baseline systems,respectively.Furthermore,is optimized and evaluated on mainstream GPUs.The results demonstrate that achieves average throughput improvements of 2.7x on the NVIDIA A100 GPU and 1.9x on the AMD MI100 GPU.展开更多
Exploring the expected quantizing scheme with suitable mixed-precision policy is the key to compress deep neural networks(DNNs)in high efficiency and accuracy.This exploration implies heavy workloads for domain expert...Exploring the expected quantizing scheme with suitable mixed-precision policy is the key to compress deep neural networks(DNNs)in high efficiency and accuracy.This exploration implies heavy workloads for domain experts,and an automatic compression method is needed.However,the huge search space of the automatic method introduces plenty of computing budgets that make the automatic process challenging to be applied in real scenarios.In this paper,we propose an end-to-end framework named AutoQNN,for automatically quantizing different layers utilizing different schemes and bitwidths without any human labor.AutoQNN can seek desirable quantizing schemes and mixed-precision policies for mainstream DNN models efficiently by involving three techniques:quantizing scheme search(QSS),quantizing precision learning(QPL),and quantized architecture generation(QAG).QSS introduces five quantizing schemes and defines three new schemes as a candidate set for scheme search,and then uses the Differentiable Neural Architecture Search(DNAS)algorithm to seek the layer-or model-desired scheme from the set.QPL is the first method to learn mixed-precision policies by reparameterizing the bitwidths of quantizing schemes,to the best of our knowledge.QPL optimizes both classification loss and precision loss of DNNs efficiently and obtains the relatively optimal mixed-precision model within limited model size and memory footprint.QAG is designed to convert arbitrary architectures into corresponding quantized ones without manual intervention,to facilitate end-to-end neural network quantization.We have implemented AutoQNN and integrated it into Keras.Extensive experiments demonstrate that AutoQNN can consistently outperform state-of-the-art quantization.For 2-bit weight and activation of AlexNet and ResNet18,AutoQNN can achieve the accuracy results of 59.75%and 68.86%,respectively,and obtain accuracy improvements by up to 1.65%and 1.74%,respectively,compared with state-of-the-art methods.Especially,compared with the full-precision AlexNet and ResNet18,the 2-bit models only slightly incur accuracy degradation by 0.26%and 0.76%,respectively,which can fulfill practical application demands.展开更多
Agile hardware design is gaining increasing momentum and bringing new chips in larger quantities to the market faster.However,it also takes new challenges for compiler developers to retarget existing compilers to thes...Agile hardware design is gaining increasing momentum and bringing new chips in larger quantities to the market faster.However,it also takes new challenges for compiler developers to retarget existing compilers to these new chips in shorter time than ever before.Currently,retargeting a compiler backend,e.g.,an LLVM backend to a new target,requires compiler developers to write manually a set of target description files(totalling 10300+lines of code(LOC)for RISC-V in LLVM),which is error-prone and time-consuming.In this paper,we introduce a new approach,Au-tomatic Target Description File Generation(ATG),which accelerates the generation of a compiler backend for a new tar-get by generating its target description files automatically.Given a new target,ATG proceeds in two stages.First,ATG synthesizes a small list of target-specific properties and a list of code-layout templates from the target description files of a set of existing targets with similar instruction set architectures(ISAs).Second,ATG requests compiler developers to fill in the information for each instruction in the new target in tabular form according to the list of target-specific properties syn-thesized and then generates its target description files automatically according to the list of code-layout templates synthe-sized.The first stage can often be reused by different new targets sharing similar ISAs.We evaluate ATG using nine RISC-V instruction sets drawn from a total of 1029 instructions in LLVM 12.0.ATG enables compiler developers to gen-erate compiler backends for these ISAs that emit the same assembly code as the existing compiler backends for RISC-V but with significantly less development effort(by specifying each instruction in terms of up to 61 target-specific properties only).展开更多
The emerging mobile robot industry has spurred a flurry of interest in solving the simultaneous localization and mapping(SLAM)problem.However,existing SLAM platforms have difficulty in meeting the real-time and low-po...The emerging mobile robot industry has spurred a flurry of interest in solving the simultaneous localization and mapping(SLAM)problem.However,existing SLAM platforms have difficulty in meeting the real-time and low-pow-er requirements imposed by mobile systems.Though specialized hardware is promising with regard to achieving high per-formance and lowering the power,designing an efficient accelerator for SLAM is severely hindered by a wide variety of SLAM algorithms.Based on our detailed analysis of representative SLAM algorithms,we observe that SLAM algorithms advance two challenges for designing efficient hardware accelerators:the large number of computational primitives and ir-regular control flows.To address these two challenges,we propose a hardware accelerator that features composable com-putation units classified as the matrix,vector,scalar,and control units.In addition,we design a hierarchical instruction set for coping with a broad range of SLAM algorithms with irregular control flows.Experimental results show that,com-pared against an Intel x86 processor,on average,our accelerator with the area of 7.41 mm^(2) achieves 10.52x and 112.62x better performance and energy savings,respectively,across different datasets.Compared against a more energy-efficient ARM Cortex processor,our accelerator still achieves 33.03x and 62.64x better performance and energy savings,respec-tively.展开更多
Dynamic neural network(NN)techniques are increasingly important because they facilitate deep learning techniques with more complex network architectures.However,existing studies,which predominantly optimize the static...Dynamic neural network(NN)techniques are increasingly important because they facilitate deep learning techniques with more complex network architectures.However,existing studies,which predominantly optimize the static computational graphs by static scheduling methods,usually focus on optimizing static neural networks in deep neural network(DNN)accelerators.We analyze the execution process of dynamic neural networks and observe that dynamic features introduce challenges for efficient scheduling and pipelining in existing DNN accelerators.We propose DyPipe,a holistic approach to optimizing dynamic neural network inferences in enhanced DNN accelerators.DyPipe achieves significant performance improvements for dynamic neural networks while it introduces negligible overhead for static neural networks.Our evaluation demonstrates that DyPipe achieves 1.7x speedup on dynamic neural networks and maintains more than 96%performance for static neural networks.展开更多
基金the Strategic Priority Research Program of Chinese Academy of Sciences(No.XDA18000000)the National Natural Science Foundation of China(No.61732018,61872335).
文摘The superconducting rapid single flux quantum(RSFQ)integrated circuit is a promising solu-tion for overcoming speed and power bottlenecks in high-performance computing systems in the post-Moore era.This paper presents an architecture designed to improve the speed and power limitations of high-performance computing systems using superconducting technology.Since superconducting microprocessors,which operate at cryogenic temperatures,require support from semiconductor cir-cuits,the proposed design utilizes the von Neumann architecture with a superconducting RSFQ mi-croprocessor,cryogenic semiconductor memory,a room temperature field programmable gate array(FPGA)controller,and a host computer for input/output.Additionally,the paper introduces two key circuit designs:a start/stop controllable superconducting clock generator and an asynchronous communication interface between the RSFQ and semiconductor chips used to implement the control system.Experimental results demonstrate that the proposed design is feasible and effective,provi-ding valuable insights for future superconducting computer systems.
文摘In recent years,video coding has been widely applied in the field of video image processing to remove redundant information and improve data transmission efficiency.However,during the video coding process,irrelevant objects such as background elements are often encoded due to environmental disturbances,resulting in the wastage of computational resources.Existing research on video coding efficiency optimization primarily focuses on optimizing encoding units during intra-frame or inter frame prediction after the generation of coding units,neglecting the optimization of video images before coding unit generation.To address this challenge,This work proposes an image semantic segmentation compression algorithm based on macroblock encoding,called image semantic segmentation compression algorithm based on macroblock encoding(ISSC-ME),which consists of three modules.(1)The semantic label generation module generates interesting object labels using a grid-based approach to reduce redundant coding of consecutive frames.(2)The image segmentation network module generates a semantic segmentation image using U-Net.(3)The macroblock coding module,is a block segmentation-based video encoding and decoding algorithm used to compress images and improve video transmission efficiency.Experimental results show that the proposed image semantic segmentation optimization algorithm can reduce the computational costs,and improve the overall accuracy by 1.00%and the mean intersection over union(IoU)by 1.20%.In addition,the proposed compression algorithm utilizes macroblock fusion,resulting in the image compression rate achieving 80.64%.It has been proven that the proposed algorithm greatly reduces data storage and transmission,and enables fast image compression processing at the millisecond level.
基金Supported by the National Key R&D Program of China(No.2023YFB4502200)the National Natural Science Foundation of China(No.U22A2028,61925208,62222214,62341411,62102398,62102399,U20A20227,62302478,62302482,62302483,62302480,62302481)+2 种基金the Strategic Priority Research Program of the Chinese Academy of Sciences(No.XDB0660300,XDB0660301,XDB0660302)the Chinese Academy of Sciences Project for Young Scientists in Basic Research(No.YSBR-029)the Youth Innovation Promotion Association of Chinese Academy of Sciences and Xplore Prize.
文摘The challenge of enhancing the generalization capacity of reinforcement learning(RL)agents remains a formidable obstacle.Existing RL methods,despite achieving superhuman performance on certain benchmarks,often struggle with this aspect.A potential reason is that the benchmarks used for training and evaluation may not adequately offer a diverse set of transferable tasks.Although recent studies have developed bench-marking environments to address this shortcoming,they typically fall short in providing tasks that both ensure a solid foundation for generalization and exhibit significant variability.To overcome these limitations,this work introduces the concept that‘objects are composed of more fundamental components’in environment design,as implemented in the proposed environment called summon the magic(StM).This environment generates tasks where objects are derived from extensible and shareable basic components,facilitating strategy reuse and enhancing generalization.Furthermore,two new metrics,adaptation sensitivity range(ASR)and parameter correlation coefficient(PCC),are proposed to better capture and evaluate the generalization process of RL agents.Experimental results show that increasing the number of basic components of the object reduces the proximal policy optimization(PPO)agent’s training-testing gap by 60.9%(in episode reward),significantly alleviating overfitting.Additionally,linear variations in other environmental factors,such as the training monster set proportion and the total number of basic components,uniformly decrease the gap by at least 32.1%.These results highlight StM’s effectiveness in benchmarking and probing the generalization capabilities of RL algorithms.
基金the National Key Research and Devecopment Program of China(No.2022YFB4501601)the National Natural Science Foundation of China(No.62102398,U20A20227,62222214,62002338,U22A2028,U19B2019)+1 种基金the Chinese Academy of Sciences Project for Young Scientists in Basic Research(YSBR-029)Youth Innovation Promotion Association Chinese Academy of Sciences。
文摘Quantized training has been proven to be a prominent method to achieve deep neural network training under limited computational resources.It uses low bit-width arithmetics with a proper scaling factor to achieve negligible accuracy loss.Cambricon-Q is the ASIC design proposed to efficiently support quantized training,and achieves significant performance improvement.However,there are still two caveats in the design.First,Cambricon-Q with different hardware specifications may lead to different numerical errors,resulting in non-reproducible behaviors which may become a major concern in critical applications.Second,Cambricon-Q cannot leverage data sparsity,where considerable cycles could still be squeezed out.To address the caveats,the acceleration core of Cambricon-Q is redesigned to support fine-grained irregular data processing.The new design not only enables acceleration on sparse data,but also enables performing local dynamic quantization by contiguous value ranges(which is hardware independent),instead of contiguous addresses(which is dependent on hardware factors).Experimental results show that the accuracy loss of the method still keeps negligible,and the accelerator achieves 1.61×performance improvement over Cambricon-Q,with about 10%energy increase.
基金supported in part by the Strategic Priority Research Program of Chinese Academy of Sciences(CAS)under Grant Nos.XDA0320000 and XDA0320300.
文摘With a similar threat model,conventional software mechanisms aimed at various levels of security can be categorized as intra-address space protection(IASP)including memory safety,control-flow integrity,syscall filtering,and isolation.When enhancing security,software-only IASP methods result in an expanded trusted computing base(TCB)and can lead to performance slowdowns,making it challenging to strike a balance between security and performance.Recent studies indicate that hardware-assisted methods enhance efficiency by encapsulating hardware primitives and utilizing specialized microarchitecture designs.They also enhance security by reducing the trusted computing base’s attack surface.However,there has been limited discussion regarding the key challenges in current hardware-assisted IASP studies.This paper conducts a comprehensive survey of hardware-assisted IASP and discusses critical design issues,such as metadata management strategies,protection comprehensiveness,protection granularity,and processor complexity.Through a qualitative analysis of existing methods,this paper summarizes the research trends in hardware-assisted IASP technologies and emphasizes the importance of isolation models,access control strategies,and cross-compartment switching in future hardware-assisted IASP designs.
基金This work was supported by the National Key Research and Development Program of China under Grant No.2021YFB0300600the National Natural Science Foundation of China under Grant Nos.92270206,T2125013,62032023,61972377,T2293702,and 12274360+2 种基金the Chinese Academy of Sciences Project for Young Scientists in Basic Research under Grant No.YSBR-005the Network Information Project of Chinese Academy of Sciences under Grant No.CASWX2021SF-0103the Key Research Program of Chinese Academy of Sciences under Grant No.ZDBSSSW-WHC002.
文摘The growing demand for semiconductor devices simulation poses a big challenge for large-scale electronic structure calculations.Among various methods,the linearly scaling three-dimensional fragment(LS3DF)method exhibits excellent scalability in large-scale simulations.Based on algorithmic and system-level optimizations,we propose a highly scalable and highly efficient implementation of LS3DF on a domestic heterogeneous supercomputer equipped with acceler-ators.In terms of algorithmic optimizations,the original all-band conjugate gradient algorithm is refined to achieve faster convergence,and mixed precision computing is adopted to increase overall efficiency.In terms of system-level optimiza-tions,the original two-layer parallel structure is replaced by a coarse-grained parallel method.Optimization strategies such as multi-stream,kernel fusion,and redundant computation removal are proposed to increase further utilization of the com-putational power provided by the heterogeneous machines.As a result,our optimized LS3DF can scale to a 10-million sili-con atoms system,attaining a peak performance of 34.8 PFLOPS(21.2% of the peak).All the improvements can be adapt-ed to the next-generation supercomputers for larger simulations.
基金supported by the Major Research Plan of the National Natural Science Foundation of China under Grant No.92270202the Strategic Priority Research Program of the Chinese Academy of Sciences under Grant No.XDB44030200.
文摘Non-volatile memories(NVMs)provide lower latency and higher bandwidth than block devices.Besides,NVMs are byte-addressable and provide persistence that can be used as memory-level storage devices(non-volatile main memory,NVMM).These features change storage hierarchy and allow CPU to access persistent data using load/store instructions.Thus,we can directly build a file system on NVMM.However,traditional file systems are designed based on slow block devices.They use a deep and complex software stack to optimize file system performance.This design results in software overhead being the dominant factor affecting NVMM file systems.Besides,scalability,crash consistency,data protection,and cross-media storage should be reconsidered in NVMM file systems.We survey existing work on optimizing NVMM file systems.First,we analyze the problems when directly using traditional file systems on NVMM,including heavy software overhead,limited scalability,inappropriate consistency guarantee techniques,etc.Second,we summarize the technique of 30 typical NVMM file systems and analyze their advantages and disadvantages.Finally,we provide a few suggestions for designing a high-performance NVMM file system based on real hardware Optane DC persistent memory module.Specifically,we suggest applying various techniques to reduce software overheads,improving the scalability of virtual file system(VFS),adopting highly-concurrent data structures(e.g.,lock and index),using memory protection keys(MPK)for data protection,and carefully designing data placement/migration for cross-media file system.
基金supported by the National Natural Science Foundation of China under Grant Nos.61925208,U22A2028,62302483,62222214,62341411,62102399,and 62372436the Chinese Academy of Sciences(CAS)Project for Young Scientists in Basic Research under Grant No.YSBR-029the Youth Innovation Promotion Association of CAS,and Xplore Prize.
文摘In this paper,we present a comprehensive overview of artificial intelligence(AI)computing systems for large language models(LLMs)training.The rapid advancement of LLMs in recent years,coupled with the widespread adoption of algorithms and applications such as BERT,ChatGPT,and DeepSeek,has sparked significant interest in this field.We classify LLMs into encoder-only,encoder-decoder,and decoder-only models,and briefly analyze their training and inference processes to emphasize their substantial need for computational resources.These operations depend heavily on Alspecific accelerators like GPUs(graphics processing units),TPUs(tensor processing units),and MLUs(machine learning units).However,as the gap widens between the increasing complexity of LLMs and the current capabilities of accelerators,it becomes essential to adopt heterogeneous computing systems optimized for distributed environments to manage the growing computational and memory requirements of LLMs.We delve into the execution and scheduling of LLM algorithms,underlining the critical role of distributed computing strategies,memory management enhancements,and boosting computational efficiency.This paper clarifies the complex relationship between algorithm design,hardware infrastructure,and software optimization,and provides an in-depth understanding of both the software and hardware infrastructure supporting LLMs training,offering insights into the challenges and potential avenues for future development and deployment.
基金supported in part by the National Key Research and Development Program of China under Grant Nos.2021YFB0300202 and 2022YFB4500403the National Natural Science Foundation of China under Grant Nos.62202454,62032023,and T2125013.
文摘Base-calling is an essential step in the analysis of third-generation genome data.Many previous hardware efforts aimed at enhancing processing in the workflow.However,an order of magnitude throughput gap still exists.In this paper,we propose FuHsi to improve the end-to-end throughput of the base-calling process.FuHsi is an in-cache accelerator that only introduces three components to the traditional CPUs in the sequencer.We propose FuHsi Cache,which offloads the bottleneck operations to cache arithmetic.Specifically,we accelerate beam search,string conversion,and MAC(multiply-accumulate)using algorithm/hardware co-design.We also introduce FuHsi APIs and FuHsi Controller to provide coarse-grained control for FuHsi Cache.Experimental results show that FuHsi can achieve 45.7x,113.1x,and 100x throughput per watt speedup compared with an NVIDIA Jetson baseline,an NVIDIA A100 GPU baseline,and the Helix accelerator,respectively.FuHsi can provide base-calling requests for up to 15 ONT sequencers simultaneously.
基金supported by the Major Research Plan of the National Natural Science Foundation of China under Grant No.92270202the Strategic Priority Research Program of the Chinese Academy of Sciences under Grant No.XDB44030200.
文摘Persistent memory(PM)allows file systems to directly persist data on the memory bus.To increase the capacity of PM file systems,building a file system across sockets with each attached PM is attractive.However,accessing data across sockets incurs impacts of the non-uniform memory access(NUMA)architecture,which will lead to significant performance degradation.In this paper,we first use experiments to understand the NUMA impacts on building PM file systems.And then,we propose four design principles for building a high-performance PM file system NapFS for the NUMA architecture.We architect NapFS with per-socket local PM file systems and per-socket dedicated IO thread pools.This not only allows applications to delegate data accesses to IO threads for avoiding remote PM accesses,but also fully reuses existing single-socket PM file systems to reduce implementation complexity.Additionally,NapFS utilizes fast DRAM to accelerate performance by adding a global cache and adopts a selective cache mechanism to eliminate the redundant double-copy overhead for synchronization operations.Lastly,we show that NapFS can adopt extended optimizations to improve scalability and the performance of critical requests.We evaluate NapFS against other multi-socket PM file systems.The evaluation results show that NapFS achieves 2.2x and 1.0x throughput improvement for Filebench and RocksDB,respectively.
基金supported by the National Key Research and Development Program of China under Grant No.2021YFB0300202the National Natural Science Foundation of China under Grant Nos.62032023,T2125013 and 62102396+3 种基金the Beijing Nova Program under Grant No.Z211100002121143the Youth Innovation Promotion Association of Chinese Academy of Sciences under Grant No.2021099the Innovation Funding of Institute of Computing Technology,Chinese Academy of Sciences under Grant No.E461030Tianjin Science and Technology Plan Project under Grant No.24ZXKJGX00060.
文摘The escalating demand on batched deep learning inference requires concurrent deployment of multiple deep neural network(DNN)models on a shared accelerator,thereby enabling spatial multiplexing to enhance resource utilization.Spatial multiplexing for co-locating multiple model services on the same accelerator increases the complexity of scheduling within a cluster.The meticulous collaborative optimization of model co-location combinations and resource allocation in a cluster creates an extensive configuration space for scheduling.In this paper,we present,a highthroughput inference system that schedules batch-oriented and heterogeneous requests on spatial multiplexing-enabled computing clusters.determines optimal scheduling configurations by jointly optimizing model co-location and resource allocation using reinforcement learning to solve this combinatorial optimization problem.The experimental results demonstrate that on a large-scale cluster comprising 250 machine nodes with 1000 neural processing units(NPUs),achieves average performance improvements of 2.2x,1.3x,and 1.2x compared with the baseline systems,respectively.Furthermore,is optimized and evaluated on mainstream GPUs.The results demonstrate that achieves average throughput improvements of 2.7x on the NVIDIA A100 GPU and 1.9x on the AMD MI100 GPU.
基金supported by the China Postdoctoral Science Foundation under Grant No.2022M721707the National Natural Science Foundation of China under Grant Nos.62002175 and 62272248+1 种基金the Special Funding for Excellent Enterprise Technology Correspondent of Tianjin under Grant No.21YDTPJC00380the Open Project Foundation of Information Security Evaluation Center of Civil Aviation,Civil Aviation University of China,under Grant No.ISECCA-202102.
文摘Exploring the expected quantizing scheme with suitable mixed-precision policy is the key to compress deep neural networks(DNNs)in high efficiency and accuracy.This exploration implies heavy workloads for domain experts,and an automatic compression method is needed.However,the huge search space of the automatic method introduces plenty of computing budgets that make the automatic process challenging to be applied in real scenarios.In this paper,we propose an end-to-end framework named AutoQNN,for automatically quantizing different layers utilizing different schemes and bitwidths without any human labor.AutoQNN can seek desirable quantizing schemes and mixed-precision policies for mainstream DNN models efficiently by involving three techniques:quantizing scheme search(QSS),quantizing precision learning(QPL),and quantized architecture generation(QAG).QSS introduces five quantizing schemes and defines three new schemes as a candidate set for scheme search,and then uses the Differentiable Neural Architecture Search(DNAS)algorithm to seek the layer-or model-desired scheme from the set.QPL is the first method to learn mixed-precision policies by reparameterizing the bitwidths of quantizing schemes,to the best of our knowledge.QPL optimizes both classification loss and precision loss of DNNs efficiently and obtains the relatively optimal mixed-precision model within limited model size and memory footprint.QAG is designed to convert arbitrary architectures into corresponding quantized ones without manual intervention,to facilitate end-to-end neural network quantization.We have implemented AutoQNN and integrated it into Keras.Extensive experiments demonstrate that AutoQNN can consistently outperform state-of-the-art quantization.For 2-bit weight and activation of AlexNet and ResNet18,AutoQNN can achieve the accuracy results of 59.75%and 68.86%,respectively,and obtain accuracy improvements by up to 1.65%and 1.74%,respectively,compared with state-of-the-art methods.Especially,compared with the full-precision AlexNet and ResNet18,the 2-bit models only slightly incur accuracy degradation by 0.26%and 0.76%,respectively,which can fulfill practical application demands.
基金supported by the Strategic Pilot Science and Technology Project of Chinese Academy of Sciences(Category C)under Grant No.XDC05000000the Youth Program of National Natural Science Foundation of China under Grant No.61802368.
文摘Agile hardware design is gaining increasing momentum and bringing new chips in larger quantities to the market faster.However,it also takes new challenges for compiler developers to retarget existing compilers to these new chips in shorter time than ever before.Currently,retargeting a compiler backend,e.g.,an LLVM backend to a new target,requires compiler developers to write manually a set of target description files(totalling 10300+lines of code(LOC)for RISC-V in LLVM),which is error-prone and time-consuming.In this paper,we introduce a new approach,Au-tomatic Target Description File Generation(ATG),which accelerates the generation of a compiler backend for a new tar-get by generating its target description files automatically.Given a new target,ATG proceeds in two stages.First,ATG synthesizes a small list of target-specific properties and a list of code-layout templates from the target description files of a set of existing targets with similar instruction set architectures(ISAs).Second,ATG requests compiler developers to fill in the information for each instruction in the new target in tabular form according to the list of target-specific properties syn-thesized and then generates its target description files automatically according to the list of code-layout templates synthe-sized.The first stage can often be reused by different new targets sharing similar ISAs.We evaluate ATG using nine RISC-V instruction sets drawn from a total of 1029 instructions in LLVM 12.0.ATG enables compiler developers to gen-erate compiler backends for these ISAs that emit the same assembly code as the existing compiler backends for RISC-V but with significantly less development effort(by specifying each instruction in terms of up to 61 target-specific properties only).
基金supported by the National Natural Science Foundation of China under Grant Nos.61925208,61906179,U19B2019,and U20A20227the Strategic Priority Research Program of Chinese Academy of Sciences under Grant No.XDB32050200+1 种基金Beijing Academy of Artificial Intelligence(BAAI),Chinese Academy of Sciences(CAS)Project for Young Scientists in Basic Research(YSBR-029)Youth Innovation Promotion Association CAS.
文摘The emerging mobile robot industry has spurred a flurry of interest in solving the simultaneous localization and mapping(SLAM)problem.However,existing SLAM platforms have difficulty in meeting the real-time and low-pow-er requirements imposed by mobile systems.Though specialized hardware is promising with regard to achieving high per-formance and lowering the power,designing an efficient accelerator for SLAM is severely hindered by a wide variety of SLAM algorithms.Based on our detailed analysis of representative SLAM algorithms,we observe that SLAM algorithms advance two challenges for designing efficient hardware accelerators:the large number of computational primitives and ir-regular control flows.To address these two challenges,we propose a hardware accelerator that features composable com-putation units classified as the matrix,vector,scalar,and control units.In addition,we design a hierarchical instruction set for coping with a broad range of SLAM algorithms with irregular control flows.Experimental results show that,com-pared against an Intel x86 processor,on average,our accelerator with the area of 7.41 mm^(2) achieves 10.52x and 112.62x better performance and energy savings,respectively,across different datasets.Compared against a more energy-efficient ARM Cortex processor,our accelerator still achieves 33.03x and 62.64x better performance and energy savings,respec-tively.
基金supported by the Beijing Natural Science Foundation under Grant No.JQ18013the National Natural Science Foundation of China under Grant Nos.61925208,61732007,61732002 and 61906179+1 种基金the Strategic Priority Research Program of Chinese Academy of Sciences(CAS)under Grant No.XDB32050200the Youth Innovation Promotion Association CAS,Beijing Academy of Artificial Intelligence(BAAI)and Xplore Prize.
文摘Dynamic neural network(NN)techniques are increasingly important because they facilitate deep learning techniques with more complex network architectures.However,existing studies,which predominantly optimize the static computational graphs by static scheduling methods,usually focus on optimizing static neural networks in deep neural network(DNN)accelerators.We analyze the execution process of dynamic neural networks and observe that dynamic features introduce challenges for efficient scheduling and pipelining in existing DNN accelerators.We propose DyPipe,a holistic approach to optimizing dynamic neural network inferences in enhanced DNN accelerators.DyPipe achieves significant performance improvements for dynamic neural networks while it introduces negligible overhead for static neural networks.Our evaluation demonstrates that DyPipe achieves 1.7x speedup on dynamic neural networks and maintains more than 96%performance for static neural networks.