With the rapid development of big data and artificial intelligence(AI),the cloud platform architecture system is constantly developing,optimizing,and improving.As such,new applications,like deep computing and high-per...With the rapid development of big data and artificial intelligence(AI),the cloud platform architecture system is constantly developing,optimizing,and improving.As such,new applications,like deep computing and high-performance computing,require enhanced computing power.To meet this requirement,a non-uniform memory access(NUMA)configuration method is proposed for the cloud computing system according to the affinity,adaptability,and availability of the NUMA architecture processor platform.The proposed method is verified based on the test environment of a domestic central processing unit(CPU).展开更多
Due to the recent trend of software intelligence in the Fourth Industrial Revolution,deep learning has become a mainstream workload for modern computer systems.Since the data size of deep learning increasingly grows,m...Due to the recent trend of software intelligence in the Fourth Industrial Revolution,deep learning has become a mainstream workload for modern computer systems.Since the data size of deep learning increasingly grows,managing the limited memory capacity efficiently for deep learning workloads becomes important.In this paper,we analyze memory accesses in deep learning workloads and find out some unique characteristics differentiated from traditional workloads.First,when comparing instruction and data accesses,data access accounts for 96%–99%of total memory accesses in deep learning workloads,which is quite different from traditional workloads.Second,when comparing read and write accesses,write access dominates,accounting for 64%–80%of total memory accesses.Third,although write access makes up the majority of memory accesses,it shows a low access bias of 0.3 in the Zipf parameter.Fourth,in predicting re-access,recency is important in read access,but frequency provides more accurate information in write access.Based on these observations,we introduce a Non-Volatile Random Access Memory(NVRAM)-accelerated memory architecture for deep learning workloads,and present a new memory management policy for this architecture.By considering the memory access characteristics of deep learning workloads,the proposed policy improves memory performance by 64.3%on average compared to the CLOCK policy.展开更多
Most transactional memory (TM) research focused on multi-core processors, and others investigated at the clusters, leaving the area of non-uniform memory access (NUMA) system unexplored. The existing TM implementation...Most transactional memory (TM) research focused on multi-core processors, and others investigated at the clusters, leaving the area of non-uniform memory access (NUMA) system unexplored. The existing TM implementations made significant performance degradation on NUMA system because they ignored the slower remote memory access. To solve this problem, a latency-based conflict detection and a forecasting-based conflict prevention method were proposed. Using these techniques, the NUMA aware TM system was presented. By reducing the remote memory access and the abort rate of transaction, the experiment results show that the NUMA aware strategies present good practical TM performance on NUMA system.展开更多
Memory access fast switching structures in cluster are studied,and three kinds of fast switching structures( FS,LR2 SS,and LAPS) are proposed. A mixed simulation test bench is constructed and used for statistic of d...Memory access fast switching structures in cluster are studied,and three kinds of fast switching structures( FS,LR2 SS,and LAPS) are proposed. A mixed simulation test bench is constructed and used for statistic of data access delay among these three structures in various cases. Finally these structures are realized on Xilinx FPGA development board and DCT,FFT,SAD,IME,FME,and de-blocking filtering algorithms are mapped onto the structures. Compared with available architectures,our proposed structures have lower data access delay and lower area.展开更多
Over the past decade, Graphics Processing Units (GPUs) have revolutionized high-performance computing, playing pivotal roles in advancing fields like IoT, autonomous vehicles, and exascale computing. Despite these adv...Over the past decade, Graphics Processing Units (GPUs) have revolutionized high-performance computing, playing pivotal roles in advancing fields like IoT, autonomous vehicles, and exascale computing. Despite these advancements, efficiently programming GPUs remains a daunting challenge, often relying on trial-and-error optimization methods. This paper introduces an optimization technique for CUDA programs through a novel Data Layout strategy, aimed at restructuring memory data arrangement to significantly enhance data access locality. Focusing on the dynamic programming algorithm for chained matrix multiplication—a critical operation across various domains including artificial intelligence (AI), high-performance computing (HPC), and the Internet of Things (IoT)—this technique facilitates more localized access. We specifically illustrate the importance of efficient matrix multiplication in these areas, underscoring the technique’s broader applicability and its potential to address some of the most pressing computational challenges in GPU-accelerated applications. Our findings reveal a remarkable reduction in memory consumption and a substantial 50% decrease in execution time for CUDA programs utilizing this technique, thereby setting a new benchmark for optimization in GPU computing.展开更多
Persistent memory(PM)allows file systems to directly persist data on the memory bus.To increase the capacity of PM file systems,building a file system across sockets with each attached PM is attractive.However,accessi...Persistent memory(PM)allows file systems to directly persist data on the memory bus.To increase the capacity of PM file systems,building a file system across sockets with each attached PM is attractive.However,accessing data across sockets incurs impacts of the non-uniform memory access(NUMA)architecture,which will lead to significant performance degradation.In this paper,we first use experiments to understand the NUMA impacts on building PM file systems.And then,we propose four design principles for building a high-performance PM file system NapFS for the NUMA architecture.We architect NapFS with per-socket local PM file systems and per-socket dedicated IO thread pools.This not only allows applications to delegate data accesses to IO threads for avoiding remote PM accesses,but also fully reuses existing single-socket PM file systems to reduce implementation complexity.Additionally,NapFS utilizes fast DRAM to accelerate performance by adding a global cache and adopts a selective cache mechanism to eliminate the redundant double-copy overhead for synchronization operations.Lastly,we show that NapFS can adopt extended optimizations to improve scalability and the performance of critical requests.We evaluate NapFS against other multi-socket PM file systems.The evaluation results show that NapFS achieves 2.2x and 1.0x throughput improvement for Filebench and RocksDB,respectively.展开更多
Magnetoresistive random access memory(MRAM)is a promising non-volatile memory technology that can be utilized as an energy and space-efficient storage and computing solution,particularly in cache functions within circ...Magnetoresistive random access memory(MRAM)is a promising non-volatile memory technology that can be utilized as an energy and space-efficient storage and computing solution,particularly in cache functions within circuits.Although MRAM has achieved mass production,its manufacturing process still remains challenging,resulting in only a few semiconductor companies dominating its production.In this review,we delve into the materials,processes,and devices used in MRAM,focusing on both the widely adopted spin transfer torque MRAM and the next-generation spin-orbit torque MRAM.We provide an overview of their operational mechanisms and manufacturing technologies.Furthermore,we outline the major hurdles faced in MRAM manufacturing and propose potential solutions in detail.Then,the applications of MRAM in artificial intelligent hardware are introduced.Finally,we present an outlook on the future development and applications of MRAM.展开更多
Graph convolutional neural networks(GCNs)have emerged as an effective approach to extending deep learning for graph data analytics,but they are computationally challenging given the irregular graphs and the large num-...Graph convolutional neural networks(GCNs)have emerged as an effective approach to extending deep learning for graph data analytics,but they are computationally challenging given the irregular graphs and the large num-ber of nodes in a graph.GCNs involve chain sparse-dense matrix multiplications with six loops,which results in a large de-sign space for GCN accelerators.Prior work on GCN acceleration either employs limited loop optimization techniques,or determines the design variables based on random sampling,which can hardly exploit data reuse efficiently,thus degrading system efficiency.To overcome this limitation,this paper proposes GShuttle,a GCN acceleration scheme that maximizes memory access efficiency to achieve high performance and energy efficiency.GShuttle systematically explores loop opti-mization techniques for GCN acceleration,and quantitatively analyzes the design objectives(e.g.,required DRAM access-es and SRAM accesses)by analytical calculation based on multiple design variables.GShuttle further employs two ap-proaches,pruned search space sweeping and greedy search,to find the optimal design variables under certain design con-straints.We demonstrated the efficacy of GShuttle by evaluation on five widely used graph datasets.The experimental simulations show that GShuttle reduces the number of DRAM accesses by a factor of 1.5 and saves energy by a factor of 1.7 compared with the state-of-the-art approaches.展开更多
As one of the most notorious programming errors,memory access errors still hurt modern software security.Particularly,they are hidden deeply in important software systems written in memory unsafe languages like C/C++....As one of the most notorious programming errors,memory access errors still hurt modern software security.Particularly,they are hidden deeply in important software systems written in memory unsafe languages like C/C++.Plenty of work have been proposed to detect bugs leading to memory access errors.However,all existing works lack the ability to handle two challenges.First,they are not able to tackle fine-grained memory access errors,e.g.,data overflow inside one data structure.These errors are usually overlooked for a long time since they happen inside one memory block and do not lead to program crash.Second,most existing works rely on source code or debugging information to recover memory boundary information,so they cannot be directly applied to detection of memory access errors in binary code.However,searching memory access errors in binary code is a very common scenario in software vulnerability detection and exploitation.In order to overcome these challenges,we propose Memory Access Integrity(MAI),a dynamic method to detect finegrained memory access errors in off-the-shelf binary executables.The core idea is to recover fine-grained accessing policy between memory access behaviors and memory ranges,and then detect memory access errors based on the policy.The key insight in our work is that memory accessing patterns reveal information for recovering the boundary of memory objects and the accessing policy.Based on these recovered information,our method maintains a new memory model to simulate the life cycle of memory objects and report errors when any accessing policy is violated.We evaluate our tool on popular CTF datasets and real world softwares.Compared with the state of the art detection tool,the evaluation result demonstrates that our tool can detect fine-grained memory access errors effectively and efficiently.As the practical impact,our tool has detected three 0-day memory access errors in an audio decoder.展开更多
The radiation damage effect of key structural materials is one of the main research subjects of the numerical reactor.From the perspective of experimental safety and feasibility,Molecular Dynamics(MD)in the materials ...The radiation damage effect of key structural materials is one of the main research subjects of the numerical reactor.From the perspective of experimental safety and feasibility,Molecular Dynamics(MD)in the materials field is an ideal method for simulating the radiation damage of structural materials.The Crystal-MD represents a massive parallel MD simulation software based on the key material characteristics of reactors.Compared with the Large-scale Atomic/Molecurlar Massively Parallel Simulator(LAMMPS)and ITAP Molecular Dynamics(IMD)software,the Crystal-MD reduces the memory required for software operation to a certain extent,but it is very time-consuming.Moreover,the calculation results of the Crystal-MD have large deviations,and there are also some problems,such as memory limitation and frequent communication during its migration and optimization.In this paper,in order to solve the above problems,the memory access mode of the Crystal-MD software is studied.Based on the memory access mode,a memory access optimization strategy is proposed for a unique architecture of China’s supercomputer Sunway Taihu Light.The proposed optimization strategy is verified by the experiments,and experimental results show that the running speed of the Crystal-MD is increased significantly by using the proposed optimization strategy.展开更多
General purpose graphics processing units(GPGPUs)can be used to improve computing performance considerably for regular applications.However,irregular memory access exists in many applications,and the benefits of graph...General purpose graphics processing units(GPGPUs)can be used to improve computing performance considerably for regular applications.However,irregular memory access exists in many applications,and the benefits of graphics processing units(GPUs)are less substantial for irregular applications.In recent years,several studies have presented some solutions to remove static irregular memory access.However,eliminating dynamic irregular memory access with software remains a serious challenge.A pure software solution without hardware extensions or offline profiling is proposed to eliminate dynamic irregular memory access,especially for indirect memory access.Data reordering and index redirection are suggested to reduce the number of memory transactions,thereby improving the performance of GPU kernels.To improve the efficiency of data reordering,an operation to reorder data is offloaded to a GPU to reduce overhead and thus transfer data.Through concurrently executing the compute unified device architecture(CUDA)streams of data reordering and the data processing kernel,the overhead of data reordering can be reduced.After these optimizations,the volume of memory transactions can be reduced by 16.7%-50%compared with CUSPARSE-based benchmarks,and the performance of irregular kernels can be improved by 9.64%-34.9%using an NVIDIA Tesla P4 GPU.展开更多
Computational Fluid Dynamics-Discrete Element Method is used to model gas-solid systems in several applications in energy,pharmaceutical and petrochemical industries.Computational performance bot-tlenecks often limit ...Computational Fluid Dynamics-Discrete Element Method is used to model gas-solid systems in several applications in energy,pharmaceutical and petrochemical industries.Computational performance bot-tlenecks often limit the problem sizes that can be simulated at industrial scale.The data structures used to store several millions of particles in such large-scale simulations have a large memory footprint that does not fit into the processor cache hierarchies on current high-performance-computing platforms,leading to reduced computational performance.This paper specifically addresses this aspect of memory access bottlenecks in industrial scale simulations.The use of space-flling curves to improve memory access patterns is described and their impact on computational performance is quantified in both shared and distributed memory parallelization paradigms.The Morton space flling curve applied to uniform grids and k-dimensional tree partitions are used to reorder the particle data-structure thus improving spatial and temporal locality in memory.The performance impact of these techniques when applied to two benchmark problems,namely the homogeneous-cooling-system and a fluidized-bed,are presented.These optimization techniques lead to approximately two-fold performance improvement in particle focused operations such as neighbor-list creation and data-exchange,with~1.5 times overall improvement in a fluidization simulation with 1.27 million particles.展开更多
As one of the most notorious programming errors,memory access errors still hurt modern software security.Particularly,they are hidden deeply in important software systems written in memory unsafe languages like C/C++....As one of the most notorious programming errors,memory access errors still hurt modern software security.Particularly,they are hidden deeply in important software systems written in memory unsafe languages like C/C++.Plenty of work have been proposed to detect bugs leading to memory access errors.However,all existing works lack the ability to handle two challenges.First,they are not able to tackle fine-grained memory access errors,e.g.,data overflow inside one data structure.These errors are usually overlooked for a long time since they happen inside one memory block and do not lead to program crash.Second,most existing works rely on source code or debugging information to recover memory boundary information,so they cannot be directly applied to detection of memory access errors in binary code.However,searching memory access errors in binary code is a very common scenario in software vulnerability detection and exploitation.In order to overcome these challenges,we propose Memory Access Integrity(MAI),a dynamic method to detect finegrained memory access errors in off-the-shelf binary executables.The core idea is to recover fine-grained accessing policy between memory access behaviors and memory ranges,and then detect memory access errors based on the policy.The key insight in our work is that memory accessing patterns reveal information for recovering the boundary of memory objects and the accessing policy.Based on these recovered information,our method maintains a new memory model to simulate the life cycle of memory objects and report errors when any accessing policy is violated.We evaluate our tool on popular CTF datasets and real world softwares.Compared with the state of the art detection tool,the evaluation result demonstrates that our tool can detect fine-grained memory access errors effectively and efficiently.As the practical impact,our tool has detected three 0-day memory access errors in an audio decoder.展开更多
Using computer-aided design three-dimensional (3D) simulation technology, the recovery mechanism of single event upset and the effects of spacing and hit angle on the recovery are studied. It is found that the multi...Using computer-aided design three-dimensional (3D) simulation technology, the recovery mechanism of single event upset and the effects of spacing and hit angle on the recovery are studied. It is found that the multi-node charge collection plays a key role in recovery and shielding the charge sharing by adding guard rings. It cannot exhibit the recovery effect. It is also indicated that the upset linear energy transfer (LET) threshold is kept constant while the recovery LET threshold increases as the spacing increases. Additionally, the effect of incident angle on recovery is analysed and it is shown that a larger angle can bring about a stronger charge sharing effect, thus strengthening the recovery ability.展开更多
The era of information explosion is coming and information need to be continuously stored and randomly accessed over long-term periods,which constitute an insurmountable challenge for existing data centers.At present,...The era of information explosion is coming and information need to be continuously stored and randomly accessed over long-term periods,which constitute an insurmountable challenge for existing data centers.At present,computing devices use the von Neumann architecture with separate computing and memory units,which exposes the shortcomings of“memory bottleneck”.Nonvolatile memristor can realize data storage and in-memory computing at the same time and promises to overcome this bottleneck.Phase-change random access memory(PCRAM)is called one of the best solutions for next generation non-volatile memory.Due to its high speed,good data retention,high density,low power consumption,PCRAM has the broad commercial prospects in the in-memory computing application.In this review,the research progress of phase-change materials and device structures for PCRAM,as well as the most critical performances for a universal memory,such as speed,capacity,and power consumption,are reviewed.By comparing the advantages and disadvantages of phase-change optical disk and PCRAM,a new concept of optoelectronic hybrid storage based on phase-change material is proposed.Furthermore,its feasibility to replace existing memory technologies as a universal memory is also discussed as well.展开更多
An optimized device structure for reducing the RESET current of phase-change random access memory (PCRAM) with blade-type like (BTL) phase change layer is proposed. The electrical thermal analysis of the BTL cell ...An optimized device structure for reducing the RESET current of phase-change random access memory (PCRAM) with blade-type like (BTL) phase change layer is proposed. The electrical thermal analysis of the BTL cell and the blade heater contactor structure by three-dimensional finite element modeling are compared with each other during RESET operation. The simulation results show that the programming region of the phase change layer in the BTL cell is much smaller, and thermal electrical distributions of the BTL cell are more concentrated on the TiN/GST interface. The results indicate that the BTL cell has the superiorities of increasing the heating efficiency, decreasing the power consumption and reducing the RESET current from 0.67mA to 0.32mA. Therefore, the BTL cell will be appropriate for high performance PCRAM device with lower power consumption and lower RESET current.展开更多
In this letter,the Ta/HfO/BN/TiN resistive switching devices are fabricated and they exhibit low power consumption and high uniformity each.The reset current is reduced for the HfO/BN bilayer device compared with that...In this letter,the Ta/HfO/BN/TiN resistive switching devices are fabricated and they exhibit low power consumption and high uniformity each.The reset current is reduced for the HfO/BN bilayer device compared with that for the Ta/HfO/TiN structure.Furthermore,the reset current decreases with increasing BN thickness.The HfOlayer is a dominating switching layer,while the low-permittivity and high-resistivity BN layer acts as a barrier of electrons injection into TiN electrode.The current conduction mechanism of low resistance state in the HfO/BN bilayer device is space-chargelimited current(SCLC),while it is Ohmic conduction in the HfOdevice.展开更多
This paper investigated phase change Si1Sb2Te3 material for application of chalcogenide random access memory. Current-voltage performance was conducted to determine threshold current of phase change from amorphous pha...This paper investigated phase change Si1Sb2Te3 material for application of chalcogenide random access memory. Current-voltage performance was conducted to determine threshold current of phase change from amorphous phase to polycrystalline phase. The film holds a threshold current about 0.155 mA, which is smaller than the value 0.31 mA of Ge2Sb2Te5 film. Amorphous Si1Sb2Te3 changes to face-centred-cubic structure at ~ 180℃ and changes to hexagonal structure at ~ 270℃. Annealing temperature dependent electric resistivity of Si1Sb2Te3 film was studied by four-point probe method. Data retention of the films was characterized as well.展开更多
Recent progresses in magnetic tunnel junctions with perpendicular magnetic anisotropy (PMA) are reviewed and summarized. At first, the concept and source of perpendicular magnetic anisotropy (PMA) are introduced. ...Recent progresses in magnetic tunnel junctions with perpendicular magnetic anisotropy (PMA) are reviewed and summarized. At first, the concept and source of perpendicular magnetic anisotropy (PMA) are introduced. Next, a historical overview of PMA materials as magnetic electrodes, such as the RE-TM alloys TbFeCo and GdFeCo, novel tetragonal manganese alloys Mn-Ga, L10-ordered (Co, Fe)/Pt alloy, multilayer film [Co, Fe, CoFe/Pt, Pd, Ni, AU]N, and ultra-thin magnetic metal/oxidized barrier is offered. The other part of the article focuses on the optimization and fabrication of CoFeB/MgO/CoFeB p-MTJs, which is thought to have high potential to meet the main demands for non-volatile magnetic random access memory.展开更多
Metal phthalocyanine is considered one of the most promising candidates for the design and fabrication of flexible resistive random access memory(RRAM)devices due to its intrinsic flexibility and excellent functionali...Metal phthalocyanine is considered one of the most promising candidates for the design and fabrication of flexible resistive random access memory(RRAM)devices due to its intrinsic flexibility and excellent functionality.However,performance degradation and the lack of multi-level capability,which can directly expand the storage capacity in one memory cell without sacrificing additional layout area,are the primary obstacles to the use of metal phthalocyanine RRAMs in information storage.Here,a flexible RRAM with pristine nickel phthalocyanine(Ni Pc)as the resistive layer is reported for multi-level data storage.Due to its high trap-concentration,the charge transport behavior of the device agrees well with classical space charge limited conduction controlled by traps,leading to an excellent performance,including a high on-off current ratio of 10^(7),a long-term retention of 10^(6)s,a reproducible endurance over6000 cycles,long-term flexibility at a bending strain of 0.6%,a write speed of 50 ns under sequential bias pulses and the capability of multi-level data storage with reliable retention and uniformity.展开更多
基金the National Key Research and Development Program of China(No.2017YFC0212100)National High-tech R&D Program of China(No.2015AA015308).
文摘With the rapid development of big data and artificial intelligence(AI),the cloud platform architecture system is constantly developing,optimizing,and improving.As such,new applications,like deep computing and high-performance computing,require enhanced computing power.To meet this requirement,a non-uniform memory access(NUMA)configuration method is proposed for the cloud computing system according to the affinity,adaptability,and availability of the NUMA architecture processor platform.The proposed method is verified based on the test environment of a domestic central processing unit(CPU).
基金supported in part by the NRF(National Research Foundation of Korea)Grant(No.2019R1A2C1009275)by the Institute of Information&communications Technology Planning&Evaluation(IITP)grant funded by theKorean government(MSIT)(No.2021-0-02068,Artificial Intelligence Innovation Hub).
文摘Due to the recent trend of software intelligence in the Fourth Industrial Revolution,deep learning has become a mainstream workload for modern computer systems.Since the data size of deep learning increasingly grows,managing the limited memory capacity efficiently for deep learning workloads becomes important.In this paper,we analyze memory accesses in deep learning workloads and find out some unique characteristics differentiated from traditional workloads.First,when comparing instruction and data accesses,data access accounts for 96%–99%of total memory accesses in deep learning workloads,which is quite different from traditional workloads.Second,when comparing read and write accesses,write access dominates,accounting for 64%–80%of total memory accesses.Third,although write access makes up the majority of memory accesses,it shows a low access bias of 0.3 in the Zipf parameter.Fourth,in predicting re-access,recency is important in read access,but frequency provides more accurate information in write access.Based on these observations,we introduce a Non-Volatile Random Access Memory(NVRAM)-accelerated memory architecture for deep learning workloads,and present a new memory management policy for this architecture.By considering the memory access characteristics of deep learning workloads,the proposed policy improves memory performance by 64.3%on average compared to the CLOCK policy.
基金Projects(61003075, 61170261) supported by the National Natural Science Foundation of China
文摘Most transactional memory (TM) research focused on multi-core processors, and others investigated at the clusters, leaving the area of non-uniform memory access (NUMA) system unexplored. The existing TM implementations made significant performance degradation on NUMA system because they ignored the slower remote memory access. To solve this problem, a latency-based conflict detection and a forecasting-based conflict prevention method were proposed. Using these techniques, the NUMA aware TM system was presented. By reducing the remote memory access and the abort rate of transaction, the experiment results show that the NUMA aware strategies present good practical TM performance on NUMA system.
基金Supported by the National Natural Science Foundation of China(61272120,61634004,61602377)the Shaanxi Provincial Co-ordination Innovation Project of Science and Technology(2016KTZDGY02-04-02)Scientific Research Program Funded by Shannxi Provincial Education Department(17JK0689)
文摘Memory access fast switching structures in cluster are studied,and three kinds of fast switching structures( FS,LR2 SS,and LAPS) are proposed. A mixed simulation test bench is constructed and used for statistic of data access delay among these three structures in various cases. Finally these structures are realized on Xilinx FPGA development board and DCT,FFT,SAD,IME,FME,and de-blocking filtering algorithms are mapped onto the structures. Compared with available architectures,our proposed structures have lower data access delay and lower area.
文摘Over the past decade, Graphics Processing Units (GPUs) have revolutionized high-performance computing, playing pivotal roles in advancing fields like IoT, autonomous vehicles, and exascale computing. Despite these advancements, efficiently programming GPUs remains a daunting challenge, often relying on trial-and-error optimization methods. This paper introduces an optimization technique for CUDA programs through a novel Data Layout strategy, aimed at restructuring memory data arrangement to significantly enhance data access locality. Focusing on the dynamic programming algorithm for chained matrix multiplication—a critical operation across various domains including artificial intelligence (AI), high-performance computing (HPC), and the Internet of Things (IoT)—this technique facilitates more localized access. We specifically illustrate the importance of efficient matrix multiplication in these areas, underscoring the technique’s broader applicability and its potential to address some of the most pressing computational challenges in GPU-accelerated applications. Our findings reveal a remarkable reduction in memory consumption and a substantial 50% decrease in execution time for CUDA programs utilizing this technique, thereby setting a new benchmark for optimization in GPU computing.
基金supported by the Major Research Plan of the National Natural Science Foundation of China under Grant No.92270202the Strategic Priority Research Program of the Chinese Academy of Sciences under Grant No.XDB44030200.
文摘Persistent memory(PM)allows file systems to directly persist data on the memory bus.To increase the capacity of PM file systems,building a file system across sockets with each attached PM is attractive.However,accessing data across sockets incurs impacts of the non-uniform memory access(NUMA)architecture,which will lead to significant performance degradation.In this paper,we first use experiments to understand the NUMA impacts on building PM file systems.And then,we propose four design principles for building a high-performance PM file system NapFS for the NUMA architecture.We architect NapFS with per-socket local PM file systems and per-socket dedicated IO thread pools.This not only allows applications to delegate data accesses to IO threads for avoiding remote PM accesses,but also fully reuses existing single-socket PM file systems to reduce implementation complexity.Additionally,NapFS utilizes fast DRAM to accelerate performance by adding a global cache and adopts a selective cache mechanism to eliminate the redundant double-copy overhead for synchronization operations.Lastly,we show that NapFS can adopt extended optimizations to improve scalability and the performance of critical requests.We evaluate NapFS against other multi-socket PM file systems.The evaluation results show that NapFS achieves 2.2x and 1.0x throughput improvement for Filebench and RocksDB,respectively.
基金supported in part by the Youth Innovation Promotion Association of Chinese Academy of Sciences(CAS)under Grant 2020118Beijing Nova Program under Grant 20230484358Beijing Superstring Academy of Memory Technology:under Grant No.E2DF06X003。
文摘Magnetoresistive random access memory(MRAM)is a promising non-volatile memory technology that can be utilized as an energy and space-efficient storage and computing solution,particularly in cache functions within circuits.Although MRAM has achieved mass production,its manufacturing process still remains challenging,resulting in only a few semiconductor companies dominating its production.In this review,we delve into the materials,processes,and devices used in MRAM,focusing on both the widely adopted spin transfer torque MRAM and the next-generation spin-orbit torque MRAM.We provide an overview of their operational mechanisms and manufacturing technologies.Furthermore,we outline the major hurdles faced in MRAM manufacturing and propose potential solutions in detail.Then,the applications of MRAM in artificial intelligent hardware are introduced.Finally,we present an outlook on the future development and applications of MRAM.
基金supported by the U.S.National Science Foundation under Grant Nos.CCF-2131946,CCF-1953980,and CCF-1702980.
文摘Graph convolutional neural networks(GCNs)have emerged as an effective approach to extending deep learning for graph data analytics,but they are computationally challenging given the irregular graphs and the large num-ber of nodes in a graph.GCNs involve chain sparse-dense matrix multiplications with six loops,which results in a large de-sign space for GCN accelerators.Prior work on GCN acceleration either employs limited loop optimization techniques,or determines the design variables based on random sampling,which can hardly exploit data reuse efficiently,thus degrading system efficiency.To overcome this limitation,this paper proposes GShuttle,a GCN acceleration scheme that maximizes memory access efficiency to achieve high performance and energy efficiency.GShuttle systematically explores loop opti-mization techniques for GCN acceleration,and quantitatively analyzes the design objectives(e.g.,required DRAM access-es and SRAM accesses)by analytical calculation based on multiple design variables.GShuttle further employs two ap-proaches,pruned search space sweeping and greedy search,to find the optimal design variables under certain design con-straints.We demonstrated the efficacy of GShuttle by evaluation on five widely used graph datasets.The experimental simulations show that GShuttle reduces the number of DRAM accesses by a factor of 1.5 and saves energy by a factor of 1.7 compared with the state-of-the-art approaches.
文摘As one of the most notorious programming errors,memory access errors still hurt modern software security.Particularly,they are hidden deeply in important software systems written in memory unsafe languages like C/C++.Plenty of work have been proposed to detect bugs leading to memory access errors.However,all existing works lack the ability to handle two challenges.First,they are not able to tackle fine-grained memory access errors,e.g.,data overflow inside one data structure.These errors are usually overlooked for a long time since they happen inside one memory block and do not lead to program crash.Second,most existing works rely on source code or debugging information to recover memory boundary information,so they cannot be directly applied to detection of memory access errors in binary code.However,searching memory access errors in binary code is a very common scenario in software vulnerability detection and exploitation.In order to overcome these challenges,we propose Memory Access Integrity(MAI),a dynamic method to detect finegrained memory access errors in off-the-shelf binary executables.The core idea is to recover fine-grained accessing policy between memory access behaviors and memory ranges,and then detect memory access errors based on the policy.The key insight in our work is that memory accessing patterns reveal information for recovering the boundary of memory objects and the accessing policy.Based on these recovered information,our method maintains a new memory model to simulate the life cycle of memory objects and report errors when any accessing policy is violated.We evaluate our tool on popular CTF datasets and real world softwares.Compared with the state of the art detection tool,the evaluation result demonstrates that our tool can detect fine-grained memory access errors effectively and efficiently.As the practical impact,our tool has detected three 0-day memory access errors in an audio decoder.
基金supported by the National Key R&D Program of China(No.2017YFB0202003)。
文摘The radiation damage effect of key structural materials is one of the main research subjects of the numerical reactor.From the perspective of experimental safety and feasibility,Molecular Dynamics(MD)in the materials field is an ideal method for simulating the radiation damage of structural materials.The Crystal-MD represents a massive parallel MD simulation software based on the key material characteristics of reactors.Compared with the Large-scale Atomic/Molecurlar Massively Parallel Simulator(LAMMPS)and ITAP Molecular Dynamics(IMD)software,the Crystal-MD reduces the memory required for software operation to a certain extent,but it is very time-consuming.Moreover,the calculation results of the Crystal-MD have large deviations,and there are also some problems,such as memory limitation and frequent communication during its migration and optimization.In this paper,in order to solve the above problems,the memory access mode of the Crystal-MD software is studied.Based on the memory access mode,a memory access optimization strategy is proposed for a unique architecture of China’s supercomputer Sunway Taihu Light.The proposed optimization strategy is verified by the experiments,and experimental results show that the running speed of the Crystal-MD is increased significantly by using the proposed optimization strategy.
基金Project supported by the National Key Research and Development Program of China(No.2018YFB1003500)。
文摘General purpose graphics processing units(GPGPUs)can be used to improve computing performance considerably for regular applications.However,irregular memory access exists in many applications,and the benefits of graphics processing units(GPUs)are less substantial for irregular applications.In recent years,several studies have presented some solutions to remove static irregular memory access.However,eliminating dynamic irregular memory access with software remains a serious challenge.A pure software solution without hardware extensions or offline profiling is proposed to eliminate dynamic irregular memory access,especially for indirect memory access.Data reordering and index redirection are suggested to reduce the number of memory transactions,thereby improving the performance of GPU kernels.To improve the efficiency of data reordering,an operation to reorder data is offloaded to a GPU to reduce overhead and thus transfer data.Through concurrently executing the compute unified device architecture(CUDA)streams of data reordering and the data processing kernel,the overhead of data reordering can be reduced.After these optimizations,the volume of memory transactions can be reduced by 16.7%-50%compared with CUSPARSE-based benchmarks,and the performance of irregular kernels can be improved by 9.64%-34.9%using an NVIDIA Tesla P4 GPU.
文摘Computational Fluid Dynamics-Discrete Element Method is used to model gas-solid systems in several applications in energy,pharmaceutical and petrochemical industries.Computational performance bot-tlenecks often limit the problem sizes that can be simulated at industrial scale.The data structures used to store several millions of particles in such large-scale simulations have a large memory footprint that does not fit into the processor cache hierarchies on current high-performance-computing platforms,leading to reduced computational performance.This paper specifically addresses this aspect of memory access bottlenecks in industrial scale simulations.The use of space-flling curves to improve memory access patterns is described and their impact on computational performance is quantified in both shared and distributed memory parallelization paradigms.The Morton space flling curve applied to uniform grids and k-dimensional tree partitions are used to reorder the particle data-structure thus improving spatial and temporal locality in memory.The performance impact of these techniques when applied to two benchmark problems,namely the homogeneous-cooling-system and a fluidized-bed,are presented.These optimization techniques lead to approximately two-fold performance improvement in particle focused operations such as neighbor-list creation and data-exchange,with~1.5 times overall improvement in a fluidization simulation with 1.27 million particles.
文摘As one of the most notorious programming errors,memory access errors still hurt modern software security.Particularly,they are hidden deeply in important software systems written in memory unsafe languages like C/C++.Plenty of work have been proposed to detect bugs leading to memory access errors.However,all existing works lack the ability to handle two challenges.First,they are not able to tackle fine-grained memory access errors,e.g.,data overflow inside one data structure.These errors are usually overlooked for a long time since they happen inside one memory block and do not lead to program crash.Second,most existing works rely on source code or debugging information to recover memory boundary information,so they cannot be directly applied to detection of memory access errors in binary code.However,searching memory access errors in binary code is a very common scenario in software vulnerability detection and exploitation.In order to overcome these challenges,we propose Memory Access Integrity(MAI),a dynamic method to detect finegrained memory access errors in off-the-shelf binary executables.The core idea is to recover fine-grained accessing policy between memory access behaviors and memory ranges,and then detect memory access errors based on the policy.The key insight in our work is that memory accessing patterns reveal information for recovering the boundary of memory objects and the accessing policy.Based on these recovered information,our method maintains a new memory model to simulate the life cycle of memory objects and report errors when any accessing policy is violated.We evaluate our tool on popular CTF datasets and real world softwares.Compared with the state of the art detection tool,the evaluation result demonstrates that our tool can detect fine-grained memory access errors effectively and efficiently.As the practical impact,our tool has detected three 0-day memory access errors in an audio decoder.
基金supported by the State Key Program of the National Natural Science Foundation of China (Grant No.60836004)the National Natural Science Foundation of China (Grant Nos.61076025 and 61006070)
文摘Using computer-aided design three-dimensional (3D) simulation technology, the recovery mechanism of single event upset and the effects of spacing and hit angle on the recovery are studied. It is found that the multi-node charge collection plays a key role in recovery and shielding the charge sharing by adding guard rings. It cannot exhibit the recovery effect. It is also indicated that the upset linear energy transfer (LET) threshold is kept constant while the recovery LET threshold increases as the spacing increases. Additionally, the effect of incident angle on recovery is analysed and it is shown that a larger angle can bring about a stronger charge sharing effect, thus strengthening the recovery ability.
基金the National Natural Science Foundation of China(Grant Nos.21773291,61904118,and 22002102)the Natural Science Foundation of Jiangsu Province,China(Grant Nos.BK20190935 and BK20190947)+3 种基金the Natural Science Foundation of the Jiangsu Higher Education Institutions of China(Grant Nos.19KJA210005,19KJB510012,19KJB120005,and 19KJB430034)the Fund from the Suzhou Key Laboratory for Nanophotonic and Nanoelectronic Materials and Its Devices(Grant No.SZS201812)the Science Fund from the Jiangsu Key Laboratory for Environment Functional Materialsthe State Key Laboratory of Transducer Technology,Shanghai Institute of Microsystem and Information Technology,Chinese Academy of Sciences.
文摘The era of information explosion is coming and information need to be continuously stored and randomly accessed over long-term periods,which constitute an insurmountable challenge for existing data centers.At present,computing devices use the von Neumann architecture with separate computing and memory units,which exposes the shortcomings of“memory bottleneck”.Nonvolatile memristor can realize data storage and in-memory computing at the same time and promises to overcome this bottleneck.Phase-change random access memory(PCRAM)is called one of the best solutions for next generation non-volatile memory.Due to its high speed,good data retention,high density,low power consumption,PCRAM has the broad commercial prospects in the in-memory computing application.In this review,the research progress of phase-change materials and device structures for PCRAM,as well as the most critical performances for a universal memory,such as speed,capacity,and power consumption,are reviewed.By comparing the advantages and disadvantages of phase-change optical disk and PCRAM,a new concept of optoelectronic hybrid storage based on phase-change material is proposed.Furthermore,its feasibility to replace existing memory technologies as a universal memory is also discussed as well.
基金Supported by the Strategic Priority Research Program of the Chinese Academy of Sciences under Grant No XDA09020402the National Integrate Circuit Research Program of China under Grant No 2009ZX02023-003+1 种基金the National Natural Science Foundation of China under Grant Nos 61261160500,61376006,61401444 and 61504157the Science and Technology Council of Shanghai under Grant Nos 14DZ2294900,15DZ2270900 and 14ZR1447500
文摘An optimized device structure for reducing the RESET current of phase-change random access memory (PCRAM) with blade-type like (BTL) phase change layer is proposed. The electrical thermal analysis of the BTL cell and the blade heater contactor structure by three-dimensional finite element modeling are compared with each other during RESET operation. The simulation results show that the programming region of the phase change layer in the BTL cell is much smaller, and thermal electrical distributions of the BTL cell are more concentrated on the TiN/GST interface. The results indicate that the BTL cell has the superiorities of increasing the heating efficiency, decreasing the power consumption and reducing the RESET current from 0.67mA to 0.32mA. Therefore, the BTL cell will be appropriate for high performance PCRAM device with lower power consumption and lower RESET current.
基金supported by the National Natural Science Foundation of China(Grant Nos.61274113,11204212,61404091,51502203,and 51502204)the Tianjin Natural Science Foundation,China(Grant Nos.14JCZDJC31500 and 14JCQNJC00800)the Tianjin Science and Technology Developmental Funds of Universities and Colleges,China(Grant No.20130701)
文摘In this letter,the Ta/HfO/BN/TiN resistive switching devices are fabricated and they exhibit low power consumption and high uniformity each.The reset current is reduced for the HfO/BN bilayer device compared with that for the Ta/HfO/TiN structure.Furthermore,the reset current decreases with increasing BN thickness.The HfOlayer is a dominating switching layer,while the low-permittivity and high-resistivity BN layer acts as a barrier of electrons injection into TiN electrode.The current conduction mechanism of low resistance state in the HfO/BN bilayer device is space-chargelimited current(SCLC),while it is Ohmic conduction in the HfOdevice.
文摘This paper investigated phase change Si1Sb2Te3 material for application of chalcogenide random access memory. Current-voltage performance was conducted to determine threshold current of phase change from amorphous phase to polycrystalline phase. The film holds a threshold current about 0.155 mA, which is smaller than the value 0.31 mA of Ge2Sb2Te5 film. Amorphous Si1Sb2Te3 changes to face-centred-cubic structure at ~ 180℃ and changes to hexagonal structure at ~ 270℃. Annealing temperature dependent electric resistivity of Si1Sb2Te3 film was studied by four-point probe method. Data retention of the films was characterized as well.
基金supported by the State Key Project of Fundamental Research of Ministry of Science and Technology,China(Grant No.2010CB934400)the National Natural Science Foundation of China(Grant Nos.51229101 and 11374351)
文摘Recent progresses in magnetic tunnel junctions with perpendicular magnetic anisotropy (PMA) are reviewed and summarized. At first, the concept and source of perpendicular magnetic anisotropy (PMA) are introduced. Next, a historical overview of PMA materials as magnetic electrodes, such as the RE-TM alloys TbFeCo and GdFeCo, novel tetragonal manganese alloys Mn-Ga, L10-ordered (Co, Fe)/Pt alloy, multilayer film [Co, Fe, CoFe/Pt, Pd, Ni, AU]N, and ultra-thin magnetic metal/oxidized barrier is offered. The other part of the article focuses on the optimization and fabrication of CoFeB/MgO/CoFeB p-MTJs, which is thought to have high potential to meet the main demands for non-volatile magnetic random access memory.
基金supported by National Natural Science Foundation of China(Nos.61574143,61704175,51502304)the Strategic Priority Research Program of Chinese Academy of Sciences(Grant No.XDB30000000)+2 种基金the Key Research Program of Frontier Sciences of the Chinese Academy of Sciences(No.ZDBS-LY-JSC027)Liaoning Revitalization Talents Program(No.XLYC1807109)the National Key Research and Development Program of China(2016YFB0401104)。
文摘Metal phthalocyanine is considered one of the most promising candidates for the design and fabrication of flexible resistive random access memory(RRAM)devices due to its intrinsic flexibility and excellent functionality.However,performance degradation and the lack of multi-level capability,which can directly expand the storage capacity in one memory cell without sacrificing additional layout area,are the primary obstacles to the use of metal phthalocyanine RRAMs in information storage.Here,a flexible RRAM with pristine nickel phthalocyanine(Ni Pc)as the resistive layer is reported for multi-level data storage.Due to its high trap-concentration,the charge transport behavior of the device agrees well with classical space charge limited conduction controlled by traps,leading to an excellent performance,including a high on-off current ratio of 10^(7),a long-term retention of 10^(6)s,a reproducible endurance over6000 cycles,long-term flexibility at a bending strain of 0.6%,a write speed of 50 ns under sequential bias pulses and the capability of multi-level data storage with reliable retention and uniformity.