An in-memory storage system provides submillisecond latency and improves the concurrency of user applications by caching data into memory from external storage.Fault tolerance of in-memory storage systems is essential...An in-memory storage system provides submillisecond latency and improves the concurrency of user applications by caching data into memory from external storage.Fault tolerance of in-memory storage systems is essential,as the loss of cached data requires access to data from external storage,which evidently increases the response latency.Typically,replication and erasure code(EC)are two fault-tolerant schemes that pose different trade-offs between access performance and storage usage.To help make the best performance and space trade-off,we design ElasticMem,a hybrid fault-tolerant distributed in-memory storage system that supports elastic redundancy transition to dynamically change the fault-tolerant scheme.ElasticMem exploits a novel EC-oriented replication(EOR)that carefully designs the data placement of replication according to the future data layout of EC to enhance the I/O efficiency of redundancy transition.ElasticMem solves the consistency problem caused by concurrent data accesses via a lightweight table-based scheme combined with data bypassing.It detects correlated read and write requests and serves subsequent read requests with local data.We implement a prototype that realizes ElasticMem based on Memcached.Experiments show that ElasticMem remarkably reduces the time of redundancy transition,the overall latency of correlated concurrent data accesses,and the latency of single data access among them.展开更多
The conventional computing architecture faces substantial chal-lenges,including high latency and energy consumption between memory and processing units.In response,in-memory computing has emerged as a promising altern...The conventional computing architecture faces substantial chal-lenges,including high latency and energy consumption between memory and processing units.In response,in-memory computing has emerged as a promising alternative architecture,enabling computing operations within memory arrays to overcome these limitations.Memristive devices have gained significant attention as key components for in-memory computing due to their high-density arrays,rapid response times,and ability to emulate biological synapses.Among these devices,two-dimensional(2D)material-based memristor and memtransistor arrays have emerged as particularly promising candidates for next-generation in-memory computing,thanks to their exceptional performance driven by the unique properties of 2D materials,such as layered structures,mechanical flexibility,and the capability to form heterojunctions.This review delves into the state-of-the-art research on 2D material-based memristive arrays,encompassing critical aspects such as material selection,device perfor-mance metrics,array structures,and potential applications.Furthermore,it provides a comprehensive overview of the current challenges and limitations associated with these arrays,along with potential solutions.The primary objective of this review is to serve as a significant milestone in realizing next-generation in-memory computing utilizing 2D materials and bridge the gap from single-device characterization to array-level and system-level implementations of neuromorphic computing,leveraging the potential of 2D material-based memristive devices.展开更多
Artificial intelligence(AI)processes data-centric applications with minimal effort.However,it poses new challenges to system design in terms of computational speed and energy efficiency.The traditional von Neumann arc...Artificial intelligence(AI)processes data-centric applications with minimal effort.However,it poses new challenges to system design in terms of computational speed and energy efficiency.The traditional von Neumann architecture cannot meet the requirements of heavily datacentric applications due to the separation of computation and storage.The emergence of computing inmemory(CIM)is significant in circumventing the von Neumann bottleneck.A commercialized memory architecture,static random-access memory(SRAM),is fast and robust,consumes less power,and is compatible with state-of-the-art technology.This study investigates the research progress of SRAM-based CIM technology in three levels:circuit,function,and application.It also outlines the problems,challenges,and prospects of SRAM-based CIM macros.展开更多
Facing the computing demands of Internet of things(IoT)and artificial intelligence(AI),the cost induced by moving the data between the central processing unit(CPU)and memory is the key problem and a chip featured with...Facing the computing demands of Internet of things(IoT)and artificial intelligence(AI),the cost induced by moving the data between the central processing unit(CPU)and memory is the key problem and a chip featured with flexible structural unit,ultra-low power consumption,and huge parallelism will be needed.In-memory computing,a non-von Neumann architecture fusing memory units and computing units,can eliminate the data transfer time and energy consumption while performing massive parallel computations.Prototype in-memory computing schemes modified from different memory technologies have shown orders of magnitude improvement in computing efficiency,making it be regarded as the ultimate computing paradigm.Here we review the state-of-the-art memory device technologies potential for in-memory computing,summarize their versatile applications in neural network,stochastic generation,and hybrid precision digital computing,with promising solutions for unprecedented computing tasks,and also discuss the challenges of stability and integration for general in-memory computing.展开更多
The“memory wall”of traditional von Neumann computing systems severely restricts the efficiency of data-intensive task execution,while in-memory computing(IMC)architecture is a promising approach to breaking the bott...The“memory wall”of traditional von Neumann computing systems severely restricts the efficiency of data-intensive task execution,while in-memory computing(IMC)architecture is a promising approach to breaking the bottleneck.Although variations and instability in ultra-scaled memory cells seriously degrade the calculation accuracy in IMC architectures,stochastic computing(SC)can compensate for these shortcomings due to its low sensitivity to cell disturbances.Furthermore,massive parallel computing can be processed to improve the speed and efficiency of the system.In this paper,by designing logic functions in NOR flash arrays,SC in IMC for the image edge detection is realized,demonstrating ultra-low computational complexity and power consumption(25.5 fJ/pixel at 2-bit sequence length).More impressively,the noise immunity is 6 times higher than that of the traditional binary method,showing good tolerances to cell variation and reliability degradation when implementing massive parallel computation in the array.展开更多
In this paper, a Distributed In-Memory Database (DIMDB) system is proposed to improve processing efficiency in mass data applications. The system uses an enhanced language similar to Structured Query Language (SQL...In this paper, a Distributed In-Memory Database (DIMDB) system is proposed to improve processing efficiency in mass data applications. The system uses an enhanced language similar to Structured Query Language (SQL) with a key-value storage schema. The design goals of the DIMDB system is described and its system architecture is discussed. Operation flow and the enhanced SOL-like language are also discussed, and experimental results are used to test the validity of the system.展开更多
The in-memory computing(IMC)paradigm emerges as an effective solution to break the bottlenecks of conventional von Neumann architecture.In the current work,an approximate multiplier in spin-orbit torque magnetoresisti...The in-memory computing(IMC)paradigm emerges as an effective solution to break the bottlenecks of conventional von Neumann architecture.In the current work,an approximate multiplier in spin-orbit torque magnetoresistive random access memory(SOTMRAM)based true IMC(STIMC)architecture was presented,where computations were performed natively within the cell array instead of in peripheral circuits.Firstly,basic Boolean logic operations were realized by utilizing the feature of unipolar SOT device.Two majority gate-based imprecise compressors and an ultra-efficient approximate multiplier were then built to reduce the energy and latency.An optimized data mapping strategy facilitating bit-serial operations with an extensive degree of parallelism was also adopted.Finally,the performance enhancements by performing our approximate multiplier in image smoothing were demonstrated.Detailed simulation results show that the proposed 838 approximate multiplier could reduce the energy and latency at least by 74.2%and 44.4%compared with the existing designs.Moreover,the scheme could achieve improved peak signal-to-noise ratio(PSNR)and structural similarity index metric(SSIM),ensuring high-quality image processing outcomes.展开更多
Combining logical function and memory characteristics of transistors is an ideal strategy for enhancing computational efficiency of transistor devices.Here,we rationally design a tri-gate two-dimensional(2D)ferroelect...Combining logical function and memory characteristics of transistors is an ideal strategy for enhancing computational efficiency of transistor devices.Here,we rationally design a tri-gate two-dimensional(2D)ferroelectric van der Waals heterostructures device based on copper indium thiophosphate(CuInP_(2)S_(6))and few layers tungsten disulfide(WS_(2)),and demonstrate its multi-functional applications in multi-valued state of data,non-volatile storage,and logic operation.By co-regulating the input signals across the tri-gate,we show that the device can switch functions flexibly at a low supply voltage of 6 V,giving rise to an ultra-high current switching ratio of 107 and a low subthreshold swing of 53.9 mV/dec.These findings offer perspectives in designing smart 2D devices with excellent functions based on ferroelectric van der Waals heterostructures.展开更多
The Rowhammer bug is a novel micro-architectural security threat, enabling powerful privilege-escalation attacks on various mainstream platforms. It works by actively flipping bits in Dynamic Random Access Memory(DRAM...The Rowhammer bug is a novel micro-architectural security threat, enabling powerful privilege-escalation attacks on various mainstream platforms. It works by actively flipping bits in Dynamic Random Access Memory(DRAM) cells with unprivileged instructions. In order to set up Rowhammer against binaries in the Linux page cache, the Waylaying algorithm has previously been proposed. The Waylaying method stealthily relocates binaries onto exploitable physical addresses without exhausting system memory. However, the proof-of-concept Waylaying algorithm can be easily detected during page cache eviction because of its high disk I/O overhead and long running time. This paper proposes the more advanced Memway algorithm, which improves on Waylaying in terms of both I/O overhead and speed. Running time and disk I/O overhead are reduced by 90% by utilizing Linux tmpfs and inmemory swapping to manage eviction files. Furthermore, by combining Memway with the unprivileged posix fadvise API, the binary relocation step is made 100 times faster. Equipped with our Memway+fadvise relocation scheme,we demonstrate practical Rowhammer attacks that take only 15–200 minutes to covertly relocate a victim binary,and less than 3 seconds to flip the target instruction bit.展开更多
Driven by the increasing requirements of high-performance computing applications,supercomputers are prone to containing more and more computing nodes.Applications running on such a large-scale computing system are lik...Driven by the increasing requirements of high-performance computing applications,supercomputers are prone to containing more and more computing nodes.Applications running on such a large-scale computing system are likely to spawn millions of parallel processes,which usually generate a burst of I/O requests,introducing a great challenge into the metadata management of underlying parallel file systems.The traditional method used to overcome such a challenge is adopting multiple metadata servers in the scale-out manner,which will inevitably confront with serious network and consistence problems.This work instead pursues to enhance the metadata performance in the scale-up manner.Specifically,we propose to improve the performance of each individual metadata server by employing GPU to handle metadata requests in parallel.Our proposal designs a novel metadata server architecture,which employs CPU to interact with file system clients,while offloading the computing tasks about metadata into GPU.To take full advantages of the parallelism existing in GPU,we redesign the in-memory data structure for the name space of file systems.The new data structure can perfectly fit to the memory architecture of GPU,and thus helps to exploit the large number of parallel threads within GPU to serve the bursty metadata requests concurrently.We implement a prototype based on BeeGFS and conduct extensive experiments to evaluate our proposal,and the experimental results demonstrate that our GPU-based solution outperforms the CPU-based scheme by more than 50%under typical metadata operations.The superiority is strengthened further on high concurrent scenarios,e.g.,the high-performance computing systems supporting millions of parallel threads.展开更多
With the rapid growth of computer science and big data,the traditional von Neumann architecture suffers the aggravating data communication costs due to the separated structure of the processing units and memories.Memr...With the rapid growth of computer science and big data,the traditional von Neumann architecture suffers the aggravating data communication costs due to the separated structure of the processing units and memories.Memristive in-memory computing paradigm is considered as a prominent candidate to address these issues,and plentiful applications have been demonstrated and verified.These applications can be broadly categorized into two major types:soft computing that can tolerant uncertain and imprecise results,and hard computing that emphasizes explicit and precise numerical results for each task,leading to different requirements on the computational accuracies and the corresponding hardware solutions.In this review,we conduct a thorough survey of the recent advances of memristive in-memory computing applications,both on the soft computing type that focuses on artificial neural networks and other machine learning algorithms,and the hard computing type that includes scientific computing and digital image processing.At the end of the review,we discuss the remaining challenges and future opportunities of memristive in-memory computing in the incoming Artificial Intelligence of Things era.展开更多
Similarity search,that is,finding similar items in massive data,is a fundamental computing problem in many fields such as data mining and information retrieval.However,for large-scale and high-dimension data,it suffer...Similarity search,that is,finding similar items in massive data,is a fundamental computing problem in many fields such as data mining and information retrieval.However,for large-scale and high-dimension data,it suffers from high computational complexity,requiring tremendous computation resources.Here,based on the low-power self-selective memristors,for the first time,we propose an in-memory search(IMS)system with two innovative designs.First,by exploiting the natural distribution law of the devices resistance,a hardware locality sensitive hashing encoder has been designed to transform the realvalued vectors into more efficient binary codes.Second,a compact memristive ternary content addressable memory is developed to calculate the Hamming distances between the binary codes in parallel.Our IMS system demonstrated a 168energy efficiency improvement over all-transistors counterparts in clustering and classification tasks,while achieving a software-comparable accuracy,thus providing a low-complexity and low-power solution for in-memory data mining applications.展开更多
Exploring materials with multiple properties who can endow a simple device with integrated functionalities has attracted enormous attention in the microelectronic field. One reason is the imperious demand for processo...Exploring materials with multiple properties who can endow a simple device with integrated functionalities has attracted enormous attention in the microelectronic field. One reason is the imperious demand for processors with continuously higher performance and totally new architecture. Combining ferroelectric with semiconducting properties is a promising solution. Here, we show that logic, in-memory computing, and optoelectrical logic and non-volatile computing functionalities can be integrated into a single transistor with ferroelectric semiconducting α-In2Se3 as the channel. Two-input AND, OR, and nonvolatile NOR and NAND logic operations with current on/off ratios reaching up to five orders, good endurance(1000 operation cycles), and fast operating speed(10μs) are realized. In addition, optoelectrical OR logic and non-volatile implication(IMP) operations, as well as ternary-input optoelectrical logic and inmemory computing functions are achieved by introducing light as an additional input signal. Our work highlights the potential of integrating complex logic functions and new-type computing into a simple device based on emerging ferroelectric semiconductors.展开更多
In-memory systems with erasure coding(EC)enabled are widely used to achieve high performance and data availability.However,as the scale of clusters grows,the server-level fail-slow problem is becoming increasingly fre...In-memory systems with erasure coding(EC)enabled are widely used to achieve high performance and data availability.However,as the scale of clusters grows,the server-level fail-slow problem is becoming increasingly frequent,which can create long tail latency.The influence of long tail latency is further amplified in EC-based systems due to the synchronous nature of multiple EC sub-operations.In this paper,we propose an EC-enabled in-memory storage system called ShortTail,which can achieve consistent performance and low latency for both reads and writes.First,ShortTail uses a lightweight request monitor to track the performance of each memory node and identify any fail-slow node.Second,ShortTail selectively performs degraded reads and redirected writes to avoid accessing fail-slow nodes.Finally,ShortTail posts an adaptive write strategy to reduce write amplification of small writes.We implement ShortTail on top of Memcached and compare it with two baseline systems.The experimental results show that ShortTail can reduce the P99 tail latency by up to 63.77%;it also brings significant improvements in the median latency and average latency.展开更多
We have witnessed exciting development of RAM technology in the past decade. The memory size grows rapidly and the price continues to decrease, so that it is fea- sible to deploy large amounts of RAM in a computer sys...We have witnessed exciting development of RAM technology in the past decade. The memory size grows rapidly and the price continues to decrease, so that it is fea- sible to deploy large amounts of RAM in a computer system. Several companies and research institutions have devoted a lot of resources to develop in-memory databases (IMDB) that implement queries after loading data into (virtual) memory in advance. The bloom of various in-memory databases pursues us to test and evaluate their performance objectively and fairly. Although the existing database benchmarks like Wisconsin benchmark and TPC-X series have achieved great success, they cannot suit for in-memory databases due to the lack of consideration of unique characteristics of an IMDB. In this study, we propose MemTest, a novel benchmark that concerns some major characteristics of an in-memory database. This benchmark constructs particular metrics, which cover processing time, compression ratio, minimal memory space and column strength of an in-memory database. We design a data model based on inter-bank transaction applications, and a data generator to support uniform and skew data distributions. The MemTest workload includes a set of queries and transactions against the metrics and data model. Finally, we illustrate the efficacy of MemTest through the implementations on two different in-memory databases.展开更多
Large-scale key-value stores are widely used in many Web-based systems to store huge amount of data as(key, value) pairs. In order to reduce the latency of accessing such(key, value) pairs, an in-memory cache system i...Large-scale key-value stores are widely used in many Web-based systems to store huge amount of data as(key, value) pairs. In order to reduce the latency of accessing such(key, value) pairs, an in-memory cache system is usually deployed between the front-end Web system and the back-end database system. In practice, a cache system may consist of a number of server nodes, and fault tolerance is a critical feature to maintain the latency Service-Level Agreements(SLAs). In this paper, we present the design, implementation, analysis, and evaluation of R-Memcached, a reliable in-memory key-value cache system that is built on top of the popular Memcached software. R-Memcached exploits coding techniques to achieve reliability, and can tolerate up to two node failures.Our experimental results show that R-Memcached can maintain very good latency and throughput performance even during the period of node failures.展开更多
In-memory computing is an alternative method to effectively accelerate the massive data-computing tasks of artificial intelligence(AI)and break the memory wall.In this work,we propose a 2T1C DRAM structure for in-memo...In-memory computing is an alternative method to effectively accelerate the massive data-computing tasks of artificial intelligence(AI)and break the memory wall.In this work,we propose a 2T1C DRAM structure for in-memory computing.It integrates a monolayer graphene transistor,a monolayer MoS_(2)transistor,and a capacitor in a two-transistor-onecapacitor(2T1C)configuration.In this structure,the storage node is in a similar position to that of one-transistor-one-capacitor(1T1C)dynamic random-access memory(DRAM),while an additional graphene transistor is used to accomplish the nondestructive readout of the stored information.Furthermore,the ultralow leakage current of the MoS_(2)transistor enables the storage of multi-level voltages on the capacitor with a long retention time.The stored charges can effectually tune the channel conductance of the graphene transistor due to its excellent linearity so that linear analog multiplication can be realized.Because of the almost unlimited cycling endurance of DRAM,our 2T1C DRAM has great potential for in situ training and recognition,which can significantly improve the recognition accuracy of neural networks.展开更多
Ferroelectrics have great potential in the field of nonvolatile memory due to programmable polarization states by external electric field in nonvolatile manner.However,complementary metal oxide semiconductor compatibi...Ferroelectrics have great potential in the field of nonvolatile memory due to programmable polarization states by external electric field in nonvolatile manner.However,complementary metal oxide semiconductor compatibility and uniformity of ferroelectric performance after size scaling have always been two thorny issues hindering practical application of ferroelectric memory devices.The emerging ferroelectricity of wurtzite structure nitride offers opportunities to circumvent the dilemma.This review covers the mechanism of ferroelectricity and domain dynamics in ferroelectric AlScN films.The performance optimization of AlScN films grown by different techniques is summarized and their applications for memories and emerging in-memory computing are illustrated.Finally,the challenges and perspectives regarding the commercial avenue of ferroelectric AlScN are discussed.展开更多
Efcient cache management plays a vital role in in-memory dataparallel systems,such as Spark,Tez,Storm and HANA.Recent research,notably research on the Least Reference Count(LRC)and Most Reference Distance(MRD)policies...Efcient cache management plays a vital role in in-memory dataparallel systems,such as Spark,Tez,Storm and HANA.Recent research,notably research on the Least Reference Count(LRC)and Most Reference Distance(MRD)policies,has shown that dependency-aware caching management practices that consider the application’s directed acyclic graph(DAG)perform well in Spark.However,these practices ignore the further relationship between RDDs and cached some redundant RDDs with the same child RDDs,which degrades the memory performance.Hence,in memory-constrained situations,systems may encounter a performance bottleneck due to frequent data block replacement.In addition,the prefetch mechanisms in some cache management policies,such as MRD,are hard to trigger.In this paper,we propose a new cache management method called RDE(Redundant Data Eviction)that can fully utilize applications’DAG information to optimize the management result.By considering both RDDs’dependencies and the reference sequence,we effectively evict RDDs with redundant features and perfect the memory for incoming data blocks.Experiments show that RDE improves performance by an average of 55%compared to LRU and by up to 48%and 20%compared to LRC and MRD,respectively.RDE also shows less sensitivity to memory bottlenecks,which means better availability in memory-constrained environments.展开更多
基金supported by the Fundamental Research Funds for the Central Universities(WK2150110022)Anhui Provincial Natural Science Foundation(2208085QF189)National Natural Science Foundation of China(62202440).
文摘An in-memory storage system provides submillisecond latency and improves the concurrency of user applications by caching data into memory from external storage.Fault tolerance of in-memory storage systems is essential,as the loss of cached data requires access to data from external storage,which evidently increases the response latency.Typically,replication and erasure code(EC)are two fault-tolerant schemes that pose different trade-offs between access performance and storage usage.To help make the best performance and space trade-off,we design ElasticMem,a hybrid fault-tolerant distributed in-memory storage system that supports elastic redundancy transition to dynamically change the fault-tolerant scheme.ElasticMem exploits a novel EC-oriented replication(EOR)that carefully designs the data placement of replication according to the future data layout of EC to enhance the I/O efficiency of redundancy transition.ElasticMem solves the consistency problem caused by concurrent data accesses via a lightweight table-based scheme combined with data bypassing.It detects correlated read and write requests and serves subsequent read requests with local data.We implement a prototype that realizes ElasticMem based on Memcached.Experiments show that ElasticMem remarkably reduces the time of redundancy transition,the overall latency of correlated concurrent data accesses,and the latency of single data access among them.
基金This work was supported by the National Research Foundation,Singapore under Award No.NRF-CRP24-2020-0002.
文摘The conventional computing architecture faces substantial chal-lenges,including high latency and energy consumption between memory and processing units.In response,in-memory computing has emerged as a promising alternative architecture,enabling computing operations within memory arrays to overcome these limitations.Memristive devices have gained significant attention as key components for in-memory computing due to their high-density arrays,rapid response times,and ability to emulate biological synapses.Among these devices,two-dimensional(2D)material-based memristor and memtransistor arrays have emerged as particularly promising candidates for next-generation in-memory computing,thanks to their exceptional performance driven by the unique properties of 2D materials,such as layered structures,mechanical flexibility,and the capability to form heterojunctions.This review delves into the state-of-the-art research on 2D material-based memristive arrays,encompassing critical aspects such as material selection,device perfor-mance metrics,array structures,and potential applications.Furthermore,it provides a comprehensive overview of the current challenges and limitations associated with these arrays,along with potential solutions.The primary objective of this review is to serve as a significant milestone in realizing next-generation in-memory computing utilizing 2D materials and bridge the gap from single-device characterization to array-level and system-level implementations of neuromorphic computing,leveraging the potential of 2D material-based memristive devices.
基金the National Key Research and Development Program of China(2018YFB2202602)The State Key Program of the National Natural Science Foundation of China(NO.61934005)+1 种基金The National Natural Science Foundation of China(NO.62074001)Joint Funds of the National Natural Science Foundation of China under Grant U19A2074.
文摘Artificial intelligence(AI)processes data-centric applications with minimal effort.However,it poses new challenges to system design in terms of computational speed and energy efficiency.The traditional von Neumann architecture cannot meet the requirements of heavily datacentric applications due to the separation of computation and storage.The emergence of computing inmemory(CIM)is significant in circumventing the von Neumann bottleneck.A commercialized memory architecture,static random-access memory(SRAM),is fast and robust,consumes less power,and is compatible with state-of-the-art technology.This study investigates the research progress of SRAM-based CIM technology in three levels:circuit,function,and application.It also outlines the problems,challenges,and prospects of SRAM-based CIM macros.
基金Project supported by the National Natural Science Foundation of China(Grant Nos.61925402 and 61851402)Science and Technology Commission of Shanghai Municipality,China(Grant No.19JC1416600)+1 种基金the National Key Research and Development Program of China(Grant No.2017YFB0405600)Shanghai Education Development Foundation and Shanghai Municipal Education Commission Shuguang Program,China(Grant No.18SG01).
文摘Facing the computing demands of Internet of things(IoT)and artificial intelligence(AI),the cost induced by moving the data between the central processing unit(CPU)and memory is the key problem and a chip featured with flexible structural unit,ultra-low power consumption,and huge parallelism will be needed.In-memory computing,a non-von Neumann architecture fusing memory units and computing units,can eliminate the data transfer time and energy consumption while performing massive parallel computations.Prototype in-memory computing schemes modified from different memory technologies have shown orders of magnitude improvement in computing efficiency,making it be regarded as the ultimate computing paradigm.Here we review the state-of-the-art memory device technologies potential for in-memory computing,summarize their versatile applications in neural network,stochastic generation,and hybrid precision digital computing,with promising solutions for unprecedented computing tasks,and also discuss the challenges of stability and integration for general in-memory computing.
基金supported by the National Natural Science Foundation of China(Nos.62034006,91964105,61874068)the China Key Research and Development Program(No.2016YFA0201802)+1 种基金the Natural Science Foundation of Shandong Province(No.ZR2020JQ28)Program of Qilu Young Scholars of Shandong University。
文摘The“memory wall”of traditional von Neumann computing systems severely restricts the efficiency of data-intensive task execution,while in-memory computing(IMC)architecture is a promising approach to breaking the bottleneck.Although variations and instability in ultra-scaled memory cells seriously degrade the calculation accuracy in IMC architectures,stochastic computing(SC)can compensate for these shortcomings due to its low sensitivity to cell disturbances.Furthermore,massive parallel computing can be processed to improve the speed and efficiency of the system.In this paper,by designing logic functions in NOR flash arrays,SC in IMC for the image edge detection is realized,demonstrating ultra-low computational complexity and power consumption(25.5 fJ/pixel at 2-bit sequence length).More impressively,the noise immunity is 6 times higher than that of the traditional binary method,showing good tolerances to cell variation and reliability degradation when implementing massive parallel computation in the array.
文摘In this paper, a Distributed In-Memory Database (DIMDB) system is proposed to improve processing efficiency in mass data applications. The system uses an enhanced language similar to Structured Query Language (SQL) with a key-value storage schema. The design goals of the DIMDB system is described and its system architecture is discussed. Operation flow and the enhanced SOL-like language are also discussed, and experimental results are used to test the validity of the system.
基金supported in part by National Natural Science Foundation of China(Grant Nos.62374055,12327806)supported in part by Natural Science Foundation of Wuhan(Grant No.2024040701010049).
文摘The in-memory computing(IMC)paradigm emerges as an effective solution to break the bottlenecks of conventional von Neumann architecture.In the current work,an approximate multiplier in spin-orbit torque magnetoresistive random access memory(SOTMRAM)based true IMC(STIMC)architecture was presented,where computations were performed natively within the cell array instead of in peripheral circuits.Firstly,basic Boolean logic operations were realized by utilizing the feature of unipolar SOT device.Two majority gate-based imprecise compressors and an ultra-efficient approximate multiplier were then built to reduce the energy and latency.An optimized data mapping strategy facilitating bit-serial operations with an extensive degree of parallelism was also adopted.Finally,the performance enhancements by performing our approximate multiplier in image smoothing were demonstrated.Detailed simulation results show that the proposed 838 approximate multiplier could reduce the energy and latency at least by 74.2%and 44.4%compared with the existing designs.Moreover,the scheme could achieve improved peak signal-to-noise ratio(PSNR)and structural similarity index metric(SSIM),ensuring high-quality image processing outcomes.
基金supported by the National Natural Science Foundation of China(No.62104073)the China Postdoctoral Science Foundation(No.2021M691088)+1 种基金the Pearl River Talent Recruitment Program(No.2019ZT08X639)Z.C.W.acknowledges the European Research Executive Agency(Project 101079184-FUNLAYERS).
文摘Combining logical function and memory characteristics of transistors is an ideal strategy for enhancing computational efficiency of transistor devices.Here,we rationally design a tri-gate two-dimensional(2D)ferroelectric van der Waals heterostructures device based on copper indium thiophosphate(CuInP_(2)S_(6))and few layers tungsten disulfide(WS_(2)),and demonstrate its multi-functional applications in multi-valued state of data,non-volatile storage,and logic operation.By co-regulating the input signals across the tri-gate,we show that the device can switch functions flexibly at a low supply voltage of 6 V,giving rise to an ultra-high current switching ratio of 107 and a low subthreshold swing of 53.9 mV/dec.These findings offer perspectives in designing smart 2D devices with excellent functions based on ferroelectric van der Waals heterostructures.
基金supported by the National Natural Science Foundation of China(Nos.U1836112,U1536204,and 61876134)the Fundamental Research Funds for the Central Universities(No.2042018kf10281)+1 种基金Foundation of Key Lab of Information Assurance and Technology(No.KJ-17-101)China Scholarship Council
文摘The Rowhammer bug is a novel micro-architectural security threat, enabling powerful privilege-escalation attacks on various mainstream platforms. It works by actively flipping bits in Dynamic Random Access Memory(DRAM) cells with unprivileged instructions. In order to set up Rowhammer against binaries in the Linux page cache, the Waylaying algorithm has previously been proposed. The Waylaying method stealthily relocates binaries onto exploitable physical addresses without exhausting system memory. However, the proof-of-concept Waylaying algorithm can be easily detected during page cache eviction because of its high disk I/O overhead and long running time. This paper proposes the more advanced Memway algorithm, which improves on Waylaying in terms of both I/O overhead and speed. Running time and disk I/O overhead are reduced by 90% by utilizing Linux tmpfs and inmemory swapping to manage eviction files. Furthermore, by combining Memway with the unprivileged posix fadvise API, the binary relocation step is made 100 times faster. Equipped with our Memway+fadvise relocation scheme,we demonstrate practical Rowhammer attacks that take only 15–200 minutes to covertly relocate a victim binary,and less than 3 seconds to flip the target instruction bit.
基金Supported by the National Key Research and Development Program of China under Grant No. 2018YFB0203904the National Natural Science Foundation of China under Grant Nos. 61872392, U1811461 and 61832020+4 种基金the Pearl River Science and Technology Nova Program of Guangzhou under Grant No. 201906010008Guangdong Natural Science Foundation under Grant No. 2018B030312002the Major Program of Guangdong Basic and Applied Research under Grant No. 2019B030302002the Program for Guangdong Introducing Innovative and Entrepreneurial Teams under Grant No. 2016ZT06D211the Key-Area Research and Development Program of Guang Dong Province of China under Grant No. 2019B010107001.
文摘Driven by the increasing requirements of high-performance computing applications,supercomputers are prone to containing more and more computing nodes.Applications running on such a large-scale computing system are likely to spawn millions of parallel processes,which usually generate a burst of I/O requests,introducing a great challenge into the metadata management of underlying parallel file systems.The traditional method used to overcome such a challenge is adopting multiple metadata servers in the scale-out manner,which will inevitably confront with serious network and consistence problems.This work instead pursues to enhance the metadata performance in the scale-up manner.Specifically,we propose to improve the performance of each individual metadata server by employing GPU to handle metadata requests in parallel.Our proposal designs a novel metadata server architecture,which employs CPU to interact with file system clients,while offloading the computing tasks about metadata into GPU.To take full advantages of the parallelism existing in GPU,we redesign the in-memory data structure for the name space of file systems.The new data structure can perfectly fit to the memory architecture of GPU,and thus helps to exploit the large number of parallel threads within GPU to serve the bursty metadata requests concurrently.We implement a prototype based on BeeGFS and conduct extensive experiments to evaluate our proposal,and the experimental results demonstrate that our GPU-based solution outperforms the CPU-based scheme by more than 50%under typical metadata operations.The superiority is strengthened further on high concurrent scenarios,e.g.,the high-performance computing systems supporting millions of parallel threads.
基金This work was financially supported by the National Key R&D Program of China(Nos.2019YFB2205100 and 2021ZD0201201)the National Natural Science Foundation of China(Grant Nos.92064012 and 61874164).
文摘With the rapid growth of computer science and big data,the traditional von Neumann architecture suffers the aggravating data communication costs due to the separated structure of the processing units and memories.Memristive in-memory computing paradigm is considered as a prominent candidate to address these issues,and plentiful applications have been demonstrated and verified.These applications can be broadly categorized into two major types:soft computing that can tolerant uncertain and imprecise results,and hard computing that emphasizes explicit and precise numerical results for each task,leading to different requirements on the computational accuracies and the corresponding hardware solutions.In this review,we conduct a thorough survey of the recent advances of memristive in-memory computing applications,both on the soft computing type that focuses on artificial neural networks and other machine learning algorithms,and the hard computing type that includes scientific computing and digital image processing.At the end of the review,we discuss the remaining challenges and future opportunities of memristive in-memory computing in the incoming Artificial Intelligence of Things era.
基金National Key Research and Development Plan of MOST of China,Grant/Award Numbers:2019YFB2205100,2021ZD0201201National Natural Science Foundation of China,Grant/Award Number:92064012+1 种基金Hubei Engineering Research Center on MicroelectronicsChua Memristor Institute。
文摘Similarity search,that is,finding similar items in massive data,is a fundamental computing problem in many fields such as data mining and information retrieval.However,for large-scale and high-dimension data,it suffers from high computational complexity,requiring tremendous computation resources.Here,based on the low-power self-selective memristors,for the first time,we propose an in-memory search(IMS)system with two innovative designs.First,by exploiting the natural distribution law of the devices resistance,a hardware locality sensitive hashing encoder has been designed to transform the realvalued vectors into more efficient binary codes.Second,a compact memristive ternary content addressable memory is developed to calculate the Hamming distances between the binary codes in parallel.Our IMS system demonstrated a 168energy efficiency improvement over all-transistors counterparts in clustering and classification tasks,while achieving a software-comparable accuracy,thus providing a low-complexity and low-power solution for in-memory data mining applications.
基金supported by the National Key R&D Program of China(2018YFA0703700 and 2016YFA0200700)the National Natural Science Foundation of China(91964203,61625401,61851403,61974036,61804146,and 61804035)+1 种基金the Strategic Priority Research Program of Chinese Academy of Sciences(XDB30000000)CAS Key Laboratory of Nanosystem and Hierarchical Fabrication.The authors also gratefully acknowledge the support of Youth Innovation Promotion Association CAS.
文摘Exploring materials with multiple properties who can endow a simple device with integrated functionalities has attracted enormous attention in the microelectronic field. One reason is the imperious demand for processors with continuously higher performance and totally new architecture. Combining ferroelectric with semiconducting properties is a promising solution. Here, we show that logic, in-memory computing, and optoelectrical logic and non-volatile computing functionalities can be integrated into a single transistor with ferroelectric semiconducting α-In2Se3 as the channel. Two-input AND, OR, and nonvolatile NOR and NAND logic operations with current on/off ratios reaching up to five orders, good endurance(1000 operation cycles), and fast operating speed(10μs) are realized. In addition, optoelectrical OR logic and non-volatile implication(IMP) operations, as well as ternary-input optoelectrical logic and inmemory computing functions are achieved by introducing light as an additional input signal. Our work highlights the potential of integrating complex logic functions and new-type computing into a simple device based on emerging ferroelectric semiconductors.
基金supported by the National Natural Science Foundation of China(No.62025203)the Changchun Key Scientific and Technological Research and Development Project,China(No.21ZGN30)。
文摘In-memory systems with erasure coding(EC)enabled are widely used to achieve high performance and data availability.However,as the scale of clusters grows,the server-level fail-slow problem is becoming increasingly frequent,which can create long tail latency.The influence of long tail latency is further amplified in EC-based systems due to the synchronous nature of multiple EC sub-operations.In this paper,we propose an EC-enabled in-memory storage system called ShortTail,which can achieve consistent performance and low latency for both reads and writes.First,ShortTail uses a lightweight request monitor to track the performance of each memory node and identify any fail-slow node.Second,ShortTail selectively performs degraded reads and redirected writes to avoid accessing fail-slow nodes.Finally,ShortTail posts an adaptive write strategy to reduce write amplification of small writes.We implement ShortTail on top of Memcached and compare it with two baseline systems.The experimental results show that ShortTail can reduce the P99 tail latency by up to 63.77%;it also brings significant improvements in the median latency and average latency.
文摘We have witnessed exciting development of RAM technology in the past decade. The memory size grows rapidly and the price continues to decrease, so that it is fea- sible to deploy large amounts of RAM in a computer system. Several companies and research institutions have devoted a lot of resources to develop in-memory databases (IMDB) that implement queries after loading data into (virtual) memory in advance. The bloom of various in-memory databases pursues us to test and evaluate their performance objectively and fairly. Although the existing database benchmarks like Wisconsin benchmark and TPC-X series have achieved great success, they cannot suit for in-memory databases due to the lack of consideration of unique characteristics of an IMDB. In this study, we propose MemTest, a novel benchmark that concerns some major characteristics of an in-memory database. This benchmark constructs particular metrics, which cover processing time, compression ratio, minimal memory space and column strength of an in-memory database. We design a data model based on inter-bank transaction applications, and a data generator to support uniform and skew data distributions. The MemTest workload includes a set of queries and transactions against the metrics and data model. Finally, we illustrate the efficacy of MemTest through the implementations on two different in-memory databases.
基金supported in part by Hong Kong GRF grant HKBU 210412 and HKBU grant FRG2/14-15/059
文摘Large-scale key-value stores are widely used in many Web-based systems to store huge amount of data as(key, value) pairs. In order to reduce the latency of accessing such(key, value) pairs, an in-memory cache system is usually deployed between the front-end Web system and the back-end database system. In practice, a cache system may consist of a number of server nodes, and fault tolerance is a critical feature to maintain the latency Service-Level Agreements(SLAs). In this paper, we present the design, implementation, analysis, and evaluation of R-Memcached, a reliable in-memory key-value cache system that is built on top of the popular Memcached software. R-Memcached exploits coding techniques to achieve reliability, and can tolerate up to two node failures.Our experimental results show that R-Memcached can maintain very good latency and throughput performance even during the period of node failures.
基金This work was supported by the National Key Research and Development Program(2021YFA1200500)in part by the Innovation Program of Shanghai Municipal Education Commission(2021-01-07-00-07-E00077)Shanghai Municipal Science and Technology Commission(21DZ1100900).
文摘In-memory computing is an alternative method to effectively accelerate the massive data-computing tasks of artificial intelligence(AI)and break the memory wall.In this work,we propose a 2T1C DRAM structure for in-memory computing.It integrates a monolayer graphene transistor,a monolayer MoS_(2)transistor,and a capacitor in a two-transistor-onecapacitor(2T1C)configuration.In this structure,the storage node is in a similar position to that of one-transistor-one-capacitor(1T1C)dynamic random-access memory(DRAM),while an additional graphene transistor is used to accomplish the nondestructive readout of the stored information.Furthermore,the ultralow leakage current of the MoS_(2)transistor enables the storage of multi-level voltages on the capacitor with a long retention time.The stored charges can effectually tune the channel conductance of the graphene transistor due to its excellent linearity so that linear analog multiplication can be realized.Because of the almost unlimited cycling endurance of DRAM,our 2T1C DRAM has great potential for in situ training and recognition,which can significantly improve the recognition accuracy of neural networks.
基金fundings of National Natural Science Foundation of China(No.T2222025,62174053 and 61804055)National Key Research and Development program of China(No.2021YFA1200700)+1 种基金Shanghai Science and Technology Innovation Action Plan(No.21JC1402000 and 21520714100)the Fundamental Research Funds for the Central Universities.
文摘Ferroelectrics have great potential in the field of nonvolatile memory due to programmable polarization states by external electric field in nonvolatile manner.However,complementary metal oxide semiconductor compatibility and uniformity of ferroelectric performance after size scaling have always been two thorny issues hindering practical application of ferroelectric memory devices.The emerging ferroelectricity of wurtzite structure nitride offers opportunities to circumvent the dilemma.This review covers the mechanism of ferroelectricity and domain dynamics in ferroelectric AlScN films.The performance optimization of AlScN films grown by different techniques is summarized and their applications for memories and emerging in-memory computing are illustrated.Finally,the challenges and perspectives regarding the commercial avenue of ferroelectric AlScN are discussed.
基金supported by the National Natural Science Foundation of China under Grant 6110002。
文摘Efcient cache management plays a vital role in in-memory dataparallel systems,such as Spark,Tez,Storm and HANA.Recent research,notably research on the Least Reference Count(LRC)and Most Reference Distance(MRD)policies,has shown that dependency-aware caching management practices that consider the application’s directed acyclic graph(DAG)perform well in Spark.However,these practices ignore the further relationship between RDDs and cached some redundant RDDs with the same child RDDs,which degrades the memory performance.Hence,in memory-constrained situations,systems may encounter a performance bottleneck due to frequent data block replacement.In addition,the prefetch mechanisms in some cache management policies,such as MRD,are hard to trigger.In this paper,we propose a new cache management method called RDE(Redundant Data Eviction)that can fully utilize applications’DAG information to optimize the management result.By considering both RDDs’dependencies and the reference sequence,we effectively evict RDDs with redundant features and perfect the memory for incoming data blocks.Experiments show that RDE improves performance by an average of 55%compared to LRU and by up to 48%and 20%compared to LRC and MRD,respectively.RDE also shows less sensitivity to memory bottlenecks,which means better availability in memory-constrained environments.