A notable portion of cachelines in real-world workloads exhibits inner non-uniform access behaviors.However,modern cache management rarely considers this fine-grained feature,which impacts the effective cache capacity...A notable portion of cachelines in real-world workloads exhibits inner non-uniform access behaviors.However,modern cache management rarely considers this fine-grained feature,which impacts the effective cache capacity of contemporary high-performance spacecraft processors.To harness these non-uniform access behaviors,an efficient cache replacement framework featuring an auxiliary cache specifically designed to retain evicted hot data was proposed.This framework reconstructs the cache replacement policy,facilitating data migration between the main cache and the auxiliary cache.Unlike traditional cacheline-granularity policies,the approach excels at identifying and evicting infrequently used data,thereby optimizing cache utilization.The evaluation shows impressive performance improvement,especially on workloads with irregular access patterns.Benefiting from fine granularity,the proposal achieves superior storage efficiency compared with commonly used cache management schemes,providing a potential optimization opportunity for modern resource-constrained processors,such as spacecraft processors.Furthermore,the framework complements existing modern cache replacement policies and can be seamlessly integrated with minimal modifications,enhancing their overall efficacy.展开更多
The historical significance of the Stern–Gerlach(SG)experiment lies in its provision of the initial evidence for space quantization.Over time,its sequential form has evolved into an elegant paradigm that effectively ...The historical significance of the Stern–Gerlach(SG)experiment lies in its provision of the initial evidence for space quantization.Over time,its sequential form has evolved into an elegant paradigm that effectively illustrates the fundamental principles of quantum theory.To date,the practical implementation of the sequential SG experiment has not been fully achieved.In this study,we demonstrate the capability of programmable quantum processors to simulate the sequential SG experiment.The specific parametric shallow quantum circuits,which are suitable for the limitations of current noisy quantum hardware,are given to replicate the functionality of SG devices with the ability to perform measurements in different directions.Surprisingly,it has been demonstrated that Wigner’s SG interferometer can be readily implemented in our sequential quantum circuit.With the utilization of the identical circuits,it is also feasible to implement Wheeler’s delayed-choice experiment.We propose the utilization of cross-shaped programmable quantum processors to showcase sequential experiments,and the simulation results demonstrate a strong alignment with theoretical predictions.With the rapid advancement of cloud-based quantum computing,such as BAQIS Quafu,it is our belief that the proposed solution is well-suited for deployment on the cloud,allowing for public accessibility.Our findings not only expand the potential applications of quantum computers,but also contribute to a deeper comprehension of the fundamental principles underlying quantum theory.展开更多
The availability of computers and communication networks allows us to gather and analyse data on a far larger scale than previously. At present, it is believed that statistics is a suitable method to analyse networks ...The availability of computers and communication networks allows us to gather and analyse data on a far larger scale than previously. At present, it is believed that statistics is a suitable method to analyse networks with millions, or more, of vertices. The MATLAB language, with its mass of statistical functions, is a good choice to rapidly realize an algorithm prototype of complex networks. The performance of the MATLAB codes can be further improved by using graphic processor units (GPU). This paper presents the strategies and performance of the GPU implementation of a complex networks package, and the Jacket toolbox of MATLAB is used. Compared with some commercially available CPU implementations, GPU can achieve a speedup of, on average, 11.3x. The experimental result proves that the GPU platform combined with the MATLAB language is a good combination for complex network research.展开更多
Virtualization is the key technology of cloud computing. Network virtualization plays an important role in this field. Its performance is very relevant to network virtualizing. Nowadays its implementations are mainly ...Virtualization is the key technology of cloud computing. Network virtualization plays an important role in this field. Its performance is very relevant to network virtualizing. Nowadays its implementations are mainly based on the idea of Software Define Network (SDN). Open vSwitch is a sort of software virtual switch, which conforms to the OpenFlow protocol standard. It is basically deployed in the Linux kernel hypervisor. This leads to its performance relatively poor because of the limited system resource. In turn, the packet process throughput is very low.In this paper, we present a Cavium-based Open vSwitch implementation. The Cavium platform features with multi cores and couples of hard ac-celerators. It supports zero-copy of packets and handles packet more quickly. We also carry some experiments on the platform. It indicates that we can use it in the enterprise network or campus network as convergence layer and core layer device.展开更多
Network processors (NPs) are widely used for programmable and high-performance networks;however, the programs for NPs are less portable, the number of NP program developers is small, and the development cost is high. ...Network processors (NPs) are widely used for programmable and high-performance networks;however, the programs for NPs are less portable, the number of NP program developers is small, and the development cost is high. To solve these problems, this paper proposes an open, high-level, and portable programming language called “Phonepl”, which is independent from vendor-specific proprietary hardware and software but can be translated into an NP program with high performance especially in the memory use. A common NP hardware feature is that a whole packet is stored in DRAM, but the header is cached in SRAM. Phonepl has a hardware-independent abstraction of this feature so that it allows programmers mostly unconscious of this hardware feature. To implement the abstraction, four representations of packet data type that cover all the packet operations (including substring, concatenation, input, and output) are introduced. Phonepl have been implemented on Octeon NPs used in plug-ins for a network-virtualization environment called the VNode Infrastructure, and several packet-handling programs were evaluated. As for the evaluation result, the conversion throughput is close to the wire rate, i.e., 10 Gbps, and no packet loss (by cache miss) occurs when the packet size is 256 bytes or larger.展开更多
There are varieties of embedded systems in the world. It is a big challenge to optimize the instruction sets of System on Chips (SoCs) according to different systems' working environments. The idea of programmable...There are varieties of embedded systems in the world. It is a big challenge to optimize the instruction sets of System on Chips (SoCs) according to different systems' working environments. The idea of programmable instruction set is an effective method to gain embedded system's re-configurability. This letter presents a logic module for Java processor to be capable of using programmable instruction set. Cost (area, power, and timing) of the module is trivial. Such module is also reusable for other embedded system solutions besides Java systems.展开更多
Developing parallel applications on heterogeneous processors is facing the challenges of 'memory wall',due to limited capacity of local storage,limited bandwidth and long latency for memory access. Aiming at t...Developing parallel applications on heterogeneous processors is facing the challenges of 'memory wall',due to limited capacity of local storage,limited bandwidth and long latency for memory access. Aiming at this problem,a parallelization approach was proposed with six memory optimization schemes for CG,four schemes of them aiming at all kinds of sparse matrix-vector multiplication (SPMV) operation. Conducted on IBM QS20,the parallelization approach can reach up to 21 and 133 times speedups with size A and B,respectively,compared with single power processor element. Finally,the conclusion is drawn that the peak bandwidth of memory access on Cell BE can be obtained in SPMV,simple computation is more efficient on heterogeneous processors and loop-unrolling can hide local storage access latency while executing scalar operation on SIMD cores.展开更多
Processors have been playing important roles in both communication infrastructure systems and terminals.In this paper,both application specific and general purpose processors for communications are discussed including...Processors have been playing important roles in both communication infrastructure systems and terminals.In this paper,both application specific and general purpose processors for communications are discussed including the roles,the history,the current situations,and the trends.One trend is that ASIPs(Application Specific Instruction-set Processors) are taking over ASICs(Application Specific Integrated Circuits) because of the increasing needs both on performance and compatibility of multi-modes.The trend opened opportunities for researchers crossing the boundary between communications and computer architecture.Another trend is the serverlization,i.e.,more infrastructure equipments are replaced by servers.The trend opened opportunities for researchers working towards high performance computing for communication,such as research on communication algorithm kernels and real time programming methods on servers.展开更多
In order to eliminate the energy waste caused by the traditional static hardware multithreaded processor used in real-time embedded system working in the low workload situation, the energy efficiency of the hardware m...In order to eliminate the energy waste caused by the traditional static hardware multithreaded processor used in real-time embedded system working in the low workload situation, the energy efficiency of the hardware multithread is discussed and a novel dynamic multithreaded architecture is proposed. The proposed architecture saves the energy wasted by removing idle threads without manipulation on the original architecture, fulfills a seamless switching mechanism which protects active threads and avoids pipeline stall during power mode switching. The report of an implemented dynamic multithreaded processor with 45 nm process from synthesis tool indicates that the area of dynamic multithreaded architecture is only 2.27% higher than the static one in achieving dynamic power dissipation, and consumes 1.3% more power in the same peak performance.展开更多
We present a broadband and polarization-insensitive unidirectional imager that operates at the visible part of the spectrum,where image formation occurs in one direction,while in the opposite direction,it is blocked.T...We present a broadband and polarization-insensitive unidirectional imager that operates at the visible part of the spectrum,where image formation occurs in one direction,while in the opposite direction,it is blocked.This approach is enabled by deep learning-driven diffractive optical design with wafer-scale nano-fabrication using high-purity fused silica to ensure optical transparency and thermal stability.Our design achieves unidirectional imaging across three visible wavelengths(covering red,green,and blue parts of the spectrum),and we experimentally validated this broadband unidirectional imager by creating high-fidelity images in the forward direction and generating weak,distorted output patterns in the backward direction,in alignment with our numerical simulations.This work demonstrates wafer-scale production of diffractive optical processors,featuring 16 levels of nanoscale phase features distributed across two axially aligned diffractive layers for visible unidirectional imaging.This approach facilitates mass-scale production of~0.5 billion nanoscale phase features per wafer,supporting high-throughput manufacturing of hundreds to thousands of multi-layer diffractive processors suitable for large apertures and parallel processing of multiple tasks.Beyond broadband unidirectional imaging in the visible spectrum,this study establishes a pathway for artificial-intelligence-enabled diffractive optics with versatile applications,signaling a new era in optical device functionality with industrial-level,massively scalable fabrication.展开更多
As superconducting quantum processors scale,a key challenge is maintaining high coherence times and fidelity control over numerous qubits.We propose an automatic frequency allocation method for frequency-tunable qubit...As superconducting quantum processors scale,a key challenge is maintaining high coherence times and fidelity control over numerous qubits.We propose an automatic frequency allocation method for frequency-tunable qubits that equally considers coherence-limited fidelity and crosstalk-induced control errors during the allocation process.By employing a weighted average of the objective functions for coherence time and crosstalk,we numerically calculate gate fidelity to establish an open-loop optimization for determining suitable weight factors.This results in an efficient objective function for frequency optimization.We apply our method to frequency-tunable transmon qubits with tunable couplers,both theoretically and experimentally.The numerical results demonstrate significant advantages,including substantial reductions in gate errors and faster operation times,especially at higher qubit counts.Experimentally,our approach successfully achieves approximately 99.9%single-qubit fidelity on a nine-qubit chip.展开更多
In this work,we present a parallel implementation of radiation hydrodynamics coupled with particle transport,utilizing software infrastructure JASMIN(J Adaptive Structured Meshes applications INfrastructure)which enca...In this work,we present a parallel implementation of radiation hydrodynamics coupled with particle transport,utilizing software infrastructure JASMIN(J Adaptive Structured Meshes applications INfrastructure)which encapsulates high-performance technology for the numerical simulation of complex applications.Two serial codes,radiation hydrodynamics RH2D and particle transport Sn2D,have been integrated into RHSn2D on JASMIN infrastructure,which can efficiently use thousands of processors to simulate the complex multi-physics phenomena.Moreover,the non-conforming processors strategy has ensured RHSn2D against the serious load imbalance between radiation hydrodynamics and particle transport for large scale parallel simulations.Numerical results show that RHSn2D achieves a parallel efficiency of 17.1%using 90720 cells on 8192 processors compared with 256 processors in the same problem.展开更多
It can be observed from looking backward that processor architecture is improved through spirally shifting from simple to complex and from complex to simple. Nowadays we are facing another shifting from complex to sim...It can be observed from looking backward that processor architecture is improved through spirally shifting from simple to complex and from complex to simple. Nowadays we are facing another shifting from complex to simple, and new innovative architecture will emerge to utilize the continuously increasing transistor budgets. The growing importance of wire delays, changing workloads, power consumption, and design/verification complexity will drive the forthcoming era of Chip Multiprocessors (CMPs). Furthermore, typical CMP projects both from industries and from academics are investigated. Through going into depths for some primary theoretical and implementation problems of CMPs, the great challenges and opportunities to future CMPs are presented and discussed. Finally, the Godson series microprocessors designed in China are introduced.展开更多
The nonpreemptive assignment of independent tasks to a system of m uniform processors isexamined with the objective of minimizing the makespan. Using r_m, the ratio of the fasest speed tothe slowest speed of the syste...The nonpreemptive assignment of independent tasks to a system of m uniform processors isexamined with the objective of minimizing the makespan. Using r_m, the ratio of the fasest speed tothe slowest speed of the system, as a parameter, we assess the performance of LPT (largestprocessing time) schedule with respect to optimal schedules. It is shown thet the worst-case boundfor the ratio of the two schedule lengths is between展开更多
As semiconductor technology advances, there will be billions of transistors on a single chip. Chip many-core processors are emerging to take advantage of these greater transistor densities to deliver greater performan...As semiconductor technology advances, there will be billions of transistors on a single chip. Chip many-core processors are emerging to take advantage of these greater transistor densities to deliver greater performance. Effective fault tolerance techniques are essential to improve the yield of such complex chips. In this paper, a core-level redundancy scheme called N+M is proposed to improve N-core processors’ yield by providing M spare cores. In such architecture, topology is an important factor because it greatly affects the processors’ performance. The concept of logical topology and a topology reconfiguration problem are introduced, which is able to transparently provide target topology with lowest performance degradation as the presence of faulty cores on-chip. A row rippling and column stealing (RRCS) algorithm is also proposed. Results show that PRCS can give solutions with average 13.8% degradation with negligible computing time.展开更多
Data prefetching is an effective data access latency hiding technique to mask the CPU stall caused by cache misses and to bridge the performance gap between processor and memory. With hardware and/or software support,...Data prefetching is an effective data access latency hiding technique to mask the CPU stall caused by cache misses and to bridge the performance gap between processor and memory. With hardware and/or software support, data prefetching brings data closer to a processor before it is actually needed. Many prefetching techniques have been developed for single-core processors. Recent developments in processor technology have brought multicore processors into mainstream. While some of the single-core prefetching techniques are directly applicable to multicore processors, numerous novel strategies have been proposed in the past few years to take advantage of multiple cores. This paper aims to provide a comprehensive review of the state-of-the-art prefetching techniques, and proposes a taxonomy that classifies various design concerns in developing a prefetching strategy, especially for multicore processors. We compare various existing methods through analysis as well.展开更多
Due to the increasing power consumption in modern computing systems, energy management has become an important research area in the last decade. Recently, multicore has emerged to be an energy efficient architecture t...Due to the increasing power consumption in modern computing systems, energy management has become an important research area in the last decade. Recently, multicore has emerged to be an energy efficient architecture that exploits parallelisms in modern applications. However, as the number of cores on a single chip continues to increase, it has been a grand challenge on how to effectively manage the energy efficiency of multicore-based systems. In this paper, based on the voltage island and dynamic voltage and frequency scaling (DVFS) techniques, we investigate the energy efficiency of block-partitioned multieore processors, where cores are grouped into blocks with the cores on one block sharing a DVFS- enabled power supply. Depending on the number of cores on each block, we study both symmetric and asymmetric block configurations. We develop a system-level power model (which can support various power management techniques) and derive both block- and system-wide energy-efficient frequencies for systems with block-partitioned multieore processors. Based on the power model, we prove that, for embarrassingly parallel applications, having all cores on a single block can achieve the same energy savings as that of the individual block configuration (where each core forms a single block and has its own power supply). However, for applications with limited degrees of parallelism, we show the superiority of the buddy-asymmetric block configuration, where the number of required blocks (and power supplies) is logarithmically related to the number of cores on the chip, in that it can achieve the same amount of energy savings as that of the individual block configuration. The energy efficiency of different block configurations is further evaluated through extensive simulations with both synthetic as well as a real life application.展开更多
文摘A notable portion of cachelines in real-world workloads exhibits inner non-uniform access behaviors.However,modern cache management rarely considers this fine-grained feature,which impacts the effective cache capacity of contemporary high-performance spacecraft processors.To harness these non-uniform access behaviors,an efficient cache replacement framework featuring an auxiliary cache specifically designed to retain evicted hot data was proposed.This framework reconstructs the cache replacement policy,facilitating data migration between the main cache and the auxiliary cache.Unlike traditional cacheline-granularity policies,the approach excels at identifying and evicting infrequently used data,thereby optimizing cache utilization.The evaluation shows impressive performance improvement,especially on workloads with irregular access patterns.Benefiting from fine granularity,the proposal achieves superior storage efficiency compared with commonly used cache management schemes,providing a potential optimization opportunity for modern resource-constrained processors,such as spacecraft processors.Furthermore,the framework complements existing modern cache replacement policies and can be seamlessly integrated with minimal modifications,enhancing their overall efficacy.
基金supported by Beijing Academy of Quantum Information Sciencessupported by the State Key Laboratory of Low Dimensional Quantum Physics+2 种基金the Start-up Fund provided by Tsinghua Universitythe financial support provided by the National Natural Science Foundation of China(Grant No.92065113)the Anhui Initiative in Quantum Information Technologies。
文摘The historical significance of the Stern–Gerlach(SG)experiment lies in its provision of the initial evidence for space quantization.Over time,its sequential form has evolved into an elegant paradigm that effectively illustrates the fundamental principles of quantum theory.To date,the practical implementation of the sequential SG experiment has not been fully achieved.In this study,we demonstrate the capability of programmable quantum processors to simulate the sequential SG experiment.The specific parametric shallow quantum circuits,which are suitable for the limitations of current noisy quantum hardware,are given to replicate the functionality of SG devices with the ability to perform measurements in different directions.Surprisingly,it has been demonstrated that Wigner’s SG interferometer can be readily implemented in our sequential quantum circuit.With the utilization of the identical circuits,it is also feasible to implement Wheeler’s delayed-choice experiment.We propose the utilization of cross-shaped programmable quantum processors to showcase sequential experiments,and the simulation results demonstrate a strong alignment with theoretical predictions.With the rapid advancement of cloud-based quantum computing,such as BAQIS Quafu,it is our belief that the proposed solution is well-suited for deployment on the cloud,allowing for public accessibility.Our findings not only expand the potential applications of quantum computers,but also contribute to a deeper comprehension of the fundamental principles underlying quantum theory.
基金Project supported by the Science Fund for Creative Research Groups of the National Natural Science Foundation of China (Grant No.60921062)the National Natural Science Foundation of China (Grant No.60873014)the Young Scientists Fund of the National Natural Science Foundation of China (Grant Nos.61003082 and 60903059)
文摘The availability of computers and communication networks allows us to gather and analyse data on a far larger scale than previously. At present, it is believed that statistics is a suitable method to analyse networks with millions, or more, of vertices. The MATLAB language, with its mass of statistical functions, is a good choice to rapidly realize an algorithm prototype of complex networks. The performance of the MATLAB codes can be further improved by using graphic processor units (GPU). This paper presents the strategies and performance of the GPU implementation of a complex networks package, and the Jacket toolbox of MATLAB is used. Compared with some commercially available CPU implementations, GPU can achieve a speedup of, on average, 11.3x. The experimental result proves that the GPU platform combined with the MATLAB language is a good combination for complex network research.
文摘Virtualization is the key technology of cloud computing. Network virtualization plays an important role in this field. Its performance is very relevant to network virtualizing. Nowadays its implementations are mainly based on the idea of Software Define Network (SDN). Open vSwitch is a sort of software virtual switch, which conforms to the OpenFlow protocol standard. It is basically deployed in the Linux kernel hypervisor. This leads to its performance relatively poor because of the limited system resource. In turn, the packet process throughput is very low.In this paper, we present a Cavium-based Open vSwitch implementation. The Cavium platform features with multi cores and couples of hard ac-celerators. It supports zero-copy of packets and handles packet more quickly. We also carry some experiments on the platform. It indicates that we can use it in the enterprise network or campus network as convergence layer and core layer device.
文摘Network processors (NPs) are widely used for programmable and high-performance networks;however, the programs for NPs are less portable, the number of NP program developers is small, and the development cost is high. To solve these problems, this paper proposes an open, high-level, and portable programming language called “Phonepl”, which is independent from vendor-specific proprietary hardware and software but can be translated into an NP program with high performance especially in the memory use. A common NP hardware feature is that a whole packet is stored in DRAM, but the header is cached in SRAM. Phonepl has a hardware-independent abstraction of this feature so that it allows programmers mostly unconscious of this hardware feature. To implement the abstraction, four representations of packet data type that cover all the packet operations (including substring, concatenation, input, and output) are introduced. Phonepl have been implemented on Octeon NPs used in plug-ins for a network-virtualization environment called the VNode Infrastructure, and several packet-handling programs were evaluated. As for the evaluation result, the conversion throughput is close to the wire rate, i.e., 10 Gbps, and no packet loss (by cache miss) occurs when the packet size is 256 bytes or larger.
基金Supported by the Guangzhou Key Technology R&D Program (No.2007Z2-D0011)
文摘There are varieties of embedded systems in the world. It is a big challenge to optimize the instruction sets of System on Chips (SoCs) according to different systems' working environments. The idea of programmable instruction set is an effective method to gain embedded system's re-configurability. This letter presents a logic module for Java processor to be capable of using programmable instruction set. Cost (area, power, and timing) of the module is trivial. Such module is also reusable for other embedded system solutions besides Java systems.
基金Project(2008AA01A201) supported the National High-tech Research and Development Program of ChinaProjects(60833004, 60633050) supported by the National Natural Science Foundation of China
文摘Developing parallel applications on heterogeneous processors is facing the challenges of 'memory wall',due to limited capacity of local storage,limited bandwidth and long latency for memory access. Aiming at this problem,a parallelization approach was proposed with six memory optimization schemes for CG,four schemes of them aiming at all kinds of sparse matrix-vector multiplication (SPMV) operation. Conducted on IBM QS20,the parallelization approach can reach up to 21 and 133 times speedups with size A and B,respectively,compared with single power processor element. Finally,the conclusion is drawn that the peak bandwidth of memory access on Cell BE can be obtained in SPMV,simple computation is more efficient on heterogeneous processors and loop-unrolling can hide local storage access latency while executing scalar operation on SIMD cores.
基金The National High-Tech Research and Development Program of China(863 Program)2014AA01A705
文摘Processors have been playing important roles in both communication infrastructure systems and terminals.In this paper,both application specific and general purpose processors for communications are discussed including the roles,the history,the current situations,and the trends.One trend is that ASIPs(Application Specific Instruction-set Processors) are taking over ASICs(Application Specific Integrated Circuits) because of the increasing needs both on performance and compatibility of multi-modes.The trend opened opportunities for researchers crossing the boundary between communications and computer architecture.Another trend is the serverlization,i.e.,more infrastructure equipments are replaced by servers.The trend opened opportunities for researchers working towards high performance computing for communication,such as research on communication algorithm kernels and real time programming methods on servers.
基金supported partially by the National High Technical Research and Development Program of China (863 Program) under Grants No. 2011AA040101, No. 2008AA01Z134the National Natural Science Foundation of China under Grants No. 61003251, No. 61172049, No. 61173150+2 种基金the Doctoral Fund of Ministry of Education of China under Grant No. 20100006110015Beijing Municipal Natural Science Foundation under Grant No. Z111100054011078the 2012 Ladder Plan Project of Beijing Key Laboratory of Knowledge Engineering for Materials Science under Grant No. Z121101002812005
文摘In order to eliminate the energy waste caused by the traditional static hardware multithreaded processor used in real-time embedded system working in the low workload situation, the energy efficiency of the hardware multithread is discussed and a novel dynamic multithreaded architecture is proposed. The proposed architecture saves the energy wasted by removing idle threads without manipulation on the original architecture, fulfills a seamless switching mechanism which protects active threads and avoids pipeline stall during power mode switching. The report of an implemented dynamic multithreaded processor with 45 nm process from synthesis tool indicates that the area of dynamic multithreaded architecture is only 2.27% higher than the static one in achieving dynamic power dissipation, and consumes 1.3% more power in the same peak performance.
基金Ozcan Lab at UCLA acknowledges the U.S.Department of Energy(DOE),Office of Basic Energy Sciences,Division of Materials Sciences and Engineering under award no.DE-SC0023088.
文摘We present a broadband and polarization-insensitive unidirectional imager that operates at the visible part of the spectrum,where image formation occurs in one direction,while in the opposite direction,it is blocked.This approach is enabled by deep learning-driven diffractive optical design with wafer-scale nano-fabrication using high-purity fused silica to ensure optical transparency and thermal stability.Our design achieves unidirectional imaging across three visible wavelengths(covering red,green,and blue parts of the spectrum),and we experimentally validated this broadband unidirectional imager by creating high-fidelity images in the forward direction and generating weak,distorted output patterns in the backward direction,in alignment with our numerical simulations.This work demonstrates wafer-scale production of diffractive optical processors,featuring 16 levels of nanoscale phase features distributed across two axially aligned diffractive layers for visible unidirectional imaging.This approach facilitates mass-scale production of~0.5 billion nanoscale phase features per wafer,supporting high-throughput manufacturing of hundreds to thousands of multi-layer diffractive processors suitable for large apertures and parallel processing of multiple tasks.Beyond broadband unidirectional imaging in the visible spectrum,this study establishes a pathway for artificial-intelligence-enabled diffractive optics with versatile applications,signaling a new era in optical device functionality with industrial-level,massively scalable fabrication.
文摘As superconducting quantum processors scale,a key challenge is maintaining high coherence times and fidelity control over numerous qubits.We propose an automatic frequency allocation method for frequency-tunable qubits that equally considers coherence-limited fidelity and crosstalk-induced control errors during the allocation process.By employing a weighted average of the objective functions for coherence time and crosstalk,we numerically calculate gate fidelity to establish an open-loop optimization for determining suitable weight factors.This results in an efficient objective function for frequency optimization.We apply our method to frequency-tunable transmon qubits with tunable couplers,both theoretically and experimentally.The numerical results demonstrate significant advantages,including substantial reductions in gate errors and faster operation times,especially at higher qubit counts.Experimentally,our approach successfully achieves approximately 99.9%single-qubit fidelity on a nine-qubit chip.
基金National Natural Science Foundation of China(12471367)。
文摘In this work,we present a parallel implementation of radiation hydrodynamics coupled with particle transport,utilizing software infrastructure JASMIN(J Adaptive Structured Meshes applications INfrastructure)which encapsulates high-performance technology for the numerical simulation of complex applications.Two serial codes,radiation hydrodynamics RH2D and particle transport Sn2D,have been integrated into RHSn2D on JASMIN infrastructure,which can efficiently use thousands of processors to simulate the complex multi-physics phenomena.Moreover,the non-conforming processors strategy has ensured RHSn2D against the serious load imbalance between radiation hydrodynamics and particle transport for large scale parallel simulations.Numerical results show that RHSn2D achieves a parallel efficiency of 17.1%using 90720 cells on 8192 processors compared with 256 processors in the same problem.
基金Supported by the National Natural Science Foundation of China for Distinguished Young Scholar under Grant No. 60325205 the National High Technology Development 863 Program of China under Grants No. 2002AA110010, No. 2005AA110010 No. 2005AA119020, and the National Grand Fundamental Research 973 Program of China under Grant No. 2005CB321600.
文摘It can be observed from looking backward that processor architecture is improved through spirally shifting from simple to complex and from complex to simple. Nowadays we are facing another shifting from complex to simple, and new innovative architecture will emerge to utilize the continuously increasing transistor budgets. The growing importance of wire delays, changing workloads, power consumption, and design/verification complexity will drive the forthcoming era of Chip Multiprocessors (CMPs). Furthermore, typical CMP projects both from industries and from academics are investigated. Through going into depths for some primary theoretical and implementation problems of CMPs, the great challenges and opportunities to future CMPs are presented and discussed. Finally, the Godson series microprocessors designed in China are introduced.
文摘The nonpreemptive assignment of independent tasks to a system of m uniform processors isexamined with the objective of minimizing the makespan. Using r_m, the ratio of the fasest speed tothe slowest speed of the system, as a parameter, we assess the performance of LPT (largestprocessing time) schedule with respect to optimal schedules. It is shown thet the worst-case boundfor the ratio of the two schedule lengths is between
基金the National Natural Science Foundation of China (Nos. 60633060, 60606008, and 60576031)the National Key Basic Research and Development (973) Program of China (973)(Nos. 2005CB321604 and 2005CB321605)the fund of Chinese Academy of Sciences (No. 20074010) due to the President Scholarship
文摘As semiconductor technology advances, there will be billions of transistors on a single chip. Chip many-core processors are emerging to take advantage of these greater transistor densities to deliver greater performance. Effective fault tolerance techniques are essential to improve the yield of such complex chips. In this paper, a core-level redundancy scheme called N+M is proposed to improve N-core processors’ yield by providing M spare cores. In such architecture, topology is an important factor because it greatly affects the processors’ performance. The concept of logical topology and a topology reconfiguration problem are introduced, which is able to transparently provide target topology with lowest performance degradation as the presence of faulty cores on-chip. A row rippling and column stealing (RRCS) algorithm is also proposed. Results show that PRCS can give solutions with average 13.8% degradation with negligible computing time.
基金supported in part by the National Science Foundation of USA under Grant Nos.EIA-0224377,CNS-0406328,CNS-0509118,and CCF-0621435.
文摘Data prefetching is an effective data access latency hiding technique to mask the CPU stall caused by cache misses and to bridge the performance gap between processor and memory. With hardware and/or software support, data prefetching brings data closer to a processor before it is actually needed. Many prefetching techniques have been developed for single-core processors. Recent developments in processor technology have brought multicore processors into mainstream. While some of the single-core prefetching techniques are directly applicable to multicore processors, numerous novel strategies have been proposed in the past few years to take advantage of multiple cores. This paper aims to provide a comprehensive review of the state-of-the-art prefetching techniques, and proposes a taxonomy that classifies various design concerns in developing a prefetching strategy, especially for multicore processors. We compare various existing methods through analysis as well.
基金supported in part by NSF Awards of USA under Grant Nos. CNS-0855247,CNS-1016974,and NSF CAREER Award of USA under Grant No. CNS-0953005
文摘Due to the increasing power consumption in modern computing systems, energy management has become an important research area in the last decade. Recently, multicore has emerged to be an energy efficient architecture that exploits parallelisms in modern applications. However, as the number of cores on a single chip continues to increase, it has been a grand challenge on how to effectively manage the energy efficiency of multicore-based systems. In this paper, based on the voltage island and dynamic voltage and frequency scaling (DVFS) techniques, we investigate the energy efficiency of block-partitioned multieore processors, where cores are grouped into blocks with the cores on one block sharing a DVFS- enabled power supply. Depending on the number of cores on each block, we study both symmetric and asymmetric block configurations. We develop a system-level power model (which can support various power management techniques) and derive both block- and system-wide energy-efficient frequencies for systems with block-partitioned multieore processors. Based on the power model, we prove that, for embarrassingly parallel applications, having all cores on a single block can achieve the same energy savings as that of the individual block configuration (where each core forms a single block and has its own power supply). However, for applications with limited degrees of parallelism, we show the superiority of the buddy-asymmetric block configuration, where the number of required blocks (and power supplies) is logarithmically related to the number of cores on the chip, in that it can achieve the same amount of energy savings as that of the individual block configuration. The energy efficiency of different block configurations is further evaluated through extensive simulations with both synthetic as well as a real life application.