Row fixation is a parallel algorithm based on MPI that can be implemented on high performance computer system. It keeps the characteristics of matrices since row-computations are fixed on different nodes. Therefore t...Row fixation is a parallel algorithm based on MPI that can be implemented on high performance computer system. It keeps the characteristics of matrices since row-computations are fixed on different nodes. Therefore the locality of computation is realized effectively and the acceleration ratio is obtained very well for large scale parallel computations such as solving linear equations using Gaussian reduction method, LU decomposition of matrices and m-th power of matrices.展开更多
The distributed nonconvex optimization problem of minimizing a global cost function formed by a sum of n local cost functions by using local information exchange is considered.This problem is an important component of...The distributed nonconvex optimization problem of minimizing a global cost function formed by a sum of n local cost functions by using local information exchange is considered.This problem is an important component of many machine learning techniques with data parallelism,such as deep learning and federated learning.We propose a distributed primal-dual stochastic gradient descent(SGD)algorithm,suitable for arbitrarily connected communication networks and any smooth(possibly nonconvex)cost functions.We show that the proposed algorithm achieves the linear speedup convergence rate O(1/(√nT))for general nonconvex cost functions and the linear speedup convergence rate O(1/(nT)) when the global cost function satisfies the Polyak-Lojasiewicz(P-L)condition,where T is the total number of iterations.We also show that the output of the proposed algorithm with constant parameters linearly converges to a neighborhood of a global optimum.We demonstrate through numerical experiments the efficiency of our algorithm in comparison with the baseline centralized SGD and recently proposed distributed SGD algorithms.展开更多
Parallel computing is a promising approach to alleviate the computational demand in conducting large-scale finite element analyses.This paper presents a numerical modeling approach for earthquake ground response and l...Parallel computing is a promising approach to alleviate the computational demand in conducting large-scale finite element analyses.This paper presents a numerical modeling approach for earthquake ground response and liquefaction using the parallel nonlinear finite element program,ParCYCLIC,designed for distributed-memory message-passing parallel computer systems.In ParCYCLIC,finite elements are employed within an incremental plasticity,coupled solid-fluid formulation,A constitutive model calibrated by physical tests represents the salient characteristics of sand liquefaction and associated accumulation of shear deformations.Key elements of the computational strategy employed in ParCYCLIC include the development of a parallel sparse direct solver,the deployment of an automatic domain decomposer,and the use of the Multilevel Nested Dissection algorithm for ordering of the finite element nodes.Simulation results of centrifuge test models using ParCYCLIC are presented.Performance results from grid models and geotechnical simulations show that ParCYCLIC is efficiently scalable to a large number of processors.展开更多
We investigate the quantum speed limit time (QSLT) of a two-level atom under quantum-jump-based feedback control or homodyne-based feedback control. Our results show that the two different feedback control schemes h...We investigate the quantum speed limit time (QSLT) of a two-level atom under quantum-jump-based feedback control or homodyne-based feedback control. Our results show that the two different feedback control schemes have different influences on the evolutionary speed. By adjusting the feedback parameters, the quantum-jump-based feedback control can induce speedup of the atomic evolution from an excited state, but the homodyne-based feedback control cannot change the evolutionary speed. Additionally, the QSLT for the whole dynamical process is explored. Under the quantum-jump-based feedback control, the QSLT displays oscillatory behaviors, which implies multiple speed-up and speed-down processes during the evolution. While, the homodyne-based feedback control can accelerate the speed-up process and improve the uniform speed in the uniform evolution process.展开更多
With the rise of image data and increased complexity of tasks in edge detection, conventional artificial intelligence techniques have been severely impacted. To be able to solve even greater problems of the future, le...With the rise of image data and increased complexity of tasks in edge detection, conventional artificial intelligence techniques have been severely impacted. To be able to solve even greater problems of the future, learning algorithms must maintain high speed and accuracy through economical means. Traditional edge detection approaches cannot detect edges in images in a timely manner due to memory and computational time constraints. In this work, a novel parallelized ant colony optimization technique in a distributed framework provided by the Hadoop/Map-Reduce infrastructure is proposed to improve the edge detection capabilities. Moreover, a filtering technique is applied to reduce the noisy background of images to achieve significant improvement in the accuracy of edge detection. Close examinations of the implementation of the proposed algorithm are discussed and demonstrated through experiments. Results reveal high classification accuracy and significant improvements in speedup, scaleup and sizeup compared to the standard algorithms.展开更多
Supersonic viscous flows past blunt bodies is calculated with TVD difference scheme and implicit Lower Upper Symmetric Gauss Seidel (LU SGS) method, and parallel programming designing software platform PVM is used b...Supersonic viscous flows past blunt bodies is calculated with TVD difference scheme and implicit Lower Upper Symmetric Gauss Seidel (LU SGS) method, and parallel programming designing software platform PVM is used based on message passing to distribute a large task according to some patching strategies to a large number of processors in the network. These processors accomplish this large task together. The marked improvement of computational efficiency in networks, especially in MPP system, demonstrates the potential vitality of CFD in engineering design.展开更多
An optimal algorithmic approach to task scheduling for, triplet based architecture(TriBA), is proposed in this paper. TriBA is considered to be a high performance, distributed parallel computing architecture. TriBA ...An optimal algorithmic approach to task scheduling for, triplet based architecture(TriBA), is proposed in this paper. TriBA is considered to be a high performance, distributed parallel computing architecture. TriBA consists of a 2D grid of small, programmable processing units, each physically connected to its three neighbors. In parallel or distributed environment an efficient assignment of tasks to the processing elements is imperative to achieve fast job turnaround time. Moreover, the sojourn time experienced by each individual job should be minimized. The arriving jobs are comprised of parallel applications, each consisting of multiple-independent tasks that must be instantaneously assigned to processor queues, as they arrive. The processors independently and concurrently service these tasks. The key scheduling issues is, when some queue backlogs are small, an incoming job should first spread its tasks to those lightly loaded queues in order to take advantage of the parallel processing gain. Our algorithmic approach achieves optimality in task scheduling by assigning consecutive tasks to a triplet of processors exploiting locality in tasks. The experimental results show that tasks allocation to triplets of processing elements is efficient and optimal. Comparison to well accepted interconnection strategy, 2D mesh, is shown to prove the effectiveness of our algorithmic approach for TriBA. Finally we conclude that TriBA can be an efficient interconnection strategy for computations intensive applications, if tasks assignment is carried out optimally using algorithmic approach.展开更多
In view of the satellite cloud-derived wind inversion has the characteristics of large scale,intensive computing and time-consuming serial inversion algorithm is very difficult to break through the bottleneck of effic...In view of the satellite cloud-derived wind inversion has the characteristics of large scale,intensive computing and time-consuming serial inversion algorithm is very difficult to break through the bottleneck of efficiency.We proposed a parallel acceleration scheme of cloud-derived wind inversion algorithm based on MPI cluster parallel technique in this paper.The divide-and-conquer idea,assigning winds vector inversion tasks to each computing unit,is identified according to a certain strategy.Each computing unit executes the assigned tasks in parallel,namely divide-and-rule the inversion task,so as to reduce the efficiency bottleneck of long inversion time caused by serial time accumulation.In the scheme of parallel acceleration based on MPI cluster,an algorithm based on performance prediction is proposed to effectively implement load balance of MPI clusters.Through the comparative analysis of experiment data using the parallel scheme of this parallel technology framework,it shows that this parallel technology has a certain acceleration effect on the cloud-derived wind inversion algorithm.The speedup of the MPI-based parallel algorithm reaches 14.96,which achieved the expected estimate.At the same time,this paper also proposes an efficiency optimization algorithm for cloud-derived wind inversion.In the case that the inversion of wind vector accuracy loss is minimal,the optimized algorithm execution time can be up to 13 times faster.展开更多
This study presents an efficient Boussinesq-type wave model accelerated by a single Graphics Processing Unit(GPU).The model uses the hybrid finite volume and finite difference method to solve weakly dispersive and non...This study presents an efficient Boussinesq-type wave model accelerated by a single Graphics Processing Unit(GPU).The model uses the hybrid finite volume and finite difference method to solve weakly dispersive and nonlinear Boussinesq equations in the horizontal plane,enabling the model to have the shock-capturing ability to deal with breaking waves and moving shoreline properly.The code is written in CUDA C.To achieve better performance,the model uses cyclic reduction technique to solve massive tridiagonal linear systems and overlapped tiling/shared memory to reduce global memory access and enhance data reuse.Four numerical tests are conducted to validate the GPU implementation.The performance of the GPU model is evaluated by running a series of numerical simulations on two GPU platforms with different hardware configurations.Compared with the CPU version,the maximum speedup ratios for single-precision and double-precision calculations are 55.56 and 32.57,respectively.展开更多
The interference has been measured by the visibility in two-level systems, which, however, does not work for multi-level systems. We generalize a measure of the interference based on decoherence process, consistent wi...The interference has been measured by the visibility in two-level systems, which, however, does not work for multi-level systems. We generalize a measure of the interference based on decoherence process, consistent with the visibility in qubit systems. By taking duster states as examples, we show in the one-way quantum computation that the gate fidelity is proportional to the interference of the measured qubit and is inversely proportional to the interference of all register qubits. We also find that the interference increases with the number of the computing steps. So we conjecture that the interference may be the source of the speedup of the one-way quantum computation.展开更多
Glacier-related mass flows(GMFs)in the high-mountain cryosphere have become more frequent in the last decade,e.g.,the 2018 Sedongpu(SDP)GMFs in the Himalayas.Seismic forcing,thermal perturbation and heavy rainfall are...Glacier-related mass flows(GMFs)in the high-mountain cryosphere have become more frequent in the last decade,e.g.,the 2018 Sedongpu(SDP)GMFs in the Himalayas.Seismic forcing,thermal perturbation and heavy rainfall are common triggers of the GMFs.But the exact role of seimic forcing in the GMF formation is poorly known due to scarity of observational data of real cases.Here the evolution processes of the GMFs and the detachment of the trunk glacier in SDP are reconstructed by using remote sensing techniques,including feature-tracking of multi-source optical satellite imagery and visual interpretation.The reconstruction demonstrates that the high frequency of GMF events in SDP after the Milin earthquake on 18 November 2017 was mainly attributed to the earthquake-induced glacial stress changes and destablisation.The post-earthquake velocity of the trunk glacier is about three times of that in December 2016 and December 2017.The median glacier-surface velocity raised to 0.32 m d-1between November 2017 and June 2018,being 14%-77%higher than that of pre-earthquake,which is initiated by the seismic forcing and then aggravated by additional loading of ice/rock avalanches,infiltration of liquid water,progressively crevassed glacier,and local compressional deformation.Ensuing surge motion of the trunk glacier resulted from high temperature and heavy precipitation between July and September 2018.We infer that the trunk glacier is more sensitive to the thermal perturbation after the Milin earthquake,which is the predominant cause in sudden surge movement.These findings reveal comprehensive mechanisms of quakeinduced,low-angle,glacial detachment and multisource-driven GMF in the Himalayas.展开更多
A new version of the Institute of Atmospheric Physics (IAP) 9-Layer (9L) atmospheric general circulation model (AGCM) suitable for Massively Parallel Processor (MPP) has been developed. This paper presents the princip...A new version of the Institute of Atmospheric Physics (IAP) 9-Layer (9L) atmospheric general circulation model (AGCM) suitable for Massively Parallel Processor (MPP) has been developed. This paper presents the principles of the parallel code design and examines its performance on a variety of state-of-the-art parallel computers in China. Domain decomposition strategy is used to achieve parallelism that is implemented by Message Passing Interface (MPI). Only the one dimensional domain decomposition algorithm is shown to scale favorably as the number of processors is increased.展开更多
A new randomized parallel B & B algorithm is presented based on the similarity between heuristic search and statistics, and tested on a transputer network. The test result proves that the algorithm has a high spee...A new randomized parallel B & B algorithm is presented based on the similarity between heuristic search and statistics, and tested on a transputer network. The test result proves that the algorithm has a high speedup ratio, reliability, flexibility and fault tolerance.展开更多
MacCormack explicit scheme and Baldwin-Lomax algebraic turbulent model are employed to solve the axisymmetric compressible Navier-Stokes equations for the numerical simulation of the supersonic mustanl floats interact...MacCormack explicit scheme and Baldwin-Lomax algebraic turbulent model are employed to solve the axisymmetric compressible Navier-Stokes equations for the numerical simulation of the supersonic mustanl floats interacted with transverse injection at the base of a cone. A temperature switch function must be added to the artificial viscous model suggested by jameson etc to enhance the scheme's ability to eliminate oscillation for some injection case.The typical code optimization techniques about vectorization and some useful concepts and terminology about multiprocessing of YH-2 parallel supercmputer is given and explatined with some examples After reconstruction and optimization the code gets a spedup 5 .973 on pipeline computer YH- 1 and gets a speedup 1 886 for 2 processors and 3.545 for 4 processors on YH-2 parallel supeercomputer by using domain decomposition method..展开更多
Nonlinear multisplitting method is known as parallel iterative methods for solving a large-scale system of nonlinear equations F(x) = 0. We extend the idea of nonlinear multisplitting and consider a new model ill whic...Nonlinear multisplitting method is known as parallel iterative methods for solving a large-scale system of nonlinear equations F(x) = 0. We extend the idea of nonlinear multisplitting and consider a new model ill which the iteration is executed asynchronously: Each processor calculate the solution of an individual nonlinear system belong to its nonlinear multisplitting and can update the global approximation residing in the shared memory at any time. A local convergence analysis of this model is presented. Finally, we give a uumerical example which shows a 'strange' property that speedup Sp > p and efficiency Ep > 1.展开更多
Increasing needs for the study of complex dynamical systems require computing solutions of a large number of ordinary and partial differential time-dependent equations in near real-time. Numerical integration algorith...Increasing needs for the study of complex dynamical systems require computing solutions of a large number of ordinary and partial differential time-dependent equations in near real-time. Numerical integration algorithms, which are computationally expensive and inherently sequential, are typically used to compute solutions of ordinary and partial differential time-dependent equations. This presents challenges to study complex dynamical systems in near real-time. This paper examines the challenges of computing solutions of ordinary differential time-dependent equations using the Parareal algorithm belonging to the class of parallel-in-time algorithms on various high-performance computing accelerator-based architectures and associated programming models. The paper presents the code refactoring steps and performance analysis of the Parareal algorithm on two accelerator computing architectures: the Intel Xeon Phi CPU and Graphics Processing Unit many-core architectures, and with OpenMP, OpenACC, and CUDA programming models. The speedup and scaling performance analysis are used to demonstrate the suitability of the Parareal to compute the solutions of a single ordinary differential time-dependent equation and a family of interdependent ordinary differential time-dependent. The speedup, weak and strong scaling results demonstrate the suitability of Graphical Processing Units with the CUDA programming model as the most efficient accelerator for computing solutions of ordinary differential time-dependent equations using parallel-in-time algorithms. Considering the time and effort required to refactor the code for execution on the accelerator architectures, the Graphical Processing Units with the OpenACC programming model is the most efficient accelerator for computing solutions of ordinary differential time-dependent equations using parallel-in-time algorithms.展开更多
Along with the unbounded speedup and exponential growth of virtual queues requirement aiming for 100% throughput of multicast scheduling as the size of the high-speed switches scale, the issues of low throughput of mu...Along with the unbounded speedup and exponential growth of virtual queues requirement aiming for 100% throughput of multicast scheduling as the size of the high-speed switches scale, the issues of low throughput of multicast under non-speedup or fixed crosspoint buffer size is addressed. Inspired by the load balance two-stage Birkhoff-von Neumann architecture that can provide 100% throughput for all kinds of unicast traffic, a novel 3-stage architecture, consisting of the first stage for multicast fan-out splitting, the second stage for load balancing, and the last stage for switching (FSLBS) is proposed. And the dedicated multicast fan-out splitting to unicast (M2U) scheduling algorithm is developed for the first stage, while the scheduling algorithms in the last two stages adopt the periodic permutation matrix. FSLBS can achieve 100% throughput for integrated uni- and multicast traffic without speedup employing the dedicated M2U and periodic permutation matrix scheduling algorithm. The operation is theoretically validated adopting the fluid model.展开更多
A class of rapid algorithms for independent component analysis (ICA) is presented. This method utilizes multi-step past information with respect to an existing fixed-point style for increasing the non-Gaussianity. Thi...A class of rapid algorithms for independent component analysis (ICA) is presented. This method utilizes multi-step past information with respect to an existing fixed-point style for increasing the non-Gaussianity. This can be viewed as the addition of a variable-size momentum term. The use of past information comes from the idea of surrogate optimization. There is little additional cost for either software design or runtime execution when past information is included. The speed of the algorithm is evaluated on both simulated and real-world data. The real-world data includes color images and electroencephalograms (EEGs), which are an important source of data on human-computer interactions. From these experiments, it is found that the method we present here, the RapidICA, performs quickly, especially for the demixing of super-Gaussian signals.展开更多
We present a comprehensive mathematical framework establishing the foundations of holographic quantum computing, a novel paradigm that leverages holographic phenomena to achieve superior error correction and algorithm...We present a comprehensive mathematical framework establishing the foundations of holographic quantum computing, a novel paradigm that leverages holographic phenomena to achieve superior error correction and algorithmic efficiency. We rigorously demonstrate that quantum information can be encoded and processed using holographic principles, establishing fundamental theorems characterizing the error-correcting properties of holographic codes. We develop a complete set of universal quantum gates with explicit constructions and prove exponential speedups for specific classes of computational problems. Our framework demonstrates that holographic quantum codes achieve a code rate scaling as O(1/logn), superior to traditional quantum LDPC codes, while providing inherent protection against errors via geometric properties of the code structures. We prove a threshold theorem establishing that arbitrary quantum computations can be performed reliably when physical error rates fall below a constant threshold. Notably, our analysis suggests certain algorithms, including those involving high-dimensional state spaces and long-range interactions, achieve exponential speedups over both classical and conventional quantum approaches. This work establishes the theoretical foundations for a new approach to quantum computation that provides natural fault tolerance and scalability, directly addressing longstanding challenges of the field.展开更多
文摘Row fixation is a parallel algorithm based on MPI that can be implemented on high performance computer system. It keeps the characteristics of matrices since row-computations are fixed on different nodes. Therefore the locality of computation is realized effectively and the acceleration ratio is obtained very well for large scale parallel computations such as solving linear equations using Gaussian reduction method, LU decomposition of matrices and m-th power of matrices.
基金supported by the Knut and Alice Wallenberg Foundationthe Swedish Foundation for Strategic Research+1 种基金the Swedish Research Councilthe National Natural Science Foundation of China(62133003,61991403,61991404,61991400)。
文摘The distributed nonconvex optimization problem of minimizing a global cost function formed by a sum of n local cost functions by using local information exchange is considered.This problem is an important component of many machine learning techniques with data parallelism,such as deep learning and federated learning.We propose a distributed primal-dual stochastic gradient descent(SGD)algorithm,suitable for arbitrarily connected communication networks and any smooth(possibly nonconvex)cost functions.We show that the proposed algorithm achieves the linear speedup convergence rate O(1/(√nT))for general nonconvex cost functions and the linear speedup convergence rate O(1/(nT)) when the global cost function satisfies the Polyak-Lojasiewicz(P-L)condition,where T is the total number of iterations.We also show that the output of the proposed algorithm with constant parameters linearly converges to a neighborhood of a global optimum.We demonstrate through numerical experiments the efficiency of our algorithm in comparison with the baseline centralized SGD and recently proposed distributed SGD algorithms.
基金the National Science Foundation Grants Number CMS-0084616,0200510 and ANI-0205720 to University of California,San Diego, and Grant Number CMS-0084530 to Stanford UniversityAdditional funding was also provided by the NSF cooperative agreement ACI-9619020 through computing resources provided by the National Partnership for Advanced Computational Infrastructure at the San Diego Supercomputer Center
文摘Parallel computing is a promising approach to alleviate the computational demand in conducting large-scale finite element analyses.This paper presents a numerical modeling approach for earthquake ground response and liquefaction using the parallel nonlinear finite element program,ParCYCLIC,designed for distributed-memory message-passing parallel computer systems.In ParCYCLIC,finite elements are employed within an incremental plasticity,coupled solid-fluid formulation,A constitutive model calibrated by physical tests represents the salient characteristics of sand liquefaction and associated accumulation of shear deformations.Key elements of the computational strategy employed in ParCYCLIC include the development of a parallel sparse direct solver,the deployment of an automatic domain decomposer,and the use of the Multilevel Nested Dissection algorithm for ordering of the finite element nodes.Simulation results of centrifuge test models using ParCYCLIC are presented.Performance results from grid models and geotechnical simulations show that ParCYCLIC is efficiently scalable to a large number of processors.
基金Project supported by the National Natural Science Foundation of China(Grant No.11374096)Hunan Provincial Innovation Foundation for Postgraduate,China(Grant No.CX2017B177)the Scientific Research Project of Hunan Provincial Education Department,China(Grant No.16C0949)
文摘We investigate the quantum speed limit time (QSLT) of a two-level atom under quantum-jump-based feedback control or homodyne-based feedback control. Our results show that the two different feedback control schemes have different influences on the evolutionary speed. By adjusting the feedback parameters, the quantum-jump-based feedback control can induce speedup of the atomic evolution from an excited state, but the homodyne-based feedback control cannot change the evolutionary speed. Additionally, the QSLT for the whole dynamical process is explored. Under the quantum-jump-based feedback control, the QSLT displays oscillatory behaviors, which implies multiple speed-up and speed-down processes during the evolution. While, the homodyne-based feedback control can accelerate the speed-up process and improve the uniform speed in the uniform evolution process.
文摘With the rise of image data and increased complexity of tasks in edge detection, conventional artificial intelligence techniques have been severely impacted. To be able to solve even greater problems of the future, learning algorithms must maintain high speed and accuracy through economical means. Traditional edge detection approaches cannot detect edges in images in a timely manner due to memory and computational time constraints. In this work, a novel parallelized ant colony optimization technique in a distributed framework provided by the Hadoop/Map-Reduce infrastructure is proposed to improve the edge detection capabilities. Moreover, a filtering technique is applied to reduce the noisy background of images to achieve significant improvement in the accuracy of edge detection. Close examinations of the implementation of the proposed algorithm are discussed and demonstrated through experiments. Results reveal high classification accuracy and significant improvements in speedup, scaleup and sizeup compared to the standard algorithms.
文摘Supersonic viscous flows past blunt bodies is calculated with TVD difference scheme and implicit Lower Upper Symmetric Gauss Seidel (LU SGS) method, and parallel programming designing software platform PVM is used based on message passing to distribute a large task according to some patching strategies to a large number of processors in the network. These processors accomplish this large task together. The marked improvement of computational efficiency in networks, especially in MPP system, demonstrates the potential vitality of CFD in engineering design.
文摘An optimal algorithmic approach to task scheduling for, triplet based architecture(TriBA), is proposed in this paper. TriBA is considered to be a high performance, distributed parallel computing architecture. TriBA consists of a 2D grid of small, programmable processing units, each physically connected to its three neighbors. In parallel or distributed environment an efficient assignment of tasks to the processing elements is imperative to achieve fast job turnaround time. Moreover, the sojourn time experienced by each individual job should be minimized. The arriving jobs are comprised of parallel applications, each consisting of multiple-independent tasks that must be instantaneously assigned to processor queues, as they arrive. The processors independently and concurrently service these tasks. The key scheduling issues is, when some queue backlogs are small, an incoming job should first spread its tasks to those lightly loaded queues in order to take advantage of the parallel processing gain. Our algorithmic approach achieves optimality in task scheduling by assigning consecutive tasks to a triplet of processors exploiting locality in tasks. The experimental results show that tasks allocation to triplets of processing elements is efficient and optimal. Comparison to well accepted interconnection strategy, 2D mesh, is shown to prove the effectiveness of our algorithmic approach for TriBA. Finally we conclude that TriBA can be an efficient interconnection strategy for computations intensive applications, if tasks assignment is carried out optimally using algorithmic approach.
基金This work was supported in part by the National Natural Science Foundation of China(61872160,51679105,51809112)“Thirteenth Five Plan”Science and Technology Project of Education Department,Jilin Province(JJKH20200990KJ).
文摘In view of the satellite cloud-derived wind inversion has the characteristics of large scale,intensive computing and time-consuming serial inversion algorithm is very difficult to break through the bottleneck of efficiency.We proposed a parallel acceleration scheme of cloud-derived wind inversion algorithm based on MPI cluster parallel technique in this paper.The divide-and-conquer idea,assigning winds vector inversion tasks to each computing unit,is identified according to a certain strategy.Each computing unit executes the assigned tasks in parallel,namely divide-and-rule the inversion task,so as to reduce the efficiency bottleneck of long inversion time caused by serial time accumulation.In the scheme of parallel acceleration based on MPI cluster,an algorithm based on performance prediction is proposed to effectively implement load balance of MPI clusters.Through the comparative analysis of experiment data using the parallel scheme of this parallel technology framework,it shows that this parallel technology has a certain acceleration effect on the cloud-derived wind inversion algorithm.The speedup of the MPI-based parallel algorithm reaches 14.96,which achieved the expected estimate.At the same time,this paper also proposes an efficiency optimization algorithm for cloud-derived wind inversion.In the case that the inversion of wind vector accuracy loss is minimal,the optimized algorithm execution time can be up to 13 times faster.
基金The National Key Research and Development Program under contract No.2019YFC1407700the National Natural Science Foundation of China under contract Nos 51779022, 52071057 and 51809053。
文摘This study presents an efficient Boussinesq-type wave model accelerated by a single Graphics Processing Unit(GPU).The model uses the hybrid finite volume and finite difference method to solve weakly dispersive and nonlinear Boussinesq equations in the horizontal plane,enabling the model to have the shock-capturing ability to deal with breaking waves and moving shoreline properly.The code is written in CUDA C.To achieve better performance,the model uses cyclic reduction technique to solve massive tridiagonal linear systems and overlapped tiling/shared memory to reduce global memory access and enhance data reuse.Four numerical tests are conducted to validate the GPU implementation.The performance of the GPU model is evaluated by running a series of numerical simulations on two GPU platforms with different hardware configurations.Compared with the CPU version,the maximum speedup ratios for single-precision and double-precision calculations are 55.56 and 32.57,respectively.
文摘The interference has been measured by the visibility in two-level systems, which, however, does not work for multi-level systems. We generalize a measure of the interference based on decoherence process, consistent with the visibility in qubit systems. By taking duster states as examples, we show in the one-way quantum computation that the gate fidelity is proportional to the interference of the measured qubit and is inversely proportional to the interference of all register qubits. We also find that the interference increases with the number of the computing steps. So we conjecture that the interference may be the source of the speedup of the one-way quantum computation.
基金funded by National Key R&D Program of China(Grant 2018YFC1505204)the Second Tibetan Plateau Scientific Expedition and Research Program(2019QZKK0902)+1 种基金the National Key R&D Program of China(2020YFD1100701)the National Natural Science Foundation of China(91747207)。
文摘Glacier-related mass flows(GMFs)in the high-mountain cryosphere have become more frequent in the last decade,e.g.,the 2018 Sedongpu(SDP)GMFs in the Himalayas.Seismic forcing,thermal perturbation and heavy rainfall are common triggers of the GMFs.But the exact role of seimic forcing in the GMF formation is poorly known due to scarity of observational data of real cases.Here the evolution processes of the GMFs and the detachment of the trunk glacier in SDP are reconstructed by using remote sensing techniques,including feature-tracking of multi-source optical satellite imagery and visual interpretation.The reconstruction demonstrates that the high frequency of GMF events in SDP after the Milin earthquake on 18 November 2017 was mainly attributed to the earthquake-induced glacial stress changes and destablisation.The post-earthquake velocity of the trunk glacier is about three times of that in December 2016 and December 2017.The median glacier-surface velocity raised to 0.32 m d-1between November 2017 and June 2018,being 14%-77%higher than that of pre-earthquake,which is initiated by the seismic forcing and then aggravated by additional loading of ice/rock avalanches,infiltration of liquid water,progressively crevassed glacier,and local compressional deformation.Ensuing surge motion of the trunk glacier resulted from high temperature and heavy precipitation between July and September 2018.We infer that the trunk glacier is more sensitive to the thermal perturbation after the Milin earthquake,which is the predominant cause in sudden surge movement.These findings reveal comprehensive mechanisms of quakeinduced,low-angle,glacial detachment and multisource-driven GMF in the Himalayas.
基金the National Natural Science Foundation of China (Grant Nos.49775268 and 49823002) the China National Key Development Planni
文摘A new version of the Institute of Atmospheric Physics (IAP) 9-Layer (9L) atmospheric general circulation model (AGCM) suitable for Massively Parallel Processor (MPP) has been developed. This paper presents the principles of the parallel code design and examines its performance on a variety of state-of-the-art parallel computers in China. Domain decomposition strategy is used to achieve parallelism that is implemented by Message Passing Interface (MPI). Only the one dimensional domain decomposition algorithm is shown to scale favorably as the number of processors is increased.
文摘A new randomized parallel B & B algorithm is presented based on the similarity between heuristic search and statistics, and tested on a transputer network. The test result proves that the algorithm has a high speedup ratio, reliability, flexibility and fault tolerance.
文摘MacCormack explicit scheme and Baldwin-Lomax algebraic turbulent model are employed to solve the axisymmetric compressible Navier-Stokes equations for the numerical simulation of the supersonic mustanl floats interacted with transverse injection at the base of a cone. A temperature switch function must be added to the artificial viscous model suggested by jameson etc to enhance the scheme's ability to eliminate oscillation for some injection case.The typical code optimization techniques about vectorization and some useful concepts and terminology about multiprocessing of YH-2 parallel supercmputer is given and explatined with some examples After reconstruction and optimization the code gets a spedup 5 .973 on pipeline computer YH- 1 and gets a speedup 1 886 for 2 processors and 3.545 for 4 processors on YH-2 parallel supeercomputer by using domain decomposition method..
文摘Nonlinear multisplitting method is known as parallel iterative methods for solving a large-scale system of nonlinear equations F(x) = 0. We extend the idea of nonlinear multisplitting and consider a new model ill which the iteration is executed asynchronously: Each processor calculate the solution of an individual nonlinear system belong to its nonlinear multisplitting and can update the global approximation residing in the shared memory at any time. A local convergence analysis of this model is presented. Finally, we give a uumerical example which shows a 'strange' property that speedup Sp > p and efficiency Ep > 1.
文摘Increasing needs for the study of complex dynamical systems require computing solutions of a large number of ordinary and partial differential time-dependent equations in near real-time. Numerical integration algorithms, which are computationally expensive and inherently sequential, are typically used to compute solutions of ordinary and partial differential time-dependent equations. This presents challenges to study complex dynamical systems in near real-time. This paper examines the challenges of computing solutions of ordinary differential time-dependent equations using the Parareal algorithm belonging to the class of parallel-in-time algorithms on various high-performance computing accelerator-based architectures and associated programming models. The paper presents the code refactoring steps and performance analysis of the Parareal algorithm on two accelerator computing architectures: the Intel Xeon Phi CPU and Graphics Processing Unit many-core architectures, and with OpenMP, OpenACC, and CUDA programming models. The speedup and scaling performance analysis are used to demonstrate the suitability of the Parareal to compute the solutions of a single ordinary differential time-dependent equation and a family of interdependent ordinary differential time-dependent. The speedup, weak and strong scaling results demonstrate the suitability of Graphical Processing Units with the CUDA programming model as the most efficient accelerator for computing solutions of ordinary differential time-dependent equations using parallel-in-time algorithms. Considering the time and effort required to refactor the code for execution on the accelerator architectures, the Graphical Processing Units with the OpenACC programming model is the most efficient accelerator for computing solutions of ordinary differential time-dependent equations using parallel-in-time algorithms.
文摘Along with the unbounded speedup and exponential growth of virtual queues requirement aiming for 100% throughput of multicast scheduling as the size of the high-speed switches scale, the issues of low throughput of multicast under non-speedup or fixed crosspoint buffer size is addressed. Inspired by the load balance two-stage Birkhoff-von Neumann architecture that can provide 100% throughput for all kinds of unicast traffic, a novel 3-stage architecture, consisting of the first stage for multicast fan-out splitting, the second stage for load balancing, and the last stage for switching (FSLBS) is proposed. And the dedicated multicast fan-out splitting to unicast (M2U) scheduling algorithm is developed for the first stage, while the scheduling algorithms in the last two stages adopt the periodic permutation matrix. FSLBS can achieve 100% throughput for integrated uni- and multicast traffic without speedup employing the dedicated M2U and periodic permutation matrix scheduling algorithm. The operation is theoretically validated adopting the fluid model.
文摘A class of rapid algorithms for independent component analysis (ICA) is presented. This method utilizes multi-step past information with respect to an existing fixed-point style for increasing the non-Gaussianity. This can be viewed as the addition of a variable-size momentum term. The use of past information comes from the idea of surrogate optimization. There is little additional cost for either software design or runtime execution when past information is included. The speed of the algorithm is evaluated on both simulated and real-world data. The real-world data includes color images and electroencephalograms (EEGs), which are an important source of data on human-computer interactions. From these experiments, it is found that the method we present here, the RapidICA, performs quickly, especially for the demixing of super-Gaussian signals.
文摘We present a comprehensive mathematical framework establishing the foundations of holographic quantum computing, a novel paradigm that leverages holographic phenomena to achieve superior error correction and algorithmic efficiency. We rigorously demonstrate that quantum information can be encoded and processed using holographic principles, establishing fundamental theorems characterizing the error-correcting properties of holographic codes. We develop a complete set of universal quantum gates with explicit constructions and prove exponential speedups for specific classes of computational problems. Our framework demonstrates that holographic quantum codes achieve a code rate scaling as O(1/logn), superior to traditional quantum LDPC codes, while providing inherent protection against errors via geometric properties of the code structures. We prove a threshold theorem establishing that arbitrary quantum computations can be performed reliably when physical error rates fall below a constant threshold. Notably, our analysis suggests certain algorithms, including those involving high-dimensional state spaces and long-range interactions, achieve exponential speedups over both classical and conventional quantum approaches. This work establishes the theoretical foundations for a new approach to quantum computation that provides natural fault tolerance and scalability, directly addressing longstanding challenges of the field.