Given that the concurrent L1-minimization(L1-min)problem is often required in some real applications,we investigate how to solve it in parallel on GPUs in this paper.First,we propose a novel self-adaptive warp impleme...Given that the concurrent L1-minimization(L1-min)problem is often required in some real applications,we investigate how to solve it in parallel on GPUs in this paper.First,we propose a novel self-adaptive warp implementation of the matrix-vector multiplication(Ax)and a novel self-adaptive thread implementation of the matrix-vector multiplication(ATx),respectively,on the GPU.The vector-operation and inner-product decision trees are adopted to choose the optimal vector-operation and inner-product kernels for vectors of any size.Second,based on the above proposed kernels,the iterative shrinkage-thresholding algorithm is utilized to present two concurrent L1-min solvers from the perspective of the streams and the thread blocks on a GPU,and optimize their performance by using the new features of GPU such as the shuffle instruction and the read-only data cache.Finally,we design a concurrent L1-min solver on multiple GPUs.The experimental results have validated the high effectiveness and good performance of our proposed methods.展开更多
We have successfully ported an arbitrary highorder discontinuous Galerkin method for solving the threedimensional isotropic elastic wave equation on unstructured tetrahedral meshes to multiple Graphic Processing Units...We have successfully ported an arbitrary highorder discontinuous Galerkin method for solving the threedimensional isotropic elastic wave equation on unstructured tetrahedral meshes to multiple Graphic Processing Units (GPUs) using the Compute Unified Device Architecture (CUDA) of NVIDIA and Message Passing Interface (MPI) and obtained a speedup factor of about 28.3 for the single-precision version of our codes and a speedup factor of about 14.9 for the double-precision version. The GPU used in the comparisons is NVIDIA Tesla C2070 Fermi, and the CPU used is Intel Xeon W5660. To effectively overlap inter-process communication with computation, we separate the elements on each subdomain into inner and outer elements and complete the computation on outer elements and fill the MPI buffer first. While the MPI messages travel across the network, the GPU performs computation on inner elements, and all other calculations that do not use information of outer elements from neighboring subdomains. A significant portion of the speedup also comes from a customized matrix-matrix multiplication kernel, which is used extensively throughout our program. Preliminary performance analysis on our parallel GPU codes shows favorable strong and weak scalabilities.展开更多
Scale Invariant Feature Transform (SIFT) algorithm is a widely used computer vision algorithm that detects and extracts local feature descriptors from images. SIFT is computationally intensive, making it infeasible fo...Scale Invariant Feature Transform (SIFT) algorithm is a widely used computer vision algorithm that detects and extracts local feature descriptors from images. SIFT is computationally intensive, making it infeasible for single threaded im-plementation to extract local feature descriptors for high-resolution images in real time. In this paper, an approach to parallelization of the SIFT algorithm is demonstrated using NVIDIA’s Graphics Processing Unit (GPU). The parallel-ization design for SIFT on GPUs is divided into two stages, a) Algorithm de-sign-generic design strategies which focuses on data and b) Implementation de-sign-architecture specific design strategies which focuses on optimally using GPU resources for maximum occupancy. Increasing memory latency hiding, eliminating branches and data blocking achieve a significant decrease in aver-age computational time. Furthermore, it is observed via Paraver tools that our approach to parallelization while optimizing for maximum occupancy allows GPU to execute memory bound SIFT algorithm at optimal levels.展开更多
As one of the most essential and important operations in linear algebra, the performance prediction of sparse matrix-vector multiplication (SpMV) on GPUs has got more and more attention in recent years. In 2012, Guo a...As one of the most essential and important operations in linear algebra, the performance prediction of sparse matrix-vector multiplication (SpMV) on GPUs has got more and more attention in recent years. In 2012, Guo and Wang put forward a new idea to predict the performance of SpMV on GPUs. However, they didn’t consider the matrix structure completely, so the execution time predicted by their model tends to be inaccurate for general sparse matrix. To address this problem, we proposed two new similar models, which take into account the structure of the matrices and make the performance prediction model more accurate. In addition, we predict the execution time of SpMV for CSR-V, CSR-S, ELL and JAD sparse matrix storage formats by the new models on the CUDA platform. Our experimental results show that the accuracy of prediction by our models is 1.69 times better than Guo and Wang’s model on average for most general matrices.展开更多
The parallel computation capabilities of modern graphics processing units (GPUs) have attracted increasing attention from researchers and engineers who have been conducting high computational throughput studies. How...The parallel computation capabilities of modern graphics processing units (GPUs) have attracted increasing attention from researchers and engineers who have been conducting high computational throughput studies. However, current single GPU based engineering solutions are often struggling to fulfill their real-time requirements. Thus, the multi-GPU-based approach has become a popular and cost-effective choice for tackling the demands. In those cases, the computational load balancing over multiple GPU "nodes" is often the key and bottleneck that affect the quality and performance of the real=time system. The existing load balancing approaches are mainly based on the assumption that all GPU nodes in the same computer framework are of equal computational performance, which is often not the case due to cluster design and other legacy issues. This paper presents a novel dynamic load balancing (DLB) model for rapid data division and allocation on heterogeneous GPU nodes based on an innovative fuzzy neural network (FNN). In this research, a 5-state parameter feedback mechanism defining the overall cluster and node performance is proposed. The corresponding FNN-based DLB model will be capable of monitoring and predicting individual node performance under different workload scenarios. A real=time adaptive scheduler has been devised to reorganize the data inputs to each node when necessary to maintain their runtime computational performance. The devised model has been implemented on two dimensional (2D) discrete wavelet transform (DWT) applications for evaluation. Experiment results show that this DLB model enables a high computational throughput while ensuring real=time and precision requirements from complex computational tasks.展开更多
Particle accelerators play an important role in a wide range of scientific discoveries and industrial applications. The self-consistent multi-particle simulation based on the particle-in-cell (PIC) method has been use...Particle accelerators play an important role in a wide range of scientific discoveries and industrial applications. The self-consistent multi-particle simulation based on the particle-in-cell (PIC) method has been used to study charged particle beam dynamics inside those accelerators. However, the PIC simulation is time-consuming and needs to use modern parallel computers for high-resolution applications. In this paper, we implemented a parallel beam dynamics PIC code on multi-node hybrid architecture computers with multiple Graphics Processing Units (GPUs). We used two methods to parallelize the PIC code on multiple GPUs and observed that the replication method is a better choice for moderate problem size and current computer hardware while the domain decomposition method might be a better choice for large problem size and more advanced computer hardware that allows direct communications among multiple GPUs. Using the multi-node hybrid architectures at Oak Ridge Leadership Computing Facility (OLCF), the optimized GPU PIC code achieves a reasonable parallel performance and scales up to 64 GPUs with 16 million particles.展开更多
The sparse matrix vector multiplication (SpMV) is inevitable in almost all kinds of scientific computation, such as iterative methods for solving linear systems and eigenvalue problems. With the emergence and developm...The sparse matrix vector multiplication (SpMV) is inevitable in almost all kinds of scientific computation, such as iterative methods for solving linear systems and eigenvalue problems. With the emergence and development of Graphics Processing Units (GPUs), high efficient formats for SpMV should be constructed. The performance of SpMV is mainly determinted by the storage format for sparse matrix. Based on the idea of JAD format, this paper improved the ELLPACK-R format, reduced the waiting time between different threads in a warp, and the speed up achieved about 1.5 in our experimental results. Compared with other formats, such as CSR, ELL, BiELL and so on, our format performance of SpMV is optimal over 70 percent of the test matrix. We proposed a method based on parameters to analyze the performance impact on different formats. In addition, a formula was constructed to count the computation and the number of iterations.展开更多
In our previous work, a novel algorithm to perform robust pose estimation was presented. The pose was estimated using points on the object to regions on image correspondence. The laboratory experiments conducted in th...In our previous work, a novel algorithm to perform robust pose estimation was presented. The pose was estimated using points on the object to regions on image correspondence. The laboratory experiments conducted in the previous work showed that the accuracy of the estimated pose was over 99% for position and 84% for orientation estimations respectively. However, for larger objects, the algorithm requires a high number of points to achieve the same accuracy. The requirement of higher number of points makes the algorithm, computationally intensive resulting in the algorithm infeasible for real-time computer vision applications. In this paper, the algorithm is parallelized to run on NVIDIA GPUs. The results indicate that even for objects having more than 2000 points, the algorithm can estimate the pose in real time for each frame of high-resolution videos.展开更多
Sparse matrix-vector multiplication(SpMV)is one of the key kernels extensively employed in both industrial and scientific applications,with its computation and random access incurring a lot of overhead.To capitalize o...Sparse matrix-vector multiplication(SpMV)is one of the key kernels extensively employed in both industrial and scientific applications,with its computation and random access incurring a lot of overhead.To capitalize on higher compute rates and data movement efficiency,there have been efforts to utilize mixed precision SpMV.However,most existing techniques focus on single-grained precision selection for all matrices.In this work,we concentrate on hierarchical precision selection strategies tailored for irregular matrices,driven by the need to achieve optimal load balancing among thread groups executing on GPUs.Based on the concept of strong connection,we firstly introduce a novel adaptive row-grained precision selection strategy that surpasses existing strategy within multi-precision Jacobi methods.Secondly,our experiments have uncovered a range within which converting double-precision floating-point numbers to single-precision floating-point numbers incurs a loss smaller than the machine precision FLT_EPSILON.This range is used for element-grained precision selection.Subsequently,we propose a hierarchical precision selection compressed sparse row format(CSR)storage method and enhance the CSR-Vector kernel,achieving higher relative speedups and load balancing on a benchmark suite composed of 41 matrices compared to existing methods.Finally,we integrate the mixed precision SpMV into the generalized minimal residual method(GMRES)algorithm,achieving faster execution speeds while maintaining similar convergence accuracy as double-precision GMRES.展开更多
With the development of deep learning,hardware accelerators represented by GPUs have been used to accelerate the execution of deep learning applications.A key problem in GPU cluster is how to schedule various deep lea...With the development of deep learning,hardware accelerators represented by GPUs have been used to accelerate the execution of deep learning applications.A key problem in GPU cluster is how to schedule various deep learning applications,including training applications and latency-critical inference applications,to achieve optimal system performance.In cloud datacenters,inference applications often require fewer resources,and the exclusive GPU execution of one inference application can result in a significant waste of GPU resources.Existing work mainly focuses on the co-location execution of multiple inference applications in datacenters using MPS(Multi-Process Service).There are several problems with this execution pattern,datacenters may be in low-workload state for long periods of time due to the diurnal pattern of inference applications,MPS-based data sharing can lead to interaction errors between contexts,and resource contention may cause Quality of Service(QoS)violations.To solve above problems,we propose ArkGPU,a runtime system that dynamically allocates resources.ArkGPU can improve the resource utilization of the cluster,while guaranteeing the QoS of inference applications.ArkGPU is comprised of a performance predictor,a scheduler,a resource limiter,and an adjustment unit.We conduct extensive experiments on the NVIDIA V100 GPU to verify the effectiveness of ArkGPU.We achieve High-Goodput for latency-critical applications which have an average throughput increase of 584.27%compared to MPS.We deploy multiple applications simultaneously on ArkGPU,and in this case,goodput is improved by 94.98%compared to k8s-native and 38.65%compared to MPS.展开更多
Tucker decomposition is one of the most popular models for analyzing and compressing large-scale tensorial data.Existing Tucker decomposition algorithms are usually based on a single solver to compute the factor matri...Tucker decomposition is one of the most popular models for analyzing and compressing large-scale tensorial data.Existing Tucker decomposition algorithms are usually based on a single solver to compute the factor matrices and intermediate tensor in a predetermined order,and are not flexible enough to adapt with the diversities of the input data and the hardware.Moreover,to exploit highly efficient matrix multiplication kernels,most Tucker decomposition implementations rely on explicit matricizations,which could introduce extra costs of data conversion.In this paper,we present a-Tucker,a new framework for input-adaptive and matricization-free Tucker decomposition of higher-order tensors on GPUs.A two-level flexible Tucker decomposition algorithm is proposed to enable the switch of different calculation orders and different factor solvers,and a machine-learning adaptive order-solver selector is applied to automatically cope with change of the application scenarios.To further improve the performance,we implement a-Tucker in a fully matricization-free manner without any conversion between tensors and matrices.Experiments show that a-Tucker can substantially outperform existing works while keeping similar accuracy with a variety of synthetic and real-world tensors.展开更多
The compositional model is often used to describe multicomponent multiphase porous media flows in the petroleum industry.The fully implicit method with strong stability and weak constraints on time-step sizes is commo...The compositional model is often used to describe multicomponent multiphase porous media flows in the petroleum industry.The fully implicit method with strong stability and weak constraints on time-step sizes is commonly used in mainstream commercial reservoir simulators.In this paper,we develop an efficient multistage preconditioner for the fully implicit compositional flow simulation.The method employs an adaptive setup phase to improve the parallel efficiency on GPUs.Furthermore,a multicolor Gauss-Seidel algorithm based on the adjacency matrix is applied in the algebraic multigrid methods for the pressure part.Numerical results demonstrate that the proposed algorithm achieves good parallel speedup while yielding the same convergence behavior as the corresponding sequential version.展开更多
Hash functions are essential in cryptographic primitives such as digital signatures,key exchanges,and blockchain technology.SM3,built upon the Merkle-Damgard structure,is a crucial element in Chinese commercial crypto...Hash functions are essential in cryptographic primitives such as digital signatures,key exchanges,and blockchain technology.SM3,built upon the Merkle-Damgard structure,is a crucial element in Chinese commercial cryptographic schemes.Optimizing hash function performance is crucial given the growth of Internet of Things(IoT)devices and the rapid evolution of blockchain technology.In this paper,we introduce a high-performance implementation framework for accelerating the SM3 cryptography hash function,short for HI-SM3,using heterogeneous GPU(graphics processing unit)parallel computing devices.HI-SM3 enhances the implementation of hash functions across four dimensions:parallelism,register utilization,memory access,and instruction efficiency,resulting in significant performance gains across various GPU platforms.Leveraging the NVIDIA RTX 4090 GPU,HI-SM3 achieves a remarkable peak performance of 454.74 GB/s,surpassing OpenSSL on a high-end server CPU(E5-2699V3)with 16 cores by over 150 times.On the Hygon DCU accelerator,a Chinese domestic graphics card,it achieves 113.77 GB/s.Furthermore,compared with the fastest known GPU-based SM3 implementation,HI-SM3 on the same GPU platform exhibits a 3.12x performance improvement.Even on embedded GPUs consuming less than 40W,HI-SM3 attains a throughput of 5.90 GB/s,which is twice as high as that of a server-level CPU.In summary,HI-SM3 provides a significant performance advantage,positioning it as a compelling solution for accelerating hash operations.展开更多
基金The research has been supported by the Natural Science Foundation of China under great number 61872422the Natural Science Foundation of Zhejiang Province,China under great number LY19F020028.
文摘Given that the concurrent L1-minimization(L1-min)problem is often required in some real applications,we investigate how to solve it in parallel on GPUs in this paper.First,we propose a novel self-adaptive warp implementation of the matrix-vector multiplication(Ax)and a novel self-adaptive thread implementation of the matrix-vector multiplication(ATx),respectively,on the GPU.The vector-operation and inner-product decision trees are adopted to choose the optimal vector-operation and inner-product kernels for vectors of any size.Second,based on the above proposed kernels,the iterative shrinkage-thresholding algorithm is utilized to present two concurrent L1-min solvers from the perspective of the streams and the thread blocks on a GPU,and optimize their performance by using the new features of GPU such as the shuffle instruction and the read-only data cache.Finally,we design a concurrent L1-min solver on multiple GPUs.The experimental results have validated the high effectiveness and good performance of our proposed methods.
基金supported by the School of Energy Resources at the University of WyomingThe GPU hardware used in this study was purchased using the NSF Grant EAR-0930040
文摘We have successfully ported an arbitrary highorder discontinuous Galerkin method for solving the threedimensional isotropic elastic wave equation on unstructured tetrahedral meshes to multiple Graphic Processing Units (GPUs) using the Compute Unified Device Architecture (CUDA) of NVIDIA and Message Passing Interface (MPI) and obtained a speedup factor of about 28.3 for the single-precision version of our codes and a speedup factor of about 14.9 for the double-precision version. The GPU used in the comparisons is NVIDIA Tesla C2070 Fermi, and the CPU used is Intel Xeon W5660. To effectively overlap inter-process communication with computation, we separate the elements on each subdomain into inner and outer elements and complete the computation on outer elements and fill the MPI buffer first. While the MPI messages travel across the network, the GPU performs computation on inner elements, and all other calculations that do not use information of outer elements from neighboring subdomains. A significant portion of the speedup also comes from a customized matrix-matrix multiplication kernel, which is used extensively throughout our program. Preliminary performance analysis on our parallel GPU codes shows favorable strong and weak scalabilities.
文摘Scale Invariant Feature Transform (SIFT) algorithm is a widely used computer vision algorithm that detects and extracts local feature descriptors from images. SIFT is computationally intensive, making it infeasible for single threaded im-plementation to extract local feature descriptors for high-resolution images in real time. In this paper, an approach to parallelization of the SIFT algorithm is demonstrated using NVIDIA’s Graphics Processing Unit (GPU). The parallel-ization design for SIFT on GPUs is divided into two stages, a) Algorithm de-sign-generic design strategies which focuses on data and b) Implementation de-sign-architecture specific design strategies which focuses on optimally using GPU resources for maximum occupancy. Increasing memory latency hiding, eliminating branches and data blocking achieve a significant decrease in aver-age computational time. Furthermore, it is observed via Paraver tools that our approach to parallelization while optimizing for maximum occupancy allows GPU to execute memory bound SIFT algorithm at optimal levels.
文摘As one of the most essential and important operations in linear algebra, the performance prediction of sparse matrix-vector multiplication (SpMV) on GPUs has got more and more attention in recent years. In 2012, Guo and Wang put forward a new idea to predict the performance of SpMV on GPUs. However, they didn’t consider the matrix structure completely, so the execution time predicted by their model tends to be inaccurate for general sparse matrix. To address this problem, we proposed two new similar models, which take into account the structure of the matrices and make the performance prediction model more accurate. In addition, we predict the execution time of SpMV for CSR-V, CSR-S, ELL and JAD sparse matrix storage formats by the new models on the CUDA platform. Our experimental results show that the accuracy of prediction by our models is 1.69 times better than Guo and Wang’s model on average for most general matrices.
基金supported by National Natural Science Foundation of China(No.61203172)the SSTP of Sichuan(Nos.2018YYJC0994 and 2017JY0011)Shenzhen STPP(No.GJHZ20160301164521358)
文摘The parallel computation capabilities of modern graphics processing units (GPUs) have attracted increasing attention from researchers and engineers who have been conducting high computational throughput studies. However, current single GPU based engineering solutions are often struggling to fulfill their real-time requirements. Thus, the multi-GPU-based approach has become a popular and cost-effective choice for tackling the demands. In those cases, the computational load balancing over multiple GPU "nodes" is often the key and bottleneck that affect the quality and performance of the real=time system. The existing load balancing approaches are mainly based on the assumption that all GPU nodes in the same computer framework are of equal computational performance, which is often not the case due to cluster design and other legacy issues. This paper presents a novel dynamic load balancing (DLB) model for rapid data division and allocation on heterogeneous GPU nodes based on an innovative fuzzy neural network (FNN). In this research, a 5-state parameter feedback mechanism defining the overall cluster and node performance is proposed. The corresponding FNN-based DLB model will be capable of monitoring and predicting individual node performance under different workload scenarios. A real=time adaptive scheduler has been devised to reorganize the data inputs to each node when necessary to maintain their runtime computational performance. The devised model has been implemented on two dimensional (2D) discrete wavelet transform (DWT) applications for evaluation. Experiment results show that this DLB model enables a high computational throughput while ensuring real=time and precision requirements from complex computational tasks.
文摘Particle accelerators play an important role in a wide range of scientific discoveries and industrial applications. The self-consistent multi-particle simulation based on the particle-in-cell (PIC) method has been used to study charged particle beam dynamics inside those accelerators. However, the PIC simulation is time-consuming and needs to use modern parallel computers for high-resolution applications. In this paper, we implemented a parallel beam dynamics PIC code on multi-node hybrid architecture computers with multiple Graphics Processing Units (GPUs). We used two methods to parallelize the PIC code on multiple GPUs and observed that the replication method is a better choice for moderate problem size and current computer hardware while the domain decomposition method might be a better choice for large problem size and more advanced computer hardware that allows direct communications among multiple GPUs. Using the multi-node hybrid architectures at Oak Ridge Leadership Computing Facility (OLCF), the optimized GPU PIC code achieves a reasonable parallel performance and scales up to 64 GPUs with 16 million particles.
文摘The sparse matrix vector multiplication (SpMV) is inevitable in almost all kinds of scientific computation, such as iterative methods for solving linear systems and eigenvalue problems. With the emergence and development of Graphics Processing Units (GPUs), high efficient formats for SpMV should be constructed. The performance of SpMV is mainly determinted by the storage format for sparse matrix. Based on the idea of JAD format, this paper improved the ELLPACK-R format, reduced the waiting time between different threads in a warp, and the speed up achieved about 1.5 in our experimental results. Compared with other formats, such as CSR, ELL, BiELL and so on, our format performance of SpMV is optimal over 70 percent of the test matrix. We proposed a method based on parameters to analyze the performance impact on different formats. In addition, a formula was constructed to count the computation and the number of iterations.
文摘In our previous work, a novel algorithm to perform robust pose estimation was presented. The pose was estimated using points on the object to regions on image correspondence. The laboratory experiments conducted in the previous work showed that the accuracy of the estimated pose was over 99% for position and 84% for orientation estimations respectively. However, for larger objects, the algorithm requires a high number of points to achieve the same accuracy. The requirement of higher number of points makes the algorithm, computationally intensive resulting in the algorithm infeasible for real-time computer vision applications. In this paper, the algorithm is parallelized to run on NVIDIA GPUs. The results indicate that even for objects having more than 2000 points, the algorithm can estimate the pose in real time for each frame of high-resolution videos.
基金supported by National Natural Science Foundation of China(No.22333003)。
文摘Sparse matrix-vector multiplication(SpMV)is one of the key kernels extensively employed in both industrial and scientific applications,with its computation and random access incurring a lot of overhead.To capitalize on higher compute rates and data movement efficiency,there have been efforts to utilize mixed precision SpMV.However,most existing techniques focus on single-grained precision selection for all matrices.In this work,we concentrate on hierarchical precision selection strategies tailored for irregular matrices,driven by the need to achieve optimal load balancing among thread groups executing on GPUs.Based on the concept of strong connection,we firstly introduce a novel adaptive row-grained precision selection strategy that surpasses existing strategy within multi-precision Jacobi methods.Secondly,our experiments have uncovered a range within which converting double-precision floating-point numbers to single-precision floating-point numbers incurs a loss smaller than the machine precision FLT_EPSILON.This range is used for element-grained precision selection.Subsequently,we propose a hierarchical precision selection compressed sparse row format(CSR)storage method and enhance the CSR-Vector kernel,achieving higher relative speedups and load balancing on a benchmark suite composed of 41 matrices compared to existing methods.Finally,we integrate the mixed precision SpMV into the generalized minimal residual method(GMRES)algorithm,achieving faster execution speeds while maintaining similar convergence accuracy as double-precision GMRES.
基金supported by National Key Research and Development Program(Grant No.2022YFB4501404)the Beijing Natural Science Foundation(4232036)CAS Project for Youth Innovation Promotion Association.
文摘With the development of deep learning,hardware accelerators represented by GPUs have been used to accelerate the execution of deep learning applications.A key problem in GPU cluster is how to schedule various deep learning applications,including training applications and latency-critical inference applications,to achieve optimal system performance.In cloud datacenters,inference applications often require fewer resources,and the exclusive GPU execution of one inference application can result in a significant waste of GPU resources.Existing work mainly focuses on the co-location execution of multiple inference applications in datacenters using MPS(Multi-Process Service).There are several problems with this execution pattern,datacenters may be in low-workload state for long periods of time due to the diurnal pattern of inference applications,MPS-based data sharing can lead to interaction errors between contexts,and resource contention may cause Quality of Service(QoS)violations.To solve above problems,we propose ArkGPU,a runtime system that dynamically allocates resources.ArkGPU can improve the resource utilization of the cluster,while guaranteeing the QoS of inference applications.ArkGPU is comprised of a performance predictor,a scheduler,a resource limiter,and an adjustment unit.We conduct extensive experiments on the NVIDIA V100 GPU to verify the effectiveness of ArkGPU.We achieve High-Goodput for latency-critical applications which have an average throughput increase of 584.27%compared to MPS.We deploy multiple applications simultaneously on ArkGPU,and in this case,goodput is improved by 94.98%compared to k8s-native and 38.65%compared to MPS.
文摘Tucker decomposition is one of the most popular models for analyzing and compressing large-scale tensorial data.Existing Tucker decomposition algorithms are usually based on a single solver to compute the factor matrices and intermediate tensor in a predetermined order,and are not flexible enough to adapt with the diversities of the input data and the hardware.Moreover,to exploit highly efficient matrix multiplication kernels,most Tucker decomposition implementations rely on explicit matricizations,which could introduce extra costs of data conversion.In this paper,we present a-Tucker,a new framework for input-adaptive and matricization-free Tucker decomposition of higher-order tensors on GPUs.A two-level flexible Tucker decomposition algorithm is proposed to enable the switch of different calculation orders and different factor solvers,and a machine-learning adaptive order-solver selector is applied to automatically cope with change of the application scenarios.To further improve the performance,we implement a-Tucker in a fully matricization-free manner without any conversion between tensors and matrices.Experiments show that a-Tucker can substantially outperform existing works while keeping similar accuracy with a variety of synthetic and real-world tensors.
基金supported by the Postgraduate Scientific Research Innovation Project of Hunan Province(No.CX20210607)Postgraduate Scientific Research Innovation Project of Xiangtan University(No.XDCX2021B110)+2 种基金supported by the National Science Foundation of China(No.11971472)supported by the Excellent Youth Foundation of SINOPEC(No.P20009)supported by the National Science Foundation of China(No.11971414).
文摘The compositional model is often used to describe multicomponent multiphase porous media flows in the petroleum industry.The fully implicit method with strong stability and weak constraints on time-step sizes is commonly used in mainstream commercial reservoir simulators.In this paper,we develop an efficient multistage preconditioner for the fully implicit compositional flow simulation.The method employs an adaptive setup phase to improve the parallel efficiency on GPUs.Furthermore,a multicolor Gauss-Seidel algorithm based on the adjacency matrix is applied in the algebraic multigrid methods for the pressure part.Numerical results demonstrate that the proposed algorithm achieves good parallel speedup while yielding the same convergence behavior as the corresponding sequential version.
基金supported by the National Natural Science Foundation of China under Grant Nos.U23B2002,62302238,and 62372245the Natural Science Foundation of Jiangsu Province of China under Grant No.BK20220388+2 种基金the Natural Science Research Project of Colleges and Universities in Jiangsu Province of China under Grant No.22KJB520004the China Postdoctoral Science Foundation under Grant No.2022M711689the CCF-Tencent Rhino-Bird Open Research Fund under Grant No.CCF-Tencent RAGR20240129.
文摘Hash functions are essential in cryptographic primitives such as digital signatures,key exchanges,and blockchain technology.SM3,built upon the Merkle-Damgard structure,is a crucial element in Chinese commercial cryptographic schemes.Optimizing hash function performance is crucial given the growth of Internet of Things(IoT)devices and the rapid evolution of blockchain technology.In this paper,we introduce a high-performance implementation framework for accelerating the SM3 cryptography hash function,short for HI-SM3,using heterogeneous GPU(graphics processing unit)parallel computing devices.HI-SM3 enhances the implementation of hash functions across four dimensions:parallelism,register utilization,memory access,and instruction efficiency,resulting in significant performance gains across various GPU platforms.Leveraging the NVIDIA RTX 4090 GPU,HI-SM3 achieves a remarkable peak performance of 454.74 GB/s,surpassing OpenSSL on a high-end server CPU(E5-2699V3)with 16 cores by over 150 times.On the Hygon DCU accelerator,a Chinese domestic graphics card,it achieves 113.77 GB/s.Furthermore,compared with the fastest known GPU-based SM3 implementation,HI-SM3 on the same GPU platform exhibits a 3.12x performance improvement.Even on embedded GPUs consuming less than 40W,HI-SM3 attains a throughput of 5.90 GB/s,which is twice as high as that of a server-level CPU.In summary,HI-SM3 provides a significant performance advantage,positioning it as a compelling solution for accelerating hash operations.