Given that the concurrent L1-minimization(L1-min)problem is often required in some real applications,we investigate how to solve it in parallel on GPUs in this paper.First,we propose a novel self-adaptive warp impleme...Given that the concurrent L1-minimization(L1-min)problem is often required in some real applications,we investigate how to solve it in parallel on GPUs in this paper.First,we propose a novel self-adaptive warp implementation of the matrix-vector multiplication(Ax)and a novel self-adaptive thread implementation of the matrix-vector multiplication(ATx),respectively,on the GPU.The vector-operation and inner-product decision trees are adopted to choose the optimal vector-operation and inner-product kernels for vectors of any size.Second,based on the above proposed kernels,the iterative shrinkage-thresholding algorithm is utilized to present two concurrent L1-min solvers from the perspective of the streams and the thread blocks on a GPU,and optimize their performance by using the new features of GPU such as the shuffle instruction and the read-only data cache.Finally,we design a concurrent L1-min solver on multiple GPUs.The experimental results have validated the high effectiveness and good performance of our proposed methods.展开更多
We have successfully ported an arbitrary highorder discontinuous Galerkin method for solving the threedimensional isotropic elastic wave equation on unstructured tetrahedral meshes to multiple Graphic Processing Units...We have successfully ported an arbitrary highorder discontinuous Galerkin method for solving the threedimensional isotropic elastic wave equation on unstructured tetrahedral meshes to multiple Graphic Processing Units (GPUs) using the Compute Unified Device Architecture (CUDA) of NVIDIA and Message Passing Interface (MPI) and obtained a speedup factor of about 28.3 for the single-precision version of our codes and a speedup factor of about 14.9 for the double-precision version. The GPU used in the comparisons is NVIDIA Tesla C2070 Fermi, and the CPU used is Intel Xeon W5660. To effectively overlap inter-process communication with computation, we separate the elements on each subdomain into inner and outer elements and complete the computation on outer elements and fill the MPI buffer first. While the MPI messages travel across the network, the GPU performs computation on inner elements, and all other calculations that do not use information of outer elements from neighboring subdomains. A significant portion of the speedup also comes from a customized matrix-matrix multiplication kernel, which is used extensively throughout our program. Preliminary performance analysis on our parallel GPU codes shows favorable strong and weak scalabilities.展开更多
Scale Invariant Feature Transform (SIFT) algorithm is a widely used computer vision algorithm that detects and extracts local feature descriptors from images. SIFT is computationally intensive, making it infeasible fo...Scale Invariant Feature Transform (SIFT) algorithm is a widely used computer vision algorithm that detects and extracts local feature descriptors from images. SIFT is computationally intensive, making it infeasible for single threaded im-plementation to extract local feature descriptors for high-resolution images in real time. In this paper, an approach to parallelization of the SIFT algorithm is demonstrated using NVIDIA’s Graphics Processing Unit (GPU). The parallel-ization design for SIFT on GPUs is divided into two stages, a) Algorithm de-sign-generic design strategies which focuses on data and b) Implementation de-sign-architecture specific design strategies which focuses on optimally using GPU resources for maximum occupancy. Increasing memory latency hiding, eliminating branches and data blocking achieve a significant decrease in aver-age computational time. Furthermore, it is observed via Paraver tools that our approach to parallelization while optimizing for maximum occupancy allows GPU to execute memory bound SIFT algorithm at optimal levels.展开更多
As one of the most essential and important operations in linear algebra, the performance prediction of sparse matrix-vector multiplication (SpMV) on GPUs has got more and more attention in recent years. In 2012, Guo a...As one of the most essential and important operations in linear algebra, the performance prediction of sparse matrix-vector multiplication (SpMV) on GPUs has got more and more attention in recent years. In 2012, Guo and Wang put forward a new idea to predict the performance of SpMV on GPUs. However, they didn’t consider the matrix structure completely, so the execution time predicted by their model tends to be inaccurate for general sparse matrix. To address this problem, we proposed two new similar models, which take into account the structure of the matrices and make the performance prediction model more accurate. In addition, we predict the execution time of SpMV for CSR-V, CSR-S, ELL and JAD sparse matrix storage formats by the new models on the CUDA platform. Our experimental results show that the accuracy of prediction by our models is 1.69 times better than Guo and Wang’s model on average for most general matrices.展开更多
The parallel computation capabilities of modern graphics processing units (GPUs) have attracted increasing attention from researchers and engineers who have been conducting high computational throughput studies. How...The parallel computation capabilities of modern graphics processing units (GPUs) have attracted increasing attention from researchers and engineers who have been conducting high computational throughput studies. However, current single GPU based engineering solutions are often struggling to fulfill their real-time requirements. Thus, the multi-GPU-based approach has become a popular and cost-effective choice for tackling the demands. In those cases, the computational load balancing over multiple GPU "nodes" is often the key and bottleneck that affect the quality and performance of the real=time system. The existing load balancing approaches are mainly based on the assumption that all GPU nodes in the same computer framework are of equal computational performance, which is often not the case due to cluster design and other legacy issues. This paper presents a novel dynamic load balancing (DLB) model for rapid data division and allocation on heterogeneous GPU nodes based on an innovative fuzzy neural network (FNN). In this research, a 5-state parameter feedback mechanism defining the overall cluster and node performance is proposed. The corresponding FNN-based DLB model will be capable of monitoring and predicting individual node performance under different workload scenarios. A real=time adaptive scheduler has been devised to reorganize the data inputs to each node when necessary to maintain their runtime computational performance. The devised model has been implemented on two dimensional (2D) discrete wavelet transform (DWT) applications for evaluation. Experiment results show that this DLB model enables a high computational throughput while ensuring real=time and precision requirements from complex computational tasks.展开更多
Particle accelerators play an important role in a wide range of scientific discoveries and industrial applications. The self-consistent multi-particle simulation based on the particle-in-cell (PIC) method has been use...Particle accelerators play an important role in a wide range of scientific discoveries and industrial applications. The self-consistent multi-particle simulation based on the particle-in-cell (PIC) method has been used to study charged particle beam dynamics inside those accelerators. However, the PIC simulation is time-consuming and needs to use modern parallel computers for high-resolution applications. In this paper, we implemented a parallel beam dynamics PIC code on multi-node hybrid architecture computers with multiple Graphics Processing Units (GPUs). We used two methods to parallelize the PIC code on multiple GPUs and observed that the replication method is a better choice for moderate problem size and current computer hardware while the domain decomposition method might be a better choice for large problem size and more advanced computer hardware that allows direct communications among multiple GPUs. Using the multi-node hybrid architectures at Oak Ridge Leadership Computing Facility (OLCF), the optimized GPU PIC code achieves a reasonable parallel performance and scales up to 64 GPUs with 16 million particles.展开更多
The sparse matrix vector multiplication (SpMV) is inevitable in almost all kinds of scientific computation, such as iterative methods for solving linear systems and eigenvalue problems. With the emergence and developm...The sparse matrix vector multiplication (SpMV) is inevitable in almost all kinds of scientific computation, such as iterative methods for solving linear systems and eigenvalue problems. With the emergence and development of Graphics Processing Units (GPUs), high efficient formats for SpMV should be constructed. The performance of SpMV is mainly determinted by the storage format for sparse matrix. Based on the idea of JAD format, this paper improved the ELLPACK-R format, reduced the waiting time between different threads in a warp, and the speed up achieved about 1.5 in our experimental results. Compared with other formats, such as CSR, ELL, BiELL and so on, our format performance of SpMV is optimal over 70 percent of the test matrix. We proposed a method based on parameters to analyze the performance impact on different formats. In addition, a formula was constructed to count the computation and the number of iterations.展开更多
In our previous work, a novel algorithm to perform robust pose estimation was presented. The pose was estimated using points on the object to regions on image correspondence. The laboratory experiments conducted in th...In our previous work, a novel algorithm to perform robust pose estimation was presented. The pose was estimated using points on the object to regions on image correspondence. The laboratory experiments conducted in the previous work showed that the accuracy of the estimated pose was over 99% for position and 84% for orientation estimations respectively. However, for larger objects, the algorithm requires a high number of points to achieve the same accuracy. The requirement of higher number of points makes the algorithm, computationally intensive resulting in the algorithm infeasible for real-time computer vision applications. In this paper, the algorithm is parallelized to run on NVIDIA GPUs. The results indicate that even for objects having more than 2000 points, the algorithm can estimate the pose in real time for each frame of high-resolution videos.展开更多
In distributed training,increasing batch size can improve parallelism,but it can also bring many difficulties to the training process and cause training errors.In this work,we investigate the occurrence of training er...In distributed training,increasing batch size can improve parallelism,but it can also bring many difficulties to the training process and cause training errors.In this work,we investigate the occurrence of training errors in theory and train ResNet-50 on CIFAR-10 by using Stochastic Gradient Descent(SGD) and Adaptive moment estimation(Adam) while keeping the total batch size in the parameter server constant and lowering the batch size on each Graphics Processing Unit(GPU).A new method that considers momentum to eliminate training errors in distributed training is proposed.We define a Momentum-like Factor(MF) to represent the influence of former gradients on parameter updates in each iteration.Then,we modify the MF values and conduct experiments to explore how different MF values influence the training performance based on SGD,Adam,and Nesterov accelerated gradient.Experimental results reveal that increasing MFs is a reliable method for reducing training errors in distributed training.The analysis of convergent conditions in distributed training with consideration of a large batch size and multiple GPUs is presented in this paper.展开更多
In this paper,we focus on graphical processing unit(GPU)and discuss how its architecture affects the choice of algorithm and implementation of fully-implicit petroleum reservoir simulation.In order to obtain satisfact...In this paper,we focus on graphical processing unit(GPU)and discuss how its architecture affects the choice of algorithm and implementation of fully-implicit petroleum reservoir simulation.In order to obtain satisfactory performance on new many-core architectures such as GPUs,the simulator developers must know a great deal on the specific hardware and spend a lot of time on fine tuning the code.Porting a large petroleum reservoir simulator to emerging hardware architectures is expensive and risky.We analyze major components of an in-house reservoir simulator and investigate how to port them to GPUs in a cost-effective way.Preliminary numerical experiments show that our GPU-based simulator is robust and effective.More importantly,these numerical results clearly identify the main bottlenecks to obtain ideal speedup on GPUs and possibly other many-core architectures.展开更多
Graphs that are used to model real-world entities with vertices and relationships among entities with edges,have proven to be a powerful tool for describing real-world problems in applications.In most real-world scena...Graphs that are used to model real-world entities with vertices and relationships among entities with edges,have proven to be a powerful tool for describing real-world problems in applications.In most real-world scenarios,entities and their relationships are subject to constant changes.Graphs that record such changes are called dynamic graphs.In recent years,the widespread application scenarios of dynamic graphs have stimulated extensive research on dynamic graph processing systems that continuously ingest graph updates and produce up-to-date graph analytics results.As the scale of dynamic graphs becomes larger,higher performance requirements are demanded to dynamic graph processing systems.With the massive parallel processing power and high memory bandwidth,GPUs become mainstream vehicles to accelerate dynamic graph processing tasks.GPU-based dynamic graph processing systems mainly address two challenges:maintaining the graph data when updates occur(i.e.,graph updating)and producing analytics results in time(i.e.,graph computing).In this paper,we survey GPU-based dynamic graph processing systems and review their methods on addressing both graph updating and graph computing.To comprehensively discuss existing dynamic graph processing systems on GPUs,we first introduce the terminologies of dynamic graph processing and then develop a taxonomy to describe the methods employed for graph updating and graph computing.In addition,we discuss the challenges and future research directions of dynamic graph processing on GPUs.展开更多
We report a high-performance multi graphics processing unit(GPU)implementation of the Kohn–Sham time-dependent density functional theory(TDDFT)within the Tamm–Dancoff approximation.Our algorithm on massively paralle...We report a high-performance multi graphics processing unit(GPU)implementation of the Kohn–Sham time-dependent density functional theory(TDDFT)within the Tamm–Dancoff approximation.Our algorithm on massively parallel computing systems using multiple parallel models in tandem scales optimally with material size,considerably reducing the computational wall time.A benchmark TDDFT study was performed on a green fluorescent protein complex composed of 4353 atoms with 40,518 atomic orbitals represented by Gaussian-type functions,demonstrating the effect of distant protein residues on the excitation.As the largest molecule attempted to date to the best of our knowledge,the proposed strategy demonstrated reasonably high efficiencies up to 256 GPUs on a custom-built state-of-the-art GPU computing system with Nvidia A100 GPUs.We believe that our GPU-oriented algorithms,which empower first-principles simulation for very large-scale applications,may render deeper understanding of the molecular basis of material behaviors,eventually revealing new possibilities for breakthrough designs on new material systems.展开更多
Compute Unified Device Architecture (CUDA) was used to design and implement molecular dynamics (MD) simulations on graphics processing units (GPU). With an NVIDIA Tesla C870, a 20-60 fold speedup over that of one core...Compute Unified Device Architecture (CUDA) was used to design and implement molecular dynamics (MD) simulations on graphics processing units (GPU). With an NVIDIA Tesla C870, a 20-60 fold speedup over that of one core of the Intel Xeon 5430 CPU was achieved, reaching up to 150 Gflops. MD simulation of cavity flow and particle-bubble interaction in liquid was implemented on multiple GPUs using a message passing interface (MPI). Up to 200 GPUs were tested on a special network topology, which achieves good scalability. The capability of GPU clusters for large-scale molecular dynamics simulation of meso-scale flow behavior was, therefore, uncovered.展开更多
Fourier methods have revolutionized many fields of science and engineering, such as astronomy, medical imaging, seismology and spectroscopy, and the fast Fourier transform (FFT) is a computationally efficient method...Fourier methods have revolutionized many fields of science and engineering, such as astronomy, medical imaging, seismology and spectroscopy, and the fast Fourier transform (FFT) is a computationally efficient method of generating a Fourier transform. The emerging class of high performance computing architectures, such as GPU, seeks to achieve much higher performance and efficiency by exposing a hierarchy of distinct memories to software. However, the complexity of GPU programming poses a significant challenge to developers. In this paper, we propose an automatic performance tuning framework for FFT on various OpenCL GPUs, and implement a high performance library named MPFFT based on this framework. For power-of-two length FFTs, our library substantially outperforms the cIAmdFft library on AMD GPUs and achieves comparable performance as the CUFFT library on NVIDIA GPUs. Furthermore, our library also supports non-power-of-two size. For 3D non-power-of-two FFTs, our library delivers 1.5x to 28x faster than FFTYV with 4 threads and 20.01x average speedup over CUFFT 4.0 on Tesla C2050.展开更多
A real-time algorithm for constructing contour maps from grid DEM data is pre-sented.It runs completely within the programmable 3D visualization pipeline.The interpolation is paralleled by rasterizer units in the grap...A real-time algorithm for constructing contour maps from grid DEM data is pre-sented.It runs completely within the programmable 3D visualization pipeline.The interpolation is paralleled by rasterizer units in the graphics card,and contour line extraction is paralleled by pixel shader.During each frame of the rendering,we first make an elevation gradient map out of original terrain vertex data,then figure out the final contour lines with image-space processing,and directly blend the results on the original scene to obtain a final scene with contour map using alpha-blending.We implement this method in our global 3D-digitalearth system with Direct3D?9.0c API and tested on some consumer level PC platforms.For arbitrary scene with certain LOD level,the process takes less than 10 ms,giving topologically correct,anti-aliased contour lines.展开更多
In view of the frequent occurrence of floods due to climate change, and the fact that a large calculation domain, with complex land types, is required for solving the problem of the flood simulations, this paper propo...In view of the frequent occurrence of floods due to climate change, and the fact that a large calculation domain, with complex land types, is required for solving the problem of the flood simulations, this paper proposes an optimized non-uniform grid model combined with a high-resolution model based on the graphics processing unit (GPU) acceleration to simulate the surface water flow process. For the grid division, the topographic gradient change is taken as the control variable and different optimization criteria are designed according to different land types. In the numerical model, the Godunov-type method is adopted for the spatial discretization, the TVD-MUSUL and Runge-Kutta methods are used to improve the model’s spatial and temporal calculation accuracies, and the simulation time is reduced by leveraging the GPU acceleration. The model is applied to ideal and actual case studies. The results show that the numerical model based on a non-uniform grid enjoys a good stability. In the simulation of the urban inundation, approximately 40%–50% of the urban average topographic gradient change to be covered is taken as the threshold for the non-uniform grid division, and the calculation efficiency and accuracy can be optimized. In this case, the calculation efficiency of the non-uniform grid based on the optimized parameters is 2–3 times of that of the uniform grid, and the approach can be adopted for the actual flood simulation in large-scale areas.展开更多
Low-Density Parity-Check (LDPC) codes are powerful error correcting codes adopted by recent communication standards. LDPC decoders are based on belief propagation algorithms, which make use of a Tanner graph and ver...Low-Density Parity-Check (LDPC) codes are powerful error correcting codes adopted by recent communication standards. LDPC decoders are based on belief propagation algorithms, which make use of a Tanner graph and very intensive message-passing computation, and usually require hardware-based dedicated solutions. With the exponential increase of the computational power of commodity graphics processing units (GPUs), new opportunities have arisen to develop general purpose processing on GPUs. This paper proposes the use of GPUs for implementing flexible and programmable LDPC decoders. A new stream-based approach is proposed, based on compact data structures to represent the Tanner graph. It is shown that such a challenging application for stream-based computing, because of irregular memory access patterns, memory bandwidth and recursive flow control constraints, can be efficiently implemented on GPUs. The proposal was experimentally evaluated by programming LDPC decoders on GPUs using the Caravela platform, a generic interface tool for managing the kernels' execution regardless of the GPU manufacturer and operating system. Moreover, to relatively assess the obtained results, we have also implemented LDPC decoders on general purpose processors with Streaming Single Instruction Multiple Data (SIMD) Extensions. Experimental results show that the solution proposed here efficiently decodes several codewords simultaneously, reducing the processing time by one order of magnitude.展开更多
基金The research has been supported by the Natural Science Foundation of China under great number 61872422the Natural Science Foundation of Zhejiang Province,China under great number LY19F020028.
文摘Given that the concurrent L1-minimization(L1-min)problem is often required in some real applications,we investigate how to solve it in parallel on GPUs in this paper.First,we propose a novel self-adaptive warp implementation of the matrix-vector multiplication(Ax)and a novel self-adaptive thread implementation of the matrix-vector multiplication(ATx),respectively,on the GPU.The vector-operation and inner-product decision trees are adopted to choose the optimal vector-operation and inner-product kernels for vectors of any size.Second,based on the above proposed kernels,the iterative shrinkage-thresholding algorithm is utilized to present two concurrent L1-min solvers from the perspective of the streams and the thread blocks on a GPU,and optimize their performance by using the new features of GPU such as the shuffle instruction and the read-only data cache.Finally,we design a concurrent L1-min solver on multiple GPUs.The experimental results have validated the high effectiveness and good performance of our proposed methods.
基金supported by the School of Energy Resources at the University of WyomingThe GPU hardware used in this study was purchased using the NSF Grant EAR-0930040
文摘We have successfully ported an arbitrary highorder discontinuous Galerkin method for solving the threedimensional isotropic elastic wave equation on unstructured tetrahedral meshes to multiple Graphic Processing Units (GPUs) using the Compute Unified Device Architecture (CUDA) of NVIDIA and Message Passing Interface (MPI) and obtained a speedup factor of about 28.3 for the single-precision version of our codes and a speedup factor of about 14.9 for the double-precision version. The GPU used in the comparisons is NVIDIA Tesla C2070 Fermi, and the CPU used is Intel Xeon W5660. To effectively overlap inter-process communication with computation, we separate the elements on each subdomain into inner and outer elements and complete the computation on outer elements and fill the MPI buffer first. While the MPI messages travel across the network, the GPU performs computation on inner elements, and all other calculations that do not use information of outer elements from neighboring subdomains. A significant portion of the speedup also comes from a customized matrix-matrix multiplication kernel, which is used extensively throughout our program. Preliminary performance analysis on our parallel GPU codes shows favorable strong and weak scalabilities.
文摘Scale Invariant Feature Transform (SIFT) algorithm is a widely used computer vision algorithm that detects and extracts local feature descriptors from images. SIFT is computationally intensive, making it infeasible for single threaded im-plementation to extract local feature descriptors for high-resolution images in real time. In this paper, an approach to parallelization of the SIFT algorithm is demonstrated using NVIDIA’s Graphics Processing Unit (GPU). The parallel-ization design for SIFT on GPUs is divided into two stages, a) Algorithm de-sign-generic design strategies which focuses on data and b) Implementation de-sign-architecture specific design strategies which focuses on optimally using GPU resources for maximum occupancy. Increasing memory latency hiding, eliminating branches and data blocking achieve a significant decrease in aver-age computational time. Furthermore, it is observed via Paraver tools that our approach to parallelization while optimizing for maximum occupancy allows GPU to execute memory bound SIFT algorithm at optimal levels.
文摘As one of the most essential and important operations in linear algebra, the performance prediction of sparse matrix-vector multiplication (SpMV) on GPUs has got more and more attention in recent years. In 2012, Guo and Wang put forward a new idea to predict the performance of SpMV on GPUs. However, they didn’t consider the matrix structure completely, so the execution time predicted by their model tends to be inaccurate for general sparse matrix. To address this problem, we proposed two new similar models, which take into account the structure of the matrices and make the performance prediction model more accurate. In addition, we predict the execution time of SpMV for CSR-V, CSR-S, ELL and JAD sparse matrix storage formats by the new models on the CUDA platform. Our experimental results show that the accuracy of prediction by our models is 1.69 times better than Guo and Wang’s model on average for most general matrices.
基金supported by National Natural Science Foundation of China(No.61203172)the SSTP of Sichuan(Nos.2018YYJC0994 and 2017JY0011)Shenzhen STPP(No.GJHZ20160301164521358)
文摘The parallel computation capabilities of modern graphics processing units (GPUs) have attracted increasing attention from researchers and engineers who have been conducting high computational throughput studies. However, current single GPU based engineering solutions are often struggling to fulfill their real-time requirements. Thus, the multi-GPU-based approach has become a popular and cost-effective choice for tackling the demands. In those cases, the computational load balancing over multiple GPU "nodes" is often the key and bottleneck that affect the quality and performance of the real=time system. The existing load balancing approaches are mainly based on the assumption that all GPU nodes in the same computer framework are of equal computational performance, which is often not the case due to cluster design and other legacy issues. This paper presents a novel dynamic load balancing (DLB) model for rapid data division and allocation on heterogeneous GPU nodes based on an innovative fuzzy neural network (FNN). In this research, a 5-state parameter feedback mechanism defining the overall cluster and node performance is proposed. The corresponding FNN-based DLB model will be capable of monitoring and predicting individual node performance under different workload scenarios. A real=time adaptive scheduler has been devised to reorganize the data inputs to each node when necessary to maintain their runtime computational performance. The devised model has been implemented on two dimensional (2D) discrete wavelet transform (DWT) applications for evaluation. Experiment results show that this DLB model enables a high computational throughput while ensuring real=time and precision requirements from complex computational tasks.
文摘Particle accelerators play an important role in a wide range of scientific discoveries and industrial applications. The self-consistent multi-particle simulation based on the particle-in-cell (PIC) method has been used to study charged particle beam dynamics inside those accelerators. However, the PIC simulation is time-consuming and needs to use modern parallel computers for high-resolution applications. In this paper, we implemented a parallel beam dynamics PIC code on multi-node hybrid architecture computers with multiple Graphics Processing Units (GPUs). We used two methods to parallelize the PIC code on multiple GPUs and observed that the replication method is a better choice for moderate problem size and current computer hardware while the domain decomposition method might be a better choice for large problem size and more advanced computer hardware that allows direct communications among multiple GPUs. Using the multi-node hybrid architectures at Oak Ridge Leadership Computing Facility (OLCF), the optimized GPU PIC code achieves a reasonable parallel performance and scales up to 64 GPUs with 16 million particles.
文摘The sparse matrix vector multiplication (SpMV) is inevitable in almost all kinds of scientific computation, such as iterative methods for solving linear systems and eigenvalue problems. With the emergence and development of Graphics Processing Units (GPUs), high efficient formats for SpMV should be constructed. The performance of SpMV is mainly determinted by the storage format for sparse matrix. Based on the idea of JAD format, this paper improved the ELLPACK-R format, reduced the waiting time between different threads in a warp, and the speed up achieved about 1.5 in our experimental results. Compared with other formats, such as CSR, ELL, BiELL and so on, our format performance of SpMV is optimal over 70 percent of the test matrix. We proposed a method based on parameters to analyze the performance impact on different formats. In addition, a formula was constructed to count the computation and the number of iterations.
文摘In our previous work, a novel algorithm to perform robust pose estimation was presented. The pose was estimated using points on the object to regions on image correspondence. The laboratory experiments conducted in the previous work showed that the accuracy of the estimated pose was over 99% for position and 84% for orientation estimations respectively. However, for larger objects, the algorithm requires a high number of points to achieve the same accuracy. The requirement of higher number of points makes the algorithm, computationally intensive resulting in the algorithm infeasible for real-time computer vision applications. In this paper, the algorithm is parallelized to run on NVIDIA GPUs. The results indicate that even for objects having more than 2000 points, the algorithm can estimate the pose in real time for each frame of high-resolution videos.
基金partially supported by the Major State Research Development Program (No. 2016YFB0201305)the National Key R&D Program of China (No.2018YFB2101100)the National Natural Science Foundation of China (Nos. 61806216, 61702533,61932001, and 61872376)。
文摘In distributed training,increasing batch size can improve parallelism,but it can also bring many difficulties to the training process and cause training errors.In this work,we investigate the occurrence of training errors in theory and train ResNet-50 on CIFAR-10 by using Stochastic Gradient Descent(SGD) and Adaptive moment estimation(Adam) while keeping the total batch size in the parameter server constant and lowering the batch size on each Graphics Processing Unit(GPU).A new method that considers momentum to eliminate training errors in distributed training is proposed.We define a Momentum-like Factor(MF) to represent the influence of former gradients on parameter updates in each iteration.Then,we modify the MF values and conduct experiments to explore how different MF values influence the training performance based on SGD,Adam,and Nesterov accelerated gradient.Experimental results reveal that increasing MFs is a reliable method for reducing training errors in distributed training.The analysis of convergent conditions in distributed training with consideration of a large batch size and multiple GPUs is presented in this paper.
基金support from LSEC.The authors would like to thank RIPED,PetroChina,for providing data for the numerical tests and support through PetroChina New-generation Reservoir Simulation Software(No.2011A-1010)the Program of Research on Continental Sedimentary Oil Reservoir Simulation(No.z121100004912001)founded by Beijing Municipal Science&Technology Commission and PetroChina Joint Research Funding No.12HT1050002654.
文摘In this paper,we focus on graphical processing unit(GPU)and discuss how its architecture affects the choice of algorithm and implementation of fully-implicit petroleum reservoir simulation.In order to obtain satisfactory performance on new many-core architectures such as GPUs,the simulator developers must know a great deal on the specific hardware and spend a lot of time on fine tuning the code.Porting a large petroleum reservoir simulator to emerging hardware architectures is expensive and risky.We analyze major components of an in-house reservoir simulator and investigate how to port them to GPUs in a cost-effective way.Preliminary numerical experiments show that our GPU-based simulator is robust and effective.More importantly,these numerical results clearly identify the main bottlenecks to obtain ideal speedup on GPUs and possibly other many-core architectures.
基金National Natural Science Foundation of China(Grant Nos.61972444,61825202,62072195,and 61832006)Zhejiang Lab(2022P10AC02).
文摘Graphs that are used to model real-world entities with vertices and relationships among entities with edges,have proven to be a powerful tool for describing real-world problems in applications.In most real-world scenarios,entities and their relationships are subject to constant changes.Graphs that record such changes are called dynamic graphs.In recent years,the widespread application scenarios of dynamic graphs have stimulated extensive research on dynamic graph processing systems that continuously ingest graph updates and produce up-to-date graph analytics results.As the scale of dynamic graphs becomes larger,higher performance requirements are demanded to dynamic graph processing systems.With the massive parallel processing power and high memory bandwidth,GPUs become mainstream vehicles to accelerate dynamic graph processing tasks.GPU-based dynamic graph processing systems mainly address two challenges:maintaining the graph data when updates occur(i.e.,graph updating)and producing analytics results in time(i.e.,graph computing).In this paper,we survey GPU-based dynamic graph processing systems and review their methods on addressing both graph updating and graph computing.To comprehensively discuss existing dynamic graph processing systems on GPUs,we first introduce the terminologies of dynamic graph processing and then develop a taxonomy to describe the methods employed for graph updating and graph computing.In addition,we discuss the challenges and future research directions of dynamic graph processing on GPUs.
基金This work was in part supported by the National Research Foundation(NRF)of Korea(Grant No.2020R1A5A1019141 and 2021R1A2C2094153).Computational resources were provided by the Supercomput-ing Center of Samsung Electronics.
文摘We report a high-performance multi graphics processing unit(GPU)implementation of the Kohn–Sham time-dependent density functional theory(TDDFT)within the Tamm–Dancoff approximation.Our algorithm on massively parallel computing systems using multiple parallel models in tandem scales optimally with material size,considerably reducing the computational wall time.A benchmark TDDFT study was performed on a green fluorescent protein complex composed of 4353 atoms with 40,518 atomic orbitals represented by Gaussian-type functions,demonstrating the effect of distant protein residues on the excitation.As the largest molecule attempted to date to the best of our knowledge,the proposed strategy demonstrated reasonably high efficiencies up to 256 GPUs on a custom-built state-of-the-art GPU computing system with Nvidia A100 GPUs.We believe that our GPU-oriented algorithms,which empower first-principles simulation for very large-scale applications,may render deeper understanding of the molecular basis of material behaviors,eventually revealing new possibilities for breakthrough designs on new material systems.
基金Supported by the National Natural Science Foundation of China (Grant Nos. 20336040, 20221603 and 20490201)the Chinese Academy of Sciences (Grant No. Kgcxz-yw-124)
文摘Compute Unified Device Architecture (CUDA) was used to design and implement molecular dynamics (MD) simulations on graphics processing units (GPU). With an NVIDIA Tesla C870, a 20-60 fold speedup over that of one core of the Intel Xeon 5430 CPU was achieved, reaching up to 150 Gflops. MD simulation of cavity flow and particle-bubble interaction in liquid was implemented on multiple GPUs using a message passing interface (MPI). Up to 200 GPUs were tested on a special network topology, which achieves good scalability. The capability of GPU clusters for large-scale molecular dynamics simulation of meso-scale flow behavior was, therefore, uncovered.
基金This work is supported in partial by the National Natural Science Foundation of China under Grant Nos. 61133005, 61272136, 61100073, 61100066, the National High Technology Research and Development 863 Program of China under Grant Nos. 2012AA010902, 2012AA010903, and the Chinese Academy of Sciences Special Grant for Postgraduate Research, Innovation and Practice.
文摘Fourier methods have revolutionized many fields of science and engineering, such as astronomy, medical imaging, seismology and spectroscopy, and the fast Fourier transform (FFT) is a computationally efficient method of generating a Fourier transform. The emerging class of high performance computing architectures, such as GPU, seeks to achieve much higher performance and efficiency by exposing a hierarchy of distinct memories to software. However, the complexity of GPU programming poses a significant challenge to developers. In this paper, we propose an automatic performance tuning framework for FFT on various OpenCL GPUs, and implement a high performance library named MPFFT based on this framework. For power-of-two length FFTs, our library substantially outperforms the cIAmdFft library on AMD GPUs and achieves comparable performance as the CUFFT library on NVIDIA GPUs. Furthermore, our library also supports non-power-of-two size. For 3D non-power-of-two FFTs, our library delivers 1.5x to 28x faster than FFTYV with 4 threads and 20.01x average speedup over CUFFT 4.0 on Tesla C2050.
基金supported by the National Hi-Tech Research and Development Program of China ("863" Project) (Grant Nos.2009AA12Z227,2009AA12Z215)the National Natural Science Foundation of China (Grant No.40801156)
文摘A real-time algorithm for constructing contour maps from grid DEM data is pre-sented.It runs completely within the programmable 3D visualization pipeline.The interpolation is paralleled by rasterizer units in the graphics card,and contour line extraction is paralleled by pixel shader.During each frame of the rendering,we first make an elevation gradient map out of original terrain vertex data,then figure out the final contour lines with image-space processing,and directly blend the results on the original scene to obtain a final scene with contour map using alpha-blending.We implement this method in our global 3D-digitalearth system with Direct3D?9.0c API and tested on some consumer level PC platforms.For arbitrary scene with certain LOD level,the process takes less than 10 ms,giving topologically correct,anti-aliased contour lines.
基金This work was supported by the Shaanxi International Science and Technology Cooperation and Exchange Program(Grant No.2017KW-014)Projects supported by the National Natural Science Foundation of China (Grant No.51609199)the National Key Research and Development Program of China (Grant No.2016YFC0402704).
文摘In view of the frequent occurrence of floods due to climate change, and the fact that a large calculation domain, with complex land types, is required for solving the problem of the flood simulations, this paper proposes an optimized non-uniform grid model combined with a high-resolution model based on the graphics processing unit (GPU) acceleration to simulate the surface water flow process. For the grid division, the topographic gradient change is taken as the control variable and different optimization criteria are designed according to different land types. In the numerical model, the Godunov-type method is adopted for the spatial discretization, the TVD-MUSUL and Runge-Kutta methods are used to improve the model’s spatial and temporal calculation accuracies, and the simulation time is reduced by leveraging the GPU acceleration. The model is applied to ideal and actual case studies. The results show that the numerical model based on a non-uniform grid enjoys a good stability. In the simulation of the urban inundation, approximately 40%–50% of the urban average topographic gradient change to be covered is taken as the threshold for the non-uniform grid division, and the calculation efficiency and accuracy can be optimized. In this case, the calculation efficiency of the non-uniform grid based on the optimized parameters is 2–3 times of that of the uniform grid, and the approach can be adopted for the actual flood simulation in large-scale areas.
基金Supported by the Portuguese Foundation for Science and Technology,through the FEDER program,and also under Grant No.SFRH/BD/37495/2007
文摘Low-Density Parity-Check (LDPC) codes are powerful error correcting codes adopted by recent communication standards. LDPC decoders are based on belief propagation algorithms, which make use of a Tanner graph and very intensive message-passing computation, and usually require hardware-based dedicated solutions. With the exponential increase of the computational power of commodity graphics processing units (GPUs), new opportunities have arisen to develop general purpose processing on GPUs. This paper proposes the use of GPUs for implementing flexible and programmable LDPC decoders. A new stream-based approach is proposed, based on compact data structures to represent the Tanner graph. It is shown that such a challenging application for stream-based computing, because of irregular memory access patterns, memory bandwidth and recursive flow control constraints, can be efficiently implemented on GPUs. The proposal was experimentally evaluated by programming LDPC decoders on GPUs using the Caravela platform, a generic interface tool for managing the kernels' execution regardless of the GPU manufacturer and operating system. Moreover, to relatively assess the obtained results, we have also implemented LDPC decoders on general purpose processors with Streaming Single Instruction Multiple Data (SIMD) Extensions. Experimental results show that the solution proposed here efficiently decodes several codewords simultaneously, reducing the processing time by one order of magnitude.