稀疏矩阵向量乘(SpMV)是稀疏线性系统的计算核心和瓶颈,其运算效率会影响迭代求解器的整体性能,其优化研究一直是科学计算和工程应用领域中的研究热点之一。偏微分方程的离散化会产生稀疏对角矩阵,由于其多样的非零元分布,导致没有一种...稀疏矩阵向量乘(SpMV)是稀疏线性系统的计算核心和瓶颈,其运算效率会影响迭代求解器的整体性能,其优化研究一直是科学计算和工程应用领域中的研究热点之一。偏微分方程的离散化会产生稀疏对角矩阵,由于其多样的非零元分布,导致没有一种方法能够在所有矩阵中取得最优时间性能。针对上述问题,提出一种面向图形处理单元(GPU)的稀疏对角矩阵自适应SpMV优化方法AST(Adaptive SpMV Tuning)。该方法通过设计特征空间,构建特征提取器,提取矩阵结构精细特征,通过深入分析特征和SpMV方法的相关性,建立可扩展的候选方法集合,形成特征和最优方法的映射关系,构建性能预测工具,实现矩阵最优方法的高效预测。实验结果表明,AST能够取得85.8%的预测准确率,平均时间性能损失为0.09,相比于DIA(Diagonal)、HDIA(Hacked DIA)、HDC(Hybrid of DIA and Compressed Sparse Row)、DIA-Adaptive和DRM(Divide-Rearrange and Merge),能够获得平均20.19、1.86、3.06、3.72和1.53倍的内核运行时间加速和1.05、1.28、12.45、1.94和0.97倍的浮点运算性能加速。展开更多
聚类是大规模高维向量数据分析的关键技术之一.近年来,基于密度的聚类算法DBSCAN(density-based spatial clustering of applications with noise)因其无须预先指定聚类数量、能够发现复杂聚类结构并有效识别噪声点的特性,在数据分析领...聚类是大规模高维向量数据分析的关键技术之一.近年来,基于密度的聚类算法DBSCAN(density-based spatial clustering of applications with noise)因其无须预先指定聚类数量、能够发现复杂聚类结构并有效识别噪声点的特性,在数据分析领域得到了广泛应用.然而,现有的基于密度的聚类算法在处理高维向量数据时将产生极高的时间代价且面临维度灾难等问题,难以在实际场景中部署应用.此外,随着信息技术的发展,高维向量数据规模急剧增加,使用CPU进行高维向量聚类在时间代价和可扩展性等方面将面临更大的挑战.为此,提出一种GPU加速的高维向量聚类算法,通过引入K近邻(K-nearest neighbor,KNN)图索引加速DBSCAN的计算.首先,设计了GPU加速的并行K近邻图构建算法,显著降低了K近邻图索引的构建开销.其次,提出了基于层间并行的K-means树分区算法及基于广度优先搜索和核心近邻图的并行聚类算法,改进了DBSCAN算法的计算流程,实现了高并发向量聚类.最后,在真实向量数据集上进行了大量实验,并将所提出的方法与现有方法进行了性能对比.实验结果表明,所提方法在保证聚类精度的前提下,将大规模向量聚类的效率提高了5.7–2822.5倍.展开更多
Peridynamics(PD)demonstrates unique advantages in addressing fracture problems,however,its nonlocality and meshfree discretization result in high computational and storage costs.Moreover,in its engineering application...Peridynamics(PD)demonstrates unique advantages in addressing fracture problems,however,its nonlocality and meshfree discretization result in high computational and storage costs.Moreover,in its engineering applications,the computational scale of classical GPU parallel schemes is often limited by the finite graphics memory of GPU devices.In the present study,we develop an efficient particle information management strategy based on the cell-linked list method and on this basis propose a subdomain-based GPU parallel scheme,which exhibits outstanding acceleration performance in specific compute kernels while significantly reducing graphics memory usage.Compared to the classical parallel scheme,the cell-linked list method facilitates efficient management of particle information within subdomains,enabling the proposed parallel scheme to effectively reduce graphics memory usage by optimizing the size and number of subdomains while significantly improving the speed of neighbor search.As demonstrated in PD examples,the proposed parallel scheme enhances the neighbor search efficiency dramatically and achieves a significant speedup relative to serial programs.For instance,without considering the time of data transmission,the proposed scheme achieves a remarkable speedup of nearly 1076.8×in one test case,due to its excellent computational efficiency in the neighbor search.Additionally,for 2D and 3D PD models with tens of millions of particles,the graphics memory usage can be reduced up to 83.6%and 85.9%,respectively.Therefore,this subdomain-based GPU parallel scheme effectively avoids graphics memory shortages while significantly improving the computational efficiency,providing new insights into studying more complex large-scale problems.展开更多
Beam-tracking simulations have been extensively utilized in the study of collective beam instabilities in circular accelerators.Traditionally,many simulation codes have relied on central processing unit(CPU)-based met...Beam-tracking simulations have been extensively utilized in the study of collective beam instabilities in circular accelerators.Traditionally,many simulation codes have relied on central processing unit(CPU)-based methods,tracking on a single CPU core,or parallelizing the computation across multiple cores via the message passing interface(MPI).Although these approaches work well for single-bunch tracking,scaling them to multiple bunches significantly increases the computational load,which often necessitates the use of a dedicated multi-CPU cluster.To address this challenge,alternative methods leveraging General-Purpose computing on Graphics Processing Units(GPGPU)have been proposed,enabling tracking studies on a standalone desktop personal computer(PC).However,frequent CPU-GPU interactions,including data transfers and synchronization operations during tracking,can introduce communication overheads,potentially reducing the overall effectiveness of GPU-based computations.In this study,we propose a novel approach that eliminates this overhead by performing the entire tracking simulation process exclusively on the GPU,thereby enabling the simultaneous processing of all bunches and their macro-particles.Specifically,we introduce MBTRACK2-CUDA,a Compute Unified Device Architecture(CUDA)ported version of MBTRACK2,which facilitates efficient tracking of single-and multi-bunch collective effects by leveraging the full GPU-resident computation.展开更多
Optical coherence tomography(OCT),particularly Swept-Source OCT,is widely employed in medical diagnostics and industrial inspections owing to its high-resolution imaging capabilities.However,Swept-Source OCT 3D imagin...Optical coherence tomography(OCT),particularly Swept-Source OCT,is widely employed in medical diagnostics and industrial inspections owing to its high-resolution imaging capabilities.However,Swept-Source OCT 3D imaging often suffers from stripe artifacts caused by unstable light sources,system noise,and environmental interference,posing challenges to real-time processing of large-scale datasets.To address this issue,this study introduces a real-time reconstruction system that integrates stripe-artifact suppression and parallel computing using a graphics processing unit.This approach employs a frequency-domain filtering algorithm with adaptive anti-suppression parameters,dynamically adjusted through an image quality evaluation function and optimized using a convolutional neural network for complex frequency-domain feature learning.Additionally,a graphics processing unit integrated 3D reconstruction framework is developed,enhancing data processing throughput and real-time performance via a dual-queue decoupling mechanism.Experimental results demonstrate significant improvements in structural similarity(0.92),peak signal-to-noise ratio(31.62 dB),and stripe suppression ratio(15.73 dB)compared with existing methods.On the RTX 4090 platform,the proposed system achieved an end-to-end delay of 94.36 milliseconds,a frame rate of 10.3 frames per second,and a throughput of 121.5 million voxels per second,effectively suppressing artifacts while preserving image details and enhancing real-time 3D reconstruction performance.展开更多
A computational fluid dynamics(CFD)solver for a GPU/CPU heterogeneous architecture parallel computing platform is developed to simulate incompressible flows on billion-level grid points.To solve the Poisson equation,t...A computational fluid dynamics(CFD)solver for a GPU/CPU heterogeneous architecture parallel computing platform is developed to simulate incompressible flows on billion-level grid points.To solve the Poisson equation,the conjugate gradient method is used as a basic solver,and a Chebyshev method in combination with a Jacobi sub-preconditioner is used as a preconditioner.The developed CFD solver shows good performance on parallel efficiency,which exceeds 90%in the weak-scalability test when the number of grid points allocated to each GPU card is greater than 2083.In the acceleration test,it is found that running a simulation with 10403 grid points on 125 GPU cards accelerates by 203.6x over the same number of CPU cores.The developed solver is then tested in the context of a two-dimensional lid-driven cavity flow and three-dimensional Taylor-Green vortex flow.The results are consistent with previous results in the literature.展开更多
Organic reefs, the targets of deep-water petro- leum exploration, developed widely in Xisha area. However, there are concealed igneous rocks undersea, to which organic rocks have nearly equal wave impedance. So the ig...Organic reefs, the targets of deep-water petro- leum exploration, developed widely in Xisha area. However, there are concealed igneous rocks undersea, to which organic rocks have nearly equal wave impedance. So the igneous rocks have become interference for future explo- ration by having similar seismic reflection characteristics. Yet, the density and magnetism of organic reefs are very different from igneous rocks. It has obvious advantages to identify organic reefs and igneous rocks by gravity and magnetic data. At first, frequency decomposition was applied to the free-air gravity anomaly in Xisha area to obtain the 2D subdivision of the gravity anomaly and magnetic anomaly in the vertical direction. Thus, the dis- tribution of igneous rocks in the horizontal direction can be acquired according to high-frequency field, low-frequency field, and its physical properties. Then, 3D forward model- ing of gravitational field was carried out to establish the density model of this area by reference to physical properties of rocks based on former researches. Furthermore, 3D inversion of gravity anomaly by genetic algorithm method of the graphic processing unit (GPU) parallel processing in Xisha target area was applied, and 3D density structure of this area was obtained. By this way, we can confine the igneous rocks to the certain depth according to the density of the igneous rocks. The frequency decomposition and 3D inversion of gravity anomaly by genetic algorithm method of the GPU parallel processing proved to be a useful method for recognizing igneous rocks to its 3D geological position. So organic reefs and igneous rocks can be identified, which provide a prescient information for further exploration.展开更多
A microtubule gliding assay is a biological experiment observing the dynamics of microtubules driven by motor proteins fixed on a glass surface. When appropriate microtubule interactions are set up on gliding assay ex...A microtubule gliding assay is a biological experiment observing the dynamics of microtubules driven by motor proteins fixed on a glass surface. When appropriate microtubule interactions are set up on gliding assay experiments, microtubules often organize and create higher-level dynamics such as ring and bundle structures. In order to reproduce such higher-level dynamics on computers, we have been focusing on making a real-time 3D microtubule simulation. This real-time 3D microtubule simulation enables us to gain more knowledge on microtubule dynamics and their swarm movements by means of adjusting simulation paranleters in a real-time fashion. One of the technical challenges when creating a real-time 3D simulation is balancing the 3D rendering and the computing performance. Graphics processor unit (GPU) programming plays an essential role in balancing the millions of tasks, and makes this real-time 3D simulation possible. By the use of general-purpose computing on graphics processing units (GPGPU) programming we are able to run the simulation in a massively parallel fashion, even when dealing with more complex interactions between microtubules such as overriding and snuggling. Due to performance being an important factor, a performance n, odel has also been constructed from the analysis of the microtubule simulation and it is consistent with the performance measurements on different GPGPU architectures with regards to the number of cores and clock cycles.展开更多
Processing large-scale 3-D gravity data is an important topic in geophysics field. Many existing inversion methods lack the competence of processing massive data and practical application capacity. This study proposes...Processing large-scale 3-D gravity data is an important topic in geophysics field. Many existing inversion methods lack the competence of processing massive data and practical application capacity. This study proposes the application of GPU parallel processing technology to the focusing inversion method, aiming at improving the inversion accuracy while speeding up calculation and reducing the memory consumption, thus obtaining the fast and reliable inversion results for large complex model. In this paper, equivalent storage of geometric trellis is used to calculate the sensitivity matrix, and the inversion is based on GPU parallel computing technology. The parallel computing program that is optimized by reducing data transfer, access restrictions and instruction restrictions as well as latency hiding greatly reduces the memory usage, speeds up the calculation, and makes the fast inversion of large models possible. By comparing and analyzing the computing speed of traditional single thread CPU method and CUDA-based GPU parallel technology, the excellent acceleration performance of GPU parallel computing is verified, which provides ideas for practical application of some theoretical inversion methods restricted by computing speed and computer memory. The model test verifies that the focusing inversion method can overcome the problem of severe skin effect and ambiguity of geological body boundary. Moreover, the increase of the model cells and inversion data can more clearly depict the boundary position of the abnormal body and delineate its specific shape.展开更多
文摘稀疏矩阵向量乘(SpMV)是稀疏线性系统的计算核心和瓶颈,其运算效率会影响迭代求解器的整体性能,其优化研究一直是科学计算和工程应用领域中的研究热点之一。偏微分方程的离散化会产生稀疏对角矩阵,由于其多样的非零元分布,导致没有一种方法能够在所有矩阵中取得最优时间性能。针对上述问题,提出一种面向图形处理单元(GPU)的稀疏对角矩阵自适应SpMV优化方法AST(Adaptive SpMV Tuning)。该方法通过设计特征空间,构建特征提取器,提取矩阵结构精细特征,通过深入分析特征和SpMV方法的相关性,建立可扩展的候选方法集合,形成特征和最优方法的映射关系,构建性能预测工具,实现矩阵最优方法的高效预测。实验结果表明,AST能够取得85.8%的预测准确率,平均时间性能损失为0.09,相比于DIA(Diagonal)、HDIA(Hacked DIA)、HDC(Hybrid of DIA and Compressed Sparse Row)、DIA-Adaptive和DRM(Divide-Rearrange and Merge),能够获得平均20.19、1.86、3.06、3.72和1.53倍的内核运行时间加速和1.05、1.28、12.45、1.94和0.97倍的浮点运算性能加速。
文摘聚类是大规模高维向量数据分析的关键技术之一.近年来,基于密度的聚类算法DBSCAN(density-based spatial clustering of applications with noise)因其无须预先指定聚类数量、能够发现复杂聚类结构并有效识别噪声点的特性,在数据分析领域得到了广泛应用.然而,现有的基于密度的聚类算法在处理高维向量数据时将产生极高的时间代价且面临维度灾难等问题,难以在实际场景中部署应用.此外,随着信息技术的发展,高维向量数据规模急剧增加,使用CPU进行高维向量聚类在时间代价和可扩展性等方面将面临更大的挑战.为此,提出一种GPU加速的高维向量聚类算法,通过引入K近邻(K-nearest neighbor,KNN)图索引加速DBSCAN的计算.首先,设计了GPU加速的并行K近邻图构建算法,显著降低了K近邻图索引的构建开销.其次,提出了基于层间并行的K-means树分区算法及基于广度优先搜索和核心近邻图的并行聚类算法,改进了DBSCAN算法的计算流程,实现了高并发向量聚类.最后,在真实向量数据集上进行了大量实验,并将所提出的方法与现有方法进行了性能对比.实验结果表明,所提方法在保证聚类精度的前提下,将大规模向量聚类的效率提高了5.7–2822.5倍.
基金Jun Li was supported by National Natural Science Foundation of China(No.:U2441215)Lisheng Liu and Xin Lai were supported by National Natural Science Foundation of China(No.:52494933).
文摘Peridynamics(PD)demonstrates unique advantages in addressing fracture problems,however,its nonlocality and meshfree discretization result in high computational and storage costs.Moreover,in its engineering applications,the computational scale of classical GPU parallel schemes is often limited by the finite graphics memory of GPU devices.In the present study,we develop an efficient particle information management strategy based on the cell-linked list method and on this basis propose a subdomain-based GPU parallel scheme,which exhibits outstanding acceleration performance in specific compute kernels while significantly reducing graphics memory usage.Compared to the classical parallel scheme,the cell-linked list method facilitates efficient management of particle information within subdomains,enabling the proposed parallel scheme to effectively reduce graphics memory usage by optimizing the size and number of subdomains while significantly improving the speed of neighbor search.As demonstrated in PD examples,the proposed parallel scheme enhances the neighbor search efficiency dramatically and achieves a significant speedup relative to serial programs.For instance,without considering the time of data transmission,the proposed scheme achieves a remarkable speedup of nearly 1076.8×in one test case,due to its excellent computational efficiency in the neighbor search.Additionally,for 2D and 3D PD models with tens of millions of particles,the graphics memory usage can be reduced up to 83.6%and 85.9%,respectively.Therefore,this subdomain-based GPU parallel scheme effectively avoids graphics memory shortages while significantly improving the computational efficiency,providing new insights into studying more complex large-scale problems.
基金supported by the National Research Foundation of Korea(NRF)funded by the Ministry of Science and ICT(MSIT)(No.RS-2022-00143178)the Ministry of Education(MOE)(Nos.2022R1A6A3A13053896 and 2022R1F1A1074616),Republic of Korea.
文摘Beam-tracking simulations have been extensively utilized in the study of collective beam instabilities in circular accelerators.Traditionally,many simulation codes have relied on central processing unit(CPU)-based methods,tracking on a single CPU core,or parallelizing the computation across multiple cores via the message passing interface(MPI).Although these approaches work well for single-bunch tracking,scaling them to multiple bunches significantly increases the computational load,which often necessitates the use of a dedicated multi-CPU cluster.To address this challenge,alternative methods leveraging General-Purpose computing on Graphics Processing Units(GPGPU)have been proposed,enabling tracking studies on a standalone desktop personal computer(PC).However,frequent CPU-GPU interactions,including data transfers and synchronization operations during tracking,can introduce communication overheads,potentially reducing the overall effectiveness of GPU-based computations.In this study,we propose a novel approach that eliminates this overhead by performing the entire tracking simulation process exclusively on the GPU,thereby enabling the simultaneous processing of all bunches and their macro-particles.Specifically,we introduce MBTRACK2-CUDA,a Compute Unified Device Architecture(CUDA)ported version of MBTRACK2,which facilitates efficient tracking of single-and multi-bunch collective effects by leveraging the full GPU-resident computation.
文摘Optical coherence tomography(OCT),particularly Swept-Source OCT,is widely employed in medical diagnostics and industrial inspections owing to its high-resolution imaging capabilities.However,Swept-Source OCT 3D imaging often suffers from stripe artifacts caused by unstable light sources,system noise,and environmental interference,posing challenges to real-time processing of large-scale datasets.To address this issue,this study introduces a real-time reconstruction system that integrates stripe-artifact suppression and parallel computing using a graphics processing unit.This approach employs a frequency-domain filtering algorithm with adaptive anti-suppression parameters,dynamically adjusted through an image quality evaluation function and optimized using a convolutional neural network for complex frequency-domain feature learning.Additionally,a graphics processing unit integrated 3D reconstruction framework is developed,enhancing data processing throughput and real-time performance via a dual-queue decoupling mechanism.Experimental results demonstrate significant improvements in structural similarity(0.92),peak signal-to-noise ratio(31.62 dB),and stripe suppression ratio(15.73 dB)compared with existing methods.On the RTX 4090 platform,the proposed system achieved an end-to-end delay of 94.36 milliseconds,a frame rate of 10.3 frames per second,and a throughput of 121.5 million voxels per second,effectively suppressing artifacts while preserving image details and enhancing real-time 3D reconstruction performance.
基金supported by the National Natural Science Foundation of China (NSFC)Basic Science Center Program for Multiscale Problems in Nonlinear Mechanics’(Grant No. 11988102)NSFC project (Grant No. 11972038)
文摘A computational fluid dynamics(CFD)solver for a GPU/CPU heterogeneous architecture parallel computing platform is developed to simulate incompressible flows on billion-level grid points.To solve the Poisson equation,the conjugate gradient method is used as a basic solver,and a Chebyshev method in combination with a Jacobi sub-preconditioner is used as a preconditioner.The developed CFD solver shows good performance on parallel efficiency,which exceeds 90%in the weak-scalability test when the number of grid points allocated to each GPU card is greater than 2083.In the acceleration test,it is found that running a simulation with 10403 grid points on 125 GPU cards accelerates by 203.6x over the same number of CPU cores.The developed solver is then tested in the context of a two-dimensional lid-driven cavity flow and three-dimensional Taylor-Green vortex flow.The results are consistent with previous results in the literature.
基金financially supported by the National Natural Science Foundation of China (No.41174085)
文摘Organic reefs, the targets of deep-water petro- leum exploration, developed widely in Xisha area. However, there are concealed igneous rocks undersea, to which organic rocks have nearly equal wave impedance. So the igneous rocks have become interference for future explo- ration by having similar seismic reflection characteristics. Yet, the density and magnetism of organic reefs are very different from igneous rocks. It has obvious advantages to identify organic reefs and igneous rocks by gravity and magnetic data. At first, frequency decomposition was applied to the free-air gravity anomaly in Xisha area to obtain the 2D subdivision of the gravity anomaly and magnetic anomaly in the vertical direction. Thus, the dis- tribution of igneous rocks in the horizontal direction can be acquired according to high-frequency field, low-frequency field, and its physical properties. Then, 3D forward model- ing of gravitational field was carried out to establish the density model of this area by reference to physical properties of rocks based on former researches. Furthermore, 3D inversion of gravity anomaly by genetic algorithm method of the graphic processing unit (GPU) parallel processing in Xisha target area was applied, and 3D density structure of this area was obtained. By this way, we can confine the igneous rocks to the certain depth according to the density of the igneous rocks. The frequency decomposition and 3D inversion of gravity anomaly by genetic algorithm method of the GPU parallel processing proved to be a useful method for recognizing igneous rocks to its 3D geological position. So organic reefs and igneous rocks can be identified, which provide a prescient information for further exploration.
基金Supported by the National Science and Technology Major Project of the Ministry of Science and Technology of China (No. 2013ZX06002001- 007), the National Key Scientific Instrument and Equipment Development Projects, China (No. 2012YQ180118) and the National Natural Science Foundation of China (Nos. 11275110, 11075091 and 11105081).
基金supported by a Grant-in-Aid for Scientific Research on Innovation Areas "Molecular Robotics"(No.24104004) of the Ministry of Education,Culture,Sports,Science,and Technology,Japan
文摘A microtubule gliding assay is a biological experiment observing the dynamics of microtubules driven by motor proteins fixed on a glass surface. When appropriate microtubule interactions are set up on gliding assay experiments, microtubules often organize and create higher-level dynamics such as ring and bundle structures. In order to reproduce such higher-level dynamics on computers, we have been focusing on making a real-time 3D microtubule simulation. This real-time 3D microtubule simulation enables us to gain more knowledge on microtubule dynamics and their swarm movements by means of adjusting simulation paranleters in a real-time fashion. One of the technical challenges when creating a real-time 3D simulation is balancing the 3D rendering and the computing performance. Graphics processor unit (GPU) programming plays an essential role in balancing the millions of tasks, and makes this real-time 3D simulation possible. By the use of general-purpose computing on graphics processing units (GPGPU) programming we are able to run the simulation in a massively parallel fashion, even when dealing with more complex interactions between microtubules such as overriding and snuggling. Due to performance being an important factor, a performance n, odel has also been constructed from the analysis of the microtubule simulation and it is consistent with the performance measurements on different GPGPU architectures with regards to the number of cores and clock cycles.
基金Supported by Project of National Natural Science Foundation(No.41874134)
文摘Processing large-scale 3-D gravity data is an important topic in geophysics field. Many existing inversion methods lack the competence of processing massive data and practical application capacity. This study proposes the application of GPU parallel processing technology to the focusing inversion method, aiming at improving the inversion accuracy while speeding up calculation and reducing the memory consumption, thus obtaining the fast and reliable inversion results for large complex model. In this paper, equivalent storage of geometric trellis is used to calculate the sensitivity matrix, and the inversion is based on GPU parallel computing technology. The parallel computing program that is optimized by reducing data transfer, access restrictions and instruction restrictions as well as latency hiding greatly reduces the memory usage, speeds up the calculation, and makes the fast inversion of large models possible. By comparing and analyzing the computing speed of traditional single thread CPU method and CUDA-based GPU parallel technology, the excellent acceleration performance of GPU parallel computing is verified, which provides ideas for practical application of some theoretical inversion methods restricted by computing speed and computer memory. The model test verifies that the focusing inversion method can overcome the problem of severe skin effect and ambiguity of geological body boundary. Moreover, the increase of the model cells and inversion data can more clearly depict the boundary position of the abnormal body and delineate its specific shape.