聚类是大规模高维向量数据分析的关键技术之一.近年来,基于密度的聚类算法DBSCAN(density-based spatial clustering of applications with noise)因其无须预先指定聚类数量、能够发现复杂聚类结构并有效识别噪声点的特性,在数据分析领...聚类是大规模高维向量数据分析的关键技术之一.近年来,基于密度的聚类算法DBSCAN(density-based spatial clustering of applications with noise)因其无须预先指定聚类数量、能够发现复杂聚类结构并有效识别噪声点的特性,在数据分析领域得到了广泛应用.然而,现有的基于密度的聚类算法在处理高维向量数据时将产生极高的时间代价且面临维度灾难等问题,难以在实际场景中部署应用.此外,随着信息技术的发展,高维向量数据规模急剧增加,使用CPU进行高维向量聚类在时间代价和可扩展性等方面将面临更大的挑战.为此,提出一种GPU加速的高维向量聚类算法,通过引入K近邻(K-nearest neighbor,KNN)图索引加速DBSCAN的计算.首先,设计了GPU加速的并行K近邻图构建算法,显著降低了K近邻图索引的构建开销.其次,提出了基于层间并行的K-means树分区算法及基于广度优先搜索和核心近邻图的并行聚类算法,改进了DBSCAN算法的计算流程,实现了高并发向量聚类.最后,在真实向量数据集上进行了大量实验,并将所提出的方法与现有方法进行了性能对比.实验结果表明,所提方法在保证聚类精度的前提下,将大规模向量聚类的效率提高了5.7–2822.5倍.展开更多
稀疏矩阵向量乘(SpMV)是稀疏线性系统的计算核心和瓶颈,其运算效率会影响迭代求解器的整体性能,其优化研究一直是科学计算和工程应用领域中的研究热点之一。偏微分方程的离散化会产生稀疏对角矩阵,由于其多样的非零元分布,导致没有一种...稀疏矩阵向量乘(SpMV)是稀疏线性系统的计算核心和瓶颈,其运算效率会影响迭代求解器的整体性能,其优化研究一直是科学计算和工程应用领域中的研究热点之一。偏微分方程的离散化会产生稀疏对角矩阵,由于其多样的非零元分布,导致没有一种方法能够在所有矩阵中取得最优时间性能。针对上述问题,提出一种面向图形处理单元(GPU)的稀疏对角矩阵自适应SpMV优化方法AST(Adaptive SpMV Tuning)。该方法通过设计特征空间,构建特征提取器,提取矩阵结构精细特征,通过深入分析特征和SpMV方法的相关性,建立可扩展的候选方法集合,形成特征和最优方法的映射关系,构建性能预测工具,实现矩阵最优方法的高效预测。实验结果表明,AST能够取得85.8%的预测准确率,平均时间性能损失为0.09,相比于DIA(Diagonal)、HDIA(Hacked DIA)、HDC(Hybrid of DIA and Compressed Sparse Row)、DIA-Adaptive和DRM(Divide-Rearrange and Merge),能够获得平均20.19、1.86、3.06、3.72和1.53倍的内核运行时间加速和1.05、1.28、12.45、1.94和0.97倍的浮点运算性能加速。展开更多
Peridynamics(PD)demonstrates unique advantages in addressing fracture problems,however,its nonlocality and meshfree discretization result in high computational and storage costs.Moreover,in its engineering application...Peridynamics(PD)demonstrates unique advantages in addressing fracture problems,however,its nonlocality and meshfree discretization result in high computational and storage costs.Moreover,in its engineering applications,the computational scale of classical GPU parallel schemes is often limited by the finite graphics memory of GPU devices.In the present study,we develop an efficient particle information management strategy based on the cell-linked list method and on this basis propose a subdomain-based GPU parallel scheme,which exhibits outstanding acceleration performance in specific compute kernels while significantly reducing graphics memory usage.Compared to the classical parallel scheme,the cell-linked list method facilitates efficient management of particle information within subdomains,enabling the proposed parallel scheme to effectively reduce graphics memory usage by optimizing the size and number of subdomains while significantly improving the speed of neighbor search.As demonstrated in PD examples,the proposed parallel scheme enhances the neighbor search efficiency dramatically and achieves a significant speedup relative to serial programs.For instance,without considering the time of data transmission,the proposed scheme achieves a remarkable speedup of nearly 1076.8×in one test case,due to its excellent computational efficiency in the neighbor search.Additionally,for 2D and 3D PD models with tens of millions of particles,the graphics memory usage can be reduced up to 83.6%and 85.9%,respectively.Therefore,this subdomain-based GPU parallel scheme effectively avoids graphics memory shortages while significantly improving the computational efficiency,providing new insights into studying more complex large-scale problems.展开更多
Magnetic Resonance Imaging(MRI)has a pivotal role in medical image analysis,for its ability in supporting disease detection and diagnosis.Fuzzy C-Means(FCM)clustering is widely used for MRI segmentation due to its abi...Magnetic Resonance Imaging(MRI)has a pivotal role in medical image analysis,for its ability in supporting disease detection and diagnosis.Fuzzy C-Means(FCM)clustering is widely used for MRI segmentation due to its ability to handle image uncertainty.However,the latter still has countless limitations,including sensitivity to initialization,susceptibility to local optima,and high computational cost.To address these limitations,this study integrates Grey Wolf Optimization(GWO)with FCM to enhance cluster center selection,improving segmentation accuracy and robustness.Moreover,to further refine optimization,Fuzzy Entropy Clustering was utilized for its distinctive features from other traditional objective functions.Fuzzy entropy effectively quantifies uncertainty,leading to more well-defined clusters,improved noise robustness,and better preservation of anatomical structures in MRI images.Despite these advantages,the iterative nature of GWO and FCM introduces significant computational overhead,which restricts their applicability to high-resolution medical images.To overcome this bottleneck,we propose a Parallelized-GWO-based FCM(P-GWO-FCM)approach using GPU acceleration,where both GWO optimization and FCM updates(centroid computation and membership matrix updates)are parallelized.By concurrently executing these processes,our approach efficiently distributes the computational workload,significantly reducing execution time while maintaining high segmentation accuracy.The proposed parallel method,P-GWO-FCM,was evaluated on both simulated and clinical brain MR images,focusing on segmenting white matter,gray matter,and cerebrospinal fluid regions.The results indicate significant improvements in segmentation accuracy,achieving a Jaccard Similarity(JS)of 0.92,a Partition Coefficient Index(PCI)of 0.91,a Partition Entropy Index(PEI)of 0.25,and a Davies-Bouldin Index(DBI)of 0.30.Experimental comparisons demonstrate that P-GWO-FCM outperforms existing methods in both segmentation accuracy and computational efficiency,making it a promising solution for real-time medical image segmentation.展开更多
Computational phantoms play an essential role in radiation dosimetry and health physics.Although mesh-type phantoms offer a high resolution and adjustability,their use in dose calculations is limited by their slow com...Computational phantoms play an essential role in radiation dosimetry and health physics.Although mesh-type phantoms offer a high resolution and adjustability,their use in dose calculations is limited by their slow computational speed.Progress in heterogeneous computing has allowed for substantial acceleration in the computation of mesh-type phantoms by utilizing hardware accelerators.In this study,a GPU-accelerated Monte Carlo method was developed to expedite the dose calculation for mesh-type computational phantoms.This involved designing and implementing the entire procedural flow of a GPUaccelerated Monte Carlo program.We employed acceleration structures to process the mesh-type phantom,optimized the traversal methodology,and achieved a flattened structure to overcome the limitations of GPU stack depths.Particle transport methods were realized within the mesh-type phantom,encompassing particle location and intersection techniques.In response to typical external irradiation scenarios,we utilized Geant4 along with the GPU program and its CPU serial code for dose calculations,assessing both computational accuracy and efficiency.In comparison with the benchmark simulated using Geant4 on the CPU using one thread,the relative differences in the organ dose calculated by the GPU program predominantly lay within a margin of 5%,whereas the computational time was reduced by a factor ranging from 120 to 2700.To the best of our knowledge,this study achieved a GPU-accelerated dose calculation method for mesh-type phantoms for the first time,reducing the computational time from hours to seconds per simulation of ten million particles and offering a swift and precise Monte Carlo method for dose calculation in mesh-type computational phantoms.展开更多
Beam-tracking simulations have been extensively utilized in the study of collective beam instabilities in circular accelerators.Traditionally,many simulation codes have relied on central processing unit(CPU)-based met...Beam-tracking simulations have been extensively utilized in the study of collective beam instabilities in circular accelerators.Traditionally,many simulation codes have relied on central processing unit(CPU)-based methods,tracking on a single CPU core,or parallelizing the computation across multiple cores via the message passing interface(MPI).Although these approaches work well for single-bunch tracking,scaling them to multiple bunches significantly increases the computational load,which often necessitates the use of a dedicated multi-CPU cluster.To address this challenge,alternative methods leveraging General-Purpose computing on Graphics Processing Units(GPGPU)have been proposed,enabling tracking studies on a standalone desktop personal computer(PC).However,frequent CPU-GPU interactions,including data transfers and synchronization operations during tracking,can introduce communication overheads,potentially reducing the overall effectiveness of GPU-based computations.In this study,we propose a novel approach that eliminates this overhead by performing the entire tracking simulation process exclusively on the GPU,thereby enabling the simultaneous processing of all bunches and their macro-particles.Specifically,we introduce MBTRACK2-CUDA,a Compute Unified Device Architecture(CUDA)ported version of MBTRACK2,which facilitates efficient tracking of single-and multi-bunch collective effects by leveraging the full GPU-resident computation.展开更多
文摘聚类是大规模高维向量数据分析的关键技术之一.近年来,基于密度的聚类算法DBSCAN(density-based spatial clustering of applications with noise)因其无须预先指定聚类数量、能够发现复杂聚类结构并有效识别噪声点的特性,在数据分析领域得到了广泛应用.然而,现有的基于密度的聚类算法在处理高维向量数据时将产生极高的时间代价且面临维度灾难等问题,难以在实际场景中部署应用.此外,随着信息技术的发展,高维向量数据规模急剧增加,使用CPU进行高维向量聚类在时间代价和可扩展性等方面将面临更大的挑战.为此,提出一种GPU加速的高维向量聚类算法,通过引入K近邻(K-nearest neighbor,KNN)图索引加速DBSCAN的计算.首先,设计了GPU加速的并行K近邻图构建算法,显著降低了K近邻图索引的构建开销.其次,提出了基于层间并行的K-means树分区算法及基于广度优先搜索和核心近邻图的并行聚类算法,改进了DBSCAN算法的计算流程,实现了高并发向量聚类.最后,在真实向量数据集上进行了大量实验,并将所提出的方法与现有方法进行了性能对比.实验结果表明,所提方法在保证聚类精度的前提下,将大规模向量聚类的效率提高了5.7–2822.5倍.
文摘稀疏矩阵向量乘(SpMV)是稀疏线性系统的计算核心和瓶颈,其运算效率会影响迭代求解器的整体性能,其优化研究一直是科学计算和工程应用领域中的研究热点之一。偏微分方程的离散化会产生稀疏对角矩阵,由于其多样的非零元分布,导致没有一种方法能够在所有矩阵中取得最优时间性能。针对上述问题,提出一种面向图形处理单元(GPU)的稀疏对角矩阵自适应SpMV优化方法AST(Adaptive SpMV Tuning)。该方法通过设计特征空间,构建特征提取器,提取矩阵结构精细特征,通过深入分析特征和SpMV方法的相关性,建立可扩展的候选方法集合,形成特征和最优方法的映射关系,构建性能预测工具,实现矩阵最优方法的高效预测。实验结果表明,AST能够取得85.8%的预测准确率,平均时间性能损失为0.09,相比于DIA(Diagonal)、HDIA(Hacked DIA)、HDC(Hybrid of DIA and Compressed Sparse Row)、DIA-Adaptive和DRM(Divide-Rearrange and Merge),能够获得平均20.19、1.86、3.06、3.72和1.53倍的内核运行时间加速和1.05、1.28、12.45、1.94和0.97倍的浮点运算性能加速。
基金Jun Li was supported by National Natural Science Foundation of China(No.:U2441215)Lisheng Liu and Xin Lai were supported by National Natural Science Foundation of China(No.:52494933).
文摘Peridynamics(PD)demonstrates unique advantages in addressing fracture problems,however,its nonlocality and meshfree discretization result in high computational and storage costs.Moreover,in its engineering applications,the computational scale of classical GPU parallel schemes is often limited by the finite graphics memory of GPU devices.In the present study,we develop an efficient particle information management strategy based on the cell-linked list method and on this basis propose a subdomain-based GPU parallel scheme,which exhibits outstanding acceleration performance in specific compute kernels while significantly reducing graphics memory usage.Compared to the classical parallel scheme,the cell-linked list method facilitates efficient management of particle information within subdomains,enabling the proposed parallel scheme to effectively reduce graphics memory usage by optimizing the size and number of subdomains while significantly improving the speed of neighbor search.As demonstrated in PD examples,the proposed parallel scheme enhances the neighbor search efficiency dramatically and achieves a significant speedup relative to serial programs.For instance,without considering the time of data transmission,the proposed scheme achieves a remarkable speedup of nearly 1076.8×in one test case,due to its excellent computational efficiency in the neighbor search.Additionally,for 2D and 3D PD models with tens of millions of particles,the graphics memory usage can be reduced up to 83.6%and 85.9%,respectively.Therefore,this subdomain-based GPU parallel scheme effectively avoids graphics memory shortages while significantly improving the computational efficiency,providing new insights into studying more complex large-scale problems.
文摘Magnetic Resonance Imaging(MRI)has a pivotal role in medical image analysis,for its ability in supporting disease detection and diagnosis.Fuzzy C-Means(FCM)clustering is widely used for MRI segmentation due to its ability to handle image uncertainty.However,the latter still has countless limitations,including sensitivity to initialization,susceptibility to local optima,and high computational cost.To address these limitations,this study integrates Grey Wolf Optimization(GWO)with FCM to enhance cluster center selection,improving segmentation accuracy and robustness.Moreover,to further refine optimization,Fuzzy Entropy Clustering was utilized for its distinctive features from other traditional objective functions.Fuzzy entropy effectively quantifies uncertainty,leading to more well-defined clusters,improved noise robustness,and better preservation of anatomical structures in MRI images.Despite these advantages,the iterative nature of GWO and FCM introduces significant computational overhead,which restricts their applicability to high-resolution medical images.To overcome this bottleneck,we propose a Parallelized-GWO-based FCM(P-GWO-FCM)approach using GPU acceleration,where both GWO optimization and FCM updates(centroid computation and membership matrix updates)are parallelized.By concurrently executing these processes,our approach efficiently distributes the computational workload,significantly reducing execution time while maintaining high segmentation accuracy.The proposed parallel method,P-GWO-FCM,was evaluated on both simulated and clinical brain MR images,focusing on segmenting white matter,gray matter,and cerebrospinal fluid regions.The results indicate significant improvements in segmentation accuracy,achieving a Jaccard Similarity(JS)of 0.92,a Partition Coefficient Index(PCI)of 0.91,a Partition Entropy Index(PEI)of 0.25,and a Davies-Bouldin Index(DBI)of 0.30.Experimental comparisons demonstrate that P-GWO-FCM outperforms existing methods in both segmentation accuracy and computational efficiency,making it a promising solution for real-time medical image segmentation.
基金supported by the National Natural Science Foundation of China(Nos.U2167209 and 12375312)Open-end Fund Projects of China Institute for Radiation Protection Scientific Research Platform(CIRP-HYYFZH-2023ZD001).
文摘Computational phantoms play an essential role in radiation dosimetry and health physics.Although mesh-type phantoms offer a high resolution and adjustability,their use in dose calculations is limited by their slow computational speed.Progress in heterogeneous computing has allowed for substantial acceleration in the computation of mesh-type phantoms by utilizing hardware accelerators.In this study,a GPU-accelerated Monte Carlo method was developed to expedite the dose calculation for mesh-type computational phantoms.This involved designing and implementing the entire procedural flow of a GPUaccelerated Monte Carlo program.We employed acceleration structures to process the mesh-type phantom,optimized the traversal methodology,and achieved a flattened structure to overcome the limitations of GPU stack depths.Particle transport methods were realized within the mesh-type phantom,encompassing particle location and intersection techniques.In response to typical external irradiation scenarios,we utilized Geant4 along with the GPU program and its CPU serial code for dose calculations,assessing both computational accuracy and efficiency.In comparison with the benchmark simulated using Geant4 on the CPU using one thread,the relative differences in the organ dose calculated by the GPU program predominantly lay within a margin of 5%,whereas the computational time was reduced by a factor ranging from 120 to 2700.To the best of our knowledge,this study achieved a GPU-accelerated dose calculation method for mesh-type phantoms for the first time,reducing the computational time from hours to seconds per simulation of ten million particles and offering a swift and precise Monte Carlo method for dose calculation in mesh-type computational phantoms.
基金supported by the National Research Foundation of Korea(NRF)funded by the Ministry of Science and ICT(MSIT)(No.RS-2022-00143178)the Ministry of Education(MOE)(Nos.2022R1A6A3A13053896 and 2022R1F1A1074616),Republic of Korea.
文摘Beam-tracking simulations have been extensively utilized in the study of collective beam instabilities in circular accelerators.Traditionally,many simulation codes have relied on central processing unit(CPU)-based methods,tracking on a single CPU core,or parallelizing the computation across multiple cores via the message passing interface(MPI).Although these approaches work well for single-bunch tracking,scaling them to multiple bunches significantly increases the computational load,which often necessitates the use of a dedicated multi-CPU cluster.To address this challenge,alternative methods leveraging General-Purpose computing on Graphics Processing Units(GPGPU)have been proposed,enabling tracking studies on a standalone desktop personal computer(PC).However,frequent CPU-GPU interactions,including data transfers and synchronization operations during tracking,can introduce communication overheads,potentially reducing the overall effectiveness of GPU-based computations.In this study,we propose a novel approach that eliminates this overhead by performing the entire tracking simulation process exclusively on the GPU,thereby enabling the simultaneous processing of all bunches and their macro-particles.Specifically,we introduce MBTRACK2-CUDA,a Compute Unified Device Architecture(CUDA)ported version of MBTRACK2,which facilitates efficient tracking of single-and multi-bunch collective effects by leveraging the full GPU-resident computation.