期刊文献+
共找到385篇文章
< 1 2 20 >
每页显示 20 50 100
Compute Unified Device Architecture Implementation of Euler/Navier-Stokes Solver on Graphics Processing Unit Desktop Platform for 2-D Compressible Flows
1
作者 Zhang Jiale Chen Hongquan 《Transactions of Nanjing University of Aeronautics and Astronautics》 EI CSCD 2016年第5期536-545,共10页
Personal desktop platform with teraflops peak performance of thousands of cores is realized at the price of conventional workstations using the programmable graphics processing units(GPUs).A GPU-based parallel Euler/N... Personal desktop platform with teraflops peak performance of thousands of cores is realized at the price of conventional workstations using the programmable graphics processing units(GPUs).A GPU-based parallel Euler/Navier-Stokes solver is developed for 2-D compressible flows by using NVIDIA′s Compute Unified Device Architecture(CUDA)programming model in CUDA Fortran programming language.The techniques of implementation of CUDA kernels,double-layered thread hierarchy and variety memory hierarchy are presented to form the GPU-based algorithm of Euler/Navier-Stokes equations.The resulting parallel solver is validated by a set of typical test flow cases.The numerical results show that dozens of times speedup relative to a serial CPU implementation can be achieved using a single GPU desktop platform,which demonstrates that a GPU desktop can serve as a costeffective parallel computing platform to accelerate computational fluid dynamics(CFD)simulations substantially. 展开更多
关键词 graphics processing unit(GPU) GPU parallel computing compute unified device architecture(CUDA)Fortran finite volume method(FVM) acceleration
在线阅读 下载PDF
Multi-relaxation-time lattice Boltzmann simulations of lid driven flows using graphics processing unit
2
作者 Chenggong LI J.P.Y.MAA 《Applied Mathematics and Mechanics(English Edition)》 SCIE EI CSCD 2017年第5期707-722,共16页
Large eddy simulation (LES) using the Smagorinsky eddy viscosity model is added to the two-dimensional nine velocity components (D2Q9) lattice Boltzmann equation (LBE) with multi-relaxation-time (MRT) to simul... Large eddy simulation (LES) using the Smagorinsky eddy viscosity model is added to the two-dimensional nine velocity components (D2Q9) lattice Boltzmann equation (LBE) with multi-relaxation-time (MRT) to simulate incompressible turbulent cavity flows with the Reynolds numbers up to 1 × 10^7. To improve the computation efficiency of LBM on the numerical simulations of turbulent flows, the massively parallel computing power from a graphic processing unit (GPU) with a computing unified device architecture (CUDA) is introduced into the MRT-LBE-LES model. The model performs well, compared with the results from others, with an increase of 76 times in computation efficiency. It appears that the higher the Reynolds numbers is, the smaller the Smagorinsky constant should be, if the lattice number is fixed. Also, for a selected high Reynolds number and a selected proper Smagorinsky constant, there is a minimum requirement for the lattice number so that the Smagorinsky eddy viscosity will not be excessively large. 展开更多
关键词 large eddy simulation (LES) multi-relaxation-time (MRT) lattice Boltzmann equation (LBE) two-dimensional nine velocity components (D2Q9) Smagorinskymodel graphic processing unit (GPU) computing unified device architecture (CUDA)
在线阅读 下载PDF
Graphic Processing Unit Based Phase Retrieval and CT Reconstruction for Differential X-Ray Phase Contrast Imaging
3
作者 陈晓庆 王宇杰 孙建奇 《Journal of Shanghai Jiaotong university(Science)》 EI 2014年第5期550-554,共5页
Compared with the conventional X-ray absorption imaging, the X-ray phase-contrast imaging shows higher contrast on samples with low attenuation coefficient like blood vessels and soft tissues. Among the modalities of ... Compared with the conventional X-ray absorption imaging, the X-ray phase-contrast imaging shows higher contrast on samples with low attenuation coefficient like blood vessels and soft tissues. Among the modalities of phase-contrast imaging, the grating-based phase contrast imaging has been widely accepted owing to the advantage of wide range of sample selections and exemption of coherent source. However, the downside is the substantially larger amount of data generated from the phase-stepping method which slows down the reconstruction process. Graphic processing unit(GPU) has the advantage of allowing parallel computing which is very useful for large quantity data processing. In this paper, a compute unified device architecture(CUDA) C program based on GPU is introduced to accelerate the phase retrieval and filtered back projection(FBP) algorithm for grating-based tomography. Depending on the size of the data, the CUDA C program shows different amount of speed-up over the standard C program on the same Visual Studio 2010 platform. Meanwhile, the speed-up ratio increases as the size of data increases. 展开更多
关键词 grating-based phase contrast imaging parallel computing graphic processing unit(GPU) compute unified device architecture(CUDA) filtered back projection(FBP)
原文传递
Graphic Processing Unit-Accelerated Neural Network Model for Biological Species Recognition
4
作者 温程璐 潘伟 +1 位作者 陈晓熹 祝青园 《Journal of Donghua University(English Edition)》 EI CAS 2012年第1期5-8,共4页
A graphic processing unit (GPU)-accelerated biological species recognition method using partially connected neural evolutionary network model is introduced in this paper. The partial connected neural evolutionary netw... A graphic processing unit (GPU)-accelerated biological species recognition method using partially connected neural evolutionary network model is introduced in this paper. The partial connected neural evolutionary network adopted in the paper can overcome the disadvantage of traditional neural network with small inputs. The whole image is considered as the input of the neural network, so the maximal features can be kept for recognition. To speed up the recognition process of the neural network, a fast implementation of the partially connected neural network was conducted on NVIDIA Tesla C1060 using the NVIDIA compute unified device architecture (CUDA) framework. Image sets of eight biological species were obtained to test the GPU implementation and counterpart serial CPU implementation, and experiment results showed GPU implementation works effectively on both recognition rate and speed, and gained 343 speedup over its counterpart CPU implementation. Comparing to feature-based recognition method on the same recognition task, the method also achieved an acceptable correct rate of 84.6% when testing on eight biological species. 展开更多
关键词 graphic processing unit(GPU) compute unified device architecture (CUDA) neural network species recognition
在线阅读 下载PDF
Developing Extensible Lattice-Boltzmann Simulators for General-Purpose Graphics-Processing Units
5
作者 Stuart D.C.Walsh Martin O.Saar 《Communications in Computational Physics》 SCIE 2013年第3期867-879,共13页
Lattice-Boltzmann methods are versatile numerical modeling techniques capable of reproducing a wide variety of fluid-mechanical behavior.These methods are well suited to parallel implementation,particularly on the sin... Lattice-Boltzmann methods are versatile numerical modeling techniques capable of reproducing a wide variety of fluid-mechanical behavior.These methods are well suited to parallel implementation,particularly on the single-instruction multiple data(SIMD)parallel processing environments found in computer graphics processing units(GPUs).Although recent programming tools dramatically improve the ease with which GPUbased applications can be written,the programming environment still lacks the flexibility available to more traditional CPU programs.In particular,it may be difficult to develop modular and extensible programs that require variable on-device functionality with current GPU architectures.This paper describes a process of automatic code generation that overcomes these difficulties for lattice-Boltzmann simulations.It details the development of GPU-based modules for an extensible lattice-Boltzmann simulation package-LBHydra.The performance of the automatically generated code is compared to equivalent purpose written codes for both single-phase,multiphase,and multicomponent flows.The flexibility of the new method is demonstrated by simulating a rising,dissolving droplet moving through a porous medium with user generated lattice-Boltzmann models and subroutines. 展开更多
关键词 Lattice-Boltzmann methods graphics processing units computational fluid dynamics
原文传递
Parallel Image Processing: Taking Grayscale Conversion Using OpenMP as an Example 被引量:1
6
作者 Bayan AlHumaidan Shahad Alghofaily +2 位作者 Maitha Al Qhahtani Sara Oudah Naya Nagy 《Journal of Computer and Communications》 2024年第2期1-10,共10页
In recent years, the widespread adoption of parallel computing, especially in multi-core processors and high-performance computing environments, ushered in a new era of efficiency and speed. This trend was particularl... In recent years, the widespread adoption of parallel computing, especially in multi-core processors and high-performance computing environments, ushered in a new era of efficiency and speed. This trend was particularly noteworthy in the field of image processing, which witnessed significant advancements. This parallel computing project explored the field of parallel image processing, with a focus on the grayscale conversion of colorful images. Our approach involved integrating OpenMP into our framework for parallelization to execute a critical image processing task: grayscale conversion. By using OpenMP, we strategically enhanced the overall performance of the conversion process by distributing the workload across multiple threads. The primary objectives of our project revolved around optimizing computation time and improving overall efficiency, particularly in the task of grayscale conversion of colorful images. Utilizing OpenMP for concurrent processing across multiple cores significantly reduced execution times through the effective distribution of tasks among these cores. The speedup values for various image sizes highlighted the efficacy of parallel processing, especially for large images. However, a detailed examination revealed a potential decline in parallelization efficiency with an increasing number of cores. This underscored the importance of a carefully optimized parallelization strategy, considering factors like load balancing and minimizing communication overhead. Despite challenges, the overall scalability and efficiency achieved with parallel image processing underscored OpenMP’s effectiveness in accelerating image manipulation tasks. 展开更多
关键词 Parallel computing Image processing OPENMP Parallel Programming High Performance computing GPU (graphic processing unit)
在线阅读 下载PDF
Real-time 3D Microtubule Gliding Simulation Accelerated by GPU Computing
7
作者 Gregory Gutmann Daisuke Inoue +1 位作者 Akira Kakugo Akihiko Konagaya 《International Journal of Automation and computing》 EI CSCD 2016年第2期108-116,共9页
A microtubule gliding assay is a biological experiment observing the dynamics of microtubules driven by motor proteins fixed on a glass surface. When appropriate microtubule interactions are set up on gliding assay ex... A microtubule gliding assay is a biological experiment observing the dynamics of microtubules driven by motor proteins fixed on a glass surface. When appropriate microtubule interactions are set up on gliding assay experiments, microtubules often organize and create higher-level dynamics such as ring and bundle structures. In order to reproduce such higher-level dynamics on computers, we have been focusing on making a real-time 3D microtubule simulation. This real-time 3D microtubule simulation enables us to gain more knowledge on microtubule dynamics and their swarm movements by means of adjusting simulation paranleters in a real-time fashion. One of the technical challenges when creating a real-time 3D simulation is balancing the 3D rendering and the computing performance. Graphics processor unit (GPU) programming plays an essential role in balancing the millions of tasks, and makes this real-time 3D simulation possible. By the use of general-purpose computing on graphics processing units (GPGPU) programming we are able to run the simulation in a massively parallel fashion, even when dealing with more complex interactions between microtubules such as overriding and snuggling. Due to performance being an important factor, a performance n, odel has also been constructed from the analysis of the microtubule simulation and it is consistent with the performance measurements on different GPGPU architectures with regards to the number of cores and clock cycles. 展开更多
关键词 Microtubule gliding assay 3D computer graphics and simulation parallel computing performance analysis general- purpose computing on graphics processing units (GPGPU) compute unified device arshitecture (CUDA) DirectX.
原文传递
Parallelizing maximum likelihood classification on computer cluster and graphics processing unit for supervised image classification
8
作者 Xuan Shi Bowei Xue 《International Journal of Digital Earth》 SCIE EI 2017年第7期737-748,共12页
Supervised image classification has been widely utilized in a variety of remote sensing applications.When large volume of satellite imagery data and aerial photos are increasingly available,high-performance image proc... Supervised image classification has been widely utilized in a variety of remote sensing applications.When large volume of satellite imagery data and aerial photos are increasingly available,high-performance image processing solutions are required to handle large scale of data.This paper introduces how maximum likelihood classification approach is parallelized for implementation on a computer cluster and a graphics processing unit to achieve high performance when processing big imagery data.The solution is scalable and satisfies the need of change detection,object identification,and exploratory analysis on large-scale high-resolution imagery data in remote sensing applications. 展开更多
关键词 Maximum likelihood classification supervised classification parallel computing graphics processing unit
原文传递
Efficient parallel implementation of a density peaks clustering algorithm on graphics processing unit 被引量:2
9
作者 Ke-shi GE Hua-you SU +1 位作者 Dong-sheng LI Xi-cheng LU 《Frontiers of Information Technology & Electronic Engineering》 SCIE EI CSCD 2017年第7期915-927,共13页
The density peak (DP) algorithm has been widely used in scientific research due to its novel and effective peak density-based clustering approach. However, the DP algorithm uses each pair of data points several time... The density peak (DP) algorithm has been widely used in scientific research due to its novel and effective peak density-based clustering approach. However, the DP algorithm uses each pair of data points several times when determining cluster centers, yielding high computational complexity. In this paper, we focus on accelerating the time-consuming density peaks algorithm with a graphics processing unit (GPU). We analyze the principle of the algorithm to locate its computational bottlenecks, and evaluate its potential for parallelism. In light of our analysis, we propose an efficient parallel DP algorithm targeting on a GPU architecture and implement this parallel method with compute unified device architecture (CUDA), called the ‘CUDA-DP platform'. Specifically, we use shared memory to improve data locality, which reduces the amount of global memory access. To exploit the coalescing accessing mechanism of CPU, we convert the data structure of the CUDA-DP program from array of structures to structure of arrays. In addition, we introduce a binary search-and-sampling method to avoid sorting a large array. The results of the experiment show that CUDA-DP can achieve a 45-fold acceleration when compared to the central processing unit based density peaks implementation. 展开更多
关键词 Density peak graphics processing unit Parallel computing CLUSTERING
原文传递
GPU based numerical simulation of core shooting process 被引量:1
10
作者 Yi-zhong Zhang Gao-chun Lu +3 位作者 Chang-jiang Ni Tao Jing Lin-long Yang Qin-fang Wu 《China Foundry》 SCIE 2017年第5期392-397,共6页
Core shooting process is the most widely used technique to make sand cores and it plays an important role in the quality of sand cores. Although numerical simulation can hopefully optimize the core shooting process, r... Core shooting process is the most widely used technique to make sand cores and it plays an important role in the quality of sand cores. Although numerical simulation can hopefully optimize the core shooting process, research on numerical simulation of the core shooting process is very limited. Based on a two-fluid model(TFM) and a kinetic-friction constitutive correlation, a program for 3D numerical simulation of the core shooting process has been developed and achieved good agreements with in-situ experiments. To match the needs of engineering applications, a graphics processing unit(GPU) has also been used to improve the calculation efficiency. The parallel algorithm based on the Compute Unified Device Architecture(CUDA) platform can significantly decrease computing time by multi-threaded GPU. In this work, the program accelerated by CUDA parallelization method was developed and the accuracy of the calculations was ensured by comparing with in-situ experimental results photographed by a high-speed camera. The design and optimization of the parallel algorithm were discussed. The simulation result of a sand core test-piece indicated the improvement of the calculation efficiency by GPU. The developed program has also been validated by in-situ experiments with a transparent core-box, a high-speed camera, and a pressure measuring system. The computing time of the parallel program was reduced by nearly 95% while the simulation result was still quite consistent with experimental data. The GPU parallelization method can successfully solve the problem of low computational efficiency of the 3D sand shooting simulation program, and thus the developed GPU program is appropriate for engineering applications. 展开更多
关键词 graphics processing unit (GPU) compute Unified Device Architecture (CUDA) PARALLELIZATION core shooting process
在线阅读 下载PDF
Efficient parallel implementation of the lattice Boltzmann method on large clusters of graphic processing units 被引量:6
11
作者 XIONG QinGang LI Bo +5 位作者 XU Ji FANG XiaoJian WANG XiaoWei WANG LiMin HE XianFeng GE Wei 《Chinese Science Bulletin》 SCIE EI CAS 2012年第7期707-715,共9页
Many-core processors, such as graphic processing units (GPUs), are promising platforms for intrinsic parallel algorithms such as the lattice Boltzmann method (LBM). Although tremendous speedup has been obtained on a s... Many-core processors, such as graphic processing units (GPUs), are promising platforms for intrinsic parallel algorithms such as the lattice Boltzmann method (LBM). Although tremendous speedup has been obtained on a single GPU compared with mainstream CPUs, the performance of the LBM for multiple GPUs has not been studied extensively and systematically. In this article, we carry out LBM simulation on a GPU cluster with many nodes, each having multiple Fermi GPUs. Asynchronous execution with CUDA stream functions, OpenMP and non-blocking MPI communication are incorporated to improve efficiency. The algorithm is tested for two-dimensional Couette flow and the results are in good agreement with the analytical solution. For both the oneand two-dimensional decomposition of space, the algorithm performs well as most of the communication time is hidden. Direct numerical simulation of a two-dimensional gas-solid suspension containing more than one million solid particles and one billion gas lattice cells demonstrates the potential of this algorithm in large-scale engineering applications. The algorithm can be directly extended to the three-dimensional decomposition of space and other modeling methods including explicit grid-based methods. 展开更多
关键词 格子BOLTZMANN方法 图形处理单元 并行算法 集群 COUETTE流 LBM模拟 OPENMP 直接数值模拟
在线阅读 下载PDF
Fast modeling of gravity gradients from topographic surface data using GPU parallel algorithm 被引量:1
12
作者 Xuli Tan Qingbin Wang +2 位作者 Jinkai Feng Yan Huang Ziyan Huang 《Geodesy and Geodynamics》 CSCD 2021年第4期288-297,共10页
The gravity gradient is a secondary derivative of gravity potential,containing more high-frequency information of Earth’s gravity field.Gravity gradient observation data require deducting its prior and intrinsic part... The gravity gradient is a secondary derivative of gravity potential,containing more high-frequency information of Earth’s gravity field.Gravity gradient observation data require deducting its prior and intrinsic parts to obtain more variational information.A model generated from a topographic surface database is more appropriate to represent gradiometric effects derived from near-surface mass,as other kinds of data can hardly reach the spatial resolution requirement.The rectangle prism method,namely an analytic integration of Newtonian potential integrals,is a reliable and commonly used approach to modeling gravity gradient,whereas its computing efficiency is extremely low.A modified rectangle prism method and a graphical processing unit(GPU)parallel algorithm were proposed to speed up the modeling process.The modified method avoided massive redundant computations by deforming formulas according to the symmetries of prisms’integral regions,and the proposed algorithm parallelized this method’s computing process.The parallel algorithm was compared with a conventional serial algorithm using 100 elevation data in two topographic areas(rough and moderate terrain).Modeling differences between the two algorithms were less than 0.1 E,which is attributed to precision differences between single-precision and double-precision float numbers.The parallel algorithm showed computational efficiency approximately 200 times higher than the serial algorithm in experiments,demonstrating its effective speeding up in the modeling process.Further analysis indicates that both the modified method and computational parallelism through GPU contributed to the proposed algorithm’s performances in experiments. 展开更多
关键词 Gravity gradient Topographic surface data Rectangle prism method Parallel computation graphical processing unit(GPU)
原文传递
Optimizing photoacoustic image reconstruction using cross-platform parallel computation
13
作者 Tri Vu Yuehang Wang Jun Xia 《Visual Computing for Industry,Biomedicine,and Art》 2018年第1期12-17,共6页
Three-dimensional(3D)image reconstruction involves the computations of an extensive amount of data that leads to tremendous processing time.Therefore,optimization is crucially needed to improve the performance and eff... Three-dimensional(3D)image reconstruction involves the computations of an extensive amount of data that leads to tremendous processing time.Therefore,optimization is crucially needed to improve the performance and efficiency.With the widespread use of graphics processing units(GPU),parallel computing is transforming this arduous reconstruction process for numerous imaging modalities,and photoacoustic computed tomography(PACT)is not an exception.Existing works have investigated GPU-based optimization on photoacoustic microscopy(PAM)and PACT reconstruction using compute unified device architecture(CUDA)on either C++or MATLAB only.However,our study is the first that uses cross-platform GPU computation.It maintains the simplicity of MATLAB,while improves the speed through CUDA/C++−based MATLAB converted functions called MEXCUDA.Compared to a purely MATLAB with GPU approach,our cross-platform method improves the speed five times.Because MATLAB is widely used in PAM and PACT,this study will open up new avenues for photoacoustic image reconstruction and relevant real-time imaging applications. 展开更多
关键词 Photoacoustic computed tomography graphics processing units Parallel computation Focal-line backprojection algorithm MATLAB Optical imaging
在线阅读 下载PDF
A Computational Comparison of Basis Updating Schemes for the Simplex Algorithm on a CPU-GPU System
14
作者 Nikolaos Ploskas Nikolaos Samaras 《American Journal of Operations Research》 2013年第6期497-505,共9页
The computation of the basis inverse is the most time-consuming step in simplex type algorithms. This inverse does not have to be computed from scratch at any iteration, but updating schemes can be applied to accelera... The computation of the basis inverse is the most time-consuming step in simplex type algorithms. This inverse does not have to be computed from scratch at any iteration, but updating schemes can be applied to accelerate this calculation. In this paper, we perform a computational comparison in which the basis inverse is computed with five different updating schemes. Then, we propose a parallel implementation of two updating schemes on a CPU-GPU System using MATLAB and CUDA environment. Finally, a computational study on randomly generated full dense linear programs is preented to establish the practical value of GPU-based implementation. 展开更多
关键词 SIMPLEX Algorithm BASIS INVERSE graphics processing unit MATLAB compute UNIFIED Device Architecture
在线阅读 下载PDF
基于CPU-GPU的超音速流场N-S方程数值模拟
15
作者 卢志伟 张皓茹 +3 位作者 刘锡尧 王亚东 张卓凯 张君安 《中国机械工程》 北大核心 2025年第9期1942-1950,共9页
为深入分析超音速流场的特性并提高数值计算效率,设计了一种高效的加速算法。该算法充分利用中央处理器-图形处理器(CPU-GPU)异构并行模式,通过异步流方式实现数据传输及处理,显著加速了超音速流场数值模拟的计算过程。结果表明:GPU并... 为深入分析超音速流场的特性并提高数值计算效率,设计了一种高效的加速算法。该算法充分利用中央处理器-图形处理器(CPU-GPU)异构并行模式,通过异步流方式实现数据传输及处理,显著加速了超音速流场数值模拟的计算过程。结果表明:GPU并行计算速度明显高于CPU串行计算速度,其加速比随流场网格规模的增大而明显提高。GPU并行计算可以有效提高超音速流场的计算速度,为超音速飞行器的设计、优化、性能评估及其研发提供一种强有力的并行计算方法。 展开更多
关键词 超音速流场 中央处理器-图形处理器 异构计算 有限差分
在线阅读 下载PDF
基于算网状态感知的多集群GPU算力资源调度平台设计与实现
16
作者 胡亚辉 张宸康 +4 位作者 王越嶙 洪雨琛 范鹏飞 宋俊平 周旭 《通信学报》 北大核心 2025年第10期175-190,共16页
针对大规模深度学习任务的多集群GPU调度中资源粒度粗放、缺乏统一vGPU视图及跨集群网络感知不足等问题,设计算网状态感知的多集群GPU算力调度平台。平台采用集中式架构,通过实时感知跨集群算力资源与网络状态并协同调度,实现细粒度全... 针对大规模深度学习任务的多集群GPU调度中资源粒度粗放、缺乏统一vGPU视图及跨集群网络感知不足等问题,设计算网状态感知的多集群GPU算力调度平台。平台采用集中式架构,通过实时感知跨集群算力资源与网络状态并协同调度,实现细粒度全局资源编排调度。平台先构建设备、集群、vGPU及网络层多维度指标体系,实时采集核心利用率、显存、带宽等关键数据;设计节点级vGPU编排部署模块,突破“作业到集群”局限,达成“作业到节点”精准调度,提升GPU共享效率与资源利用率。实验表明,平台可实现多集群vGPU与网络信息的实时采集可视化,经DDPG强化学习及BestFit算法验证,具备高效资源管理能力。 展开更多
关键词 多集群 图形处理器 算力资源 算网状态感知 编排调度
在线阅读 下载PDF
多GPU平台上三维格子Boltzmann方法的并行化实现
17
作者 向星 孙培杰 +1 位作者 张华海 王利民 《数据与计算发展前沿(中英文)》 2025年第5期16-27,共12页
【目的】针对大规模科学计算问题,计算范式的转变推动了通用图形处理器的发展,在计算流体力学领域新兴的格子Boltzmann方法在耦合先进物理模型时具有内在的计算效率和并行可扩展性的显著优势。【方法】本研究基于标准格子模型D3Q19,考... 【目的】针对大规模科学计算问题,计算范式的转变推动了通用图形处理器的发展,在计算流体力学领域新兴的格子Boltzmann方法在耦合先进物理模型时具有内在的计算效率和并行可扩展性的显著优势。【方法】本研究基于标准格子模型D3Q19,考虑三维区域分解和分布式数据通信方法,对三维格子Boltzmann方法进行了并行算法设计与优化。【结果】在某国产异构加速计算平台,对三维流动基准算例进行了不同网格规模下数值验证和精度测试,实现了高保真度瞬态模拟,并捕捉了不同时刻下三维涡结构的非定常演化。在单卡不同网格规模的性能测试中,在正确性验证的基础上,讨论了数据通信部分对并行性能的影响,并给出了单卡对于单核的加速比。在强/弱扩展性测试中,设置了单节点单卡和单节点四卡两组对照数值实验来研究节点间/节点内数据通信的差异。其中单节点单卡组最大计算网格规模约为21.5亿,使用了128节点上总计128张加速卡,运行时间为262.119s,并行性能为81.927GLUPS(每秒十亿格点更新,1GLUPS=103MLUPS),并行效率为94.76%;单节点四卡组最大计算网格规模约为85.9亿,使用了128节点上总计512张加速卡,并行性能为241.185GLUPS,并行效率为69.71%。【结论】本研究提出的并行化实现方法具有线性加速比和良好的并行可扩展性,展示了在E级超算系统上实现高效模拟的潜力。 展开更多
关键词 图形处理器 格子BOLTZMANN方法 扩展性测试 大规模并行计算 三维Taylor-Green涡流
在线阅读 下载PDF
GPU加速的卫星DSSS遥测信号解调技术现状
18
作者 陈其敏 焦义文 +2 位作者 吴涛 李雪健 冯浩 《航天工程大学学报》 2025年第5期66-73,共8页
针对卫星测控中直接序列扩频(Direct Sequence Spread Spectrum,DSSS)信号解调面临的高动态场景处理效率低、传统可编程门阵列(Field Programmable Gate Array,FPGA)平台灵活适应性不足等问题,对图形处理器(Graphics Processing Unit,G... 针对卫星测控中直接序列扩频(Direct Sequence Spread Spectrum,DSSS)信号解调面临的高动态场景处理效率低、传统可编程门阵列(Field Programmable Gate Array,FPGA)平台灵活适应性不足等问题,对图形处理器(Graphics Processing Unit,GPU)加速的DSSS遥测信号解调技术的发展现状进行了分析和研究,结合GPU异构计算架构与统一计算设备架构(Computer Unified Device Architecture,CUDA)编程模型,探讨其现状、不足及改进方向。 展开更多
关键词 直接序列扩频遥测信号 捕获跟踪 图形处理器 并行计算 实时解调
在线阅读 下载PDF
基于GPU并行计算的目标声散射Kirchhoff近似积分方法
19
作者 杨晨轩 安俊英 +1 位作者 孙阳 张毅 《声学技术》 北大核心 2025年第4期499-505,共7页
为提高水下目标中高频声散射的计算效率,文章建立了基于图形处理器(graphics processing unit,GPU)并行计算方式的目标声散射基尔霍夫(Kirchhoff)近似积分计算模型。首先,针对目标声散射的Kirchhoff近似积分方法的常量元模型和面元精确... 为提高水下目标中高频声散射的计算效率,文章建立了基于图形处理器(graphics processing unit,GPU)并行计算方式的目标声散射基尔霍夫(Kirchhoff)近似积分计算模型。首先,针对目标声散射的Kirchhoff近似积分方法的常量元模型和面元精确积分模型,建立基于GPU线程分配的并行化模式,形成可并行计算的算法模型;然后,以半径为1 m的刚性球为目标,采用GPU并行模型计算其声散射目标强度,并通过与解析解的对比验证算法的准确性;最后,以Benchmark模型为目标,通过仿真计算不同条件下的声散射目标强度,对比分析GPU并行计算模型的加速比。结果表明,常量元模型的GPU并行计算效率相比传统串行计算效率提高4~5倍;面元精确积分模型的GPU并行计算效率相比于传统串行计算效率提高8~11倍。基于GPU的并行化模式对目标声散射的Kirchhoff近似积分方法的计算具有明显的加速效果,且随着面元数增加,GPU计算优势更加明显。 展开更多
关键词 基尔霍夫(Kirchhoff)近似积分 图形处理器(GPU) 并行计算 目标散射
在线阅读 下载PDF
基于Tensor Cores的新型GPU架构的高性能Cholesky分解
20
作者 石璐 邹高远 +1 位作者 伍思琦 张少帅 《计算机工程与科学》 北大核心 2025年第7期1170-1180,共11页
稠密矩阵乘法(GEMMs)在Tensor Cores上可以实现高度优化。然而,现有的Cholesky分解的实现由于其有限的并行性无法达到Tensor Cores大部分的峰值性能。研究使用一种递归Cholesky分解的算法,通过将对角线块的递归细分,将原本的对称矩阵秩... 稠密矩阵乘法(GEMMs)在Tensor Cores上可以实现高度优化。然而,现有的Cholesky分解的实现由于其有限的并行性无法达到Tensor Cores大部分的峰值性能。研究使用一种递归Cholesky分解的算法,通过将对角线块的递归细分,将原本的对称矩阵秩K更新(SYRK)和三角方程组求解(TRSM)操作转化为大量的通用矩阵乘法(GEMMs),从而更充分地发挥Tensor Cores的峰值性能。实验结果表明,提出的递归Cholesky分解算法在FP32和FP16上分别比MAGMA/cuSOLVER算法提高了1.72倍和1.62倍。 展开更多
关键词 CHOLESKY分解 高性能计算 数值线性代数 通用图形处理器(GPGPU)
在线阅读 下载PDF
上一页 1 2 20 下一页 到第
使用帮助 返回顶部