期刊文献+
共找到1,015篇文章
< 1 2 51 >
每页显示 20 50 100
GAPS:GPU-accelerated processing service for SM9
1
作者 Wenhan Xu Hui Ma Rui Zhang 《Cybersecurity》 2025年第4期270-287,共18页
SM9 was established in 2016 as a Chinese ofcial identity-based cryptographic (IBC) standard, and became an ISO standard in 2021. It is well-known that IBC is suitable for Internet of Things (IoT) applications, since a... SM9 was established in 2016 as a Chinese ofcial identity-based cryptographic (IBC) standard, and became an ISO standard in 2021. It is well-known that IBC is suitable for Internet of Things (IoT) applications, since a centralized processing of client data (e.g. IoT cloud) is often done by gateways. However, due to limited computation resources inside IoT devices, the performance of SM9 becomes a bottleneck in practical usage. The existing SM9 implementa-tionsare often CPU-based, with relatively low latency and low throughput. Consequently, a pivotal challenge for SM9 in large-scale applications is how to reduce the latency while maximizing throughput for numerous concurrent inputs. After a systematic analysis of the SM9 algorithms, we apply optimization techniques including precomputa-tion,resource caching and parallelization to reduce the overhead of SM9. In this work, we introduce the frst prac-ticalimplementation of SM9 and its underlying SM9_P256 curve on GPU. Our GPU implementation combines multiple algorithms and low-level optimizations tailored for GPU’s single instruction, multiple threads architecture in order to achieve high throughput for SM9. Based on these, we propose GAPS, a high-performance Cryptog-raphyas a Service (CaaS) for SM9. GAPS adopts a heterogeneous computing architecture that fexibly schedules the inputs across two implementation platforms: a CPU for the low-latency processing of sporadic inputs, and a GPU for the high-throughput processing of batch inputs. According to our benchmark, GAPS only takes a few milliseconds to process a single SM9 request in idle mode. Moreover, when operating in its batch processing mode, GAPS can generate 2,038,071 private keys, 248,239 signatures or 238,001 ciphertexts per second. The results show that GAPS scales seamlessly across inputs of diferent sizes, preliminarily demonstrating the efcacy of our solution. 展开更多
关键词 Identity-based cryptography SM9 Cryptography as a service graphics processing units
原文传递
基于CPU-GPU的超音速流场N-S方程数值模拟
2
作者 卢志伟 张皓茹 +3 位作者 刘锡尧 王亚东 张卓凯 张君安 《中国机械工程》 北大核心 2025年第9期1942-1950,共9页
为深入分析超音速流场的特性并提高数值计算效率,设计了一种高效的加速算法。该算法充分利用中央处理器-图形处理器(CPU-GPU)异构并行模式,通过异步流方式实现数据传输及处理,显著加速了超音速流场数值模拟的计算过程。结果表明:GPU并... 为深入分析超音速流场的特性并提高数值计算效率,设计了一种高效的加速算法。该算法充分利用中央处理器-图形处理器(CPU-GPU)异构并行模式,通过异步流方式实现数据传输及处理,显著加速了超音速流场数值模拟的计算过程。结果表明:GPU并行计算速度明显高于CPU串行计算速度,其加速比随流场网格规模的增大而明显提高。GPU并行计算可以有效提高超音速流场的计算速度,为超音速飞行器的设计、优化、性能评估及其研发提供一种强有力的并行计算方法。 展开更多
关键词 超音速流场 中央处理器-图形处理器 异构计算 有限差分
在线阅读 下载PDF
A graphics processing unit-based robust numerical model for solute transport driven by torrential flow condition 被引量:1
3
作者 Jing-ming HOU Bao-shan SHI +6 位作者 Qiu-hua LIANG Yu TONG Yong-de KANG Zhao-an ZHANG Gang-gang BAI Xu-jun GAO Xiao YANG 《Journal of Zhejiang University-Science A(Applied Physics & Engineering)》 SCIE EI CAS CSCD 2021年第10期835-850,共16页
Solute transport simulations are important in water pollution events.This paper introduces a finite volume Godunovtype model for solving a 4×4 matrix form of the hyperbolic conservation laws consisting of 2D shal... Solute transport simulations are important in water pollution events.This paper introduces a finite volume Godunovtype model for solving a 4×4 matrix form of the hyperbolic conservation laws consisting of 2D shallow water equations and transport equations.The model adopts the Harten-Lax-van Leer-contact(HLLC)-approximate Riemann solution to calculate the cell interface fluxes.It can deal well with the changes in the dry and wet interfaces in an actual complex terrain,and it has a strong shock-wave capturing ability.Using monotonic upstream-centred scheme for conservation laws(MUSCL)linear reconstruction with finite slope and the Runge-Kutta time integration method can achieve second-order accuracy.At the same time,the introduction of graphics processing unit(GPU)-accelerated computing technology greatly increases the computing speed.The model is validated against multiple benchmarks,and the results are in good agreement with analytical solutions and other published numerical predictions.The third test case uses the GPU and central processing unit(CPU)calculation models which take 3.865 s and 13.865 s,respectively,indicating that the GPU calculation model can increase the calculation speed by 3.6 times.In the fourth test case,comparing the numerical model calculated by GPU with the traditional numerical model calculated by CPU,the calculation efficiencies of the numerical model calculated by GPU under different resolution grids are 9.8–44.6 times higher than those by CPU.Therefore,it has better potential than previous models for large-scale simulation of solute transport in water pollution incidents.It can provide a reliable theoretical basis and strong data support in the rapid assessment and early warning of water pollution accidents. 展开更多
关键词 Solute transport Shallow water equations Godunov-type scheme Harten-Lax-van Leer-contact(HLLC)Riemann solver graphics processing unit(gpu)acceleration technology Torrential flow
原文传递
CPWS:一种基于检查点的GPGPU多级warp调度器
4
作者 姜泽坤 原博 +3 位作者 崔剑峰 黄立波 常俊胜 刘胜 《计算机工程与科学》 北大核心 2025年第9期1563-1570,共8页
通用图形处理器(GPGPU)使用单指令多线程(SIMT)模型,该模型允许大量线程同时执行同一指令,从而显著提高计算效率。在SIMT模型中,GPGPU将一组线程组织成名为线程束(warp)的逻辑执行单元。由于硬件必须在多个warp之间进行时分复用,所以war... 通用图形处理器(GPGPU)使用单指令多线程(SIMT)模型,该模型允许大量线程同时执行同一指令,从而显著提高计算效率。在SIMT模型中,GPGPU将一组线程组织成名为线程束(warp)的逻辑执行单元。由于硬件必须在多个warp之间进行时分复用,所以warp调度是实现高效并行计算的关键。通过添加新的检查点指令,设计并实现了一种基于检查点的多级warp调度器CPWS。CPWS能够跟踪每个warp的执行进度,并根据该进度动态调整其调度策略,整体硬件开销较低。实验表明,CPWS的性能与贪婪调度器(GTO)的相比提高了11%,与松散轮询调度(LRR)的相比提高了16.7%,与两级轮询的相比提高了10.6%。此外,通过在FPGA上的综合结果表明,CPWS相比GTO增加的逻辑单元开销仅为0.8%。 展开更多
关键词 通用图形处理器 检查点 线程束调度器
在线阅读 下载PDF
基于GPU的OMCSS水声通信M元解扩算法并行实现
5
作者 彭海源 王巍 +4 位作者 李德瑞 刘彦君 李宇 迟骋 田亚男 《系统工程与电子技术》 北大核心 2025年第3期978-986,共9页
针对正交多载波扩频(orthogonal multi-carrier spread spectrum,OMCSS)水声通信系统接收信号快速处理需求,提出一种基于图形处理模块(graphic processing unit,GPU)的M元解扩算法的并行实现方法。首先,分析M元解扩算法在GPU平台上实现... 针对正交多载波扩频(orthogonal multi-carrier spread spectrum,OMCSS)水声通信系统接收信号快速处理需求,提出一种基于图形处理模块(graphic processing unit,GPU)的M元解扩算法的并行实现方法。首先,分析M元解扩算法在GPU平台上实现的可行性,针对算法内部基础运算单元进行并行优化处理。然后,为了进一步提升GPU并行运行速度,对算法进行基于并发内核执行的M元并行解扩计算架构设计。在中央处理器(central processing unit,CPU)+GPU异构平台上对算法性能进行测试。测试结果表明,设计的M元并行解扩算法相比M元串行解扩算法在运行速度上有最大90.47%的提升,最大加速比为10.5。 展开更多
关键词 正交多载波扩频 水声通信 M元解扩 图形处理模块 并行实现
在线阅读 下载PDF
TIME-DOMAIN INTERPOLATION ON GRAPHICS PROCESSING UNIT 被引量:1
6
作者 XIQI LI GUOHUA SHI YUDONG ZHANG 《Journal of Innovative Optical Health Sciences》 SCIE EI CAS 2011年第1期89-95,共7页
The signal processing speed of spectral domain optical coherence tomography(SD-OCT)has become a bottleneck in a lot of medical applications.Recently,a time-domain interpolation method was proposed.This method can get ... The signal processing speed of spectral domain optical coherence tomography(SD-OCT)has become a bottleneck in a lot of medical applications.Recently,a time-domain interpolation method was proposed.This method can get better signal-to-noise ratio(SNR)but much-reduced signal processing time in SD-OCT data processing as compared with the commonly used zeropadding interpolation method.Additionally,the resampled data can be obtained by a few data and coefficients in the cutoff window.Thus,a lot of interpolations can be performed simultaneously.So,this interpolation method is suitable for parallel computing.By using graphics processing unit(GPU)and the compute unified device architecture(CUDA)program model,time-domain interpolation can be accelerated significantly.The computing capability can be achieved more than 250,000 A-lines,200,000 A-lines,and 160,000 A-lines in a second for 2,048 pixel OCT when the cutoff length is L=11,L=21,and L=31,respectively.A frame SD-OCT data(400A-lines×2,048 pixel per line)is acquired and processed on GPU in real time.The results show that signal processing time of SD-OCT can befinished in 6.223 ms when the cutoff length L=21,which is much faster than that on central processing unit(CPU).Real-time signal processing of acquired data can be realized. 展开更多
关键词 Optical coherence tomography real-time signal processing graphics processing unit gpu CUDA
原文传递
The inversion of density structure by graphic processing unit(GPU) and identification of igneous rocks in Xisha area 被引量:1
7
作者 Lei Yu Jian Zhang +2 位作者 Wei Lin Rongqiang Wei Shiguo Wu 《Earthquake Science》 2014年第1期117-125,共9页
Organic reefs, the targets of deep-water petro- leum exploration, developed widely in Xisha area. However, there are concealed igneous rocks undersea, to which organic rocks have nearly equal wave impedance. So the ig... Organic reefs, the targets of deep-water petro- leum exploration, developed widely in Xisha area. However, there are concealed igneous rocks undersea, to which organic rocks have nearly equal wave impedance. So the igneous rocks have become interference for future explo- ration by having similar seismic reflection characteristics. Yet, the density and magnetism of organic reefs are very different from igneous rocks. It has obvious advantages to identify organic reefs and igneous rocks by gravity and magnetic data. At first, frequency decomposition was applied to the free-air gravity anomaly in Xisha area to obtain the 2D subdivision of the gravity anomaly and magnetic anomaly in the vertical direction. Thus, the dis- tribution of igneous rocks in the horizontal direction can be acquired according to high-frequency field, low-frequency field, and its physical properties. Then, 3D forward model- ing of gravitational field was carried out to establish the density model of this area by reference to physical properties of rocks based on former researches. Furthermore, 3D inversion of gravity anomaly by genetic algorithm method of the graphic processing unit (GPU) parallel processing in Xisha target area was applied, and 3D density structure of this area was obtained. By this way, we can confine the igneous rocks to the certain depth according to the density of the igneous rocks. The frequency decomposition and 3D inversion of gravity anomaly by genetic algorithm method of the GPU parallel processing proved to be a useful method for recognizing igneous rocks to its 3D geological position. So organic reefs and igneous rocks can be identified, which provide a prescient information for further exploration. 展开更多
关键词 Xisha area Organic reefs and igneous rocks -Frequency decomposition of potential field 3D inversionof the graphic processing unit gpu parallel processing
在线阅读 下载PDF
基于算网状态感知的多集群GPU算力资源调度平台设计与实现
8
作者 胡亚辉 张宸康 +4 位作者 王越嶙 洪雨琛 范鹏飞 宋俊平 周旭 《通信学报》 北大核心 2025年第10期175-190,共16页
针对大规模深度学习任务的多集群GPU调度中资源粒度粗放、缺乏统一vGPU视图及跨集群网络感知不足等问题,设计算网状态感知的多集群GPU算力调度平台。平台采用集中式架构,通过实时感知跨集群算力资源与网络状态并协同调度,实现细粒度全... 针对大规模深度学习任务的多集群GPU调度中资源粒度粗放、缺乏统一vGPU视图及跨集群网络感知不足等问题,设计算网状态感知的多集群GPU算力调度平台。平台采用集中式架构,通过实时感知跨集群算力资源与网络状态并协同调度,实现细粒度全局资源编排调度。平台先构建设备、集群、vGPU及网络层多维度指标体系,实时采集核心利用率、显存、带宽等关键数据;设计节点级vGPU编排部署模块,突破“作业到集群”局限,达成“作业到节点”精准调度,提升GPU共享效率与资源利用率。实验表明,平台可实现多集群vGPU与网络信息的实时采集可视化,经DDPG强化学习及BestFit算法验证,具备高效资源管理能力。 展开更多
关键词 多集群 图形处理器 算力资源 算网状态感知 编排调度
在线阅读 下载PDF
Complex hexagonal close-packed dendritic growth during alloy solidification by graphics processing unit-accelerated three-dimensional phase-field simulations:demo for Mg–Gd alloy
9
作者 Sheng-Lan Yang Jing Zhong +5 位作者 Kai Wang Xun Kang Jian-Bao Gao Jiong Wang Qian Li Li-Jun Zhang 《Rare Metals》 SCIE EI CAS CSCD 2023年第10期3468-3484,共17页
In this study,insights into the effect of interfacial anisotropy on a complex hexagonal close-packed(hcp) dendritic growth during alloy solidification were gained by graphics processing unit(GPU)-accelerated three-dim... In this study,insights into the effect of interfacial anisotropy on a complex hexagonal close-packed(hcp) dendritic growth during alloy solidification were gained by graphics processing unit(GPU)-accelerated three-dimensional(3D) phase-field simulations,as demonstrated for a Mg-Gd alloy.An anisotropic phasefield model with finite interface dissipation was developed by incorporating the contribution of the anisotropy of interfacial energy into the total free energy functional.The modified spherical harmonic anisotropy function was then chosen for the hcp crystal.The GPU parallel computing algorithm was implemented in the present phase-field model,and a corresponding code was developed in the compute unified device architecture parallel computing platform.Benchmark tests indicated that the calculation efficiency of a single TESLA V100 GPU could be~80times that of open multi-processing(OpenMP) with eight central processing unit cores.By coupling the phase-field model with reliable thermodynamic and interfacial energy descriptions,the 3D phase-field simulation of α-Mg dendritic growth in the Mg-6Gd(in wt%) alloy during solidification was performed.Various two-dimensional dendrite morphologies were revealed by cutting the simulated 3D dendrite along different crystallographic planes.Typical sixfold equiaxed and butterflied microstructures observed in experiments were well reproduced. 展开更多
关键词 Interfacial anisotropy Dendrite solidification Phase-field model graphics processing unit(gpu) Mg–Gd
原文传递
Compute Unified Device Architecture Implementation of Euler/Navier-Stokes Solver on Graphics Processing Unit Desktop Platform for 2-D Compressible Flows
10
作者 Zhang Jiale Chen Hongquan 《Transactions of Nanjing University of Aeronautics and Astronautics》 EI CSCD 2016年第5期536-545,共10页
Personal desktop platform with teraflops peak performance of thousands of cores is realized at the price of conventional workstations using the programmable graphics processing units(GPUs).A GPU-based parallel Euler/N... Personal desktop platform with teraflops peak performance of thousands of cores is realized at the price of conventional workstations using the programmable graphics processing units(GPUs).A GPU-based parallel Euler/Navier-Stokes solver is developed for 2-D compressible flows by using NVIDIA′s Compute Unified Device Architecture(CUDA)programming model in CUDA Fortran programming language.The techniques of implementation of CUDA kernels,double-layered thread hierarchy and variety memory hierarchy are presented to form the GPU-based algorithm of Euler/Navier-Stokes equations.The resulting parallel solver is validated by a set of typical test flow cases.The numerical results show that dozens of times speedup relative to a serial CPU implementation can be achieved using a single GPU desktop platform,which demonstrates that a GPU desktop can serve as a costeffective parallel computing platform to accelerate computational fluid dynamics(CFD)simulations substantially. 展开更多
关键词 graphics processing unit(gpu) gpu parallel computing compute unified device architecture(CUDA)Fortran finite volume method(FVM) acceleration
在线阅读 下载PDF
基于NVIDIA GPU的高轨SAR快速BP算法子孔径成像CUDA设计与实现
11
作者 雷苏力 苏翔 +3 位作者 杨娟娟 高阳 向天舜 党红杏 《空间电子技术》 2025年第3期54-59,共6页
后向投影(BP)成像算法是经典的合成孔径雷达(SAR)时域成像算法,其能够适应长合成孔径时间、大幅宽、弯曲轨迹和超大数据量的星载SAR成像。改进的快速BP算法(FFBP)应用BP算法对SAR回波进行子孔径成像,能有效降低算法运算量。即便如此,FFB... 后向投影(BP)成像算法是经典的合成孔径雷达(SAR)时域成像算法,其能够适应长合成孔径时间、大幅宽、弯曲轨迹和超大数据量的星载SAR成像。改进的快速BP算法(FFBP)应用BP算法对SAR回波进行子孔径成像,能有效降低算法运算量。即便如此,FFBP算法的巨大的运算量仍然在工程中难以满足时效性需求,文章使用图形处理器(GPU)作为CPU的协处理器,提出基于FFBP算法的子孔径(CUDA)实现方案,使用流实现回波数据分块传输延迟隐藏的同时避免了高频次切换进程,另外设计超细颗粒度线程,实现子孔径FFBP算法成像的GPU大规模并发。经验证,使用该CUDA解决方案完成高轨SAR卫星FFBP子孔径成像时,设备的执行效率大于90%,相较于CPU 32线程并发程序具有120倍加速比。 展开更多
关键词 高轨SAR 快速后向投影(FFBP)成像算法 图形处理器(gpu)
在线阅读 下载PDF
Bypass-Enabled Thread Compaction for Divergent Control Flow in Graphics Processing Units
12
作者 LI Bingchao WEI Jizeng +1 位作者 GUO Wei SUN Jizhou 《Journal of Shanghai Jiaotong university(Science)》 EI 2021年第2期245-256,共12页
Graphics processing units(GPUs)employ the single instruction multiple data(SIMD)hardware to run threads in parallel and allow each thread to maintain an arbitrary control flow.Threads running concurrently within a war... Graphics processing units(GPUs)employ the single instruction multiple data(SIMD)hardware to run threads in parallel and allow each thread to maintain an arbitrary control flow.Threads running concurrently within a warp may jump to different paths after conditional branches.Such divergent control flow makes some lanes idle and hence reduces the SIMD utilization of GPUs.To alleviate the waste of SIMD lanes,threads from multiple warps can be collected together to improve the SIMD lane utilization by compacting threads into idle lanes.However,this mechanism induces extra barrier synchronizations since warps have to be stalled to wait for other warps for compactions,resulting in that no warps are scheduled in some cases.In this paper,we propose an approach to reduce the overhead of barrier synchronizat ions induced by compactions,In our approach,a compaction is bypassed by warps whose threads all jump to the same path after branches.Moreover,warps waiting for a compaction can also bypass this compaction when no warps are ready for issuing.In addition,a compaction is canceled if idle lanes can not be reduced via this compaction.The experimental results demonstrate that our approach provides an average improvement of 21%over the baseline GPU for applications with massive divergent branches,while recovering the performance loss induced by compactions by 13%on average for applications with many non-divergent control flows. 展开更多
关键词 graphics processing unit(gpu) single instruction ultiple data(SIMD) THREAD warps BYPASS
原文传递
GPU加速的卫星DSSS遥测信号解调技术现状
13
作者 陈其敏 焦义文 +2 位作者 吴涛 李雪健 冯浩 《航天工程大学学报》 2025年第5期66-73,共8页
针对卫星测控中直接序列扩频(Direct Sequence Spread Spectrum,DSSS)信号解调面临的高动态场景处理效率低、传统可编程门阵列(Field Programmable Gate Array,FPGA)平台灵活适应性不足等问题,对图形处理器(Graphics Processing Unit,G... 针对卫星测控中直接序列扩频(Direct Sequence Spread Spectrum,DSSS)信号解调面临的高动态场景处理效率低、传统可编程门阵列(Field Programmable Gate Array,FPGA)平台灵活适应性不足等问题,对图形处理器(Graphics Processing Unit,GPU)加速的DSSS遥测信号解调技术的发展现状进行了分析和研究,结合GPU异构计算架构与统一计算设备架构(Computer Unified Device Architecture,CUDA)编程模型,探讨其现状、不足及改进方向。 展开更多
关键词 直接序列扩频遥测信号 捕获跟踪 图形处理器 并行计算 实时解调
在线阅读 下载PDF
Simulation of fluid-structure interaction in a microchannel using the lattice Boltzmann method and size-dependent beam element on a graphics processing unit
14
作者 Vahid Esfahanian Esmaeil Dehdashti Amir Mehdi Dehrouye-Semnani 《Chinese Physics B》 SCIE EI CAS CSCD 2014年第8期389-395,共7页
Fluid-structure interaction (FSI) problems in microchannels play a prominent role in many engineering applications. The present study is an effort toward the simulation of flow in microchannel considering FSI. The b... Fluid-structure interaction (FSI) problems in microchannels play a prominent role in many engineering applications. The present study is an effort toward the simulation of flow in microchannel considering FSI. The bottom boundary of the microchannel is simulated by size-dependent beam elements for the finite element method (FEM) based on a modified cou- ple stress theory. The lattice Boltzmann method (LBM) using the D2Q13 LB model is coupled to the FEM in order to solve the fluid part of the FSI problem. Because of the fact that the LBM generally needs only nearest neighbor information, the algorithm is an ideal candidate for parallel computing. The simulations are carried out on graphics processing units (GPUs) using computed unified device architecture (CUDA). In the present study, the governing equations are non-dimensionalized and the set of dimensionless groups is exhibited to show their effects on micro-beam displacement. The numerical results show that the displacements of the micro-beam predicted by the size-dependent beam element are smaller than those by the classical beam element. 展开更多
关键词 fluid-structure interaction graphics processing unit lattice Boltzmann method size-dependentbeam element
原文传递
Multi-relaxation-time lattice Boltzmann simulations of lid driven flows using graphics processing unit
15
作者 Chenggong LI J.P.Y.MAA 《Applied Mathematics and Mechanics(English Edition)》 SCIE EI CSCD 2017年第5期707-722,共16页
Large eddy simulation (LES) using the Smagorinsky eddy viscosity model is added to the two-dimensional nine velocity components (D2Q9) lattice Boltzmann equation (LBE) with multi-relaxation-time (MRT) to simul... Large eddy simulation (LES) using the Smagorinsky eddy viscosity model is added to the two-dimensional nine velocity components (D2Q9) lattice Boltzmann equation (LBE) with multi-relaxation-time (MRT) to simulate incompressible turbulent cavity flows with the Reynolds numbers up to 1 × 10^7. To improve the computation efficiency of LBM on the numerical simulations of turbulent flows, the massively parallel computing power from a graphic processing unit (GPU) with a computing unified device architecture (CUDA) is introduced into the MRT-LBE-LES model. The model performs well, compared with the results from others, with an increase of 76 times in computation efficiency. It appears that the higher the Reynolds numbers is, the smaller the Smagorinsky constant should be, if the lattice number is fixed. Also, for a selected high Reynolds number and a selected proper Smagorinsky constant, there is a minimum requirement for the lattice number so that the Smagorinsky eddy viscosity will not be excessively large. 展开更多
关键词 large eddy simulation (LES) multi-relaxation-time (MRT) lattice Boltzmann equation (LBE) two-dimensional nine velocity components (D2Q9) Smagorinskymodel graphic processing unit gpu computing unified device architecture (CUDA)
在线阅读 下载PDF
Exploiting Parallelism in the Simulation of General Purpose Graphics Processing Unit Program
16
作者 赵夏 马胜 +1 位作者 陈微 王志英 《Journal of Shanghai Jiaotong university(Science)》 EI 2016年第3期280-288,共9页
The simulation is an important means of performance evaluation of the computer architecture. Nowadays, the serial simulation of general purpose graphics processing unit(GPGPU) architecture is the main bottleneck for t... The simulation is an important means of performance evaluation of the computer architecture. Nowadays, the serial simulation of general purpose graphics processing unit(GPGPU) architecture is the main bottleneck for the simulation speed. To address this issue, we propose the intra-kernel parallelization on a multicore processor and the inter-kernel parallelization on a multiple-machine platform. We apply these two methods to the GPGPU-sim simulator. The intra-kernel parallelization method firstly parallelizes the serial simulation of multiple compute units in one cycle. Then it parallelizes the timing and functional simulation to reduce the performance loss caused by the synchronization between different compute units. The inter-kernel parallelization method divides multiple kernels of a CUDA program into several groups and distributes these groups across multiple simulation hosts to perform the simulation. Experimental results show that the intra-kernel parallelization method achieves a speed-up of up to 12 with a maximum error rate of 0.009 4% on a 32-core machine, and the inter-kernel parallelization method can accelerate the simulation by a factor of up to 3.9 with a maximum error rate of 0.11% on four simulation hosts. The orthogonality between these two methods allows us to combine them together on multiple multi-core hosts to get further performance improvements. 展开更多
关键词 general purpose graphics processing unit(GPgpu) MULTICORE intra-kernel inter-kernel parallel
原文传递
一种子孔径CS条带SAR成像算法的GPU实现
17
作者 雷迪 张晓滨 黄安陈 《计算机与数字工程》 2025年第7期1823-1828,共6页
合成孔径雷达(SAR)成像技术因其不受环境干扰、性能稳定,被广泛应用于遥感观测、导航定位等领域,但成像数据大、成像流程运行时间长始终是SAR成像算法处理过程中存在的问题。论文提出了一种基于图形处理器(GPU)加速的子孔径线性调频变标... 合成孔径雷达(SAR)成像技术因其不受环境干扰、性能稳定,被广泛应用于遥感观测、导航定位等领域,但成像数据大、成像流程运行时间长始终是SAR成像算法处理过程中存在的问题。论文提出了一种基于图形处理器(GPU)加速的子孔径线性调频变标(CS)条带SAR成像算法。采用该方案可以将全孔径划分为多个子孔径,每个子孔径分别完成对雷达数据的CS成像处理,并在GPU端完成对子孔径成像结果的拼接融合。由于CS算法流程中的快速傅里叶变换(FFT)与相位因子相乘等大量串行浮点计算在GPU端并行实现,该算法可以有效减小GPU计算量并且缩短成像算法运行时间。实验证明,该方案通过GPU端并行化密集数据的浮点计算,实现的子孔径CS算法核心步骤的效率与基于CPU上的处理效率相比,有数十倍的速度提升。 展开更多
关键词 子孔径 图形处理器 线性调频变标算法 快速傅里叶变换
在线阅读 下载PDF
基于GPU并行计算的目标声散射Kirchhoff近似积分方法
18
作者 杨晨轩 安俊英 +1 位作者 孙阳 张毅 《声学技术》 北大核心 2025年第4期499-505,共7页
为提高水下目标中高频声散射的计算效率,文章建立了基于图形处理器(graphics processing unit,GPU)并行计算方式的目标声散射基尔霍夫(Kirchhoff)近似积分计算模型。首先,针对目标声散射的Kirchhoff近似积分方法的常量元模型和面元精确... 为提高水下目标中高频声散射的计算效率,文章建立了基于图形处理器(graphics processing unit,GPU)并行计算方式的目标声散射基尔霍夫(Kirchhoff)近似积分计算模型。首先,针对目标声散射的Kirchhoff近似积分方法的常量元模型和面元精确积分模型,建立基于GPU线程分配的并行化模式,形成可并行计算的算法模型;然后,以半径为1 m的刚性球为目标,采用GPU并行模型计算其声散射目标强度,并通过与解析解的对比验证算法的准确性;最后,以Benchmark模型为目标,通过仿真计算不同条件下的声散射目标强度,对比分析GPU并行计算模型的加速比。结果表明,常量元模型的GPU并行计算效率相比传统串行计算效率提高4~5倍;面元精确积分模型的GPU并行计算效率相比于传统串行计算效率提高8~11倍。基于GPU的并行化模式对目标声散射的Kirchhoff近似积分方法的计算具有明显的加速效果,且随着面元数增加,GPU计算优势更加明显。 展开更多
关键词 基尔霍夫(Kirchhoff)近似积分 图形处理器(gpu) 并行计算 目标散射
在线阅读 下载PDF
多GPU平台上三维格子Boltzmann方法的并行化实现
19
作者 向星 孙培杰 +1 位作者 张华海 王利民 《数据与计算发展前沿(中英文)》 2025年第5期16-27,共12页
【目的】针对大规模科学计算问题,计算范式的转变推动了通用图形处理器的发展,在计算流体力学领域新兴的格子Boltzmann方法在耦合先进物理模型时具有内在的计算效率和并行可扩展性的显著优势。【方法】本研究基于标准格子模型D3Q19,考... 【目的】针对大规模科学计算问题,计算范式的转变推动了通用图形处理器的发展,在计算流体力学领域新兴的格子Boltzmann方法在耦合先进物理模型时具有内在的计算效率和并行可扩展性的显著优势。【方法】本研究基于标准格子模型D3Q19,考虑三维区域分解和分布式数据通信方法,对三维格子Boltzmann方法进行了并行算法设计与优化。【结果】在某国产异构加速计算平台,对三维流动基准算例进行了不同网格规模下数值验证和精度测试,实现了高保真度瞬态模拟,并捕捉了不同时刻下三维涡结构的非定常演化。在单卡不同网格规模的性能测试中,在正确性验证的基础上,讨论了数据通信部分对并行性能的影响,并给出了单卡对于单核的加速比。在强/弱扩展性测试中,设置了单节点单卡和单节点四卡两组对照数值实验来研究节点间/节点内数据通信的差异。其中单节点单卡组最大计算网格规模约为21.5亿,使用了128节点上总计128张加速卡,运行时间为262.119s,并行性能为81.927GLUPS(每秒十亿格点更新,1GLUPS=103MLUPS),并行效率为94.76%;单节点四卡组最大计算网格规模约为85.9亿,使用了128节点上总计512张加速卡,并行性能为241.185GLUPS,并行效率为69.71%。【结论】本研究提出的并行化实现方法具有线性加速比和良好的并行可扩展性,展示了在E级超算系统上实现高效模拟的潜力。 展开更多
关键词 图形处理器 格子BOLTZMANN方法 扩展性测试 大规模并行计算 三维Taylor-Green涡流
在线阅读 下载PDF
NGP-ERGAS: Revisit Instant Neural Graphics Primitives with the Relative Dimensionless Global Error in Synthesis
20
作者 Dongheng Ye Heping Li +2 位作者 Ning An Jian Cheng Liang Wang 《Computers, Materials & Continua》 2025年第8期3731-3747,共17页
The newly emerging neural radiance fields(NeRF)methods can implicitly fulfill three-dimensional(3D)reconstruction via training a neural network to render novel-view images of a given scene with given posed images.The ... The newly emerging neural radiance fields(NeRF)methods can implicitly fulfill three-dimensional(3D)reconstruction via training a neural network to render novel-view images of a given scene with given posed images.The Instant Neural Graphics Primitives(Instant-NGP)method further improves the position encoding of NeRF.It obtains state-of-the-art efficiency.However,only a local pixel-wised loss is considered when training the Instant-NGP while overlooking the nonlocal structural information between pixels.Despite a good quantitative result,it leads to a poor visual effect,especially the completeness.Inspired by the stochastic structural similarity(S3IM)method that exploits nonlocal structural information of groups of pixels,this paper proposes a new method to improve the completeness of fast novel view synthesis.The proposed method first extends the thread-wised processing of the Instant-NGP to the processing in a customthread block(i.e.,a group of threads).Then,the relative dimensionless global error in synthesis,i.e.,Erreur Relative Globale Adimensionnelle de Synthese(ERGAS),of a group of pixels corresponding to a group of threads is computed and incorporated into the loss function.Extensive experiments validate the proposed method.It can obtain better quantitative results than the original Instant-NGP with fewer iteration steps.PSNR is increased by 1%.Amazing qualitative results are obtained,especially for delicate structures and details such as lines and continuous structures.With the dramatic improvements in the visual effects,our method can boost the practicability of implicit 3D reconstruction in applications such as self-driving and augmented reality. 展开更多
关键词 Neural radiance fields novel view synthesis 3D reconstruction graphic processing unit
在线阅读 下载PDF
上一页 1 2 51 下一页 到第
使用帮助 返回顶部