基于GPU的FFT高性能算法库的实现和优化

Implementation and Optimization of High-Performance FFT Algorithm Library Based on GPU

下载PDF

导出

摘要【目的】本研究旨在优化GPU上FFT算法库的性能,并填补国产GPU高性能FFT算法库的空缺。【方法】主要采取的优化策略如下:一是基于GPU的并行优势,充分利用DFT的数学特性并提出分块处理和层级化计算方案。二是提出了一种去除位元反转的宽度优先与深度优先相结合的新型蝶形网络结构。三是针对多批次数据,采用共享内存和分块处理策略。【结果】与CPU端FFTW库对比,在大规模数据上加速比在2以上。与业内先进的clFFT库相比,在128和256批次的小规模数据上,2的幂次规模的平均加速比为1.47、1.58,非2的幂次规模的平均加速比为3.58、4.07。对于大规模数据,2的幂次规模的平均加速比为2.04、2.38,非2的幂次规模的平均加速比为5.39,5.28。【结论】实验表明,GPU在处理大规模数据时性能显著优于CPU,且PerfFFT在不同规模数据上性能均优于clFFT,验证了优化策略的有效性。 [Objective]This research primarily aims to address the computational and memory access performance bottlenecks in GPU-based FFT implementations,optimize the performance of FFT algorithm libraries,and bridge the gap in high-performance FFT algorithm libraries for domestic GPUs.[Methods]The optimization strategies adopted in this study are as follows:Firstly,leveraging GPU’s parallel computing advantages and fully utilizing DFT’s mathematical characteristics,we developed a block processing and hierarchical computation scheme that optimizes the computational flow of FFT butterfly operations to achieve high performance.Secondly,targeting GPU’s hierarchical memory architecture and memory access patterns,we proposed a novel bit-reversal-free hybrid butterfly network structure combining depth-first and breadth-first approaches.By optimizing memory access patterns and improving data scheduling,this strategy reduces memory access conflicts,better utilizes GPU cache structures and shared memory,while minimizing global memory accesses.Thirdly,for multi-batch data processing,we implemented shared memory and block processing strategies that enable GPUs to parallelize multiple FFT tasks within the same computational cycle.[Results]Compared with the CPU-based FFTW library,the speedup ratio exceeds 2 for largescale data.Additionally,by comparing with the industry-leading open-source clFFT library,PerfFFT achieves average speedup ratios of 1.47 and 1.58 for smaller-scale data with power-of-two input sizes,and 3.58 and 4.07 for non-power-of-two input sizes,when the batch size is 128 and 256,respectively.For large-scale data,the average speedup ratios are 2.04 and 2.38 for power-of-two input sizes,and 5.39 and 5.28 for non-power-of-two input sizes.[Conclusion]The experimental results demonstrate that GPUs achieve substantially higher performance than CPUs for large-scale data processing and the PerfFFT library exhibits superior performance compared to the clFFT library when processing input data of various sizes,verifying the effectiveness of the series of optimization strategies proposed in this study.These findings provide valuable insights and references for achieving high-performance FFT implementations on GPUs.

作者杜振鹏徐建良张先轶黄强 DU Zhenpeng;XU Jianliang;ZHANG Xianyi;HUANG Qiang(Ocean University of China,Qingdao,Shandong 266000,China;PerfXLab Technologies Co.,Ltd,Beijing 100080,China)

机构地区中国海洋大学澎峰科技有限公司

出处《数据与计算发展前沿(中英文)》 2025年第6期124-135,共12页 Frontiers of Data & Computing

基金基于高性能计算知识图谱的并行软件性能优化方法研究(CARCHA202113)。

关键词快速傅里叶变换图形处理单元开放计算语言并行计算 FFT GPU OpenCL parallel computing

分类号 TP332 [自动化与计算机技术—计算机系统结构]

引文网络
相关文献

参考文献3

1李亚美,陈莉丽,王锋,胡畅.基于异构编程模型的FFT算法实现和优化[J].智能安全,2023,2(4):24-34. 被引量：1
2贾珍珍,杨凌,黄立波,郭辉,王勇,刘胜,常俊胜,王永文.开源GPU研究综述[J].小型微型计算机系统,2024,45(9):2294-2304. 被引量：3
3赵翔,贾海鹏,张云泉,邓明森,张广婷,郭金鑫.基于ARMv8处理器的实数FFT实现与性能优化研究[J].计算机学报,2023,46(5):1003-1018. 被引量：3

二级参考文献5

1陈暾,李志豪,贾海鹏,张云泉.基于ARMv8平台的多维FFT实现与优化研究[J].计算机学报,2019,42(11):2384-2402. 被引量：10
2龚彤艳,张广婷,贾海鹏,袁良.一种偶数基Cooley-Tukey FFT高性能实现方法[J].计算机科学,2020,47(1):31-39. 被引量：11
3包云岗,孙凝晖.开源芯片生态技术体系构建面临的机遇与挑战[J].中国科学院院刊,2022,37(1):24-29. 被引量：10
4郭金鑫,张广婷,张云泉,陈泽华,贾海鹏.Cooley-Tukey FFT算法高性能实现与优化研究[J].计算机科学与探索,2022,16(6):1304-1315. 被引量：5
5鲁蔚征,张峰,贺寅烜,陈跃国,翟季冬,杜小勇.华为昇腾神经网络加速器性能评测与优化[J].计算机学报,2022,45(8):1618-1637. 被引量：14

共引文献4

1Kaijing Liu,Liang Dong,Huanhuan Xie,Baoxin Li,Jingzhi Zhou.HF-VHF dual-channel multifunctional radio astronomy terminal system[J].Astronomical Techniques and Instruments,2024,1(2):140-149.
2舒燕君,郑翔宇,徐成华,黄沛,王永琪,周凡,张展,左德承.面向LoongArch边界检查访存指令的GCC优化[J].计算机研究与发展,2025,62(5):1136-1150.
3张宇,丁建明.基于YOLOv8n的重载铁路扣件状态检测网络[J].机械,2025,52(5):68-74.
4杜慧敏,袁鼎,王睿辰,赵毅飞,沈泽京.嵌入式GPU纹理解压缩电路的设计与实现[J].电子设计工程,2025,33(24):47-52.

1侯波,李书,郭瑾玉,肖曦.列车ERM数据解析速度分析及优化[J].电力机车与城轨车辆,2025,48(6):83-86.
2路文军,刘红卫,曹欢欢.改进的宽度优先超图划分生成方法[J].哈尔滨师范大学自然科学学报,2025,41(3):5-10.
3刘志昌,何标涛.一种多约束机器人关节空间路径规划及优化方法QMC-RRT^(*)[J].机床与液压,2025,53(21):81-86.
4韩皓庭,范雨晴,朱岩,朱桂桢,张乐君,陈娥.基于整数同态加密的定点数密态计算方案[J].工程科学学报,2025,47(12):2539-2553.
5徐国愚,张一丹,魏笑,毛洋敏.自适应路由与双阈值剪枝的多模态大模型检索增强感知[J].计算机科学与探索,2025,19(12):3257-3266.
6张萍,曹华伟,杨莫凡,梁彦,安学军.VS-NRM:基于数据划分的PageRank并行图算法优化[J].高技术通讯,2025,35(6):579-589.
7彭宇盟,于金泽,潘隆盛,曾梓敬,袁田,时颖,张政波.基于RTMPose和PatchTST的帕金森病和特发性震颤的视频鉴别诊断研究[J].解放军医学院学报,2025,46(7):638-645.
8范海菊,岳爽,窦育强,李名,张明珠.基于新型混沌系统和二进制块压缩感知的图像加密算法[J].计算机科学,2025,52(12):400-410.
9苏永雷,张志飞.电控盖板螺栓结合面建模及辐射噪声抑制[J].机械工程学报,2025,61(17):161-170.
10刘玲,周榕溪,姜榕融.基于犹豫模糊TOPSIS的医药供应链脆弱性评估研究[J].物流技术,2025,44(11):72-84.

数据与计算发展前沿(中英文)

2025年第6期

浏览历史

内容加载中请稍等...

基于GPU的FFT高性能算法库的实现和优化

参考文献3

二级参考文献5

共引文献4

相关作者

相关机构

相关主题

浏览历史