基于OpenCL的连续数据无关访存密集型函数并行与优化研究被引量：1

Parallelism and Research on Functions with Continuously Independent Data and Intensive Memory Access Using OpenCL

下载PDF

导出

摘要连续的数据无关是指计算目标矩阵连续的元素时使用的源矩阵元素之间没有关系且也为连续的,访存密集型是指函数的计算量较小,但是有大量的数据传输操作。在OpenCL框架下,以bitwise函数为例,研究和实现了连续数据无关访存密集型函数在GPU平台上的并行与优化。在考察向量化、线程组织方式和指令选择优化等多个优化角度在不同的GPU硬件平台上对性能的影响之后,实现了这个函数的跨平台性能移植。实验结果表明,在不考虑数据传输的前提下,优化后的函数与这个函数在OpenCV库中的CPU版本相比,在AMD HD 5850GPU达到了平均40倍的性能加速比;在AMD HD 7970GPU达到了平均90倍的性能加速比;在NVIDIA Tesla C2050GPU上达到了平均60倍的性能加速比;同时,与这个函数在OpenCV库中的CUDA实现相比,在NVIDIA Tesla C2050平台上也达到了1.5倍的性能加速。 Continuously independent data type means when calculating the continuous elements of destination matrix, the used elements of source matrices are also continuous and there are no relationship among them. Intensive memory access function is the function that has less computation but a lot of data transfer operations. This paper took the bit- wise function as the example, studied and implemented the parallel and the optimizing methods of the continuously inde- pendent data and intensive memory access function on GPU platforms. Based on the OpenCL framework, this paper studied and compared various optimizing methods, such as vectorizing, threads organizing, and instruction selecting, and finally used these methods to implement the cross-platform transfer of the bitwise function among different platforms. The study tested the function＇s execution time without data transfer both on AMD GPU and NVIDIA GPU platforms. On the AMD Radeon HD 5850 platform, the performance has reached 40 times faster than the CPU version in OpenCV library, 90 times faster on AMD Radeon HD 7970 platform, and 60 times faster on NVID/A GPU Tesla C2050 plat- form. On NVIDIA GPU Tesla C2050 platform,the speedup is 1.5 comparing with the CUDA version in OpenCV library.

作者蒋丽媛张云泉龙国平贾海鹏

机构地区中国科学院软件研究所并行软件与计算科学实验室中国科学院研究生院中国科学院软件研究所计算机科学国家重点实验室中国海洋大学信息科学与工程学院

出处《计算机科学》 CSCD 北大核心 2013年第3期111-115,共5页 Computer Science

基金国家自然科学基金资助项目(60303020 60533020) 国家自然科学基金资助重点项目(60503020) 国家自然科学基金青年基金课题(61100072) 国家"863"计划基金资助项目(2012AA010902) ISCAS-AMD联合fusion软件中心资助

关键词 GPU OPENCL 向量化 ROI GPU, OpenCL, Vectorization, ROI

分类号 TP302 [自动化与计算机技术—计算机系统结构]

引文网络
相关文献

参考文献13

1陈钢,吴百锋.面向OpenCL模型的GPU性能优化[J].计算机辅助设计与图形学学报,2011,23(4):571-581. 被引量：21
2ATI Stream SDK OpenCL Programming Guide[M].rev1.03,2010.
3NVIDIA's Next Generation CUDA Architecture Whitepaper[M].V1.1,2009:8.
4Herve CHEVANNE Dr.Ing,AMD.A Methodology For Optimizing Data Transfer in OpenCL[S].2011.
5张樱,张云泉,龙国平.基于OpenCL的图像模糊化算法优化研究[J].计算机科学,2012,39(3):260-264. 被引量：7
6AMD Accelerated Parallel Processing OpenCL Programming Guide[M].Rev 1.3f,2011.
7NVIDIA OpenCL Best Practice Guide[M].Version 1.0,2009.
8NVIDIA OpneCL Programming Guide[M].Version 4.1,2012:56.
9AMD上海研发中心.跨平台的多核与众核编程讲义OpenCL的方式[M].2010.
10贾海鹏,张云泉,龙国平,徐建良,李焱.基于OpenCL的拉普拉斯图像增强算法优化研究[J].计算机科学,2012,39(5):271-277. 被引量：19

二级参考文献35

1吴恩华,柳有权.基于图形处理器(GPU)的通用计算[J].计算机辅助设计与图形学学报,2004,16(5):601-612. 被引量：228
2()wens J D, Houston M, Luebke D, et al. GPU computing [J]. Proceedings of the IEEE, 2008, 96(5): 879-899.
3Owens J D, Luebke D, Govindaraju N, et al. A survey of general-purpose computation on graphics hardware [J]. Computer Graphics Forum, 2007, 26(1): 80-113.
4Fatahalian K, Houston M. GPUs:a closer look [J]. ACM Queue, 2008, 6(2): 18 28.
5Jang B, Mistry P, Sehaa D, et al. Data transformations enabling loop vectorization on multithreaded data parallel architectures [C] //Proceedings of the 15th ACM SIGPLAN Symposium on Principles ahd Practice of Parallel Programming. New York: ACM Press, 2010:353-354.
6Liu Y X, Zhang E Z, Shen X P. A cross-input adaptive framework for GPU program optimizations [C] //Proceedings of IEEE International Symposium on Parallel & Distributed Processing. Los Alamitos: IEEE Computer Society Press, 2009, 1-10.
7Ryoo S, Rodrigucs C I, Stone S S, et al. Program optimization space pruning for a multithreaded GPU [C]// Proceedings of the 6th Annual IEEE/ACM International Symposium on Code Generation and Optimization. New York: ACM Press, 2008:195-204.
8Ryoo S, Rodrigues C l, Stone S S, el al. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA [C] //Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. New York: ACM Press, 2008:73-82.
9Jang 13, Do S, Pien H, etal. Architecture aware optimization targeting multithreaded stream computing[C] //Proceedings of the 2nd Workshop on General Purpose Processing onGraphics Processing Units, New York: ACM Press, 2009: 62-70.
10Baskaran M M, Bondhugu/a U, Krishnamoorthy S, et al. A compiler framework for optimization of affine loop nests for GPGPUs [C] //Proceedings of the 22nd Annual International Conference on Supercomputing. New York: ACM Press, 2008:225-234.

共引文献42

1詹云,赵新灿,谭同德.基于OpenCL的异构系统并行编程[J].计算机工程与设计,2012,33(11):4191-4195. 被引量：23
2庞旭,张云泉,龙国平,贾海鹏,颜深根.基于OpenCL的均值平移算法在多个众核平台的性能优化研究[J].计算机科学,2013,40(3):79-85. 被引量：1
3吴再龙,张云泉,龙国平,徐建良,贾海鹏.基于OpenCL的图像重映射算法优化研究[J].科研信息化技术与应用,2013,4(1):57-66. 被引量：4
4熊英,罗琼.基于OpenCL的NDVI算法的并行化实现[J].电脑开发与应用,2013,26(11):77-78. 被引量：2
5赵成龙,施慧彬,俞忻峰.基于OpenCL的双GPU基数排序算法[J].计算机与现代化,2015(1):27-30. 被引量：1
6龚若皓,杨斌.基于移动多核GPU的并行二维DCT变换实现方法[J].成都信息工程学院学报,2015,30(1):22-26. 被引量：2
7黎柏春,杨建宇,于天彪,王宛山.在GPU上实现基于高斯映射的通用刀具扫描体建模[J].计算机辅助设计与图形学学报,2015,27(7):1334-1340. 被引量：1
8唐玲,杜雨洺.一种基于多级Kalman滤波的高精度距离估计方法[J].成都信息工程学院学报,2015,30(2):131-135.
9赵成龙,施慧彬,俞忻峰.基于OpenCL的Lammps短程力算法优化研究[J].计算机工程与科学,2015,37(9):1614-1620. 被引量：1
10刘磊,王燕燕,申春,李玉祥,刘雷.Bellman-Ford算法性能可移植的GPU并行优化[J].吉林大学学报（工学版）,2015,45(5):1559-1564. 被引量：7

同被引文献3

1李伯杨,聂峰光,李晓霞,郭力.GPU并行计算集群上的LAMMPS分子动力学模拟性能测试[J].计算机与应用化学,2011,28(10):1229-1233. 被引量：5
2贾海鹏,张云泉,龙国平,徐建良,李焱.基于OpenCL的拉普拉斯图像增强算法优化研究[J].计算机科学,2012,39(5):271-277. 被引量：19
3庞旭,张云泉,龙国平,贾海鹏,颜深根.基于OpenCL的均值平移算法在多个众核平台的性能优化研究[J].计算机科学,2013,40(3):79-85. 被引量：1

引证文献1

1赵成龙,施慧彬,俞忻峰.基于OpenCL的Lammps短程力算法优化研究[J].计算机工程与科学,2015,37(9):1614-1620. 被引量：1

二级引证文献1

1林琳,祝爱琦,赵明璨,张帅,叶炎昊,徐骥,韩林,赵荣彩,侯超峰.晶硅分子动力学模拟的GPU加速算法优化[J].计算机工程,2023,49(4):166-173. 被引量：3

1彭慧丽,张啸剑.基于差分隐私的空间分割研究综述[J].燕山大学学报,2016,40(3):263-269. 被引量：2
2陈立岩.基于J2EE平台的B/S系统性能优化设计研究[J].计算机技术与发展,2008,18(6):122-124. 被引量：4
3王海艳,李根,王汝传.一种基于改进的BP神经网络的入侵检测方法[J].南通大学学报（自然科学版）,2010,9(3):19-23. 被引量：2
4吴金秀.asp.net与jsp两种Web开发技术比较研究[J].企业技术开发,2010,29(8):20-21.
5陈青.简易实用的门禁系统[J].无锡职业技术学院学报,2008,7(4):53-54.
6吴再龙,张云泉,龙国平,徐建良,贾海鹏.基于OpenCL的图像重映射算法优化研究[J].科研信息化技术与应用,2013,4(1):57-66. 被引量：4
7马士超,王贞松.IPSec协议实现及其现状分析[J].计算机工程,2006,32(22):107-110. 被引量：8
8沈滨,孙建伶.异步通讯的J2EE系统的性能优化研究[J].计算机工程与设计,2006,27(8):1389-1391.
9肖汉,马歌,周清雷.面向OpenCL架构的Harris角点检测算法[J].计算机科学,2014,41(7):306-309. 被引量：7
10Sunny.擦亮慧眼看清Intel CPU[J].电子制作．电脑维护与应用,2004(9):59-59.

计算机科学

2013年第3期

浏览历史

内容加载中请稍等...

基于OpenCL的连续数据无关访存密集型函数并行与优化研究被引量：1

参考文献13

二级参考文献35

共引文献42

同被引文献3

引证文献1

二级引证文献1

相关作者

相关机构

相关主题

浏览历史

基于OpenCL的连续数据无关访存密集型函数并行与优化研究 被引量：1

参考文献13

二级参考文献35

共引文献42

同被引文献3

引证文献1

二级引证文献1

相关作者

相关机构

相关主题

浏览历史

基于OpenCL的连续数据无关访存密集型函数并行与优化研究被引量：1