基于直接内存访问和动态共享缓冲区的超长向量归约操作硬件卸载结构与方法

A hardware offloading structure and method for ultra long vector reduction operation based on direct memory access and dynamic shared buffer

下载PDF

导出

摘要 MPI聚合通信通过将多个计算结点的多个进程组织起来协同完成一系列通信操作,以提高系统性能。其中,超长操作数向量的归约操作在高性能计算和AI计算中应用广泛。提出了一种基于DMA和动态共享缓冲区的超长向量归约操作的硬件卸载结构与方法。通过专用硬件通信序列触发机制,实现聚合通信硬件卸载流程的控制;通过DMA传输协议提升归约操作数的软硬件传输效率;提出片上动态共享缓冲区存储结构,以实现大量操作数的灵活高效缓存;通过部署片上ALU阵列,直接在网络芯片中完成计算。实验结果表明,相对于MPI非卸载方式和“天河”原有卸载方式均有明显的加速效果,尤其是当归约向量长度较大时,加速效果显著提升。 MPI(Message Passing Interface)collective communication enhances system performance by organizing multiple processes across multiple computing nodes to collaboratively complete a series of communication operations.Among these,reduction operations on ultra-long operand vectors are widely used in high performance computing and AI(Artificial Intelligence)computations.This paper proposes a hardware offloading structure and method for ultra-long vector reduction operations based on DMA(Direct Memory Access)and dynamic shared buffers.It achieves control over the hardware offloading process for collective communication through a dedicated hardware communication sequence trigger mechanism.The DMA transmission protocol is employed to enhance the software-hardware transmission efficiency of reduction operands.An on-chip dynamic shared buffer storage structure is introduced to achieve flexible and efficient caching of a large number of operands.By deploying an on-chip ALU(Arithmetic Logic Unit)array,computations are performed directly within the network chip.Experimental results demonstrate significant acceleration compared to both non-offloaded MPI methods and the original offloading method used in Tianhe,especially when dealing with longer reduction vectors.

作者徐金波戴艺翦杰 XU Jinbo;DAI Yi;JIAN Jie(College of Computer Science and Technology,National University of Defense Technology,Changsha 410073,China)

机构地区国防科技大学计算机学院

出处《计算机工程与科学》北大核心 2025年第4期571-581,共11页 Computer Engineering & Science

基金国防科技重点实验室基金(2022-KJWPDL-11) 自主创新科学基金(22-ZZCX-002)。

关键词聚合通信归约直接内存访问动态共享缓冲区硬件卸载 collective communication reduce direct memory access dynamic shared buffer hardware offloading

分类号 TP302.2 [自动化与计算机技术—计算机系统结构]

引文网络
相关文献

参考文献3

1杨学军,廖湘科,卢凯,胡庆丰,宋君强,苏金树.The TianHe-1A Supercomputer: Its Hardware and Software[J].Journal of Computer Science & Technology,2011,26(3):344-351. 被引量：27
2廖湘科,庞征,王克非,卢宇彤,谢旻,夏军,董德尊,所光.High Performance Interconnect Network for Tianhe System[J].Journal of Computer Science & Technology,2015,30(2):259-272. 被引量：25
3王浩,张伟,谢旻,董勇.基于天河互连MPI聚合通信归约操作卸载优化[J].计算机工程与科学,2020,42(11):1981-1987. 被引量：2

二级参考文献13

1http://www.top500.org/lists/2010/11, Dec. 1, 2010.
2Yang X, Yan X, Xing Z, Deng Y, Jiang J, Zhang Y. A 64-bit stream processor architecture for scientific applications. In Proc. ISCA 2007, San Diego, USA, June 9-13, 2007, pp.210- 219.
3http://www.top500.org/lists/2009/11, Dec. 1, 2010.
4Rountree B, Lowenthal D K. Bounding energy consumption in largescale MPI programs. In Proc. SC2007, Nevada, USA, Nov. 10-16, 2007, pp.1-9.
5A Berl, E Gelenbe,-M Di Girolamo, G Giuliani, H De Meer, M Dang, K Pentikousis. Energy-efficient cloud computing. The Computer Journal, 2009, 53(7): 1045-1051.
6http://www.greenSOO.org/lists/2010/11/top/list.php?from=1&to=100, Dec. 1, 2010.
7Kirk D. NVIDIA CUDA software and GPU parallel computing architecture. In Proc. ISMM2007, Montreal, Canada, Oct. 21-22, 2007, pp.103-104.
8http://software.intel.com/en-us/articles/intel-vt uneamplifier-xe/, Dec. 1, 2010.
9http://www.totalviewtech.com/home/, Dec. 1, 2010.
10http://www.nvidia.com/docs/10/43395/NV_DS_Tesla_M2050_M2070_Apr10_LowRes.pdf, Dec. 1, 2010.

共引文献46

1菅晓东,李扬,冯景华,孟祥飞,朱小谦.“天河一号”在生命科学研究中的应用[J].计算机工程与科学,2012,34(8):171-175.
2TANG YuHua,ZHANG BaiDa,WU JunJie,HU TianJiang,ZHOU Jing,LIU FuDong.Parallel architecture and optimization for discrete-event simulation of spike neural networks[J].Science China(Technological Sciences),2013,56(2):509-517. 被引量：5
3蔡晔,刘刚,毛睿,罗秋明,陈国良.KD-90普及型个人高性能计算机系统设计与性能优化[J].深圳大学学报（理工版）,2013,30(2):138-143. 被引量：8
4杨灿群,吴强,唐滔,王锋,薛京灵.Programming for scientific computing on peta-scale heterogeneous parallel systems[J].Journal of Central South University,2013,20(5):1189-1203. 被引量：1
5邓亮,徐传福,刘巍,张理论.交替方向隐式CFD解法器的GPU并行计算及其优化[J].计算机应用,2013,33(10):2783-2786. 被引量：2
6李亮,王恩东,朱正东,颜康,张保,董小社.应用动态生成树的GPU显存数据复用优化[J].西安交通大学学报,2013,47(10):44-50. 被引量：1
7张拥军,林宇斐.基于闭合最小图划分模型的多作业分配优化方法[J].计算机科学,2014,41(6):22-26. 被引量：1
8雷斐,董德尊,廖湘科.SuperStar:一种可扩展高阶互连拓扑结构[J].计算机工程与科学,2014,36(6):1034-1041.
9吴伟,卿鹏,何王全.E级计算运行时系统结构研究[J].高性能计算技术,2014,0(2):26-31.
10李鑫,林宇斐,郭晓威.面向分布式流体系结构的多副本积极容错技术[J].计算机工程与科学,2015,37(12):2233-2241. 被引量：2

1曾辉,熊诗雨,狄永正,史红周.基于剪枝的大模型联邦参数高效微调技术[J].计算机应用,2025,45(3):715-724. 被引量：2
2朱琦,戴艺,彭晋韬,谢旻,梁崇山,刘鹏,杨博,刘杰.基于“天河二号”聚合通信卸载特性的MPI_Barrier优化[J].计算机工程与科学,2025,47(3):400-411.
3宁静.基于Newton-Raphson的复杂电力系统潮流计算研究[J].消费电子,2025(3):34-36.
4毕金凤.sap2000索桁架支承点式玻璃幕墙结构分析[J].中文科技期刊数据库(全文版)工程技术,2018(1):00052-00053.
5段雪良,周正华,卞祝,韩轶,赵玲,李政,廖成亮,贺家聪,刘伟.插值算法对透射边界数值模拟结果的影响分析[J].地震科学进展,2025,55(4):197-208.

计算机工程与科学

2025年第4期

浏览历史

内容加载中请稍等...

基于直接内存访问和动态共享缓冲区的超长向量归约操作硬件卸载结构与方法

参考文献3

二级参考文献13

共引文献46

相关作者

相关机构

相关主题

浏览历史