期刊文献+

基于直接内存访问和动态共享缓冲区的超长向量归约操作硬件卸载结构与方法

A hardware offloading structure and method for ultra long vector reduction operation based on direct memory access and dynamic shared buffer
在线阅读 下载PDF
导出
摘要 MPI聚合通信通过将多个计算结点的多个进程组织起来协同完成一系列通信操作,以提高系统性能。其中,超长操作数向量的归约操作在高性能计算和AI计算中应用广泛。提出了一种基于DMA和动态共享缓冲区的超长向量归约操作的硬件卸载结构与方法。通过专用硬件通信序列触发机制,实现聚合通信硬件卸载流程的控制;通过DMA传输协议提升归约操作数的软硬件传输效率;提出片上动态共享缓冲区存储结构,以实现大量操作数的灵活高效缓存;通过部署片上ALU阵列,直接在网络芯片中完成计算。实验结果表明,相对于MPI非卸载方式和“天河”原有卸载方式均有明显的加速效果,尤其是当归约向量长度较大时,加速效果显著提升。 MPI(Message Passing Interface)collective communication enhances system performance by organizing multiple processes across multiple computing nodes to collaboratively complete a series of communication operations.Among these,reduction operations on ultra-long operand vectors are widely used in high performance computing and AI(Artificial Intelligence)computations.This paper proposes a hardware offloading structure and method for ultra-long vector reduction operations based on DMA(Direct Memory Access)and dynamic shared buffers.It achieves control over the hardware offloading process for collective communication through a dedicated hardware communication sequence trigger mechanism.The DMA transmission protocol is employed to enhance the software-hardware transmission efficiency of reduction operands.An on-chip dynamic shared buffer storage structure is introduced to achieve flexible and efficient caching of a large number of operands.By deploying an on-chip ALU(Arithmetic Logic Unit)array,computations are performed directly within the network chip.Experimental results demonstrate significant acceleration compared to both non-offloaded MPI methods and the original offloading method used in Tianhe,especially when dealing with longer reduction vectors.
作者 徐金波 戴艺 翦杰 XU Jinbo;DAI Yi;JIAN Jie(College of Computer Science and Technology,National University of Defense Technology,Changsha 410073,China)
出处 《计算机工程与科学》 北大核心 2025年第4期571-581,共11页 Computer Engineering & Science
基金 国防科技重点实验室基金(2022-KJWPDL-11) 自主创新科学基金(22-ZZCX-002)。
关键词 聚合通信 归约 直接内存访问 动态共享缓冲区 硬件卸载 collective communication reduce direct memory access dynamic shared buffer hardware offloading
  • 相关文献

参考文献3

二级参考文献13

  • 1http://www.top500.org/lists/2010/11, Dec. 1, 2010.
  • 2Yang X, Yan X, Xing Z, Deng Y, Jiang J, Zhang Y. A 64-bit stream processor architecture for scientific applications. In Proc. ISCA 2007, San Diego, USA, June 9-13, 2007, pp.210- 219.
  • 3http://www.top500.org/lists/2009/11, Dec. 1, 2010.
  • 4Rountree B, Lowenthal D K. Bounding energy consumption in largescale MPI programs. In Proc. SC2007, Nevada, USA, Nov. 10-16, 2007, pp.1-9.
  • 5A Berl, E Gelenbe,-M Di Girolamo, G Giuliani, H De Meer, M Dang, K Pentikousis. Energy-efficient cloud computing. The Computer Journal, 2009, 53(7): 1045-1051.
  • 6http://www.greenSOO.org/lists/2010/11/top/list.php?from=1&to=100, Dec. 1, 2010.
  • 7Kirk D. NVIDIA CUDA software and GPU parallel computing architecture. In Proc. ISMM2007, Montreal, Canada, Oct. 21-22, 2007, pp.103-104.
  • 8http://software.intel.com/en-us/articles/intel-vt uneamplifier-xe/, Dec. 1, 2010.
  • 9http://www.totalviewtech.com/home/, Dec. 1, 2010.
  • 10http://www.nvidia.com/docs/10/43395/NV_DS_Tesla_M2050_M2070_Apr10_LowRes.pdf, Dec. 1, 2010.

共引文献46

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部