张量转置(tensor transposition)作为基础张量运算原语,广泛应用于信号处理、科学计算以及深度学习等各种领域,在张量数据密集型应用及高性能计算中具有重要作用。随着能效指标在高性能计算系统中的重要性日益凸显,基于数字信号处理器(d...张量转置(tensor transposition)作为基础张量运算原语,广泛应用于信号处理、科学计算以及深度学习等各种领域,在张量数据密集型应用及高性能计算中具有重要作用。随着能效指标在高性能计算系统中的重要性日益凸显,基于数字信号处理器(digital signal processors,DSPs)的加速器已被集成至通用计算系统。然而,传统面向多核CPU和GPU的张量转置库因架构差异无法充分适配DSP架构。一方面,DSP架构的向量化计算潜力尚未得到充分挖掘;另一方面,其复杂的片上存储体系与多层次共享内存结构为张量并行程序设计带来了显著挑战。针对国产多核DSP的架构特点,提出ftmTT算法,并设计实现了一个面向多核DSP架构的通用张量转置库。ftmTT算法通过设计适配DSP架构的高效内存访问模式充分挖掘其并行化和向量化潜力,其核心创新包括:1)采用分块策略将高维张量转置转化为多核DSP平台所提供的矩阵转置内核操作;2)提出基于DMA点对点传输的张量数据块访存合并方案来降低数据搬运开销;3)通过双缓冲设计异步重叠转置计算与DMA传输实现计算通信隐藏,最终面向多核DSP实现高性能并行张量转置。在国产多核DSP平台FT-M7032的实验表明,ftmTT张量转置算法取得了最高达理论带宽75.96%的性能,达到FT-M7032平台STREAM带宽99.23%的性能。展开更多
Si P微系统是一种高度集成化的系统,其内部可能集成1个或多个DSP、NOR Flash和DDR存储器、AI加速芯片等,有些复杂的微系统还集成了FPGA芯片。由于内部集成了多个微组件,芯片之间相互连接,传统的测试单一微组件的方法并不适用于微系统的...Si P微系统是一种高度集成化的系统,其内部可能集成1个或多个DSP、NOR Flash和DDR存储器、AI加速芯片等,有些复杂的微系统还集成了FPGA芯片。由于内部集成了多个微组件,芯片之间相互连接,传统的测试单一微组件的方法并不适用于微系统的测试。提出了一套DSP微组件测试方法,该系统包括1块专门的测试板、可调试的电脑测试环境和JTAG通信。与单一的DSP裸芯测试相比,它可以快速稳定地实现DSP微组件的性能测试,满足大批量生产测试的需求。展开更多
With the continuous increasing of circuit scale, the problem of power consumption is paid much more attention than before, especially in large designs. In this paper, an experience of optimizing the power consumption ...With the continuous increasing of circuit scale, the problem of power consumption is paid much more attention than before, especially in large designs. In this paper, an experience of optimizing the power consumption of the 16-bit datapath in a 32-bit reconfigurable pipelined Digital Signal Processor (DSP) is introduced. By keeping the old input values and preventing the useless switching of the logic blocks on the datapath, the power consumption is much lowered. At the same time, by relocating some logic blocks between different pipeline stages and employing some data forward logics, a better balanced pipeline is achieved to lower the power consumption for conditional computation instructions at very low timing and area costs. The effectivity of these power optimization technologies are proved by the experimental results. Finally, some ideas about how to reduce the power consumption of circuits are proposed, which are very effective and useful in practice designs, especially in pipelined ones.展开更多
文摘张量转置(tensor transposition)作为基础张量运算原语,广泛应用于信号处理、科学计算以及深度学习等各种领域,在张量数据密集型应用及高性能计算中具有重要作用。随着能效指标在高性能计算系统中的重要性日益凸显,基于数字信号处理器(digital signal processors,DSPs)的加速器已被集成至通用计算系统。然而,传统面向多核CPU和GPU的张量转置库因架构差异无法充分适配DSP架构。一方面,DSP架构的向量化计算潜力尚未得到充分挖掘;另一方面,其复杂的片上存储体系与多层次共享内存结构为张量并行程序设计带来了显著挑战。针对国产多核DSP的架构特点,提出ftmTT算法,并设计实现了一个面向多核DSP架构的通用张量转置库。ftmTT算法通过设计适配DSP架构的高效内存访问模式充分挖掘其并行化和向量化潜力,其核心创新包括:1)采用分块策略将高维张量转置转化为多核DSP平台所提供的矩阵转置内核操作;2)提出基于DMA点对点传输的张量数据块访存合并方案来降低数据搬运开销;3)通过双缓冲设计异步重叠转置计算与DMA传输实现计算通信隐藏,最终面向多核DSP实现高性能并行张量转置。在国产多核DSP平台FT-M7032的实验表明,ftmTT张量转置算法取得了最高达理论带宽75.96%的性能,达到FT-M7032平台STREAM带宽99.23%的性能。
文摘With the continuous increasing of circuit scale, the problem of power consumption is paid much more attention than before, especially in large designs. In this paper, an experience of optimizing the power consumption of the 16-bit datapath in a 32-bit reconfigurable pipelined Digital Signal Processor (DSP) is introduced. By keeping the old input values and preventing the useless switching of the logic blocks on the datapath, the power consumption is much lowered. At the same time, by relocating some logic blocks between different pipeline stages and employing some data forward logics, a better balanced pipeline is achieved to lower the power consumption for conditional computation instructions at very low timing and area costs. The effectivity of these power optimization technologies are proved by the experimental results. Finally, some ideas about how to reduce the power consumption of circuits are proposed, which are very effective and useful in practice designs, especially in pipelined ones.