Based on the structure and bootloader principle of DSP system, a method for the remote updation, debugging and self-loading of DSP system is developed by Ethernet. Hardware circuit and DM9000 driven program are presen...Based on the structure and bootloader principle of DSP system, a method for the remote updation, debugging and self-loading of DSP system is developed by Ethernet. Hardware circuit and DM9000 driven program are presented and TCP/IP protocol is embedded into DSP. Through the re-mapping of external memory address, it is easy to implement the load of program section selectively and DSP self-boot. The experimental results show that the problem of high cost in system maintenance by conventional field debugging by emulator and the limitation in chip-level boot are resolved.展开更多
张量转置(tensor transposition)作为基础张量运算原语,广泛应用于信号处理、科学计算以及深度学习等各种领域,在张量数据密集型应用及高性能计算中具有重要作用。随着能效指标在高性能计算系统中的重要性日益凸显,基于数字信号处理器(d...张量转置(tensor transposition)作为基础张量运算原语,广泛应用于信号处理、科学计算以及深度学习等各种领域,在张量数据密集型应用及高性能计算中具有重要作用。随着能效指标在高性能计算系统中的重要性日益凸显,基于数字信号处理器(digital signal processors,DSPs)的加速器已被集成至通用计算系统。然而,传统面向多核CPU和GPU的张量转置库因架构差异无法充分适配DSP架构。一方面,DSP架构的向量化计算潜力尚未得到充分挖掘;另一方面,其复杂的片上存储体系与多层次共享内存结构为张量并行程序设计带来了显著挑战。针对国产多核DSP的架构特点,提出ftmTT算法,并设计实现了一个面向多核DSP架构的通用张量转置库。ftmTT算法通过设计适配DSP架构的高效内存访问模式充分挖掘其并行化和向量化潜力,其核心创新包括:1)采用分块策略将高维张量转置转化为多核DSP平台所提供的矩阵转置内核操作;2)提出基于DMA点对点传输的张量数据块访存合并方案来降低数据搬运开销;3)通过双缓冲设计异步重叠转置计算与DMA传输实现计算通信隐藏,最终面向多核DSP实现高性能并行张量转置。在国产多核DSP平台FT-M7032的实验表明,ftmTT张量转置算法取得了最高达理论带宽75.96%的性能,达到FT-M7032平台STREAM带宽99.23%的性能。展开更多
This paper presents techniques and approaches capable of achieving a real-time JPEG2000 compressing system using DSP chips. We propose a three-DSP real-time parallel processing system using efficient memory management...This paper presents techniques and approaches capable of achieving a real-time JPEG2000 compressing system using DSP chips. We propose a three-DSP real-time parallel processing system using efficient memory management for discrete wavelet transform (DWT) and parallel-pass architecture for embedded block coding with optimized truncation (EBCOT). This system performs compression of 1392×1040 pixels monochrome images with the speed of 10 fps/camera of 2 digital still cameras and is proven to be a practical and efficient DSP solution.展开更多
Si P微系统是一种高度集成化的系统,其内部可能集成1个或多个DSP、NOR Flash和DDR存储器、AI加速芯片等,有些复杂的微系统还集成了FPGA芯片。由于内部集成了多个微组件,芯片之间相互连接,传统的测试单一微组件的方法并不适用于微系统的...Si P微系统是一种高度集成化的系统,其内部可能集成1个或多个DSP、NOR Flash和DDR存储器、AI加速芯片等,有些复杂的微系统还集成了FPGA芯片。由于内部集成了多个微组件,芯片之间相互连接,传统的测试单一微组件的方法并不适用于微系统的测试。提出了一套DSP微组件测试方法,该系统包括1块专门的测试板、可调试的电脑测试环境和JTAG通信。与单一的DSP裸芯测试相比,它可以快速稳定地实现DSP微组件的性能测试,满足大批量生产测试的需求。展开更多
文摘Based on the structure and bootloader principle of DSP system, a method for the remote updation, debugging and self-loading of DSP system is developed by Ethernet. Hardware circuit and DM9000 driven program are presented and TCP/IP protocol is embedded into DSP. Through the re-mapping of external memory address, it is easy to implement the load of program section selectively and DSP self-boot. The experimental results show that the problem of high cost in system maintenance by conventional field debugging by emulator and the limitation in chip-level boot are resolved.
文摘张量转置(tensor transposition)作为基础张量运算原语,广泛应用于信号处理、科学计算以及深度学习等各种领域,在张量数据密集型应用及高性能计算中具有重要作用。随着能效指标在高性能计算系统中的重要性日益凸显,基于数字信号处理器(digital signal processors,DSPs)的加速器已被集成至通用计算系统。然而,传统面向多核CPU和GPU的张量转置库因架构差异无法充分适配DSP架构。一方面,DSP架构的向量化计算潜力尚未得到充分挖掘;另一方面,其复杂的片上存储体系与多层次共享内存结构为张量并行程序设计带来了显著挑战。针对国产多核DSP的架构特点,提出ftmTT算法,并设计实现了一个面向多核DSP架构的通用张量转置库。ftmTT算法通过设计适配DSP架构的高效内存访问模式充分挖掘其并行化和向量化潜力,其核心创新包括:1)采用分块策略将高维张量转置转化为多核DSP平台所提供的矩阵转置内核操作;2)提出基于DMA点对点传输的张量数据块访存合并方案来降低数据搬运开销;3)通过双缓冲设计异步重叠转置计算与DMA传输实现计算通信隐藏,最终面向多核DSP实现高性能并行张量转置。在国产多核DSP平台FT-M7032的实验表明,ftmTT张量转置算法取得了最高达理论带宽75.96%的性能,达到FT-M7032平台STREAM带宽99.23%的性能。
文摘This paper presents techniques and approaches capable of achieving a real-time JPEG2000 compressing system using DSP chips. We propose a three-DSP real-time parallel processing system using efficient memory management for discrete wavelet transform (DWT) and parallel-pass architecture for embedded block coding with optimized truncation (EBCOT). This system performs compression of 1392×1040 pixels monochrome images with the speed of 10 fps/camera of 2 digital still cameras and is proven to be a practical and efficient DSP solution.