The Least Mean Square(LMS)adaptive filtering algorithm is a significant filtering algorithm widely used in noise processing and other fields that automatically adjusts the values of filter coefficients according to th...The Least Mean Square(LMS)adaptive filtering algorithm is a significant filtering algorithm widely used in noise processing and other fields that automatically adjusts the values of filter coefficients according to the results,aimed at optimizing the filtered results.Based on a basic serial LMS adaptive filtering algorithm,we propose a vectorized parallel processing scheme for the LMS adaptive filtering algorithm in this work.By combining the characteristics of the algorithm processing flow and those of the parallel technologies used in vector Digital Signal Processes(DSPs),the optimizations such as loop fusion,double-word accessing,and vector shuffling of the LMS algorithm are studied in depth,and the loop unrolling optimization method is used to accelerate the calculation of the algorithm further.Experimental research was conducted on the high-performance FT-M7002 DSP platform in this paper.The results show that,compared with the running performance of the LMS adaptive filtering algorithm in Texas Instruments(TI)’s dsplib library on the TMS320C6678 processor,the optimization effect of the proposed optimization algorithm in this paper can achieve a maximum speed-up ratio of up to 6.9×for medium-scale data.The merged memory access optimization implemented on the GPU platform achieves an average 1.5x speedup compared to the basic parallel scheme.展开更多
张量转置(tensor transposition)作为基础张量运算原语,广泛应用于信号处理、科学计算以及深度学习等各种领域,在张量数据密集型应用及高性能计算中具有重要作用。随着能效指标在高性能计算系统中的重要性日益凸显,基于数字信号处理器(d...张量转置(tensor transposition)作为基础张量运算原语,广泛应用于信号处理、科学计算以及深度学习等各种领域,在张量数据密集型应用及高性能计算中具有重要作用。随着能效指标在高性能计算系统中的重要性日益凸显,基于数字信号处理器(digital signal processors,DSPs)的加速器已被集成至通用计算系统。然而,传统面向多核CPU和GPU的张量转置库因架构差异无法充分适配DSP架构。一方面,DSP架构的向量化计算潜力尚未得到充分挖掘;另一方面,其复杂的片上存储体系与多层次共享内存结构为张量并行程序设计带来了显著挑战。针对国产多核DSP的架构特点,提出ftmTT算法,并设计实现了一个面向多核DSP架构的通用张量转置库。ftmTT算法通过设计适配DSP架构的高效内存访问模式充分挖掘其并行化和向量化潜力,其核心创新包括:1)采用分块策略将高维张量转置转化为多核DSP平台所提供的矩阵转置内核操作;2)提出基于DMA点对点传输的张量数据块访存合并方案来降低数据搬运开销;3)通过双缓冲设计异步重叠转置计算与DMA传输实现计算通信隐藏,最终面向多核DSP实现高性能并行张量转置。在国产多核DSP平台FT-M7032的实验表明,ftmTT张量转置算法取得了最高达理论带宽75.96%的性能,达到FT-M7032平台STREAM带宽99.23%的性能。展开更多
In order to compensate for the deficiency of Sine Pulse Width Modulation(SPWM), on the base of analyzing the principle of space w tot pulse width modulation and being compared with SPWM, the method of solving workin...In order to compensate for the deficiency of Sine Pulse Width Modulation(SPWM), on the base of analyzing the principle of space w tot pulse width modulation and being compared with SPWM, the method of solving working time of adjacent vector and the method of generate space voltage vector were introduced. The experiment to the inverter which consists of IGBT proves that SVPWM centrol algorithm can reduce harmonic effectively, it is beneficial to enhancing the utilization rate of voltage source inverter direct current power supply.展开更多
基金funded by the Hunan Provincial Natural Science Foundation of China(No.2023JJ50019)the National Science and Technology Major Project(No.2022ZD0119003).
文摘The Least Mean Square(LMS)adaptive filtering algorithm is a significant filtering algorithm widely used in noise processing and other fields that automatically adjusts the values of filter coefficients according to the results,aimed at optimizing the filtered results.Based on a basic serial LMS adaptive filtering algorithm,we propose a vectorized parallel processing scheme for the LMS adaptive filtering algorithm in this work.By combining the characteristics of the algorithm processing flow and those of the parallel technologies used in vector Digital Signal Processes(DSPs),the optimizations such as loop fusion,double-word accessing,and vector shuffling of the LMS algorithm are studied in depth,and the loop unrolling optimization method is used to accelerate the calculation of the algorithm further.Experimental research was conducted on the high-performance FT-M7002 DSP platform in this paper.The results show that,compared with the running performance of the LMS adaptive filtering algorithm in Texas Instruments(TI)’s dsplib library on the TMS320C6678 processor,the optimization effect of the proposed optimization algorithm in this paper can achieve a maximum speed-up ratio of up to 6.9×for medium-scale data.The merged memory access optimization implemented on the GPU platform achieves an average 1.5x speedup compared to the basic parallel scheme.
文摘张量转置(tensor transposition)作为基础张量运算原语,广泛应用于信号处理、科学计算以及深度学习等各种领域,在张量数据密集型应用及高性能计算中具有重要作用。随着能效指标在高性能计算系统中的重要性日益凸显,基于数字信号处理器(digital signal processors,DSPs)的加速器已被集成至通用计算系统。然而,传统面向多核CPU和GPU的张量转置库因架构差异无法充分适配DSP架构。一方面,DSP架构的向量化计算潜力尚未得到充分挖掘;另一方面,其复杂的片上存储体系与多层次共享内存结构为张量并行程序设计带来了显著挑战。针对国产多核DSP的架构特点,提出ftmTT算法,并设计实现了一个面向多核DSP架构的通用张量转置库。ftmTT算法通过设计适配DSP架构的高效内存访问模式充分挖掘其并行化和向量化潜力,其核心创新包括:1)采用分块策略将高维张量转置转化为多核DSP平台所提供的矩阵转置内核操作;2)提出基于DMA点对点传输的张量数据块访存合并方案来降低数据搬运开销;3)通过双缓冲设计异步重叠转置计算与DMA传输实现计算通信隐藏,最终面向多核DSP实现高性能并行张量转置。在国产多核DSP平台FT-M7032的实验表明,ftmTT张量转置算法取得了最高达理论带宽75.96%的性能,达到FT-M7032平台STREAM带宽99.23%的性能。
文摘In order to compensate for the deficiency of Sine Pulse Width Modulation(SPWM), on the base of analyzing the principle of space w tot pulse width modulation and being compared with SPWM, the method of solving working time of adjacent vector and the method of generate space voltage vector were introduced. The experiment to the inverter which consists of IGBT proves that SVPWM centrol algorithm can reduce harmonic effectively, it is beneficial to enhancing the utilization rate of voltage source inverter direct current power supply.