The multispectral remote sensing image(MS-RSI)is degraded existing multi-spectral camera due to various hardware limitations.In this paper,we propose a novel core tensor dictionary learning approach with the robust mo...The multispectral remote sensing image(MS-RSI)is degraded existing multi-spectral camera due to various hardware limitations.In this paper,we propose a novel core tensor dictionary learning approach with the robust modified Gaussian mixture model for MS-RSI restoration.First,the multispectral patch is modeled by three-order tensor and high-order singular value decomposition is applied to the tensor.Then the task of MS-RSI restoration is formulated as a minimum sparse core tensor estimation problem.To improve the accuracy of core tensor coding,the core tensor estimation based on the robust modified Gaussian mixture model is introduced into the proposed model by exploiting the sparse distribution prior in image.When applied to MS-RSI restoration,our experimental results have shown that the proposed algorithm can better reconstruct the sharpness of the image textures and can outperform several existing state-of-the-art multispectral image restoration methods in both subjective image quality and visual perception.展开更多
Multichannel audio signal is more difficult to be compressed than mono and stereo ones.A novel multichannel audio signal compression method based on tensor representation and decomposition is proposed in this paper.Th...Multichannel audio signal is more difficult to be compressed than mono and stereo ones.A novel multichannel audio signal compression method based on tensor representation and decomposition is proposed in this paper.The multichannel audio is represented with 3-order tensor space and is decomposed into core tensor with three factor matrices in the way of channel,time and frequency.Only the truncated core tensor is transmitted which will be multiplied by the pre-trained factor matrices to reconstruct the original tensor space.Objective and subjective experiments have been done to show a very noticeable compression capability with an acceptable output quality.The novelty of the proposed compression method is that it enables both high compression capability and backward compatibility with limited signal distortion to the hearing.展开更多
张量转置(tensor transposition)作为基础张量运算原语,广泛应用于信号处理、科学计算以及深度学习等各种领域,在张量数据密集型应用及高性能计算中具有重要作用。随着能效指标在高性能计算系统中的重要性日益凸显,基于数字信号处理器(d...张量转置(tensor transposition)作为基础张量运算原语,广泛应用于信号处理、科学计算以及深度学习等各种领域,在张量数据密集型应用及高性能计算中具有重要作用。随着能效指标在高性能计算系统中的重要性日益凸显,基于数字信号处理器(digital signal processors,DSPs)的加速器已被集成至通用计算系统。然而,传统面向多核CPU和GPU的张量转置库因架构差异无法充分适配DSP架构。一方面,DSP架构的向量化计算潜力尚未得到充分挖掘;另一方面,其复杂的片上存储体系与多层次共享内存结构为张量并行程序设计带来了显著挑战。针对国产多核DSP的架构特点,提出ftmTT算法,并设计实现了一个面向多核DSP架构的通用张量转置库。ftmTT算法通过设计适配DSP架构的高效内存访问模式充分挖掘其并行化和向量化潜力,其核心创新包括:1)采用分块策略将高维张量转置转化为多核DSP平台所提供的矩阵转置内核操作;2)提出基于DMA点对点传输的张量数据块访存合并方案来降低数据搬运开销;3)通过双缓冲设计异步重叠转置计算与DMA传输实现计算通信隐藏,最终面向多核DSP实现高性能并行张量转置。在国产多核DSP平台FT-M7032的实验表明,ftmTT张量转置算法取得了最高达理论带宽75.96%的性能,达到FT-M7032平台STREAM带宽99.23%的性能。展开更多
We propose an optical tensor core(OTC) architecture for neural network training. The key computational components of the OTC are the arrayed optical dot-product units(DPUs). The homodyne-detection-based DPUs can condu...We propose an optical tensor core(OTC) architecture for neural network training. The key computational components of the OTC are the arrayed optical dot-product units(DPUs). The homodyne-detection-based DPUs can conduct the essential computational work of neural network training, i.e., matrix-matrix multiplication. Dual-layer waveguide topology is adopted to feed data into these DPUs with ultra-low insertion loss and cross talk. Therefore, the OTC architecture allows a large-scale dot-product array and can be integrated into a photonic chip. The feasibility of the OTC and its effectiveness on neural network training are verified with numerical simulations.展开更多
In this paper,we propose and implement a mixed-precision Block-ISAI preconditioner for solving linear systems from mul-tiphysics areas.By leveraging FP32 computing,our approach accelerates the sparse matrix-vector pro...In this paper,we propose and implement a mixed-precision Block-ISAI preconditioner for solving linear systems from mul-tiphysics areas.By leveraging FP32 computing,our approach accelerates the sparse matrix-vector product kernel while main-taining satisfactory accuracy.Meanwhile,an efficient,warp-based GPU implementation for Block-ISAI preconditioner with Tensor core acceleration is proposed.For the matrix-multiplication portion of it,we use the double-precision Tensor core on the NVIDIA GPUs A100 to accelerate it.To showcase the effectiveness of our method,detailed comparisons are made which shows noteworthy speedup:precisely,it is 6x faster than cuSPARSE and 11.2x faster than PETSc’s built-in preconditioner.展开更多
Optimizing GEneral Matrix Multiplication(GEMM)on GPU platforms is becoming increasingly critical to meet the growing computational demands of modern deep neural network research.While significant progress has been mad...Optimizing GEneral Matrix Multiplication(GEMM)on GPU platforms is becoming increasingly critical to meet the growing computational demands of modern deep neural network research.While significant progress has been made in accelerating high-precision GEMM,the optimization of low-bit GEMM remains a challenging open problem.The CUTLASS library provides highly optimized low-bit GEMM templates leveraging Tensor Cores;however,performance varies considerably depending on tile and pipeline configurations across different GPU architectures.In this work,we propose a novel auto-tuning framework for low-bit CUTLASS GEMM,utilizing a neural network model to predict optimal GEMM template parameters for target GPUs.Our model is trained on a synthetic dataset with up to 116100 unique samples,encompassing diverse matrix sizes across various Ampere GPUs,and is thoroughly evaluated on these hardware platforms.Experimental results show that our method achieves an accuracy of up to 95.11%on the validation dataset.Furthermore,real-time evaluations of low-bit data types on the A100 GPU demonstrate speedups of up to 1.99×for GEMM operations and 1.28×for the linear layer,compared to the default CUTLASS templates.展开更多
Transformers are widely used in various fields such as natural language processing and computer vision.However,the training time for large Transformer models can be challenging due to the Multi-Head Attention(MHA)mech...Transformers are widely used in various fields such as natural language processing and computer vision.However,the training time for large Transformer models can be challenging due to the Multi-Head Attention(MHA)mechanism.Especially as models become larger,training becomes more costly.So it is crucial to utilize various resources for efficient model training.Currently,NVIDIA Volta GPU is still widely used.However,because the computational shapes supported by Tensor Core Units(TCU)of Volta GPU differ from other GPU architectures,most efforts have not focused on using them to accelerate Transformer training.To address this issue,we propose SparkAttention,an acceleration library designed to speed up MHA training on the Volta GPU.SparkAttention leverages TCU and kernel fusion to reduce the number of high bandwidth memory(HBM)accesses and overhead.Our End-to-End experimental results on an NVIDIA V100 GPU show that SparkAttention achieves on average 1.80×(up to 2.46×)speedup compared to using PyTorch.展开更多
基金This work was supported by the Project of Shandong Province Higher Educational Science and Technology Program[KJ2018BAN047,Geng,L.]National Natural Science Foundation of China[61801222,Fu,P.]+2 种基金Fundamental Research Funds for the Central Universities[30919011230,Fu,P.]Science and Technology Innovation Program for Distributed Young Talents of Shandong Province Higher Education Institutions[2019KJN045,Guo,Q.]Shandong Provincial Key Laboratory of Network Based Intelligent Computing[http://nbic.ujn.edu.cn/].
文摘The multispectral remote sensing image(MS-RSI)is degraded existing multi-spectral camera due to various hardware limitations.In this paper,we propose a novel core tensor dictionary learning approach with the robust modified Gaussian mixture model for MS-RSI restoration.First,the multispectral patch is modeled by three-order tensor and high-order singular value decomposition is applied to the tensor.Then the task of MS-RSI restoration is formulated as a minimum sparse core tensor estimation problem.To improve the accuracy of core tensor coding,the core tensor estimation based on the robust modified Gaussian mixture model is introduced into the proposed model by exploiting the sparse distribution prior in image.When applied to MS-RSI restoration,our experimental results have shown that the proposed algorithm can better reconstruct the sharpness of the image textures and can outperform several existing state-of-the-art multispectral image restoration methods in both subjective image quality and visual perception.
基金This work was partially supported by the National Natural Science Foundation of China under Grants No.11161140319,No.61001188,the Specialized Research Fund for the Doctoral Program of Higher Education under Grant No.20101101110020,the Fund for Basic Research from Beijing Institute of Technology under Grant No.20120542011,the Fund for Beijing Higher Education Young Elite Teacher Project under Grant No.YETP1202
文摘Multichannel audio signal is more difficult to be compressed than mono and stereo ones.A novel multichannel audio signal compression method based on tensor representation and decomposition is proposed in this paper.The multichannel audio is represented with 3-order tensor space and is decomposed into core tensor with three factor matrices in the way of channel,time and frequency.Only the truncated core tensor is transmitted which will be multiplied by the pre-trained factor matrices to reconstruct the original tensor space.Objective and subjective experiments have been done to show a very noticeable compression capability with an acceptable output quality.The novelty of the proposed compression method is that it enables both high compression capability and backward compatibility with limited signal distortion to the hearing.
文摘张量转置(tensor transposition)作为基础张量运算原语,广泛应用于信号处理、科学计算以及深度学习等各种领域,在张量数据密集型应用及高性能计算中具有重要作用。随着能效指标在高性能计算系统中的重要性日益凸显,基于数字信号处理器(digital signal processors,DSPs)的加速器已被集成至通用计算系统。然而,传统面向多核CPU和GPU的张量转置库因架构差异无法充分适配DSP架构。一方面,DSP架构的向量化计算潜力尚未得到充分挖掘;另一方面,其复杂的片上存储体系与多层次共享内存结构为张量并行程序设计带来了显著挑战。针对国产多核DSP的架构特点,提出ftmTT算法,并设计实现了一个面向多核DSP架构的通用张量转置库。ftmTT算法通过设计适配DSP架构的高效内存访问模式充分挖掘其并行化和向量化潜力,其核心创新包括:1)采用分块策略将高维张量转置转化为多核DSP平台所提供的矩阵转置内核操作;2)提出基于DMA点对点传输的张量数据块访存合并方案来降低数据搬运开销;3)通过双缓冲设计异步重叠转置计算与DMA传输实现计算通信隐藏,最终面向多核DSP实现高性能并行张量转置。在国产多核DSP平台FT-M7032的实验表明,ftmTT张量转置算法取得了最高达理论带宽75.96%的性能,达到FT-M7032平台STREAM带宽99.23%的性能。
基金supported by the National Key R&D Program of China (No.2019YFB2203700)the National Natural Science Foundation of China (No.61822508)。
文摘We propose an optical tensor core(OTC) architecture for neural network training. The key computational components of the OTC are the arrayed optical dot-product units(DPUs). The homodyne-detection-based DPUs can conduct the essential computational work of neural network training, i.e., matrix-matrix multiplication. Dual-layer waveguide topology is adopted to feed data into these DPUs with ultra-low insertion loss and cross talk. Therefore, the OTC architecture allows a large-scale dot-product array and can be integrated into a photonic chip. The feasibility of the OTC and its effectiveness on neural network training are verified with numerical simulations.
基金funded by Key Technologies Research and Development Program(No.2020YFB1709500).
文摘In this paper,we propose and implement a mixed-precision Block-ISAI preconditioner for solving linear systems from mul-tiphysics areas.By leveraging FP32 computing,our approach accelerates the sparse matrix-vector product kernel while main-taining satisfactory accuracy.Meanwhile,an efficient,warp-based GPU implementation for Block-ISAI preconditioner with Tensor core acceleration is proposed.For the matrix-multiplication portion of it,we use the double-precision Tensor core on the NVIDIA GPUs A100 to accelerate it.To showcase the effectiveness of our method,detailed comparisons are made which shows noteworthy speedup:precisely,it is 6x faster than cuSPARSE and 11.2x faster than PETSc’s built-in preconditioner.
基金supported by the Federal Ministry of Research,Technology and Space under the funding code“KI-Servicezentrum Berlin-Brandenburg”16IS22092.
文摘Optimizing GEneral Matrix Multiplication(GEMM)on GPU platforms is becoming increasingly critical to meet the growing computational demands of modern deep neural network research.While significant progress has been made in accelerating high-precision GEMM,the optimization of low-bit GEMM remains a challenging open problem.The CUTLASS library provides highly optimized low-bit GEMM templates leveraging Tensor Cores;however,performance varies considerably depending on tile and pipeline configurations across different GPU architectures.In this work,we propose a novel auto-tuning framework for low-bit CUTLASS GEMM,utilizing a neural network model to predict optimal GEMM template parameters for target GPUs.Our model is trained on a synthetic dataset with up to 116100 unique samples,encompassing diverse matrix sizes across various Ampere GPUs,and is thoroughly evaluated on these hardware platforms.Experimental results show that our method achieves an accuracy of up to 95.11%on the validation dataset.Furthermore,real-time evaluations of low-bit data types on the A100 GPU demonstrate speedups of up to 1.99×for GEMM operations and 1.28×for the linear layer,compared to the default CUTLASS templates.
基金supported by the National Science and Technology Major Project(2023ZD0120502)the National Natural Science Foundation of China under Grant No.62372055the Fundamental Research Funds for the Central Universities,the fund of Laboratory for Advanced Computing and Intelligence Engineering。
文摘Transformers are widely used in various fields such as natural language processing and computer vision.However,the training time for large Transformer models can be challenging due to the Multi-Head Attention(MHA)mechanism.Especially as models become larger,training becomes more costly.So it is crucial to utilize various resources for efficient model training.Currently,NVIDIA Volta GPU is still widely used.However,because the computational shapes supported by Tensor Core Units(TCU)of Volta GPU differ from other GPU architectures,most efforts have not focused on using them to accelerate Transformer training.To address this issue,we propose SparkAttention,an acceleration library designed to speed up MHA training on the Volta GPU.SparkAttention leverages TCU and kernel fusion to reduce the number of high bandwidth memory(HBM)accesses and overhead.Our End-to-End experimental results on an NVIDIA V100 GPU show that SparkAttention achieves on average 1.80×(up to 2.46×)speedup compared to using PyTorch.