Transformers have demonstrated considerable success across various domains but are constrained by their significant computational and memory requirements.This poses challenges for deployment on resource-constrained de...Transformers have demonstrated considerable success across various domains but are constrained by their significant computational and memory requirements.This poses challenges for deployment on resource-constrained devices.Quantization,as an effective model compression method,can significantly reduce the operational time of Transformers on edge devices.Notably,Transformers display more substantial outliers than convolutional neural networks,leading to uneven feature distribution among different channels and tokens.To address this issue,we propose an adaptive outlier correction quantization(AOCQ)method for Transformers,which significantly alleviates the adverse effects of these outliers.AOCQ adjusts the notable discrepancies in channels and tokens across three levels:operator level,framework level,and loss level.We introduce a new operator that equivalently balances the activations across different channels and insert an extra stage to optimize the activation quantization step on the framework level.Additionally,we transfer the imbalanced activations across tokens and channels to the optimization of model weights on the loss level.Based on the theoretical study,our method can reduce the quantization error.The effectiveness of the proposed method is verified on various benchmark models and tasks.Surprisingly,DeiT-Base with 8-bit post-training quantization(PTQ)can achieve 81.57%accuracy with a 0.28 percentage point drop while enjoying 4×faster runtime.Furthermore,the weights of Swin and DeiT on several tasks,including classification and object detection,can be post-quantized to ultra-low 4 bits,with a minimal accuracy loss of 2%,while requiring nearly 8×less memory.展开更多
Structural neural network pruning aims to remove the redundant channels in the deep convolutional neural networks(CNNs)by pruning the filters of less importance to the final output accuracy.To reduce the degradation o...Structural neural network pruning aims to remove the redundant channels in the deep convolutional neural networks(CNNs)by pruning the filters of less importance to the final output accuracy.To reduce the degradation of performance after pruning,many methods utilize the loss with sparse regularization to produce structured sparsity.In this paper,we analyze these sparsity-training-based methods and find that the regularization of unpruned channels is unnecessary.Moreover,it restricts the network′s capacity,which leads to under-fitting.To solve this problem,we propose a novel pruning method,named Mask Sparsity,with pruning-aware sparse regularization.Mask Sparsity imposes the fine-grained sparse regularization on the specific filters selected by a pruning mask,rather than all the filters of the model.Before the fine-grained sparse regularization of Mask Sparity,we can use many methods to get the pruning mask,such as running the global sparse regularization.Mask Sparsity achieves a 63.03%float point operations(FLOPs)reduction on Res Net-110 by removing 60.34%of the parameters,with no top-1 accuracy loss on CIFAR-10.On ILSVRC-2012,Mask Sparsity reduces more than 51.07%FLOPs on Res Net-50,with only a loss of 0.76%in the top-1 accuracy.The code of this paper is released at https://github.com/CASIA-IVA-Lab/Mask Sparsity.We have also integrated the code into a self-developed Py Torch pruning toolkit,named Easy Pruner,at https://gitee.com/casia_iva_engineer/easypruner.展开更多
文摘Transformers have demonstrated considerable success across various domains but are constrained by their significant computational and memory requirements.This poses challenges for deployment on resource-constrained devices.Quantization,as an effective model compression method,can significantly reduce the operational time of Transformers on edge devices.Notably,Transformers display more substantial outliers than convolutional neural networks,leading to uneven feature distribution among different channels and tokens.To address this issue,we propose an adaptive outlier correction quantization(AOCQ)method for Transformers,which significantly alleviates the adverse effects of these outliers.AOCQ adjusts the notable discrepancies in channels and tokens across three levels:operator level,framework level,and loss level.We introduce a new operator that equivalently balances the activations across different channels and insert an extra stage to optimize the activation quantization step on the framework level.Additionally,we transfer the imbalanced activations across tokens and channels to the optimization of model weights on the loss level.Based on the theoretical study,our method can reduce the quantization error.The effectiveness of the proposed method is verified on various benchmark models and tasks.Surprisingly,DeiT-Base with 8-bit post-training quantization(PTQ)can achieve 81.57%accuracy with a 0.28 percentage point drop while enjoying 4×faster runtime.Furthermore,the weights of Swin and DeiT on several tasks,including classification and object detection,can be post-quantized to ultra-low 4 bits,with a minimal accuracy loss of 2%,while requiring nearly 8×less memory.
基金supported by National Natural Science Foundation of China(Nos.62176254,61976210,61876086,62076235,62002356,62006230 and 62002357)National Key R&D Program of China(No.2021ZD0110403).
文摘Structural neural network pruning aims to remove the redundant channels in the deep convolutional neural networks(CNNs)by pruning the filters of less importance to the final output accuracy.To reduce the degradation of performance after pruning,many methods utilize the loss with sparse regularization to produce structured sparsity.In this paper,we analyze these sparsity-training-based methods and find that the regularization of unpruned channels is unnecessary.Moreover,it restricts the network′s capacity,which leads to under-fitting.To solve this problem,we propose a novel pruning method,named Mask Sparsity,with pruning-aware sparse regularization.Mask Sparsity imposes the fine-grained sparse regularization on the specific filters selected by a pruning mask,rather than all the filters of the model.Before the fine-grained sparse regularization of Mask Sparity,we can use many methods to get the pruning mask,such as running the global sparse regularization.Mask Sparsity achieves a 63.03%float point operations(FLOPs)reduction on Res Net-110 by removing 60.34%of the parameters,with no top-1 accuracy loss on CIFAR-10.On ILSVRC-2012,Mask Sparsity reduces more than 51.07%FLOPs on Res Net-50,with only a loss of 0.76%in the top-1 accuracy.The code of this paper is released at https://github.com/CASIA-IVA-Lab/Mask Sparsity.We have also integrated the code into a self-developed Py Torch pruning toolkit,named Easy Pruner,at https://gitee.com/casia_iva_engineer/easypruner.