The rapid growth of digital data necessitates advanced natural language processing(NLP)models like BERT(Bidi-rectional Encoder Representations from Transformers),known for its superior performance in text classificati...The rapid growth of digital data necessitates advanced natural language processing(NLP)models like BERT(Bidi-rectional Encoder Representations from Transformers),known for its superior performance in text classification.However,BERT’s size and computational demands limit its practicality,especially in resource-constrained settings.This research compresses the BERT base model for Bengali emotion classification through knowledge distillation(KD),pruning,and quantization techniques.Despite Bengali being the sixth most spoken language globally,NLP research in this area is limited.Our approach addresses this gap by creating an efficient BERT-based model for Bengali text.We have explored 20 combinations for KD,quantization,and pruning,resulting in improved speedup,fewer parameters,and reduced memory size.Our best results demonstrate significant improvements in both speed and efficiency.For instance,in the case of mBERT,we achieved a 3.87×speedup and 4×compression ratio with a combination of Distil+Prune+Quant that reduced parameters from 178 to 46 M,while the memory size decreased from 711 to 178 MB.These results offer scalable solutions for NLP tasks in various languages and advance the field of model compression,making these models suitable for real-world applications in resource-limited environments.展开更多
Vision Transformers(ViTs)have achieved state-of-the-art performance on various computer vision tasks.However these models are memory-consuming and computation-intensive,making their deployment and efficient inference ...Vision Transformers(ViTs)have achieved state-of-the-art performance on various computer vision tasks.However these models are memory-consuming and computation-intensive,making their deployment and efficient inference on edge devices challenging.Model quantization is a promising approach to reduce model complexity.Prior works have explored tailored quantization algorithms for ViTs but unfortunately retained floating-point(FP)scaling factors,which not only yield non-negligible re-quantization overhead,but also hinder the quantized models to perform efficient integer-only inference.In this paper,we propose H-ViT,a dedicated post-training quantization scheme(e.g.,symmetric uniform quantization and layer-wise quantization for both weights and part of activations)to effectively quantize ViTs with fewer Power-of-Two(PoT)scaling factors,thus minimizing the re-quantization overhead and memory consumption.In addition,observing serious inter-channel variation in LayerNorm inputs and outputs,we propose Power-of-Two quantization(PTQ),a systematic method to reducing the performance degradation without hyper-parameters.Extensive experiments are conducted on multiple vision tasks with different model variants,proving that H-ViT offers comparable(or even slightly higher)INT8 quantization performance with PoT scaling factors when compared to the counterpart with floating-point scaling factors.For instance,we reach 78.43 top-1 accuracy with DeiT-S on ImageNet,51.6 box AP and 44.8 mask AP with Cascade Mask R-CNN(Swin-B)on COCO.展开更多
Vision Transformers(ViTs)have achieved remarkable success across various artificial intelligence-based computer vision applications.However,their demanding computational and memory requirements pose significant challe...Vision Transformers(ViTs)have achieved remarkable success across various artificial intelligence-based computer vision applications.However,their demanding computational and memory requirements pose significant challenges for de-ployment on resource-constrained edge devices.Although post-training quantization(PTQ)provides a promising solution by reducing model precision with minimal calibration data,aggressive low-bit quantization typically leads to substantial perfor-mance degradation.To address this challenge,we present the truncated uniform-log2 quantizer and progressive bit-decline reconstruction method for vision Transformer quantization(TP-ViT).It is an innovative PTQ framework specifically designed for ViTs,featuring two key technical contributions:(1)truncated uniform-log2 quantizer,a novel quantization approach which effectively handles outlier values in post-Softmax activations,significantly reducing quantization errors;(2)bit-decline optimiza-tion strategy,which employs transition weights to gradually reduce bit precision while maintaining model performance under extreme quantization conditions.Comprehensive experiments on image classification,object detection,and instance segmenta-tion tasks demonstrate TP-ViT’s superior performance compared to state-of-the-art PTQ methods,particularly in challenging 3-bit quantization scenarios.Our framework achieves a notable 6.18 percentage points improvement in top-1 accuracy for ViT-small under 3-bit quantization.These results validate TP-ViT’s robustness and general applicability,paving the way for more efficient deployment of ViT models in computer vision applications on edge hardware.展开更多
Transformers have demonstrated considerable success across various domains but are constrained by their significant computational and memory requirements.This poses challenges for deployment on resource-constrained de...Transformers have demonstrated considerable success across various domains but are constrained by their significant computational and memory requirements.This poses challenges for deployment on resource-constrained devices.Quantization,as an effective model compression method,can significantly reduce the operational time of Transformers on edge devices.Notably,Transformers display more substantial outliers than convolutional neural networks,leading to uneven feature distribution among different channels and tokens.To address this issue,we propose an adaptive outlier correction quantization(AOCQ)method for Transformers,which significantly alleviates the adverse effects of these outliers.AOCQ adjusts the notable discrepancies in channels and tokens across three levels:operator level,framework level,and loss level.We introduce a new operator that equivalently balances the activations across different channels and insert an extra stage to optimize the activation quantization step on the framework level.Additionally,we transfer the imbalanced activations across tokens and channels to the optimization of model weights on the loss level.Based on the theoretical study,our method can reduce the quantization error.The effectiveness of the proposed method is verified on various benchmark models and tasks.Surprisingly,DeiT-Base with 8-bit post-training quantization(PTQ)can achieve 81.57%accuracy with a 0.28 percentage point drop while enjoying 4×faster runtime.Furthermore,the weights of Swin and DeiT on several tasks,including classification and object detection,can be post-quantized to ultra-low 4 bits,with a minimal accuracy loss of 2%,while requiring nearly 8×less memory.展开更多
文摘The rapid growth of digital data necessitates advanced natural language processing(NLP)models like BERT(Bidi-rectional Encoder Representations from Transformers),known for its superior performance in text classification.However,BERT’s size and computational demands limit its practicality,especially in resource-constrained settings.This research compresses the BERT base model for Bengali emotion classification through knowledge distillation(KD),pruning,and quantization techniques.Despite Bengali being the sixth most spoken language globally,NLP research in this area is limited.Our approach addresses this gap by creating an efficient BERT-based model for Bengali text.We have explored 20 combinations for KD,quantization,and pruning,resulting in improved speedup,fewer parameters,and reduced memory size.Our best results demonstrate significant improvements in both speed and efficiency.For instance,in the case of mBERT,we achieved a 3.87×speedup and 4×compression ratio with a combination of Distil+Prune+Quant that reduced parameters from 178 to 46 M,while the memory size decreased from 711 to 178 MB.These results offer scalable solutions for NLP tasks in various languages and advance the field of model compression,making these models suitable for real-world applications in resource-limited environments.
基金supported by National Key Research and Development Program of China under Grant No.2022YFB2502903NSFC under grant No.62495092,62088102Natural Science Basic Research Plan in Shaanxi Province of China under Grant No.2025SYS-SYSZD-023.
文摘Vision Transformers(ViTs)have achieved state-of-the-art performance on various computer vision tasks.However these models are memory-consuming and computation-intensive,making their deployment and efficient inference on edge devices challenging.Model quantization is a promising approach to reduce model complexity.Prior works have explored tailored quantization algorithms for ViTs but unfortunately retained floating-point(FP)scaling factors,which not only yield non-negligible re-quantization overhead,but also hinder the quantized models to perform efficient integer-only inference.In this paper,we propose H-ViT,a dedicated post-training quantization scheme(e.g.,symmetric uniform quantization and layer-wise quantization for both weights and part of activations)to effectively quantize ViTs with fewer Power-of-Two(PoT)scaling factors,thus minimizing the re-quantization overhead and memory consumption.In addition,observing serious inter-channel variation in LayerNorm inputs and outputs,we propose Power-of-Two quantization(PTQ),a systematic method to reducing the performance degradation without hyper-parameters.Extensive experiments are conducted on multiple vision tasks with different model variants,proving that H-ViT offers comparable(or even slightly higher)INT8 quantization performance with PoT scaling factors when compared to the counterpart with floating-point scaling factors.For instance,we reach 78.43 top-1 accuracy with DeiT-S on ImageNet,51.6 box AP and 44.8 mask AP with Cascade Mask R-CNN(Swin-B)on COCO.
基金supported by the National Natural Science Foundation of China(Nos.62301092 and 62301093).
文摘Vision Transformers(ViTs)have achieved remarkable success across various artificial intelligence-based computer vision applications.However,their demanding computational and memory requirements pose significant challenges for de-ployment on resource-constrained edge devices.Although post-training quantization(PTQ)provides a promising solution by reducing model precision with minimal calibration data,aggressive low-bit quantization typically leads to substantial perfor-mance degradation.To address this challenge,we present the truncated uniform-log2 quantizer and progressive bit-decline reconstruction method for vision Transformer quantization(TP-ViT).It is an innovative PTQ framework specifically designed for ViTs,featuring two key technical contributions:(1)truncated uniform-log2 quantizer,a novel quantization approach which effectively handles outlier values in post-Softmax activations,significantly reducing quantization errors;(2)bit-decline optimiza-tion strategy,which employs transition weights to gradually reduce bit precision while maintaining model performance under extreme quantization conditions.Comprehensive experiments on image classification,object detection,and instance segmenta-tion tasks demonstrate TP-ViT’s superior performance compared to state-of-the-art PTQ methods,particularly in challenging 3-bit quantization scenarios.Our framework achieves a notable 6.18 percentage points improvement in top-1 accuracy for ViT-small under 3-bit quantization.These results validate TP-ViT’s robustness and general applicability,paving the way for more efficient deployment of ViT models in computer vision applications on edge hardware.
文摘Transformers have demonstrated considerable success across various domains but are constrained by their significant computational and memory requirements.This poses challenges for deployment on resource-constrained devices.Quantization,as an effective model compression method,can significantly reduce the operational time of Transformers on edge devices.Notably,Transformers display more substantial outliers than convolutional neural networks,leading to uneven feature distribution among different channels and tokens.To address this issue,we propose an adaptive outlier correction quantization(AOCQ)method for Transformers,which significantly alleviates the adverse effects of these outliers.AOCQ adjusts the notable discrepancies in channels and tokens across three levels:operator level,framework level,and loss level.We introduce a new operator that equivalently balances the activations across different channels and insert an extra stage to optimize the activation quantization step on the framework level.Additionally,we transfer the imbalanced activations across tokens and channels to the optimization of model weights on the loss level.Based on the theoretical study,our method can reduce the quantization error.The effectiveness of the proposed method is verified on various benchmark models and tasks.Surprisingly,DeiT-Base with 8-bit post-training quantization(PTQ)can achieve 81.57%accuracy with a 0.28 percentage point drop while enjoying 4×faster runtime.Furthermore,the weights of Swin and DeiT on several tasks,including classification and object detection,can be post-quantized to ultra-low 4 bits,with a minimal accuracy loss of 2%,while requiring nearly 8×less memory.