Large-scale Language Models(LLMs)have achieved significant breakthroughs in Natural Language Processing(NLP),driven by the pre-training and fine-tuning paradigm.While this approach allows models to specialize in speci...Large-scale Language Models(LLMs)have achieved significant breakthroughs in Natural Language Processing(NLP),driven by the pre-training and fine-tuning paradigm.While this approach allows models to specialize in specific tasks with reduced training costs,the substantial memory requirements during fine-tuning present a barrier to broader deployment.Parameter-Efficient Fine-Tuning(PEFT)techniques,such as Low-Rank Adaptation(LoRA),and parameter quantization methods have emerged as solutions to address these challenges by optimizing memory usage and computational efficiency.Among these,QLoRA,which combines PEFT and quantization,has demonstrated notable success in reducing memory footprints during fine-tuning,prompting the development of various QLoRA variants.Despite these advancements,the quantitative impact of key variables on the fine-tuning performance of quantized LLMs remains underexplored.This study presents a comprehensive analysis of these key variables,focusing on their influence across different layer types and depths within LLM architectures.Our investigation uncovers several critical findings:(1)Larger layers,such as MLP layers,can maintain performance despite reductions in adapter rank,while smaller layers,like self-attention layers,aremore sensitive to such changes;(2)The effectiveness of balancing factors depends more on specific values rather than layer type or depth;(3)In quantization-aware fine-tuning,larger layers can effectively utilize smaller adapters,whereas smaller layers struggle to do so.These insights suggest that layer type is a more significant determinant of fine-tuning success than layer depth when optimizing quantized LLMs.Moreover,for the same discount of trainable parameters,reducing the trainable parameters in a larger layer is more effective in preserving fine-tuning accuracy than in a smaller one.This study provides valuable guidance for more efficient fine-tuning strategies and opens avenues for further research into optimizing LLM fine-tuning in resource-constrained environments.展开更多
针对目前BNN(Binarized Neural Network)剪枝方法存在剪枝比例低、识别准确率显著下降以及依赖训练后微调的问题,提出了一种基于三值向二值演化的滤波器级的BNN剪枝方法,命名为ETB(Evolution from Ternary to Binary)。ETB是基于学习的...针对目前BNN(Binarized Neural Network)剪枝方法存在剪枝比例低、识别准确率显著下降以及依赖训练后微调的问题,提出了一种基于三值向二值演化的滤波器级的BNN剪枝方法,命名为ETB(Evolution from Ternary to Binary)。ETB是基于学习的,通过在BNN的量化函数中引入可训练的量化阈值,使权重和激活值逐渐从三值演化到二值或零,旨在使网络在训练期间自动识别不重要的结构。此外,一个剪枝率调节算法也被设计用于调控网络的剪枝率。训练后,全零滤波器和对应的输出通道可被直接裁剪而获得精简的BNN,无需微调。为证明提出方法的可行性和其提升BNN推理效率而不牺牲准确率的潜力,在CIFAR-10上进行实验:在CIFAR-10数据集上,ETB对VGG-Small模型进行了46.3%的剪枝,模型大小压缩至0.34 MByte,准确率为89.97%,并在ResNet-18模型上进行了30.01%的剪枝,模型大小压缩至1.33 MByte,准确率为90.79%。在准确率和参数量方面,对比一些现有的BNN剪枝方法,ETB具有一定的优势。展开更多
基金supported by the National Key R&D Program of China(No.2021YFB0301200)National Natural Science Foundation of China(No.62025208).
文摘Large-scale Language Models(LLMs)have achieved significant breakthroughs in Natural Language Processing(NLP),driven by the pre-training and fine-tuning paradigm.While this approach allows models to specialize in specific tasks with reduced training costs,the substantial memory requirements during fine-tuning present a barrier to broader deployment.Parameter-Efficient Fine-Tuning(PEFT)techniques,such as Low-Rank Adaptation(LoRA),and parameter quantization methods have emerged as solutions to address these challenges by optimizing memory usage and computational efficiency.Among these,QLoRA,which combines PEFT and quantization,has demonstrated notable success in reducing memory footprints during fine-tuning,prompting the development of various QLoRA variants.Despite these advancements,the quantitative impact of key variables on the fine-tuning performance of quantized LLMs remains underexplored.This study presents a comprehensive analysis of these key variables,focusing on their influence across different layer types and depths within LLM architectures.Our investigation uncovers several critical findings:(1)Larger layers,such as MLP layers,can maintain performance despite reductions in adapter rank,while smaller layers,like self-attention layers,aremore sensitive to such changes;(2)The effectiveness of balancing factors depends more on specific values rather than layer type or depth;(3)In quantization-aware fine-tuning,larger layers can effectively utilize smaller adapters,whereas smaller layers struggle to do so.These insights suggest that layer type is a more significant determinant of fine-tuning success than layer depth when optimizing quantized LLMs.Moreover,for the same discount of trainable parameters,reducing the trainable parameters in a larger layer is more effective in preserving fine-tuning accuracy than in a smaller one.This study provides valuable guidance for more efficient fine-tuning strategies and opens avenues for further research into optimizing LLM fine-tuning in resource-constrained environments.
文摘针对目前BNN(Binarized Neural Network)剪枝方法存在剪枝比例低、识别准确率显著下降以及依赖训练后微调的问题,提出了一种基于三值向二值演化的滤波器级的BNN剪枝方法,命名为ETB(Evolution from Ternary to Binary)。ETB是基于学习的,通过在BNN的量化函数中引入可训练的量化阈值,使权重和激活值逐渐从三值演化到二值或零,旨在使网络在训练期间自动识别不重要的结构。此外,一个剪枝率调节算法也被设计用于调控网络的剪枝率。训练后,全零滤波器和对应的输出通道可被直接裁剪而获得精简的BNN,无需微调。为证明提出方法的可行性和其提升BNN推理效率而不牺牲准确率的潜力,在CIFAR-10上进行实验:在CIFAR-10数据集上,ETB对VGG-Small模型进行了46.3%的剪枝,模型大小压缩至0.34 MByte,准确率为89.97%,并在ResNet-18模型上进行了30.01%的剪枝,模型大小压缩至1.33 MByte,准确率为90.79%。在准确率和参数量方面,对比一些现有的BNN剪枝方法,ETB具有一定的优势。