摘要
在传统图像分类网络中,卷积神经网络(Convolutional Neural Network,CNN)的卷积运算需要大量乘法和累加操作,计算成本较高。Transformer模型灵活的自注意力机制使其需要大规模数据以减少过拟合风险,导致其具有较大的参数量与计算复杂度。针对上述问题,文中提出一种多阶段图像分类模型HTCNet(Hybrid Transformer-Convolution Network)。在模型的浅层阶段使用部分卷积,利用特征图冗余对部分通道进行卷积运算以减少模型的浮点运算次数(Floating Point Operations,FLOPs)。在深层阶段将卷积运算加入自注意力机制,构建一种高效的自注意力机制,有效缓解模型的过拟合风险并降低对数据的依赖性。通过自适应输入分辨率能够获取更多位置信息的卷积位置编码(Convolution Positional Encoding,CPE)。HTCNet在不同规模数据集CIFAR-10和ImageNet-1K上的分类准确率分别达到95.4%和82.6%。实验结果表明与同等规模的卷积神经网络和其他Transformer模型比较,HTCNet性能更好。
In traditional image classification networks,the convolutional operation of CNN(Convolutional Neural Network)requires a lot of multiplication and accumulation operations,and the calculation cost is high.The flexible self-attention mechanism of the Transformer model requires large-scale data to reduce the risk of overfitting,but has a large number of parameters and computational complexity.To solve these problems,a multi-stage image classification model HTCNet(Hybrid Transformer-Convolution Network)is proposed.In the shallow stage of the model,partial convolution is used,and some channels are convolved with feature graph redundancy to reduce the FLOPs(Floating Point Operations).In the deep stage,convolution operation is added to the self-attention mechanism to build an efficient self-attention mechanism,which can effectively alleviate the overfitting risk and data dependence of the model.CPE(Convolutional Position Coding)with more position information can be obtained by adaptive input resolution.The classification accuracy of HTCNet on different scale data sets CIFAR-10 and ImageNet-1K reached 95.4%and 82.6%,respectively.Experimental results show that HTCNet performs better than other Transformer models and convolutional neural networks of the same scale.
作者
朱灵龙
王亚刚
陈怡
ZHU Linglong;WANG Yagang;CHEN Yi(School of Optical-Electrical&Computer Engineering,Shanghai University of Science&Technology,Shanghai 200093,China)
出处
《电子科技》
2025年第10期96-105,共10页
Electronic Science and Technology
基金
国家重点研发计划(2020YFC2007502)。