摘要
随着互联网的不断发展,人们每天都在制造大量且复杂的图像数据,使当今主流的社交媒体充满了图像等媒体数据,快速且准确地对图像进行检索已经成为了有意义且亟待解决的问题。卷积神经网络(CNN)模型是现有的主流哈希图像检索模型。然而,CNN的卷积操作只能捕捉局部特征,无法处理全局信息;且卷积操作的感受野大小固定,无法适应不同尺度的输入图像。为此,基于Transformer模型中的Swin-Transformer模型实现了图像的有效检索。Transformer模型利用自注意力机制和位置编码操作,有效地解决了CNN的问题。而现有的Swin-Transformer哈希图像检索模型的窗口注意力模块在提取图像特征时对于图像的不同通道给予了相同的权重,忽略了图像不同通道特征信息的差异性和依赖关系,使得提取的特征的可利用性降低,造成了计算资源的浪费。针对上述问题,提出了基于混合注意力与偏振非对称损失的哈希图像检索模型(HRMPA)。该设计基于Swin-Transformer的哈希特征提取模块(HFST),在HFST中的(S)W-MSA模块加入了通道注意力模块(CAB),得到基于混合注意力的哈希特征提取模块(HFMA),从而使模型对输入图像的不同通道的特征赋予不同的权重信息,增加了提取特征的多样性且最大限度地利用了计算资源。同时,为了最小化类内汉明距离、最大化类间汉明距离,并充分利用数据的监督信息,提高图像的检索精度,提出了偏振非对称损失函数(PA),使偏振损失和非对称损失以一定的权重分配比进行组合,从而有效地提高了图像的检索精度。实验表明,在哈希编码长度为16 bits时,所提模型在CIFAR-10单标签数据集上,最高平均精度均值达到98.73%,比VTS16-CSQ模型提高了1.51%;在NUSWIDE多标签数据集上,最高平均精度均值达到90.65%,比TransHash提高了18.02%,比VTS16-CSQ模型提高了5.92%。
With the continuous development of the Internet,massive and complex image data is being created every day,so that today’s mainstream social media is full of complex media data such as images.Effectively processing these image data can not only increase the utilization rate of image data but also improve the user experience.Therefore,how to retrieve images quickly and accurately has become a meaningful and urgent problem.The current mainstream hash image retrieval model is convolutional neural network model.However,the convolution operation of CNN can only capture local features,but cannot process global information,and the receptive field size of the convolution operation is fixed,it cannot adapt to input images of different scales.This paper proposes based on Swin Transformer model in Transformer model to achieve effective image retrieval.The Transformer model effectively solves the CNN problem with self-attention mechanism and location coding operation.However,the window attention module of the existing Swin-Transformer hashing image retrieval model gives the same weight to different channels of the image when extracting image features,thus ignoring the differences and dependencies of the feature information of different channels of the image,which reduces the availability of the extracted features and leads to a waste of computing resources.To solve these problems,this paper proposes hash image retrieval model based on mixed attention and polarization asymmetric loss.The model design is based on Swin-Transformer feature extraction module.The window self-attention module in HFST has been added to the channel attention block.The hash feature extraction module based on mixed attention is obtained,which enables the model to assign different weight information to the features of different channels of the input image.Increase the diversity of extracted features and maximize the use of computing resources.At the same time,in order to minimize the intra-class Hamming distance,maximize the inter-class Hamming distance,make full use of the supervision information of the data,and improve the retrieval accuracy of the image,this paper proposes polarization asymmetric loss function.The polarization loss and asymmetric loss are combined with a certain weight allocation ratio,so effectively improve the image retrieval precision.The experimental results show the validity and rationality of the proposed method.For example,when the hash coding length is 16 bits,the proposed model has a maximum average accuracy of 98.73%on the CIFAR-10 single-label dataset,which is 1.51%higher than that of the VTS16-CSQ model.The highest average retrieval accuracy mean is 90.65%on NUSWIDE multi-label dataset,which is 18.02%higher than TransHash and 5.92%higher than VTS16-CSQ model.
作者
刘华咏
徐明慧
LIU Huayong;XU Minghui(Hubei Provincial Key Laboratory of Artificial Intelligence and Smart Learning,Wuhan 430079,China;School of Computer Science,Central China Normal University,Wuhan 430079,China)
出处
《计算机科学》
北大核心
2025年第8期204-213,共10页
Computer Science
基金
教育部人文社会科学研究项目(21YJA870005)。