The utilization of visual attention enhances the performance of image classification tasks.Previous attentionbased models have demonstrated notable performance,but many of these models exhibit reduced accuracy when co...The utilization of visual attention enhances the performance of image classification tasks.Previous attentionbased models have demonstrated notable performance,but many of these models exhibit reduced accuracy when confronted with inter-class and intra-class similarities and differences.Neural-Controlled Differential Equations(N-CDE’s)and Neural Ordinary Differential Equations(NODE’s)are extensively utilized within this context.NCDE’s possesses the capacity to effectively illustrate both inter-class and intra-class similarities and differences with enhanced clarity.To this end,an attentive neural network has been proposed to generate attention maps,which uses two different types of N-CDE’s,one for adopting hidden layers and the other to generate attention values.Two distinct attention techniques are implemented including time-wise attention,also referred to as bottom N-CDE’s;and element-wise attention,called topN-CDE’s.Additionally,a trainingmethodology is proposed to guarantee that the training problem is sufficiently presented.Two classification tasks including fine-grained visual classification andmulti-label classification,are utilized to evaluate the proposedmodel.The proposedmethodology is employed on five publicly available datasets,including CUB-200-2011,ImageNet-1K,PASCAL VOC 2007,PASCAL VOC 2012,and MS COCO.The obtained visualizations have demonstrated that N-CDE’s are better appropriate for attention-based activities in comparison to conventional NODE’s.展开更多
The pixel-wise dense prediction tasks based on weakly supervisions currently use Class Attention Maps(CAMs)to generate pseudo masks as ground-truth.However,existing methods often incorporate trainable modules to expan...The pixel-wise dense prediction tasks based on weakly supervisions currently use Class Attention Maps(CAMs)to generate pseudo masks as ground-truth.However,existing methods often incorporate trainable modules to expand the immature class activation maps,which can result in significant computational overhead and complicate the training process.In this work,we investigate the semantic structure information concealed within the CNN network,and propose a semantic structure aware inference(SSA)method that utilizes this information to obtain high-quality CAM without any additional training costs.Specifically,the semantic structure modeling module(SSM)is first proposed to generate the classagnostic semantic correlation representation,where each item denotes the affinity degree between one category of objects and all the others.Then,the immature CAM are refined through a dot product operation that utilizes semantic structure information.Finally,the polished CAMs from different backbone stages are fused as the output.The advantage of SSA lies in its parameter-free nature and the absence of additional training costs,which makes it suitable for various weakly supervised pixel-dense prediction tasks.We conducted extensive experiments on weakly supervised object localization and weakly supervised semantic segmentation,and the results confirm the effectiveness of SSA.展开更多
基金Institutional Fund Projects under Grant No.(IFPIP:638-830-1443).
文摘The utilization of visual attention enhances the performance of image classification tasks.Previous attentionbased models have demonstrated notable performance,but many of these models exhibit reduced accuracy when confronted with inter-class and intra-class similarities and differences.Neural-Controlled Differential Equations(N-CDE’s)and Neural Ordinary Differential Equations(NODE’s)are extensively utilized within this context.NCDE’s possesses the capacity to effectively illustrate both inter-class and intra-class similarities and differences with enhanced clarity.To this end,an attentive neural network has been proposed to generate attention maps,which uses two different types of N-CDE’s,one for adopting hidden layers and the other to generate attention values.Two distinct attention techniques are implemented including time-wise attention,also referred to as bottom N-CDE’s;and element-wise attention,called topN-CDE’s.Additionally,a trainingmethodology is proposed to guarantee that the training problem is sufficiently presented.Two classification tasks including fine-grained visual classification andmulti-label classification,are utilized to evaluate the proposedmodel.The proposedmethodology is employed on five publicly available datasets,including CUB-200-2011,ImageNet-1K,PASCAL VOC 2007,PASCAL VOC 2012,and MS COCO.The obtained visualizations have demonstrated that N-CDE’s are better appropriate for attention-based activities in comparison to conventional NODE’s.
基金supported by the National Key R&D Program of China(2022ZD0118802)the National Natural Science Foundation of China(Grant Nos.U20B2064 and U21B2043).
文摘The pixel-wise dense prediction tasks based on weakly supervisions currently use Class Attention Maps(CAMs)to generate pseudo masks as ground-truth.However,existing methods often incorporate trainable modules to expand the immature class activation maps,which can result in significant computational overhead and complicate the training process.In this work,we investigate the semantic structure information concealed within the CNN network,and propose a semantic structure aware inference(SSA)method that utilizes this information to obtain high-quality CAM without any additional training costs.Specifically,the semantic structure modeling module(SSM)is first proposed to generate the classagnostic semantic correlation representation,where each item denotes the affinity degree between one category of objects and all the others.Then,the immature CAM are refined through a dot product operation that utilizes semantic structure information.Finally,the polished CAMs from different backbone stages are fused as the output.The advantage of SSA lies in its parameter-free nature and the absence of additional training costs,which makes it suitable for various weakly supervised pixel-dense prediction tasks.We conducted extensive experiments on weakly supervised object localization and weakly supervised semantic segmentation,and the results confirm the effectiveness of SSA.