An improved model based on you only look once version 8(YOLOv8)is proposed to solve the problem of low detection accuracy due to the diversity of object sizes in optical remote sensing images.Firstly,the feature pyram...An improved model based on you only look once version 8(YOLOv8)is proposed to solve the problem of low detection accuracy due to the diversity of object sizes in optical remote sensing images.Firstly,the feature pyramid network(FPN)structure of the original YOLOv8 mode is replaced by the generalized-FPN(GFPN)structure in GiraffeDet to realize the"cross-layer"and"cross-scale"adaptive feature fusion,to enrich the semantic information and spatial information on the feature map to improve the target detection ability of the model.Secondly,a pyramid-pool module of multi atrous spatial pyramid pooling(MASPP)is designed by using the idea of atrous convolution and feature pyramid structure to extract multi-scale features,so as to improve the processing ability of the model for multi-scale objects.The experimental results show that the detection accuracy of the improved YOLOv8 model on DIOR dataset is 92%and mean average precision(mAP)is 87.9%,respectively 3.5%and 1.7%higher than those of the original model.It is proved the detection and classification ability of the proposed model on multi-dimensional optical remote sensing target has been improved.展开更多
Multi-label image classification is a challenging task due to the diverse sizes and complex backgrounds of objects in images.Obtaining class-specific precise representations at different scales is a key aspect of feat...Multi-label image classification is a challenging task due to the diverse sizes and complex backgrounds of objects in images.Obtaining class-specific precise representations at different scales is a key aspect of feature representation.However,existing methods often rely on the single-scale deep feature,neglecting shallow and deeper layer features,which poses challenges when predicting objects of varying scales within the same image.Although some studies have explored multi-scale features,they rarely address the flow of information between scales or efficiently obtain class-specific precise representations for features at different scales.To address these issues,we propose a two-stage,three-branch Transformer-based framework.The first stage incorporates multi-scale image feature extraction and hierarchical scale attention.This design enables the model to consider objects at various scales while enhancing the flow of information across different feature scales,improving the model’s generalization to diverse object scales.The second stage includes a global feature enhancement module and a region selection module.The global feature enhancement module strengthens interconnections between different image regions,mitigating the issue of incomplete represen-tations,while the region selection module models the cross-modal relationships between image features and labels.Together,these components enable the efficient acquisition of class-specific precise feature representations.Extensive experiments on public datasets,including COCO2014,VOC2007,and VOC2012,demonstrate the effectiveness of our proposed method.Our approach achieves consistent performance gains of 0.3%,0.4%,and 0.2%over state-of-the-art methods on the three datasets,respectively.These results validate the reliability and superiority of our approach for multi-label image classification.展开更多
Bird monitoring and protection are essential for maintaining biodiversity,and fine-grained bird classification has become a key focus in this field.Audio-visual modalities provide critical cues for this task,but robus...Bird monitoring and protection are essential for maintaining biodiversity,and fine-grained bird classification has become a key focus in this field.Audio-visual modalities provide critical cues for this task,but robust feature extraction and efficient fusion remain major challenges.We introduce a multi-stage fine-grained audiovisual fusion network(MSFG-AVFNet) for fine-grained bird species classification,which addresses these challenges through two key components:(1) the audiovisual feature extraction module,which adopts a multi-stage finetuning strategy to provide high-quality unimodal features,laying a solid foundation for modality fusion;(2) the audiovisual feature fusion module,which combines a max pooling aggregation strategy with a novel audiovisual loss function to achieve effective and robust feature fusion.Experiments were conducted on the self-built AVB81and the publicly available SSW60 datasets,which contain data from 81 and 60 bird species,respectively.Comprehensive experiments demonstrate that our approach achieves notable performance gains,outperforming existing state-of-the-art methods.These results highlight its effectiveness in leveraging audiovisual modalities for fine-grained bird classification and its potential to support ecological monitoring and biodiversity research.展开更多
High-speed train engine rolling bearings play a crucial role in maintaining engine health and minimizing operational losses during train operation.To solve the problems of low accuracy of the diagnostic model and unst...High-speed train engine rolling bearings play a crucial role in maintaining engine health and minimizing operational losses during train operation.To solve the problems of low accuracy of the diagnostic model and unstable model due to the influence of noise during fault detection,a rolling bearing fault diagnosis model based on cross-attention fusion of WDCNN and BILSTM is proposed.The first layer of the wide convolutional kernel deep convolutional neural network(WDCNN)is used to extract the local features of the signal and suppress the highfrequency noise.A Bidirectional Long Short-Term Memory Network(BILSTM)is used to obtain global time series features of the signal.Cross-attention combines the WDCNN layer and the BILSTM layer so that the model can recognize more comprehensive feature information of the signal.Meanwhile,to improve the accuracy,Variable Modal Decomposition(VMD)is used to decompose the signals and filter and reconstruct the signals using envelope entropy and kurtosis,which enables the pre-processing of the signals so that the data input to the neural network contains richer feature information.The feasibility of the model is tested and experimentally validated using publicly available datasets.The experimental results show that the accuracy of themodel proposed in this paper is significantly improved compared to the traditional WDCNN,BILSTM,and WDCNN-BILSTM models.展开更多
In response to the problem of traditional methods ignoring audio modality tampering, this study aims to explore an effective deep forgery video detection technique that improves detection precision and reliability by ...In response to the problem of traditional methods ignoring audio modality tampering, this study aims to explore an effective deep forgery video detection technique that improves detection precision and reliability by fusing lip images and audio signals. The main method used is lip-audio matching detection technology based on the Siamese neural network, combined with MFCC (Mel Frequency Cepstrum Coefficient) feature extraction of band-pass filters, an improved dual-branch Siamese network structure, and a two-stream network structure design. Firstly, the video stream is preprocessed to extract lip images, and the audio stream is preprocessed to extract MFCC features. Then, these features are processed separately through the two branches of the Siamese network. Finally, the model is trained and optimized through fully connected layers and loss functions. The experimental results show that the testing accuracy of the model in this study on the LRW (Lip Reading in the Wild) dataset reaches 92.3%;the recall rate is 94.3%;the F1 score is 93.3%, significantly better than the results of CNN (Convolutional Neural Networks) and LSTM (Long Short-Term Memory) models. In the validation of multi-resolution image streams, the highest accuracy of dual-resolution image streams reaches 94%. Band-pass filters can effectively improve the signal-to-noise ratio of deep forgery video detection when processing different types of audio signals. The real-time processing performance of the model is also excellent, and it achieves an average score of up to 5 in user research. These data demonstrate that the method proposed in this study can effectively fuse visual and audio information in deep forgery video detection, accurately identify inconsistencies between video and audio, and thus verify the effectiveness of lip-audio modality fusion technology in improving detection performance.展开更多
To improve image quality under low illumination conditions,a novel low-light image enhancement method is proposed in this paper based on multi-illumination estimation and multi-scale fusion(MIMS).Firstly,the illuminat...To improve image quality under low illumination conditions,a novel low-light image enhancement method is proposed in this paper based on multi-illumination estimation and multi-scale fusion(MIMS).Firstly,the illumination is processed by contrast-limited adaptive histogram equalization(CLAHE),adaptive complementary gamma function(ACG),and adaptive detail preserving S-curve(ADPS),respectively,to obtain three components.Then,the fusion-relevant features,exposure,and color contrast are selected as the weight maps.Subsequently,these components and weight maps are fused through multi-scale to generate enhanced illumination.Finally,the enhanced images are obtained by multiplying the enhanced illumination and reflectance.Compared with existing approaches,this proposed method achieves an average increase of 0.81%and 2.89%in the structural similarity index measurement(SSIM)and peak signal-to-noise ratio(PSNR),and a decrease of 6.17%and 32.61%in the natural image quality evaluator(NIQE)and gradient magnitude similarity deviation(GMSD),respectively.展开更多
Aiming at the problem of low detection accuracy due to the different scale sizes of apple leaf disease spots and their similarity to the background,this paper proposes a multi-scale lightweight network(MSL-Net).Firstl...Aiming at the problem of low detection accuracy due to the different scale sizes of apple leaf disease spots and their similarity to the background,this paper proposes a multi-scale lightweight network(MSL-Net).Firstly,a multiplexed aggregated feature extraction network is proposed using residual bottleneck block(RES-Bottleneck)and middle partial-convolution(MP-Conv)to capture multi-scale spatial features and enhance focus on disease features for better differentiation between disease targets and background information.Secondly,a lightweight feature fusion network is designed using scale-fuse concatenation(SF-Cat)and triple-scale sequence feature fusion(TSSF)module to merge multi-scale feature maps comprehensively.Depthwise convolution(DWConv)and GhostNet lighten the network,while the cross stage partial bottleneck with 3 convolutions ghost-normalization attention module(C3-GN)reduces missed detections by suppressing irrelevant background information.Finally,soft non-maximum suppression(Soft-NMS)is used in the post-processing stage to improve the problem of misdetection of dense disease sites.The results show that the MSL-Net improves mean average precision at intersection over union of 0.5(mAP@0.5)by 2.0%over the baseline you only look once version 5s(YOLOv5s)and reduces parameters by 44%,reducing computation by 27%,outperforming other state-of-the-art(SOTA)models overall.This method also shows excellent performance compared to the latest research.展开更多
基金supported by the National Natural Science Foundation of China(No.62241109)the Tianjin Science and Technology Commissioner Project(No.20YDTPJC01110)。
文摘An improved model based on you only look once version 8(YOLOv8)is proposed to solve the problem of low detection accuracy due to the diversity of object sizes in optical remote sensing images.Firstly,the feature pyramid network(FPN)structure of the original YOLOv8 mode is replaced by the generalized-FPN(GFPN)structure in GiraffeDet to realize the"cross-layer"and"cross-scale"adaptive feature fusion,to enrich the semantic information and spatial information on the feature map to improve the target detection ability of the model.Secondly,a pyramid-pool module of multi atrous spatial pyramid pooling(MASPP)is designed by using the idea of atrous convolution and feature pyramid structure to extract multi-scale features,so as to improve the processing ability of the model for multi-scale objects.The experimental results show that the detection accuracy of the improved YOLOv8 model on DIOR dataset is 92%and mean average precision(mAP)is 87.9%,respectively 3.5%and 1.7%higher than those of the original model.It is proved the detection and classification ability of the proposed model on multi-dimensional optical remote sensing target has been improved.
基金supported by the National Natural Science Foundation of China(62302167,62477013)Natural Science Foundation of Shanghai(No.24ZR1456100)+1 种基金Science and Technology Commission of Shanghai Municipality(No.24DZ2305900)the Shanghai Municipal Special Fund for Promoting High-Quality Development of Industries(2211106).
文摘Multi-label image classification is a challenging task due to the diverse sizes and complex backgrounds of objects in images.Obtaining class-specific precise representations at different scales is a key aspect of feature representation.However,existing methods often rely on the single-scale deep feature,neglecting shallow and deeper layer features,which poses challenges when predicting objects of varying scales within the same image.Although some studies have explored multi-scale features,they rarely address the flow of information between scales or efficiently obtain class-specific precise representations for features at different scales.To address these issues,we propose a two-stage,three-branch Transformer-based framework.The first stage incorporates multi-scale image feature extraction and hierarchical scale attention.This design enables the model to consider objects at various scales while enhancing the flow of information across different feature scales,improving the model’s generalization to diverse object scales.The second stage includes a global feature enhancement module and a region selection module.The global feature enhancement module strengthens interconnections between different image regions,mitigating the issue of incomplete represen-tations,while the region selection module models the cross-modal relationships between image features and labels.Together,these components enable the efficient acquisition of class-specific precise feature representations.Extensive experiments on public datasets,including COCO2014,VOC2007,and VOC2012,demonstrate the effectiveness of our proposed method.Our approach achieves consistent performance gains of 0.3%,0.4%,and 0.2%over state-of-the-art methods on the three datasets,respectively.These results validate the reliability and superiority of our approach for multi-label image classification.
基金supported by the Beijing Natural Science Foundation(No.5252014)the Open Fund of The Key Laboratory of Urban Ecological Environment Simulation and Protection,Ministry of Ecology and Environment of the People's Republic of China (No.UEESP-202502)the National Natural Science Foundation of China (No.62303063&32371874)。
文摘Bird monitoring and protection are essential for maintaining biodiversity,and fine-grained bird classification has become a key focus in this field.Audio-visual modalities provide critical cues for this task,but robust feature extraction and efficient fusion remain major challenges.We introduce a multi-stage fine-grained audiovisual fusion network(MSFG-AVFNet) for fine-grained bird species classification,which addresses these challenges through two key components:(1) the audiovisual feature extraction module,which adopts a multi-stage finetuning strategy to provide high-quality unimodal features,laying a solid foundation for modality fusion;(2) the audiovisual feature fusion module,which combines a max pooling aggregation strategy with a novel audiovisual loss function to achieve effective and robust feature fusion.Experiments were conducted on the self-built AVB81and the publicly available SSW60 datasets,which contain data from 81 and 60 bird species,respectively.Comprehensive experiments demonstrate that our approach achieves notable performance gains,outperforming existing state-of-the-art methods.These results highlight its effectiveness in leveraging audiovisual modalities for fine-grained bird classification and its potential to support ecological monitoring and biodiversity research.
基金funded by the Jilin Provincial Department of Science and Technology,grant number 20230101208JC。
文摘High-speed train engine rolling bearings play a crucial role in maintaining engine health and minimizing operational losses during train operation.To solve the problems of low accuracy of the diagnostic model and unstable model due to the influence of noise during fault detection,a rolling bearing fault diagnosis model based on cross-attention fusion of WDCNN and BILSTM is proposed.The first layer of the wide convolutional kernel deep convolutional neural network(WDCNN)is used to extract the local features of the signal and suppress the highfrequency noise.A Bidirectional Long Short-Term Memory Network(BILSTM)is used to obtain global time series features of the signal.Cross-attention combines the WDCNN layer and the BILSTM layer so that the model can recognize more comprehensive feature information of the signal.Meanwhile,to improve the accuracy,Variable Modal Decomposition(VMD)is used to decompose the signals and filter and reconstruct the signals using envelope entropy and kurtosis,which enables the pre-processing of the signals so that the data input to the neural network contains richer feature information.The feasibility of the model is tested and experimentally validated using publicly available datasets.The experimental results show that the accuracy of themodel proposed in this paper is significantly improved compared to the traditional WDCNN,BILSTM,and WDCNN-BILSTM models.
文摘In response to the problem of traditional methods ignoring audio modality tampering, this study aims to explore an effective deep forgery video detection technique that improves detection precision and reliability by fusing lip images and audio signals. The main method used is lip-audio matching detection technology based on the Siamese neural network, combined with MFCC (Mel Frequency Cepstrum Coefficient) feature extraction of band-pass filters, an improved dual-branch Siamese network structure, and a two-stream network structure design. Firstly, the video stream is preprocessed to extract lip images, and the audio stream is preprocessed to extract MFCC features. Then, these features are processed separately through the two branches of the Siamese network. Finally, the model is trained and optimized through fully connected layers and loss functions. The experimental results show that the testing accuracy of the model in this study on the LRW (Lip Reading in the Wild) dataset reaches 92.3%;the recall rate is 94.3%;the F1 score is 93.3%, significantly better than the results of CNN (Convolutional Neural Networks) and LSTM (Long Short-Term Memory) models. In the validation of multi-resolution image streams, the highest accuracy of dual-resolution image streams reaches 94%. Band-pass filters can effectively improve the signal-to-noise ratio of deep forgery video detection when processing different types of audio signals. The real-time processing performance of the model is also excellent, and it achieves an average score of up to 5 in user research. These data demonstrate that the method proposed in this study can effectively fuse visual and audio information in deep forgery video detection, accurately identify inconsistencies between video and audio, and thus verify the effectiveness of lip-audio modality fusion technology in improving detection performance.
基金supported by the National Key R&D Program of China(No.2022YFB3205101)NSAF(No.U2230116)。
文摘To improve image quality under low illumination conditions,a novel low-light image enhancement method is proposed in this paper based on multi-illumination estimation and multi-scale fusion(MIMS).Firstly,the illumination is processed by contrast-limited adaptive histogram equalization(CLAHE),adaptive complementary gamma function(ACG),and adaptive detail preserving S-curve(ADPS),respectively,to obtain three components.Then,the fusion-relevant features,exposure,and color contrast are selected as the weight maps.Subsequently,these components and weight maps are fused through multi-scale to generate enhanced illumination.Finally,the enhanced images are obtained by multiplying the enhanced illumination and reflectance.Compared with existing approaches,this proposed method achieves an average increase of 0.81%and 2.89%in the structural similarity index measurement(SSIM)and peak signal-to-noise ratio(PSNR),and a decrease of 6.17%and 32.61%in the natural image quality evaluator(NIQE)and gradient magnitude similarity deviation(GMSD),respectively.
文摘Aiming at the problem of low detection accuracy due to the different scale sizes of apple leaf disease spots and their similarity to the background,this paper proposes a multi-scale lightweight network(MSL-Net).Firstly,a multiplexed aggregated feature extraction network is proposed using residual bottleneck block(RES-Bottleneck)and middle partial-convolution(MP-Conv)to capture multi-scale spatial features and enhance focus on disease features for better differentiation between disease targets and background information.Secondly,a lightweight feature fusion network is designed using scale-fuse concatenation(SF-Cat)and triple-scale sequence feature fusion(TSSF)module to merge multi-scale feature maps comprehensively.Depthwise convolution(DWConv)and GhostNet lighten the network,while the cross stage partial bottleneck with 3 convolutions ghost-normalization attention module(C3-GN)reduces missed detections by suppressing irrelevant background information.Finally,soft non-maximum suppression(Soft-NMS)is used in the post-processing stage to improve the problem of misdetection of dense disease sites.The results show that the MSL-Net improves mean average precision at intersection over union of 0.5(mAP@0.5)by 2.0%over the baseline you only look once version 5s(YOLOv5s)and reduces parameters by 44%,reducing computation by 27%,outperforming other state-of-the-art(SOTA)models overall.This method also shows excellent performance compared to the latest research.