Fusing hand-based features in multi-modal biometric recognition enhances anti-spoofing capabilities.Additionally,it leverages inter-modal correlation to enhance recognition performance.Concurrently,the robustness and ...Fusing hand-based features in multi-modal biometric recognition enhances anti-spoofing capabilities.Additionally,it leverages inter-modal correlation to enhance recognition performance.Concurrently,the robustness and recognition performance of the system can be enhanced through judiciously leveraging the correlation among multimodal features.Nevertheless,two issues persist in multi-modal feature fusion recognition:Firstly,the enhancement of recognition performance in fusion recognition has not comprehensively considered the inter-modality correlations among distinct modalities.Secondly,during modal fusion,improper weight selection diminishes the salience of crucial modal features,thereby diminishing the overall recognition performance.To address these two issues,we introduce an enhanced DenseNet multimodal recognition network founded on feature-level fusion.The information from the three modalities is fused akin to RGB,and the input network augments the correlation between modes through channel correlation.Within the enhanced DenseNet network,the Efficient Channel Attention Network(ECA-Net)dynamically adjusts the weight of each channel to amplify the salience of crucial information in each modal feature.Depthwise separable convolution markedly reduces the training parameters and further enhances the feature correlation.Experimental evaluations were conducted on four multimodal databases,comprising six unimodal databases,including multispectral palmprint and palm vein databases from the Chinese Academy of Sciences.The Equal Error Rates(EER)values were 0.0149%,0.0150%,0.0099%,and 0.0050%,correspondingly.In comparison to other network methods for palmprint,palm vein,and finger vein fusion recognition,this approach substantially enhances recognition performance,rendering it suitable for high-security environments with practical applicability.The experiments in this article utilized amodest sample database comprising 200 individuals.The subsequent phase involves preparing for the extension of the method to larger databases.展开更多
In the article“A Lightweight Approach for Skin Lesion Detection through Optimal Features Fusion”by Khadija Manzoor,Fiaz Majeed,Ansar Siddique,Talha Meraj,Hafiz Tayyab Rauf,Mohammed A.El-Meligy,Mohamed Sharaf,Abd Ela...In the article“A Lightweight Approach for Skin Lesion Detection through Optimal Features Fusion”by Khadija Manzoor,Fiaz Majeed,Ansar Siddique,Talha Meraj,Hafiz Tayyab Rauf,Mohammed A.El-Meligy,Mohamed Sharaf,Abd Elatty E.Abd Elgawad Computers,Materials&Continua,2022,Vol.70,No.1,pp.1617–1630.DOI:10.32604/cmc.2022.018621,URL:https://www.techscience.com/cmc/v70n1/44361,there was an error regarding the affiliation for the author Hafiz Tayyab Rauf.Instead of“Centre for Smart Systems,AI and Cybersecurity,Staffordshire University,Stoke-on-Trent,UK”,the affiliation should be“Independent Researcher,Bradford,BD80HS,UK”.展开更多
The fusion of infrared and visible images should emphasize the salient targets in the infrared image while preserving the textural details of the visible images.To meet these requirements,an autoencoder-based method f...The fusion of infrared and visible images should emphasize the salient targets in the infrared image while preserving the textural details of the visible images.To meet these requirements,an autoencoder-based method for infrared and visible image fusion is proposed.The encoder designed according to the optimization objective consists of a base encoder and a detail encoder,which is used to extract low-frequency and high-frequency information from the image.This extraction may lead to some information not being captured,so a compensation encoder is proposed to supplement the missing information.Multi-scale decomposition is also employed to extract image features more comprehensively.The decoder combines low-frequency,high-frequency and supplementary information to obtain multi-scale features.Subsequently,the attention strategy and fusion module are introduced to perform multi-scale fusion for image reconstruction.Experimental results on three datasets show that the fused images generated by this network effectively retain salient targets while being more consistent with human visual perception.展开更多
To address the challenge of missing modal information in entity alignment and to mitigate information loss or bias arising frommodal heterogeneity during fusion,while also capturing shared information acrossmodalities...To address the challenge of missing modal information in entity alignment and to mitigate information loss or bias arising frommodal heterogeneity during fusion,while also capturing shared information acrossmodalities,this paper proposes a Multi-modal Pre-synergistic Entity Alignmentmodel based on Cross-modalMutual Information Strategy Optimization(MPSEA).The model first employs independent encoders to process multi-modal features,including text,images,and numerical values.Next,a multi-modal pre-synergistic fusion mechanism integrates graph structural and visual modal features into the textual modality as preparatory information.This pre-fusion strategy enables unified perception of heterogeneous modalities at the model’s initial stage,reducing discrepancies during the fusion process.Finally,using cross-modal deep perception reinforcement learning,the model achieves adaptive multilevel feature fusion between modalities,supporting learningmore effective alignment strategies.Extensive experiments on multiple public datasets show that the MPSEA method achieves gains of up to 7% in Hits@1 and 8.2% in MRR on the FBDB15K dataset,and up to 9.1% in Hits@1 and 7.7% in MRR on the FBYG15K dataset,compared to existing state-of-the-art methods.These results confirm the effectiveness of the proposed model.展开更多
Multi-modal Named Entity Recognition(MNER)aims to better identify meaningful textual entities by integrating information from images.Previous work has focused on extracting visual semantics at a fine-grained level,or ...Multi-modal Named Entity Recognition(MNER)aims to better identify meaningful textual entities by integrating information from images.Previous work has focused on extracting visual semantics at a fine-grained level,or obtaining entity related external knowledge from knowledge bases or Large Language Models(LLMs).However,these approaches ignore the poor semantic correlation between visual and textual modalities in MNER datasets and do not explore different multi-modal fusion approaches.In this paper,we present MMAVK,a multi-modal named entity recognition model with auxiliary visual knowledge and word-level fusion,which aims to leverage the Multi-modal Large Language Model(MLLM)as an implicit knowledge base.It also extracts vision-based auxiliary knowledge from the image formore accurate and effective recognition.Specifically,we propose vision-based auxiliary knowledge generation,which guides the MLLM to extract external knowledge exclusively derived from images to aid entity recognition by designing target-specific prompts,thus avoiding redundant recognition and cognitive confusion caused by the simultaneous processing of image-text pairs.Furthermore,we employ a word-level multi-modal fusion mechanism to fuse the extracted external knowledge with each word-embedding embedded from the transformerbased encoder.Extensive experimental results demonstrate that MMAVK outperforms or equals the state-of-the-art methods on the two classical MNER datasets,even when the largemodels employed have significantly fewer parameters than other baselines.展开更多
Biometric characteristics are playing a vital role in security for the last few years.Human gait classification in video sequences is an important biometrics attribute and is used for security purposes.A new framework...Biometric characteristics are playing a vital role in security for the last few years.Human gait classification in video sequences is an important biometrics attribute and is used for security purposes.A new framework for human gait classification in video sequences using deep learning(DL)fusion assisted and posterior probability-based moth flames optimization(MFO)is proposed.In the first step,the video frames are resized and finetuned by two pre-trained lightweight DL models,EfficientNetB0 and MobileNetV2.Both models are selected based on the top-5 accuracy and less number of parameters.Later,both models are trained through deep transfer learning and extracted deep features fused using a voting scheme.In the last step,the authors develop a posterior probabilitybased MFO feature selection algorithm to select the best features.The selected features are classified using several supervised learning methods.The CASIA-B publicly available dataset has been employed for the experimental process.On this dataset,the authors selected six angles such as 0°,18°,90°,108°,162°,and 180°and obtained an average accuracy of 96.9%,95.7%,86.8%,90.0%,95.1%,and 99.7%.Results demonstrate comparable improvement in accuracy and significantly minimize the computational time with recent state-of-the-art techniques.展开更多
As the number and complexity of sensors in autonomous vehicles continue to rise,multimodal fusionbased object detection algorithms are increasingly being used to detect 3D environmental information,significantly advan...As the number and complexity of sensors in autonomous vehicles continue to rise,multimodal fusionbased object detection algorithms are increasingly being used to detect 3D environmental information,significantly advancing the development of perception technology in autonomous driving.To further promote the development of fusion algorithms and improve detection performance,this paper discusses the advantages and recent advancements of multimodal fusion-based object detection algorithms.Starting fromsingle-modal sensor detection,the paper provides a detailed overview of typical sensors used in autonomous driving and introduces object detection methods based on images and point clouds.For image-based detection methods,they are categorized into monocular detection and binocular detection based on different input types.For point cloud-based detection methods,they are classified into projection-based,voxel-based,point cluster-based,pillar-based,and graph structure-based approaches based on the technical pathways for processing point cloud features.Additionally,multimodal fusion algorithms are divided into Camera-LiDAR fusion,Camera-Radar fusion,Camera-LiDAR-Radar fusion,and other sensor fusion methods based on the types of sensors involved.Furthermore,the paper identifies five key future research directions in this field,aiming to provide insights for researchers engaged in multimodal fusion-based object detection algorithms and to encourage broader attention to the research and application of multimodal fusion-based object detection.展开更多
[Objective]Accurate prediction of tomato growth height is crucial for optimizing production environments in smart farming.However,current prediction methods predominantly rely on empirical,mechanistic,or learning-base...[Objective]Accurate prediction of tomato growth height is crucial for optimizing production environments in smart farming.However,current prediction methods predominantly rely on empirical,mechanistic,or learning-based models that utilize either images data or environmental data.These methods fail to fully leverage multi-modal data to capture the diverse aspects of plant growth comprehensively.[Methods]To address this limitation,a two-stage phenotypic feature extraction(PFE)model based on deep learning algorithm of recurrent neural network(RNN)and long short-term memory(LSTM)was developed.The model integrated environment and plant information to provide a holistic understanding of the growth process,emploied phenotypic and temporal feature extractors to comprehensively capture both types of features,enabled a deeper understanding of the interaction between tomato plants and their environment,ultimately leading to highly accurate predictions of growth height.[Results and Discussions]The experimental results showed the model's ef‐fectiveness:When predicting the next two days based on the past five days,the PFE-based RNN and LSTM models achieved mean absolute percentage error(MAPE)of 0.81%and 0.40%,respectively,which were significantly lower than the 8.00%MAPE of the large language model(LLM)and 6.72%MAPE of the Transformer-based model.In longer-term predictions,the 10-day prediction for 4 days ahead and the 30-day prediction for 12 days ahead,the PFE-RNN model continued to outperform the other two baseline models,with MAPE of 2.66%and 14.05%,respectively.[Conclusions]The proposed method,which leverages phenotypic-temporal collaboration,shows great potential for intelligent,data-driven management of tomato cultivation,making it a promising approach for enhancing the efficiency and precision of smart tomato planting management.展开更多
Laser cleaning is a highly nonlinear physical process for solving poor single-modal(e.g., acoustic or vision)detection performance and low inter-information utilization. In this study, a multi-modal feature fusion net...Laser cleaning is a highly nonlinear physical process for solving poor single-modal(e.g., acoustic or vision)detection performance and low inter-information utilization. In this study, a multi-modal feature fusion network model was constructed based on a laser paint removal experiment. The alignment of heterogeneous data under different modals was solved by combining the piecewise aggregate approximation and gramian angular field. Moreover, the attention mechanism was introduced to optimize the dual-path network and dense connection network, enabling the sampling characteristics to be extracted and integrated. Consequently, the multi-modal discriminant detection of laser paint removal was realized. According to the experimental results, the verification accuracy of the constructed model on the experimental dataset was 99.17%, which is 5.77% higher than the optimal single-modal detection results of the laser paint removal. The feature extraction network was optimized by the attention mechanism, and the model accuracy was increased by 3.3%. Results verify the improved classification performance of the constructed multi-modal feature fusion model in detecting laser paint removal, the effective integration of acoustic data and visual image data, and the accurate detection of laser paint removal.展开更多
Scene classification of high-resolution remote sensing (HRRS) image is an important research topic and has been applied broadly in many fields. Deep learning method has shown its high potential to in this domain, owin...Scene classification of high-resolution remote sensing (HRRS) image is an important research topic and has been applied broadly in many fields. Deep learning method has shown its high potential to in this domain, owing to its powerful learning ability of characterizing complex patterns. However the deep learning methods omit some global and local information of the HRRS image. To this end, in this article we show efforts to adopt explicit global and local information to provide complementary information to deep models. Specifically, we use a patch based MS-CLBP method to acquire global and local representations, and then we consider a pretrained CNN model as a feature extractor and extract deep hierarchical features from full-connection layers. After fisher vector (FV) encoding, we obtain the holistic visual representation of the scene image. We view the scene classification as a reconstruction procedure and train several class-specific stack denoising autoencoders (SDAEs) of corresponding class, i.e., one SDAE per class, and classify the test image according to the reconstruction error. Experimental results show that our combination method outperforms the state-of-the-art deep learning classification methods without employing fine-tuning.展开更多
A novel face recognition method, which is a fusion of muhi-modal face parts based on Gabor feature (MMP-GF), is proposed in this paper. Firstly, the bare face image detached from the normalized image was convolved w...A novel face recognition method, which is a fusion of muhi-modal face parts based on Gabor feature (MMP-GF), is proposed in this paper. Firstly, the bare face image detached from the normalized image was convolved with a family of Gabor kernels, and then according to the face structure and the key-points locations, the calculated Gabor images were divided into five parts: Gabor face, Gabor eyebrow, Gabor eye, Gabor nose and Gabor mouth. After that multi-modal Gabor features were spatially partitioned into non-overlapping regions and the averages of regions were concatenated to be a low dimension feature vector, whose dimension was further reduced by principal component analysis (PCA). In the decision level fusion, match results respectively calculated based on the five parts were combined according to linear discriminant analysis (LDA) and a normalized matching algorithm was used to improve the performance. Experiments on FERET database show that the proposed MMP-GF method achieves good robustness to the expression and age variations.展开更多
In geometry processing,symmetry research benefits from global geo-metric features of complete shapes,but the shape of an object captured in real-world applications is often incomplete due to the limited sensor resoluti...In geometry processing,symmetry research benefits from global geo-metric features of complete shapes,but the shape of an object captured in real-world applications is often incomplete due to the limited sensor resolution,single viewpoint,and occlusion.Different from the existing works predicting symmetry from the complete shape,we propose a learning approach for symmetry predic-tion based on a single RGB-D image.Instead of directly predicting the symmetry from incomplete shapes,our method consists of two modules,i.e.,the multi-mod-al feature fusion module and the detection-by-reconstruction module.Firstly,we build a channel-transformer network(CTN)to extract cross-fusion features from the RGB-D as the multi-modal feature fusion module,which helps us aggregate features from the color and the depth separately.Then,our self-reconstruction net-work based on a 3D variational auto-encoder(3D-VAE)takes the global geo-metric features as input,followed by a prediction symmetry network to detect the symmetry.Our experiments are conducted on three public datasets:ShapeNet,YCB,and ScanNet,we demonstrate that our method can produce reliable and accurate results.展开更多
In order to solve difficult detection of far and hard objects due to the sparseness and insufficient semantic information of LiDAR point cloud,a 3D object detection network with multi-modal data adaptive fusion is pro...In order to solve difficult detection of far and hard objects due to the sparseness and insufficient semantic information of LiDAR point cloud,a 3D object detection network with multi-modal data adaptive fusion is proposed,which makes use of multi-neighborhood information of voxel and image information.Firstly,design an improved ResNet that maintains the structure information of far and hard objects in low-resolution feature maps,which is more suitable for detection task.Meanwhile,semantema of each image feature map is enhanced by semantic information from all subsequent feature maps.Secondly,extract multi-neighborhood context information with different receptive field sizes to make up for the defect of sparseness of point cloud which improves the ability of voxel features to represent the spatial structure and semantic information of objects.Finally,propose a multi-modal feature adaptive fusion strategy which uses learnable weights to express the contribution of different modal features to the detection task,and voxel attention further enhances the fused feature expression of effective target objects.The experimental results on the KITTI benchmark show that this method outperforms VoxelNet with remarkable margins,i.e.increasing the AP by 8.78%and 5.49%on medium and hard difficulty levels.Meanwhile,our method achieves greater detection performance compared with many mainstream multi-modal methods,i.e.outperforming the AP by 1%compared with that of MVX-Net on medium and hard difficulty levels.展开更多
The Coronavirus Disease 2019(COVID-19)pandemic poses the worldwide challenges surpassing the boundaries of country,religion,race,and economy.The current benchmark method for the detection of COVID-19 is the reverse tr...The Coronavirus Disease 2019(COVID-19)pandemic poses the worldwide challenges surpassing the boundaries of country,religion,race,and economy.The current benchmark method for the detection of COVID-19 is the reverse transcription polymerase chain reaction(RT-PCR)testing.Nevertheless,this testing method is accurate enough for the diagnosis of COVID-19.However,it is time-consuming,expensive,expert-dependent,and violates social distancing.In this paper,this research proposed an effective multimodality-based and feature fusion-based(MMFF)COVID-19 detection technique through deep neural networks.In multi-modality,we have utilized the cough samples,breathe samples and sound samples of healthy as well as COVID-19 patients from publicly available COSWARA dataset.Extensive set of experimental analyses were performed to evaluate the performance of our proposed approach.Several useful features were extracted from the aforementioned modalities that were then fed as an input to long short-term memory recurrent neural network algorithms for the classification purpose.Extensive set of experimental analyses were performed to evaluate the performance of our proposed approach.The experimental results showed that our proposed approach outperformed compared to four baseline approaches published recently.We believe that our proposed technique will assists potential users to diagnose the COVID-19 without the intervention of any expert in minimum amount of time.展开更多
Power Shell has been widely deployed in fileless malware and advanced persistent threat(APT)attacks due to its high stealthiness and live-off-theland technique.However,existing works mainly focus on deobfuscation and ...Power Shell has been widely deployed in fileless malware and advanced persistent threat(APT)attacks due to its high stealthiness and live-off-theland technique.However,existing works mainly focus on deobfuscation and malicious detection,lacking the malicious Power Shell families classification and behavior analysis.Moreover,the state-of-the-art methods fail to capture fine-grained features and semantic relationships,resulting in low robustness and accuracy.To this end,we propose Power Detector,a novel malicious Power Shell script detector based on multimodal semantic fusion and deep learning.Specifically,we design four feature extraction methods to extract key features from character,token,abstract syntax tree(AST),and semantic knowledge graph.Then,we intelligently design four embeddings(i.e.,Char2Vec,Token2Vec,AST2Vec,and Rela2Vec) and construct a multi-modal fusion algorithm to concatenate feature vectors from different views.Finally,we propose a combined model based on transformer and CNN-Bi LSTM to implement Power Shell family detection.Our experiments with five types of Power Shell attacks show that PowerDetector can accurately detect various obfuscated and stealth PowerShell scripts,with a 0.9402 precision,a 0.9358 recall,and a 0.9374 F1-score.Furthermore,through singlemodal and multi-modal comparison experiments,we demonstrate that PowerDetector’s multi-modal embedding and deep learning model can achieve better accuracy and even identify more unknown attacks.展开更多
Multi-modal fusion technology gradually become a fundamental task in many fields,such as autonomous driving,smart healthcare,sentiment analysis,and human-computer interaction.It is rapidly becoming the dominant resear...Multi-modal fusion technology gradually become a fundamental task in many fields,such as autonomous driving,smart healthcare,sentiment analysis,and human-computer interaction.It is rapidly becoming the dominant research due to its powerful perception and judgment capabilities.Under complex scenes,multi-modal fusion technology utilizes the complementary characteristics of multiple data streams to fuse different data types and achieve more accurate predictions.However,achieving outstanding performance is challenging because of equipment performance limitations,missing information,and data noise.This paper comprehensively reviews existing methods based onmulti-modal fusion techniques and completes a detailed and in-depth analysis.According to the data fusion stage,multi-modal fusion has four primary methods:early fusion,deep fusion,late fusion,and hybrid fusion.The paper surveys the three majormulti-modal fusion technologies that can significantly enhance the effect of data fusion and further explore the applications of multi-modal fusion technology in various fields.Finally,it discusses the challenges and explores potential research opportunities.Multi-modal tasks still need intensive study because of data heterogeneity and quality.Preserving complementary information and eliminating redundant information between modalities is critical in multi-modal technology.Invalid data fusion methods may introduce extra noise and lead to worse results.This paper provides a comprehensive and detailed summary in response to these challenges.展开更多
Image classification based on bag-of-words(BOW)has a broad application prospect in pattern recognition field but the shortcomings such as single feature and low classification accuracy are apparent.To deal with this...Image classification based on bag-of-words(BOW)has a broad application prospect in pattern recognition field but the shortcomings such as single feature and low classification accuracy are apparent.To deal with this problem,this paper proposes to combine two ingredients:(i)Three features with functions of mutual complementation are adopted to describe the images,including pyramid histogram of words(PHOW),pyramid histogram of color(PHOC)and pyramid histogram of orientated gradients(PHOG).(ii)An adaptive feature-weight adjusted image categorization algorithm based on the SVM and the decision level fusion of multiple features are employed.Experiments are carried out on the Caltech101 database,which confirms the validity of the proposed approach.The experimental results show that the classification accuracy rate of the proposed method is improved by 7%-14%higher than that of the traditional BOW methods.With full utilization of global,local and spatial information,the algorithm is much more complete and flexible to describe the feature information of the image through the multi-feature fusion and the pyramid structure composed by image spatial multi-resolution decomposition.Significant improvements to the classification accuracy are achieved as the result.展开更多
Because the hydraulic directional valve usually works in a bad working environment and is disturbed by multi-factor noise,the traditional single sensor monitoring technology is difficult to use for an accurate diagnos...Because the hydraulic directional valve usually works in a bad working environment and is disturbed by multi-factor noise,the traditional single sensor monitoring technology is difficult to use for an accurate diagnosis of it.Therefore,a fault diagnosis method based on multi-sensor information fusion is proposed in this paper to reduce the inaccuracy and uncertainty of traditional single sensor information diagnosis technology and to realize accurate monitoring for the location or diagnosis of early faults in such valves in noisy environments.Firstly,the statistical features of signals collected by the multi-sensor are extracted and the depth features are obtained by a convolutional neural network(CNN)to form a complete and stable multi-dimensional feature set.Secondly,to obtain a weighted multi-dimensional feature set,the multi-dimensional feature sets of similar sensors are combined,and the entropy weight method is used to weight these features to reduce the interference of insensitive features.Finally,the attention mechanism is introduced to improve the dual-channel CNN,which is used to adaptively fuse the weighted multi-dimensional feature sets of heterogeneous sensors,to flexibly select heterogeneous sensor information so as to achieve an accurate diagnosis.Experimental results show that the weighted multi-dimensional feature set obtained by the proposed method has a high fault-representation ability and low information redundancy.It can diagnose simultaneously internal wear faults of the hydraulic directional valve and electromagnetic faults of actuators that are difficult to diagnose by traditional methods.This proposed method can achieve high fault-diagnosis accuracy under severe working conditions.展开更多
Human Action Recognition(HAR)is an active research topic in machine learning for the last few decades.Visual surveillance,robotics,and pedestrian detection are the main applications for action recognition.Computer vis...Human Action Recognition(HAR)is an active research topic in machine learning for the last few decades.Visual surveillance,robotics,and pedestrian detection are the main applications for action recognition.Computer vision researchers have introduced many HAR techniques,but they still face challenges such as redundant features and the cost of computing.In this article,we proposed a new method for the use of deep learning for HAR.In the proposed method,video frames are initially pre-processed using a global contrast approach and later used to train a deep learning model using domain transfer learning.The Resnet-50 Pre-Trained Model is used as a deep learning model in this work.Features are extracted from two layers:Global Average Pool(GAP)and Fully Connected(FC).The features of both layers are fused by the Canonical Correlation Analysis(CCA).Then features are selected using the Shanon Entropy-based threshold function.The selected features are finally passed to multiple classifiers for final classification.Experiments are conducted on five publicly available datasets as IXMAS,UCF Sports,YouTube,UT-Interaction,and KTH.The accuracy of these data sets was 89.6%,99.7%,100%,96.7%and 96.6%,respectively.Comparison with existing techniques has shown that the proposed method provides improved accuracy for HAR.Also,the proposed method is computationally fast based on the time of execution.展开更多
An improved model based on you only look once version 8(YOLOv8)is proposed to solve the problem of low detection accuracy due to the diversity of object sizes in optical remote sensing images.Firstly,the feature pyram...An improved model based on you only look once version 8(YOLOv8)is proposed to solve the problem of low detection accuracy due to the diversity of object sizes in optical remote sensing images.Firstly,the feature pyramid network(FPN)structure of the original YOLOv8 mode is replaced by the generalized-FPN(GFPN)structure in GiraffeDet to realize the"cross-layer"and"cross-scale"adaptive feature fusion,to enrich the semantic information and spatial information on the feature map to improve the target detection ability of the model.Secondly,a pyramid-pool module of multi atrous spatial pyramid pooling(MASPP)is designed by using the idea of atrous convolution and feature pyramid structure to extract multi-scale features,so as to improve the processing ability of the model for multi-scale objects.The experimental results show that the detection accuracy of the improved YOLOv8 model on DIOR dataset is 92%and mean average precision(mAP)is 87.9%,respectively 3.5%and 1.7%higher than those of the original model.It is proved the detection and classification ability of the proposed model on multi-dimensional optical remote sensing target has been improved.展开更多
基金funded by the National Natural Science Foundation of China(61991413)the China Postdoctoral Science Foundation(2019M651142)+1 种基金the Natural Science Foundation of Liaoning Province(2021-KF-12-07)the Natural Science Foundations of Liaoning Province(2023-MS-322).
文摘Fusing hand-based features in multi-modal biometric recognition enhances anti-spoofing capabilities.Additionally,it leverages inter-modal correlation to enhance recognition performance.Concurrently,the robustness and recognition performance of the system can be enhanced through judiciously leveraging the correlation among multimodal features.Nevertheless,two issues persist in multi-modal feature fusion recognition:Firstly,the enhancement of recognition performance in fusion recognition has not comprehensively considered the inter-modality correlations among distinct modalities.Secondly,during modal fusion,improper weight selection diminishes the salience of crucial modal features,thereby diminishing the overall recognition performance.To address these two issues,we introduce an enhanced DenseNet multimodal recognition network founded on feature-level fusion.The information from the three modalities is fused akin to RGB,and the input network augments the correlation between modes through channel correlation.Within the enhanced DenseNet network,the Efficient Channel Attention Network(ECA-Net)dynamically adjusts the weight of each channel to amplify the salience of crucial information in each modal feature.Depthwise separable convolution markedly reduces the training parameters and further enhances the feature correlation.Experimental evaluations were conducted on four multimodal databases,comprising six unimodal databases,including multispectral palmprint and palm vein databases from the Chinese Academy of Sciences.The Equal Error Rates(EER)values were 0.0149%,0.0150%,0.0099%,and 0.0050%,correspondingly.In comparison to other network methods for palmprint,palm vein,and finger vein fusion recognition,this approach substantially enhances recognition performance,rendering it suitable for high-security environments with practical applicability.The experiments in this article utilized amodest sample database comprising 200 individuals.The subsequent phase involves preparing for the extension of the method to larger databases.
文摘In the article“A Lightweight Approach for Skin Lesion Detection through Optimal Features Fusion”by Khadija Manzoor,Fiaz Majeed,Ansar Siddique,Talha Meraj,Hafiz Tayyab Rauf,Mohammed A.El-Meligy,Mohamed Sharaf,Abd Elatty E.Abd Elgawad Computers,Materials&Continua,2022,Vol.70,No.1,pp.1617–1630.DOI:10.32604/cmc.2022.018621,URL:https://www.techscience.com/cmc/v70n1/44361,there was an error regarding the affiliation for the author Hafiz Tayyab Rauf.Instead of“Centre for Smart Systems,AI and Cybersecurity,Staffordshire University,Stoke-on-Trent,UK”,the affiliation should be“Independent Researcher,Bradford,BD80HS,UK”.
基金Supported by the Henan Province Key Research and Development Project(231111211300)the Central Government of Henan Province Guides Local Science and Technology Development Funds(Z20231811005)+2 种基金Henan Province Key Research and Development Project(231111110100)Henan Provincial Outstanding Foreign Scientist Studio(GZS2024006)Henan Provincial Joint Fund for Scientific and Technological Research and Development Plan(Application and Overcoming Technical Barriers)(242103810028)。
文摘The fusion of infrared and visible images should emphasize the salient targets in the infrared image while preserving the textural details of the visible images.To meet these requirements,an autoencoder-based method for infrared and visible image fusion is proposed.The encoder designed according to the optimization objective consists of a base encoder and a detail encoder,which is used to extract low-frequency and high-frequency information from the image.This extraction may lead to some information not being captured,so a compensation encoder is proposed to supplement the missing information.Multi-scale decomposition is also employed to extract image features more comprehensively.The decoder combines low-frequency,high-frequency and supplementary information to obtain multi-scale features.Subsequently,the attention strategy and fusion module are introduced to perform multi-scale fusion for image reconstruction.Experimental results on three datasets show that the fused images generated by this network effectively retain salient targets while being more consistent with human visual perception.
基金partially supported by the National Natural Science Foundation of China under Grants 62471493 and 62402257(for conceptualization and investigation)partially supported by the Natural Science Foundation of Shandong Province,China under Grants ZR2023LZH017,ZR2024MF066,and 2023QF025(for formal analysis and validation)+1 种基金partially supported by the Open Foundation of Key Laboratory of Computing Power Network and Information Security,Ministry of Education,Qilu University of Technology(Shandong Academy of Sciences)under Grant 2023ZD010(for methodology and model design)partially supported by the Russian Science Foundation(RSF)Project under Grant 22-71-10095-P(for validation and results verification).
文摘To address the challenge of missing modal information in entity alignment and to mitigate information loss or bias arising frommodal heterogeneity during fusion,while also capturing shared information acrossmodalities,this paper proposes a Multi-modal Pre-synergistic Entity Alignmentmodel based on Cross-modalMutual Information Strategy Optimization(MPSEA).The model first employs independent encoders to process multi-modal features,including text,images,and numerical values.Next,a multi-modal pre-synergistic fusion mechanism integrates graph structural and visual modal features into the textual modality as preparatory information.This pre-fusion strategy enables unified perception of heterogeneous modalities at the model’s initial stage,reducing discrepancies during the fusion process.Finally,using cross-modal deep perception reinforcement learning,the model achieves adaptive multilevel feature fusion between modalities,supporting learningmore effective alignment strategies.Extensive experiments on multiple public datasets show that the MPSEA method achieves gains of up to 7% in Hits@1 and 8.2% in MRR on the FBDB15K dataset,and up to 9.1% in Hits@1 and 7.7% in MRR on the FBYG15K dataset,compared to existing state-of-the-art methods.These results confirm the effectiveness of the proposed model.
基金funded by Research Project,grant number BHQ090003000X03.
文摘Multi-modal Named Entity Recognition(MNER)aims to better identify meaningful textual entities by integrating information from images.Previous work has focused on extracting visual semantics at a fine-grained level,or obtaining entity related external knowledge from knowledge bases or Large Language Models(LLMs).However,these approaches ignore the poor semantic correlation between visual and textual modalities in MNER datasets and do not explore different multi-modal fusion approaches.In this paper,we present MMAVK,a multi-modal named entity recognition model with auxiliary visual knowledge and word-level fusion,which aims to leverage the Multi-modal Large Language Model(MLLM)as an implicit knowledge base.It also extracts vision-based auxiliary knowledge from the image formore accurate and effective recognition.Specifically,we propose vision-based auxiliary knowledge generation,which guides the MLLM to extract external knowledge exclusively derived from images to aid entity recognition by designing target-specific prompts,thus avoiding redundant recognition and cognitive confusion caused by the simultaneous processing of image-text pairs.Furthermore,we employ a word-level multi-modal fusion mechanism to fuse the extracted external knowledge with each word-embedding embedded from the transformerbased encoder.Extensive experimental results demonstrate that MMAVK outperforms or equals the state-of-the-art methods on the two classical MNER datasets,even when the largemodels employed have significantly fewer parameters than other baselines.
基金King Saud University,Grant/Award Number:RSP2024R157。
文摘Biometric characteristics are playing a vital role in security for the last few years.Human gait classification in video sequences is an important biometrics attribute and is used for security purposes.A new framework for human gait classification in video sequences using deep learning(DL)fusion assisted and posterior probability-based moth flames optimization(MFO)is proposed.In the first step,the video frames are resized and finetuned by two pre-trained lightweight DL models,EfficientNetB0 and MobileNetV2.Both models are selected based on the top-5 accuracy and less number of parameters.Later,both models are trained through deep transfer learning and extracted deep features fused using a voting scheme.In the last step,the authors develop a posterior probabilitybased MFO feature selection algorithm to select the best features.The selected features are classified using several supervised learning methods.The CASIA-B publicly available dataset has been employed for the experimental process.On this dataset,the authors selected six angles such as 0°,18°,90°,108°,162°,and 180°and obtained an average accuracy of 96.9%,95.7%,86.8%,90.0%,95.1%,and 99.7%.Results demonstrate comparable improvement in accuracy and significantly minimize the computational time with recent state-of-the-art techniques.
基金funded by the Yangtze River Delta Science and Technology Innovation Community Joint Research Project(2023CSJGG1600)the Natural Science Foundation of Anhui Province(2208085MF173)Wuhu“ChiZhu Light”Major Science and Technology Project(2023ZD01,2023ZD03).
文摘As the number and complexity of sensors in autonomous vehicles continue to rise,multimodal fusionbased object detection algorithms are increasingly being used to detect 3D environmental information,significantly advancing the development of perception technology in autonomous driving.To further promote the development of fusion algorithms and improve detection performance,this paper discusses the advantages and recent advancements of multimodal fusion-based object detection algorithms.Starting fromsingle-modal sensor detection,the paper provides a detailed overview of typical sensors used in autonomous driving and introduces object detection methods based on images and point clouds.For image-based detection methods,they are categorized into monocular detection and binocular detection based on different input types.For point cloud-based detection methods,they are classified into projection-based,voxel-based,point cluster-based,pillar-based,and graph structure-based approaches based on the technical pathways for processing point cloud features.Additionally,multimodal fusion algorithms are divided into Camera-LiDAR fusion,Camera-Radar fusion,Camera-LiDAR-Radar fusion,and other sensor fusion methods based on the types of sensors involved.Furthermore,the paper identifies five key future research directions in this field,aiming to provide insights for researchers engaged in multimodal fusion-based object detection algorithms and to encourage broader attention to the research and application of multimodal fusion-based object detection.
文摘[Objective]Accurate prediction of tomato growth height is crucial for optimizing production environments in smart farming.However,current prediction methods predominantly rely on empirical,mechanistic,or learning-based models that utilize either images data or environmental data.These methods fail to fully leverage multi-modal data to capture the diverse aspects of plant growth comprehensively.[Methods]To address this limitation,a two-stage phenotypic feature extraction(PFE)model based on deep learning algorithm of recurrent neural network(RNN)and long short-term memory(LSTM)was developed.The model integrated environment and plant information to provide a holistic understanding of the growth process,emploied phenotypic and temporal feature extractors to comprehensively capture both types of features,enabled a deeper understanding of the interaction between tomato plants and their environment,ultimately leading to highly accurate predictions of growth height.[Results and Discussions]The experimental results showed the model's ef‐fectiveness:When predicting the next two days based on the past five days,the PFE-based RNN and LSTM models achieved mean absolute percentage error(MAPE)of 0.81%and 0.40%,respectively,which were significantly lower than the 8.00%MAPE of the large language model(LLM)and 6.72%MAPE of the Transformer-based model.In longer-term predictions,the 10-day prediction for 4 days ahead and the 30-day prediction for 12 days ahead,the PFE-RNN model continued to outperform the other two baseline models,with MAPE of 2.66%and 14.05%,respectively.[Conclusions]The proposed method,which leverages phenotypic-temporal collaboration,shows great potential for intelligent,data-driven management of tomato cultivation,making it a promising approach for enhancing the efficiency and precision of smart tomato planting management.
基金Project(51875491) supported by the National Natural Science Foundation of ChinaProject(2021T3069) supported by the Fujian Science and Technology Plan STS Project,China。
文摘Laser cleaning is a highly nonlinear physical process for solving poor single-modal(e.g., acoustic or vision)detection performance and low inter-information utilization. In this study, a multi-modal feature fusion network model was constructed based on a laser paint removal experiment. The alignment of heterogeneous data under different modals was solved by combining the piecewise aggregate approximation and gramian angular field. Moreover, the attention mechanism was introduced to optimize the dual-path network and dense connection network, enabling the sampling characteristics to be extracted and integrated. Consequently, the multi-modal discriminant detection of laser paint removal was realized. According to the experimental results, the verification accuracy of the constructed model on the experimental dataset was 99.17%, which is 5.77% higher than the optimal single-modal detection results of the laser paint removal. The feature extraction network was optimized by the attention mechanism, and the model accuracy was increased by 3.3%. Results verify the improved classification performance of the constructed multi-modal feature fusion model in detecting laser paint removal, the effective integration of acoustic data and visual image data, and the accurate detection of laser paint removal.
文摘Scene classification of high-resolution remote sensing (HRRS) image is an important research topic and has been applied broadly in many fields. Deep learning method has shown its high potential to in this domain, owing to its powerful learning ability of characterizing complex patterns. However the deep learning methods omit some global and local information of the HRRS image. To this end, in this article we show efforts to adopt explicit global and local information to provide complementary information to deep models. Specifically, we use a patch based MS-CLBP method to acquire global and local representations, and then we consider a pretrained CNN model as a feature extractor and extract deep hierarchical features from full-connection layers. After fisher vector (FV) encoding, we obtain the holistic visual representation of the scene image. We view the scene classification as a reconstruction procedure and train several class-specific stack denoising autoencoders (SDAEs) of corresponding class, i.e., one SDAE per class, and classify the test image according to the reconstruction error. Experimental results show that our combination method outperforms the state-of-the-art deep learning classification methods without employing fine-tuning.
基金Supported by the National Key Technology R&D Program (No. 2006BAK08B07)
文摘A novel face recognition method, which is a fusion of muhi-modal face parts based on Gabor feature (MMP-GF), is proposed in this paper. Firstly, the bare face image detached from the normalized image was convolved with a family of Gabor kernels, and then according to the face structure and the key-points locations, the calculated Gabor images were divided into five parts: Gabor face, Gabor eyebrow, Gabor eye, Gabor nose and Gabor mouth. After that multi-modal Gabor features were spatially partitioned into non-overlapping regions and the averages of regions were concatenated to be a low dimension feature vector, whose dimension was further reduced by principal component analysis (PCA). In the decision level fusion, match results respectively calculated based on the five parts were combined according to linear discriminant analysis (LDA) and a normalized matching algorithm was used to improve the performance. Experiments on FERET database show that the proposed MMP-GF method achieves good robustness to the expression and age variations.
文摘In geometry processing,symmetry research benefits from global geo-metric features of complete shapes,but the shape of an object captured in real-world applications is often incomplete due to the limited sensor resolution,single viewpoint,and occlusion.Different from the existing works predicting symmetry from the complete shape,we propose a learning approach for symmetry predic-tion based on a single RGB-D image.Instead of directly predicting the symmetry from incomplete shapes,our method consists of two modules,i.e.,the multi-mod-al feature fusion module and the detection-by-reconstruction module.Firstly,we build a channel-transformer network(CTN)to extract cross-fusion features from the RGB-D as the multi-modal feature fusion module,which helps us aggregate features from the color and the depth separately.Then,our self-reconstruction net-work based on a 3D variational auto-encoder(3D-VAE)takes the global geo-metric features as input,followed by a prediction symmetry network to detect the symmetry.Our experiments are conducted on three public datasets:ShapeNet,YCB,and ScanNet,we demonstrate that our method can produce reliable and accurate results.
基金National Youth Natural Science Foundation of China(No.61806006)Innovation Program for Graduate of Jiangsu Province(No.KYLX160-781)Jiangsu University Superior Discipline Construction Project。
文摘In order to solve difficult detection of far and hard objects due to the sparseness and insufficient semantic information of LiDAR point cloud,a 3D object detection network with multi-modal data adaptive fusion is proposed,which makes use of multi-neighborhood information of voxel and image information.Firstly,design an improved ResNet that maintains the structure information of far and hard objects in low-resolution feature maps,which is more suitable for detection task.Meanwhile,semantema of each image feature map is enhanced by semantic information from all subsequent feature maps.Secondly,extract multi-neighborhood context information with different receptive field sizes to make up for the defect of sparseness of point cloud which improves the ability of voxel features to represent the spatial structure and semantic information of objects.Finally,propose a multi-modal feature adaptive fusion strategy which uses learnable weights to express the contribution of different modal features to the detection task,and voxel attention further enhances the fused feature expression of effective target objects.The experimental results on the KITTI benchmark show that this method outperforms VoxelNet with remarkable margins,i.e.increasing the AP by 8.78%and 5.49%on medium and hard difficulty levels.Meanwhile,our method achieves greater detection performance compared with many mainstream multi-modal methods,i.e.outperforming the AP by 1%compared with that of MVX-Net on medium and hard difficulty levels.
文摘The Coronavirus Disease 2019(COVID-19)pandemic poses the worldwide challenges surpassing the boundaries of country,religion,race,and economy.The current benchmark method for the detection of COVID-19 is the reverse transcription polymerase chain reaction(RT-PCR)testing.Nevertheless,this testing method is accurate enough for the diagnosis of COVID-19.However,it is time-consuming,expensive,expert-dependent,and violates social distancing.In this paper,this research proposed an effective multimodality-based and feature fusion-based(MMFF)COVID-19 detection technique through deep neural networks.In multi-modality,we have utilized the cough samples,breathe samples and sound samples of healthy as well as COVID-19 patients from publicly available COSWARA dataset.Extensive set of experimental analyses were performed to evaluate the performance of our proposed approach.Several useful features were extracted from the aforementioned modalities that were then fed as an input to long short-term memory recurrent neural network algorithms for the classification purpose.Extensive set of experimental analyses were performed to evaluate the performance of our proposed approach.The experimental results showed that our proposed approach outperformed compared to four baseline approaches published recently.We believe that our proposed technique will assists potential users to diagnose the COVID-19 without the intervention of any expert in minimum amount of time.
基金This work was supported by National Natural Science Foundation of China(No.62172308,No.U1626107,No.61972297,No.62172144,and No.62062019).
文摘Power Shell has been widely deployed in fileless malware and advanced persistent threat(APT)attacks due to its high stealthiness and live-off-theland technique.However,existing works mainly focus on deobfuscation and malicious detection,lacking the malicious Power Shell families classification and behavior analysis.Moreover,the state-of-the-art methods fail to capture fine-grained features and semantic relationships,resulting in low robustness and accuracy.To this end,we propose Power Detector,a novel malicious Power Shell script detector based on multimodal semantic fusion and deep learning.Specifically,we design four feature extraction methods to extract key features from character,token,abstract syntax tree(AST),and semantic knowledge graph.Then,we intelligently design four embeddings(i.e.,Char2Vec,Token2Vec,AST2Vec,and Rela2Vec) and construct a multi-modal fusion algorithm to concatenate feature vectors from different views.Finally,we propose a combined model based on transformer and CNN-Bi LSTM to implement Power Shell family detection.Our experiments with five types of Power Shell attacks show that PowerDetector can accurately detect various obfuscated and stealth PowerShell scripts,with a 0.9402 precision,a 0.9358 recall,and a 0.9374 F1-score.Furthermore,through singlemodal and multi-modal comparison experiments,we demonstrate that PowerDetector’s multi-modal embedding and deep learning model can achieve better accuracy and even identify more unknown attacks.
基金supported by the Natural Science Foundation of Liaoning Province(Grant No.2023-MSBA-070)the National Natural Science Foundation of China(Grant No.62302086).
文摘Multi-modal fusion technology gradually become a fundamental task in many fields,such as autonomous driving,smart healthcare,sentiment analysis,and human-computer interaction.It is rapidly becoming the dominant research due to its powerful perception and judgment capabilities.Under complex scenes,multi-modal fusion technology utilizes the complementary characteristics of multiple data streams to fuse different data types and achieve more accurate predictions.However,achieving outstanding performance is challenging because of equipment performance limitations,missing information,and data noise.This paper comprehensively reviews existing methods based onmulti-modal fusion techniques and completes a detailed and in-depth analysis.According to the data fusion stage,multi-modal fusion has four primary methods:early fusion,deep fusion,late fusion,and hybrid fusion.The paper surveys the three majormulti-modal fusion technologies that can significantly enhance the effect of data fusion and further explore the applications of multi-modal fusion technology in various fields.Finally,it discusses the challenges and explores potential research opportunities.Multi-modal tasks still need intensive study because of data heterogeneity and quality.Preserving complementary information and eliminating redundant information between modalities is critical in multi-modal technology.Invalid data fusion methods may introduce extra noise and lead to worse results.This paper provides a comprehensive and detailed summary in response to these challenges.
基金Supported by Foundation for Innovative Research Groups of the National Natural Science Foundation of China(61321002)Projects of Major International(Regional)Jiont Research Program NSFC(61120106010)+1 种基金Beijing Education Committee Cooperation Building Foundation ProjectProgram for Changjiang Scholars and Innovative Research Team in University(IRT1208)
文摘Image classification based on bag-of-words(BOW)has a broad application prospect in pattern recognition field but the shortcomings such as single feature and low classification accuracy are apparent.To deal with this problem,this paper proposes to combine two ingredients:(i)Three features with functions of mutual complementation are adopted to describe the images,including pyramid histogram of words(PHOW),pyramid histogram of color(PHOC)and pyramid histogram of orientated gradients(PHOG).(ii)An adaptive feature-weight adjusted image categorization algorithm based on the SVM and the decision level fusion of multiple features are employed.Experiments are carried out on the Caltech101 database,which confirms the validity of the proposed approach.The experimental results show that the classification accuracy rate of the proposed method is improved by 7%-14%higher than that of the traditional BOW methods.With full utilization of global,local and spatial information,the algorithm is much more complete and flexible to describe the feature information of the image through the multi-feature fusion and the pyramid structure composed by image spatial multi-resolution decomposition.Significant improvements to the classification accuracy are achieved as the result.
基金supported by the National Natural Science Foundation of China(Nos.51805376 and U1709208)the Zhejiang Provincial Natural Science Foundation of China(Nos.LY20E050028 and LD21E050001)。
文摘Because the hydraulic directional valve usually works in a bad working environment and is disturbed by multi-factor noise,the traditional single sensor monitoring technology is difficult to use for an accurate diagnosis of it.Therefore,a fault diagnosis method based on multi-sensor information fusion is proposed in this paper to reduce the inaccuracy and uncertainty of traditional single sensor information diagnosis technology and to realize accurate monitoring for the location or diagnosis of early faults in such valves in noisy environments.Firstly,the statistical features of signals collected by the multi-sensor are extracted and the depth features are obtained by a convolutional neural network(CNN)to form a complete and stable multi-dimensional feature set.Secondly,to obtain a weighted multi-dimensional feature set,the multi-dimensional feature sets of similar sensors are combined,and the entropy weight method is used to weight these features to reduce the interference of insensitive features.Finally,the attention mechanism is introduced to improve the dual-channel CNN,which is used to adaptively fuse the weighted multi-dimensional feature sets of heterogeneous sensors,to flexibly select heterogeneous sensor information so as to achieve an accurate diagnosis.Experimental results show that the weighted multi-dimensional feature set obtained by the proposed method has a high fault-representation ability and low information redundancy.It can diagnose simultaneously internal wear faults of the hydraulic directional valve and electromagnetic faults of actuators that are difficult to diagnose by traditional methods.This proposed method can achieve high fault-diagnosis accuracy under severe working conditions.
基金This research was supported by Korea Institute for Advancement of Technology(KIAT)grant funded by the Korea Government(MOTIE)(P0012724,The Competency Development Program for Industry Specialist)and the Soonchunhyang University Research Fund.
文摘Human Action Recognition(HAR)is an active research topic in machine learning for the last few decades.Visual surveillance,robotics,and pedestrian detection are the main applications for action recognition.Computer vision researchers have introduced many HAR techniques,but they still face challenges such as redundant features and the cost of computing.In this article,we proposed a new method for the use of deep learning for HAR.In the proposed method,video frames are initially pre-processed using a global contrast approach and later used to train a deep learning model using domain transfer learning.The Resnet-50 Pre-Trained Model is used as a deep learning model in this work.Features are extracted from two layers:Global Average Pool(GAP)and Fully Connected(FC).The features of both layers are fused by the Canonical Correlation Analysis(CCA).Then features are selected using the Shanon Entropy-based threshold function.The selected features are finally passed to multiple classifiers for final classification.Experiments are conducted on five publicly available datasets as IXMAS,UCF Sports,YouTube,UT-Interaction,and KTH.The accuracy of these data sets was 89.6%,99.7%,100%,96.7%and 96.6%,respectively.Comparison with existing techniques has shown that the proposed method provides improved accuracy for HAR.Also,the proposed method is computationally fast based on the time of execution.
基金supported by the National Natural Science Foundation of China(No.62241109)the Tianjin Science and Technology Commissioner Project(No.20YDTPJC01110)。
文摘An improved model based on you only look once version 8(YOLOv8)is proposed to solve the problem of low detection accuracy due to the diversity of object sizes in optical remote sensing images.Firstly,the feature pyramid network(FPN)structure of the original YOLOv8 mode is replaced by the generalized-FPN(GFPN)structure in GiraffeDet to realize the"cross-layer"and"cross-scale"adaptive feature fusion,to enrich the semantic information and spatial information on the feature map to improve the target detection ability of the model.Secondly,a pyramid-pool module of multi atrous spatial pyramid pooling(MASPP)is designed by using the idea of atrous convolution and feature pyramid structure to extract multi-scale features,so as to improve the processing ability of the model for multi-scale objects.The experimental results show that the detection accuracy of the improved YOLOv8 model on DIOR dataset is 92%and mean average precision(mAP)is 87.9%,respectively 3.5%and 1.7%higher than those of the original model.It is proved the detection and classification ability of the proposed model on multi-dimensional optical remote sensing target has been improved.