The world produces vast quantities of high-dimensional multi-semantic data.However,extracting valuable information from such a large amount of high-dimensional and multi-label data is undoubtedly arduous and challengi...The world produces vast quantities of high-dimensional multi-semantic data.However,extracting valuable information from such a large amount of high-dimensional and multi-label data is undoubtedly arduous and challenging.Feature selection aims to mitigate the adverse impacts of high dimensionality in multi-label data by eliminating redundant and irrelevant features.The ant colony optimization algorithm has demonstrated encouraging outcomes in multi-label feature selection,because of its simplicity,efficiency,and similarity to reinforcement learning.Nevertheless,existing methods do not consider crucial correlation information,such as dynamic redundancy and label correlation.To tackle these concerns,the paper proposes a multi-label feature selection technique based on ant colony optimization algorithm(MFACO),focusing on dynamic redundancy and label correlation.Initially,the dynamic redundancy is assessed between the selected feature subset and potential features.Meanwhile,the ant colony optimization algorithm extracts label correlation from the label set,which is then combined into the heuristic factor as label weights.Experimental results demonstrate that our proposed strategies can effectively enhance the optimal search ability of ant colony,outperforming the other algorithms involved in the paper.展开更多
Data collected in fields such as cybersecurity and biomedicine often encounter high dimensionality and class imbalance.To address the problem of low classification accuracy for minority class samples arising from nume...Data collected in fields such as cybersecurity and biomedicine often encounter high dimensionality and class imbalance.To address the problem of low classification accuracy for minority class samples arising from numerous irrelevant and redundant features in high-dimensional imbalanced data,we proposed a novel feature selection method named AMF-SGSK based on adaptive multi-filter and subspace-based gaining sharing knowledge.Firstly,the balanced dataset was obtained by random under-sampling.Secondly,combining the feature importance score with the AUC score for each filter method,we proposed a concept called feature hardness to judge the importance of feature,which could adaptively select the essential features.Finally,the optimal feature subset was obtained by gaining sharing knowledge in multiple subspaces.This approach effectively achieved dimensionality reduction for high-dimensional imbalanced data.The experiment results on 30 benchmark imbalanced datasets showed that AMF-SGSK performed better than other eight commonly used algorithms including BGWO and IG-SSO in terms of F1-score,AUC,and G-mean.The mean values of F1-score,AUC,and Gmean for AMF-SGSK are 0.950,0.967,and 0.965,respectively,achieving the highest among all algorithms.And the mean value of Gmean is higher than those of IG-PSO,ReliefF-GWO,and BGOA by 3.72%,11.12%,and 20.06%,respectively.Furthermore,the selected feature ratio is below 0.01 across the selected ten datasets,further demonstrating the proposed method’s overall superiority over competing approaches.AMF-SGSK could adaptively remove irrelevant and redundant features and effectively improve the classification accuracy of high-dimensional imbalanced data,providing scientific and technological references for practical applications.展开更多
Apple leaf disease is one of the main factors to constrain the apple production and quality.It takes a long time to detect the diseases by using the traditional diagnostic approach,thus farmers often miss the best tim...Apple leaf disease is one of the main factors to constrain the apple production and quality.It takes a long time to detect the diseases by using the traditional diagnostic approach,thus farmers often miss the best time to prevent and treat the diseases.Apple leaf disease recognition based on leaf image is an essential research topic in the field of computer vision,where the key task is to find an effective way to represent the diseased leaf images.In this research,based on image processing techniques and pattern recognition methods,an apple leaf disease recognition method was proposed.A color transformation structure for the input RGB(Red,Green and Blue)image was designed firstly and then RGB model was converted to HSI(Hue,Saturation and Intensity),YUV and gray models.The background was removed based on a specific threshold value,and then the disease spot image was segmented with region growing algorithm(RGA).Thirty-eight classifying features of color,texture and shape were extracted from each spot image.To reduce the dimensionality of the feature space and improve the accuracy of the apple leaf disease identification,the most valuable features were selected by combining genetic algorithm(GA)and correlation based feature selection(CFS).Finally,the diseases were recognized by SVM classifier.In the proposed method,the selected feature subset was globally optimum.The experimental results of more than 90%correct identification rate on the apple diseased leaf image database which contains 90 disease images for there kinds of apple leaf diseases,powdery mildew,mosaic and rust,demonstrate that the proposed method is feasible and effective.展开更多
Accurate prediction of shield tunneling-induced settlement is a complex problem that requires consideration of many influential parameters.Recent studies reveal that machine learning(ML)algorithms can predict the sett...Accurate prediction of shield tunneling-induced settlement is a complex problem that requires consideration of many influential parameters.Recent studies reveal that machine learning(ML)algorithms can predict the settlement caused by tunneling.However,well-performing ML models are usually less interpretable.Irrelevant input features decrease the performance and interpretability of an ML model.Nonetheless,feature selection,a critical step in the ML pipeline,is usually ignored in most studies that focused on predicting tunneling-induced settlement.This study applies four techniques,i.e.Pearson correlation method,sequential forward selection(SFS),sequential backward selection(SBS)and Boruta algorithm,to investigate the effect of feature selection on the model’s performance when predicting the tunneling-induced maximum surface settlement(S_(max)).The data set used in this study was compiled from two metro tunnel projects excavated in Hangzhou,China using earth pressure balance(EPB)shields and consists of 14 input features and a single output(i.e.S_(max)).The ML model that is trained on features selected from the Boruta algorithm demonstrates the best performance in both the training and testing phases.The relevant features chosen from the Boruta algorithm further indicate that tunneling-induced settlement is affected by parameters related to tunnel geometry,geological conditions and shield operation.The recently proposed Shapley additive explanations(SHAP)method explores how the input features contribute to the output of a complex ML model.It is observed that the larger settlements are induced during shield tunneling in silty clay.Moreover,the SHAP analysis reveals that the low magnitudes of face pressure at the top of the shield increase the model’s output。展开更多
Finding effective cancer treatment is a challenge, because the sensitivity of the cancer stems from both intrinsic cellular properties and acquired resistances from prior treatment. Previous research has revealed indi...Finding effective cancer treatment is a challenge, because the sensitivity of the cancer stems from both intrinsic cellular properties and acquired resistances from prior treatment. Previous research has revealed individual protein markers that are significant to chemosensitivity prediction. Our goal is to find correlated protein markers which are collectively significant to chemosensitivity prediction to complement the individual markers already reported. In order to do this, we used the D’ correlation measurement to study the feature selection correlations for chemosensitivity prediction of 118 anticancer agents with putatively known mechanisms of action. Three data-sets on the NCI-60 were utilized in this study: two protein datasets, one previously studied for chemosensitivity prediction and another novel to this topic, and one DNA copy number dataset. To validate our approach, we identified the protein markers that were strongly correlated by our analysis with the individual protein markers found in previous studies. Our feature analysis discovered highly correlated protein marker pairs, based on which we found individual protein markers with medical significance. While some of the markers uncovered were consistent with those previously reported, others were original to this work. Using these marker pairs we were able to further correlate the cellular functions associated with them. As an exploratory analysis, we discovered feature selection correlation patterns between and within different drug mechanisms of action for each of our datasets. In conclusion, the highly correlated protein marker pairs as well as their functions found by our feature analysis are validated by previous studies, and are shown to be medically significant, demonstrating D’ as an effective measurement of correlation in the context of feature selection for the first time.展开更多
Software Defined Networking(SDN)has emerged as a promising and exciting option for the future growth of the internet.SDN has increased the flexibility and transparency of the managed,centralized,and controlled network...Software Defined Networking(SDN)has emerged as a promising and exciting option for the future growth of the internet.SDN has increased the flexibility and transparency of the managed,centralized,and controlled network.On the other hand,these advantages create a more vulnerable environment with substantial risks,culminating in network difficulties,system paralysis,online banking frauds,and robberies.These issues have a significant detrimental impact on organizations,enterprises,and even economies.Accuracy,high performance,and real-time systems are necessary to achieve this goal.Using a SDN to extend intelligent machine learning methodologies in an Intrusion Detection System(IDS)has stimulated the interest of numerous research investigators over the last decade.In this paper,a novel HFS-LGBM IDS is proposed for SDN.First,the Hybrid Feature Selection algorithm consisting of two phases is applied to reduce the data dimension and to obtain an optimal feature subset.In thefirst phase,the Correlation based Feature Selection(CFS)algorithm is used to obtain the feature subset.The optimal feature set is obtained by applying the Random Forest Recursive Feature Elimination(RF-RFE)in the second phase.A LightGBM algorithm is then used to detect and classify different types of attacks.The experimental results based on NSL-KDD dataset show that the proposed system produces outstanding results compared to the existing methods in terms of accuracy,precision,recall and f-measure.展开更多
针对现有的网络入侵检测方法忽略了流量特征间的关联性对特征选择的重要性,且在数据平衡时未能考虑到低频攻击样本的分布离散性,导致检测性能下降的问题,提出互信息值融合(mutual information value fusion,MIVF)方法来选择与攻击行为...针对现有的网络入侵检测方法忽略了流量特征间的关联性对特征选择的重要性,且在数据平衡时未能考虑到低频攻击样本的分布离散性,导致检测性能下降的问题,提出互信息值融合(mutual information value fusion,MIVF)方法来选择与攻击行为相关性高且彼此之间关联性低的特征。提出基于DBSCAN改进的SMOTE方法对低频攻击样本按照其密度聚类分布进行过采样;构建SAE-MSCNN分类模型来检验性能。在NSL-KDD和UNSW-NB15数据集上验证,准确率分别达到92.89%和94.85%。结果表明所提方法可以有效地选择特征以及平衡数据,尤其是提高低频攻击的检测准确率。展开更多
Anomaly detection is crucial to the flight safety and maintenance of unmanned aerial vehicles(UAVs)and has attracted extensive attention from scholars.Knowledge-based approaches rely on prior knowledge,while model-bas...Anomaly detection is crucial to the flight safety and maintenance of unmanned aerial vehicles(UAVs)and has attracted extensive attention from scholars.Knowledge-based approaches rely on prior knowledge,while model-based approaches are challenging for constructing accurate and complex physical models of unmanned aerial systems(UASs).Although data-driven methods do not require extensive prior knowledge and accurate physical UAS models,they often lack parameter selection and are limited by the cost of labeling anomalous data.Furthermore,flight data with random noise pose a significant challenge for anomaly detection.This work proposes a spatiotemporal correlation based on long short-term memory and autoencoder(STCLSTM-AE)neural network data-driven method for unsupervised anomaly detection and recovery of UAV flight data.First,UAV flight data are preprocessed by combining the Savitzky-Golay filter data processing technique to mitigate the effect of noise in the original historical flight data on the model.Correlation-based feature subset selection is subsequently performed to reduce the reliance on expert knowledge.Then,the extracted features are used as the input of the designed LSTM-AE model to achieve the anomaly detection and recovery of UAV flight data in an unsupervised manner.Finally,the method's effectiveness is validated on real UAV flight data.展开更多
基金supported by National Natural Science Foundation of China(Grant Nos.62376089,62302153,62302154,62202147)the key Research and Development Program of Hubei Province,China(Grant No.2023BEB024).
文摘The world produces vast quantities of high-dimensional multi-semantic data.However,extracting valuable information from such a large amount of high-dimensional and multi-label data is undoubtedly arduous and challenging.Feature selection aims to mitigate the adverse impacts of high dimensionality in multi-label data by eliminating redundant and irrelevant features.The ant colony optimization algorithm has demonstrated encouraging outcomes in multi-label feature selection,because of its simplicity,efficiency,and similarity to reinforcement learning.Nevertheless,existing methods do not consider crucial correlation information,such as dynamic redundancy and label correlation.To tackle these concerns,the paper proposes a multi-label feature selection technique based on ant colony optimization algorithm(MFACO),focusing on dynamic redundancy and label correlation.Initially,the dynamic redundancy is assessed between the selected feature subset and potential features.Meanwhile,the ant colony optimization algorithm extracts label correlation from the label set,which is then combined into the heuristic factor as label weights.Experimental results demonstrate that our proposed strategies can effectively enhance the optimal search ability of ant colony,outperforming the other algorithms involved in the paper.
基金supported by Fundamental Research Program of Shanxi Province(Nos.202203021211088,202403021212254,202403021221109)Graduate Research Innovation Project in Shanxi Province(No.2024KY616).
文摘Data collected in fields such as cybersecurity and biomedicine often encounter high dimensionality and class imbalance.To address the problem of low classification accuracy for minority class samples arising from numerous irrelevant and redundant features in high-dimensional imbalanced data,we proposed a novel feature selection method named AMF-SGSK based on adaptive multi-filter and subspace-based gaining sharing knowledge.Firstly,the balanced dataset was obtained by random under-sampling.Secondly,combining the feature importance score with the AUC score for each filter method,we proposed a concept called feature hardness to judge the importance of feature,which could adaptively select the essential features.Finally,the optimal feature subset was obtained by gaining sharing knowledge in multiple subspaces.This approach effectively achieved dimensionality reduction for high-dimensional imbalanced data.The experiment results on 30 benchmark imbalanced datasets showed that AMF-SGSK performed better than other eight commonly used algorithms including BGWO and IG-SSO in terms of F1-score,AUC,and G-mean.The mean values of F1-score,AUC,and Gmean for AMF-SGSK are 0.950,0.967,and 0.965,respectively,achieving the highest among all algorithms.And the mean value of Gmean is higher than those of IG-PSO,ReliefF-GWO,and BGOA by 3.72%,11.12%,and 20.06%,respectively.Furthermore,the selected feature ratio is below 0.01 across the selected ten datasets,further demonstrating the proposed method’s overall superiority over competing approaches.AMF-SGSK could adaptively remove irrelevant and redundant features and effectively improve the classification accuracy of high-dimensional imbalanced data,providing scientific and technological references for practical applications.
基金Natural Science Foundation of China(grant Nos.61473237,61202170,and 61402331)It is also supported by the Shaanxi Provincial Natural Science Foundation Research Project(2014JM2-6096)+3 种基金Tianjin Research Program of Application Foundation and Advanced Technology(14JCYBJC42500)Tianjin science and technology correspondent project(16JCTPJC47300)the 2015 key projects of Tianjin science and technology support program(No.15ZCZDGX00200)the Fund of Tianjin Food Safety&Low Carbon Manufacturing Collaborative Innovation Center.
文摘Apple leaf disease is one of the main factors to constrain the apple production and quality.It takes a long time to detect the diseases by using the traditional diagnostic approach,thus farmers often miss the best time to prevent and treat the diseases.Apple leaf disease recognition based on leaf image is an essential research topic in the field of computer vision,where the key task is to find an effective way to represent the diseased leaf images.In this research,based on image processing techniques and pattern recognition methods,an apple leaf disease recognition method was proposed.A color transformation structure for the input RGB(Red,Green and Blue)image was designed firstly and then RGB model was converted to HSI(Hue,Saturation and Intensity),YUV and gray models.The background was removed based on a specific threshold value,and then the disease spot image was segmented with region growing algorithm(RGA).Thirty-eight classifying features of color,texture and shape were extracted from each spot image.To reduce the dimensionality of the feature space and improve the accuracy of the apple leaf disease identification,the most valuable features were selected by combining genetic algorithm(GA)and correlation based feature selection(CFS).Finally,the diseases were recognized by SVM classifier.In the proposed method,the selected feature subset was globally optimum.The experimental results of more than 90%correct identification rate on the apple diseased leaf image database which contains 90 disease images for there kinds of apple leaf diseases,powdery mildew,mosaic and rust,demonstrate that the proposed method is feasible and effective.
基金support provided by The Science and Technology Development Fund,Macao SAR,China(File Nos.0057/2020/AGJ and SKL-IOTSC-2021-2023)Science and Technology Program of Guangdong Province,China(Grant No.2021A0505080009).
文摘Accurate prediction of shield tunneling-induced settlement is a complex problem that requires consideration of many influential parameters.Recent studies reveal that machine learning(ML)algorithms can predict the settlement caused by tunneling.However,well-performing ML models are usually less interpretable.Irrelevant input features decrease the performance and interpretability of an ML model.Nonetheless,feature selection,a critical step in the ML pipeline,is usually ignored in most studies that focused on predicting tunneling-induced settlement.This study applies four techniques,i.e.Pearson correlation method,sequential forward selection(SFS),sequential backward selection(SBS)and Boruta algorithm,to investigate the effect of feature selection on the model’s performance when predicting the tunneling-induced maximum surface settlement(S_(max)).The data set used in this study was compiled from two metro tunnel projects excavated in Hangzhou,China using earth pressure balance(EPB)shields and consists of 14 input features and a single output(i.e.S_(max)).The ML model that is trained on features selected from the Boruta algorithm demonstrates the best performance in both the training and testing phases.The relevant features chosen from the Boruta algorithm further indicate that tunneling-induced settlement is affected by parameters related to tunnel geometry,geological conditions and shield operation.The recently proposed Shapley additive explanations(SHAP)method explores how the input features contribute to the output of a complex ML model.It is observed that the larger settlements are induced during shield tunneling in silty clay.Moreover,the SHAP analysis reveals that the low magnitudes of face pressure at the top of the shield increase the model’s output。
文摘Finding effective cancer treatment is a challenge, because the sensitivity of the cancer stems from both intrinsic cellular properties and acquired resistances from prior treatment. Previous research has revealed individual protein markers that are significant to chemosensitivity prediction. Our goal is to find correlated protein markers which are collectively significant to chemosensitivity prediction to complement the individual markers already reported. In order to do this, we used the D’ correlation measurement to study the feature selection correlations for chemosensitivity prediction of 118 anticancer agents with putatively known mechanisms of action. Three data-sets on the NCI-60 were utilized in this study: two protein datasets, one previously studied for chemosensitivity prediction and another novel to this topic, and one DNA copy number dataset. To validate our approach, we identified the protein markers that were strongly correlated by our analysis with the individual protein markers found in previous studies. Our feature analysis discovered highly correlated protein marker pairs, based on which we found individual protein markers with medical significance. While some of the markers uncovered were consistent with those previously reported, others were original to this work. Using these marker pairs we were able to further correlate the cellular functions associated with them. As an exploratory analysis, we discovered feature selection correlation patterns between and within different drug mechanisms of action for each of our datasets. In conclusion, the highly correlated protein marker pairs as well as their functions found by our feature analysis are validated by previous studies, and are shown to be medically significant, demonstrating D’ as an effective measurement of correlation in the context of feature selection for the first time.
文摘Software Defined Networking(SDN)has emerged as a promising and exciting option for the future growth of the internet.SDN has increased the flexibility and transparency of the managed,centralized,and controlled network.On the other hand,these advantages create a more vulnerable environment with substantial risks,culminating in network difficulties,system paralysis,online banking frauds,and robberies.These issues have a significant detrimental impact on organizations,enterprises,and even economies.Accuracy,high performance,and real-time systems are necessary to achieve this goal.Using a SDN to extend intelligent machine learning methodologies in an Intrusion Detection System(IDS)has stimulated the interest of numerous research investigators over the last decade.In this paper,a novel HFS-LGBM IDS is proposed for SDN.First,the Hybrid Feature Selection algorithm consisting of two phases is applied to reduce the data dimension and to obtain an optimal feature subset.In thefirst phase,the Correlation based Feature Selection(CFS)algorithm is used to obtain the feature subset.The optimal feature set is obtained by applying the Random Forest Recursive Feature Elimination(RF-RFE)in the second phase.A LightGBM algorithm is then used to detect and classify different types of attacks.The experimental results based on NSL-KDD dataset show that the proposed system produces outstanding results compared to the existing methods in terms of accuracy,precision,recall and f-measure.
文摘针对现有的网络入侵检测方法忽略了流量特征间的关联性对特征选择的重要性,且在数据平衡时未能考虑到低频攻击样本的分布离散性,导致检测性能下降的问题,提出互信息值融合(mutual information value fusion,MIVF)方法来选择与攻击行为相关性高且彼此之间关联性低的特征。提出基于DBSCAN改进的SMOTE方法对低频攻击样本按照其密度聚类分布进行过采样;构建SAE-MSCNN分类模型来检验性能。在NSL-KDD和UNSW-NB15数据集上验证,准确率分别达到92.89%和94.85%。结果表明所提方法可以有效地选择特征以及平衡数据,尤其是提高低频攻击的检测准确率。
基金supported by the National Key Research and Development Program of China(Grant No.2020YFB1713300)the Guizhou Provincial Colleges and Universities Talent Training Base Project(Grant No.[2020]009)+3 种基金the Guizhou Province Science and Technology Plan Project(Grant Nos.[2015]4011,[2017]5788)the Guizhou Provincial Department of Education Youth Science and Technology Talent Growth Project(Grant No.[2022]142)the Scientific Research Project for Introducing Talents from Guizhou University(Grant No.(2021)74)the Guizhou Province Higher Education Integrated Research Platform Project(Grant No.[2020]005)。
文摘Anomaly detection is crucial to the flight safety and maintenance of unmanned aerial vehicles(UAVs)and has attracted extensive attention from scholars.Knowledge-based approaches rely on prior knowledge,while model-based approaches are challenging for constructing accurate and complex physical models of unmanned aerial systems(UASs).Although data-driven methods do not require extensive prior knowledge and accurate physical UAS models,they often lack parameter selection and are limited by the cost of labeling anomalous data.Furthermore,flight data with random noise pose a significant challenge for anomaly detection.This work proposes a spatiotemporal correlation based on long short-term memory and autoencoder(STCLSTM-AE)neural network data-driven method for unsupervised anomaly detection and recovery of UAV flight data.First,UAV flight data are preprocessed by combining the Savitzky-Golay filter data processing technique to mitigate the effect of noise in the original historical flight data on the model.Correlation-based feature subset selection is subsequently performed to reduce the reliance on expert knowledge.Then,the extracted features are used as the input of the designed LSTM-AE model to achieve the anomaly detection and recovery of UAV flight data in an unsupervised manner.Finally,the method's effectiveness is validated on real UAV flight data.