Feature selection(FS)is a pivotal pre-processing step in developing data-driven models,influencing reliability,performance and optimization.Although existing FS techniques can yield high-performance metrics for certai...Feature selection(FS)is a pivotal pre-processing step in developing data-driven models,influencing reliability,performance and optimization.Although existing FS techniques can yield high-performance metrics for certain models,they do not invariably guarantee the extraction of the most critical or impactful features.Prior literature underscores the significance of equitable FS practices and has proposed diverse methodologies for the identification of appropriate features.However,the challenge of discerning the most relevant and influential features persists,particularly in the context of the exponential growth and heterogeneity of big data—a challenge that is increasingly salient in modern artificial intelligence(AI)applications.In response,this study introduces an innovative,automated statistical method termed Farea Similarity for Feature Selection(FSFS).The FSFS approach computes a similarity metric for each feature by benchmarking it against the record-wise mean,thereby finding feature dependencies and mitigating the influence of outliers that could potentially distort evaluation outcomes.Features are subsequently ranked according to their similarity scores,with the threshold established at the average similarity score.Notably,lower FSFS values indicate higher similarity and stronger data correlations,whereas higher values suggest lower similarity.The FSFS method is designed not only to yield reliable evaluation metrics but also to reduce data complexity without compromising model performance.Comparative analyses were performed against several established techniques,including Chi-squared(CS),Correlation Coefficient(CC),Genetic Algorithm(GA),Exhaustive Approach,Greedy Stepwise Approach,Gain Ratio,and Filtered Subset Eval,using a variety of datasets such as the Experimental Dataset,Breast Cancer Wisconsin(Original),KDD CUP 1999,NSL-KDD,UNSW-NB15,and Edge-IIoT.In the absence of the FSFS method,the highest classifier accuracies observed were 60.00%,95.13%,97.02%,98.17%,95.86%,and 94.62%for the respective datasets.When the FSFS technique was integrated with data normalization,encoding,balancing,and feature importance selection processes,accuracies improved to 100.00%,97.81%,98.63%,98.94%,94.27%,and 98.46%,respectively.The FSFS method,with a computational complexity of O(fn log n),demonstrates robust scalability and is well-suited for datasets of large size,ensuring efficient processing even when the number of features is substantial.By automatically eliminating outliers and redundant data,FSFS reduces computational overhead,resulting in faster training and improved model performance.Overall,the FSFS framework not only optimizes performance but also enhances the interpretability and explainability of data-driven models,thereby facilitating more trustworthy decision-making in AI applications.展开更多
Ground hydraulic fracturing plays a crucial role in controlling the far-field hard roof,making it imperative to identify the most suitable target stratum for effective control.Physical experiments are conducted based ...Ground hydraulic fracturing plays a crucial role in controlling the far-field hard roof,making it imperative to identify the most suitable target stratum for effective control.Physical experiments are conducted based on engineering properties to simulate the gradual collapse of the roof during longwall top coal caving(LTCC).A numerical model is established using the material point method(MPM)and the strain-softening damage constitutive model according to the structure of the physical model.Numerical simulations are conducted to analyze the LTCC process under different hard roofs for ground hydraulic fracturing.The results show that ground hydraulic fracturing releases the energy and stress of the target stratum,resulting in a substantial lag in the fracturing of the overburden before collapse occurs in the hydraulic fracturing stratum.Ground hydraulic fracturing of a low hard roof reduces the lag effect of hydraulic fractures,dissipates the energy consumed by the fracture of the hard roof,and reduces the abutment stress.Therefore,it is advisable to prioritize the selection of the lower hard roof as the target stratum.展开更多
To lower the amylose content (AC) of the indica rice restorer line 057 with high AC, backcrosses were made respectively by using four indica varieties (R367, 91499, Yanhui 559, Hui 527) as low AC donor parents and...To lower the amylose content (AC) of the indica rice restorer line 057 with high AC, backcrosses were made respectively by using four indica varieties (R367, 91499, Yanhui 559, Hui 527) as low AC donor parents and 057 as the recurrent parent. A molecular marker (PCR-Acc Ⅰ) was used to identify the genotypes (GG, TT and GT) of the waxy (Wx) gene. Plants with GT genotype were selected and used as female parent and crossed with 057 to advance generation. The ACs of rice grains harvested from plants with different Wx genotypes were measured and compared to analyze the efficiency of marker-assisted selection. The ACs of the rice grain, harvested from the plants of Wx genotypes GG, GT and TT, were higher than 20%, in the range of 17.7-28.5%, and less than 18%, respectively. The PCR-Acc Ⅰ marker could be used for efficiently lowering the AC of 057 through backcrossing, and there were some influence of parental genetic background on the AC of rice grains with the same Wx genotype.展开更多
More and more enterprises are outsourcing activities that are neither cost efficient if done in-house nor central to their businesses. Most of the studies in outsourcing decision making focus on vendor selection. Howe...More and more enterprises are outsourcing activities that are neither cost efficient if done in-house nor central to their businesses. Most of the studies in outsourcing decision making focus on vendor selection. However, little research has been done about location selection, which is also a critical step in offshore service outsourcing. The purpose of this paper is to offer a new method to deal with the destination selection problem in China. We employed the additive SE-DEA model to overcome the drawbacks of traditional DEA and SE-DEA methods, and calculated the relative efficiency of 20 service outsourcing model cities(excluding Xiamen). Based on two years of longitudinal study, we made a comparison of the 20 cities. Finally we classified the model cities by combining them with the service outsourcing ability dimension and also gave some selection suggestions and development suggestions for outsourcers' outsourcing service and the model cities, respectively.展开更多
A multiobjective quality of service (QoS) routing algorithm was proposed and used as the QoS-aware path selection approach in differentiated services and multi-protocol label switching (DiffServ-MPLS) networks. It sim...A multiobjective quality of service (QoS) routing algorithm was proposed and used as the QoS-aware path selection approach in differentiated services and multi-protocol label switching (DiffServ-MPLS) networks. It simultaneously optimizes multiple QoS objectives by a genetic algorithm in conjunction with concept of Pareto dominance. The simulation demonstrates that the proposed algorithm is capable of discovering a set of QoS-based near optimal paths within in a few iterations. In addition, the simulation results also show the scalability of the algorithm with increasing number of network nodes.展开更多
For data mining tasks on large-scale data,feature selection is a pivotal stage that plays an important role in removing redundant or irrelevant features while improving classifier performance.Traditional wrapper featu...For data mining tasks on large-scale data,feature selection is a pivotal stage that plays an important role in removing redundant or irrelevant features while improving classifier performance.Traditional wrapper feature selection methodologies typically require extensive model training and evaluation,which cannot deliver desired outcomes within a reasonable computing time.In this paper,an innovative wrapper approach termed Contribution Tracking Feature Selection(CTFS)is proposed for feature selection of large-scale data,which can locate informative features without population-level evolution.In other words,fewer evaluations are needed for CTFS compared to other evolutionary methods.We initially introduce a refined sparse autoencoder to assess the prominence of each feature in the subsequent wrapper method.Subsequently,we utilize an enhanced wrapper feature selection technique that merges Mutual Information(MI)with individual feature contributions.Finally,a fine-tuning contribution tracking mechanism discerns informative features within the optimal feature subset,operating via a dominance accumulation mechanism.Experimental results for multiple classification performance metrics demonstrate that the proposed method effectively yields smaller feature subsets without degrading classification performance in an acceptable runtime compared to state-of-the-art algorithms across most large-scale benchmark datasets.展开更多
In recent years,feature selection(FS)optimization of high-dimensional gene expression data has become one of the most promising approaches for cancer prediction and classification.This work reviews FS and classificati...In recent years,feature selection(FS)optimization of high-dimensional gene expression data has become one of the most promising approaches for cancer prediction and classification.This work reviews FS and classification methods that utilize evolutionary algorithms(EAs)for gene expression profiles in cancer or medical applications based on research motivations,challenges,and recommendations.Relevant studies were retrieved from four major academic databases-IEEE,Scopus,Springer,and ScienceDirect-using the keywords‘cancer classification’,‘optimization’,‘FS’,and‘gene expression profile’.A total of 67 papers were finally selected with key advancements identified as follows:(1)The majority of papers(44.8%)focused on developing algorithms and models for FS and classification.(2)The second category encompassed studies on biomarker identification by EAs,including 20 papers(30%).(3)The third category comprised works that applied FS to cancer data for decision support system purposes,addressing high-dimensional data and the formulation of chromosome length.These studies accounted for 12%of the total number of studies.(4)The remaining three papers(4.5%)were reviews and surveys focusing on models and developments in prediction and classification optimization for cancer classification under current technical conditions.This review highlights the importance of optimizing FS in EAs to manage high-dimensional data effectively.Despite recent advancements,significant limitations remain:the dynamic formulation of chromosome length remains an underexplored area.Thus,further research is needed on dynamic-length chromosome techniques for more sophisticated biomarker gene selection techniques.The findings suggest that further advancements in dynamic chromosome length formulations and adaptive algorithms could enhance cancer classification accuracy and efficiency.展开更多
The rapid rise of cyberattacks and the gradual failure of traditional defense systems and approaches led to using artificial intelligence(AI)techniques(such as machine learning(ML)and deep learning(DL))to build more e...The rapid rise of cyberattacks and the gradual failure of traditional defense systems and approaches led to using artificial intelligence(AI)techniques(such as machine learning(ML)and deep learning(DL))to build more efficient and reliable intrusion detection systems(IDSs).However,the advent of larger IDS datasets has negatively impacted the performance and computational complexity of AI-based IDSs.Many researchers used data preprocessing techniques such as feature selection and normalization to overcome such issues.While most of these researchers reported the success of these preprocessing techniques on a shallow level,very few studies have been performed on their effects on a wider scale.Furthermore,the performance of an IDS model is subject to not only the utilized preprocessing techniques but also the dataset and the ML/DL algorithm used,which most of the existing studies give little emphasis on.Thus,this study provides an in-depth analysis of feature selection and normalization effects on IDS models built using three IDS datasets:NSL-KDD,UNSW-NB15,and CSE–CIC–IDS2018,and various AI algorithms.A wrapper-based approach,which tends to give superior performance,and min-max normalization methods were used for feature selection and normalization,respectively.Numerous IDS models were implemented using the full and feature-selected copies of the datasets with and without normalization.The models were evaluated using popular evaluation metrics in IDS modeling,intra-and inter-model comparisons were performed between models and with state-of-the-art works.Random forest(RF)models performed better on NSL-KDD and UNSW-NB15 datasets with accuracies of 99.86%and 96.01%,respectively,whereas artificial neural network(ANN)achieved the best accuracy of 95.43%on the CSE–CIC–IDS2018 dataset.The RF models also achieved an excellent performance compared to recent works.The results show that normalization and feature selection positively affect IDS modeling.Furthermore,while feature selection benefits simpler algorithms(such as RF),normalization is more useful for complex algorithms like ANNs and deep neural networks(DNNs),and algorithms such as Naive Bayes are unsuitable for IDS modeling.The study also found that the UNSW-NB15 and CSE–CIC–IDS2018 datasets are more complex and more suitable for building and evaluating modern-day IDS than the NSL-KDD dataset.Our findings suggest that prioritizing robust algorithms like RF,alongside complex models such as ANN and DNN,can significantly enhance IDS performance.These insights provide valuable guidance for managers to develop more effective security measures by focusing on high detection rates and low false alert rates.展开更多
Earth’s internal core and crustal magnetic fields,as measured by geomagnetic satellites like MSS-1(Macao Science Satellite-1)and Swarm,are vital for understanding core dynamics and tectonic evolution.To model these i...Earth’s internal core and crustal magnetic fields,as measured by geomagnetic satellites like MSS-1(Macao Science Satellite-1)and Swarm,are vital for understanding core dynamics and tectonic evolution.To model these internal magnetic fields accurately,data selection based on specific criteria is often employed to minimize the influence of rapidly changing current systems in the ionosphere and magnetosphere.However,the quantitative impact of various data selection criteria on internal geomagnetic field modeling is not well understood.This study aims to address this issue and provide a reference for constructing and applying geomagnetic field models.First,we collect the latest MSS-1 and Swarm satellite magnetic data and summarize widely used data selection criteria in geomagnetic field modeling.Second,we briefly describe the method to co-estimate the core,crustal,and large-scale magnetospheric fields using satellite magnetic data.Finally,we conduct a series of field modeling experiments with different data selection criteria to quantitatively estimate their influence.Our numerical experiments confirm that without selecting data from dark regions and geomagnetically quiet times,the resulting internal field differences at the Earth’s surface can range from tens to hundreds of nanotesla(nT).Additionally,we find that the uncertainties introduced into field models by different data selection criteria are significantly larger than the measurement accuracy of modern geomagnetic satellites.These uncertainties should be considered when utilizing constructed magnetic field models for scientific research and applications.展开更多
The principle of genomic selection(GS) entails estimating breeding values(BVs) by summing all the SNP polygenic effects. The visible/near-infrared spectroscopy(VIS/NIRS) wavelength and abundance values can directly re...The principle of genomic selection(GS) entails estimating breeding values(BVs) by summing all the SNP polygenic effects. The visible/near-infrared spectroscopy(VIS/NIRS) wavelength and abundance values can directly reflect the concentrations of chemical substances, and the measurement of meat traits by VIS/NIRS is similar to the processing of genomic selection data by summing all ‘polygenic effects' associated with spectral feature peaks. Therefore, it is meaningful to investigate the incorporation of VIS/NIRS information into GS models to establish an efficient and low-cost breeding model. In this study, we measured 6 meat quality traits in 359Duroc×Landrace×Yorkshire pigs from Guangxi Zhuang Autonomous Region, China, and genotyped them with high-density SNP chips. According to the completeness of the information for the target population, we proposed 4breeding strategies applied to different scenarios: Ⅰ, only spectral and genotypic data exist for the target population;Ⅱ, only spectral data exist for the target population;Ⅲ, only spectral and genotypic data but with different prediction processes exist for the target population;and Ⅳ, only spectral and phenotypic data exist for the target population.The 4 scenarios were used to evaluate the genomic estimated breeding value(GEBV) accuracy by increasing the VIS/NIR spectral information. In the results of the 5-fold cross-validation, the genetic algorithm showed remarkable potential for preselection of feature wavelengths. The breeding efficiency of Strategies Ⅱ, Ⅲ, and Ⅳ was superior to that of traditional GS for most traits, and the GEBV prediction accuracy was improved by 32.2, 40.8 and 15.5%, respectively on average. Among them, the prediction accuracy of Strategy Ⅱ for fat(%) even improved by 50.7% compared to traditional GS. The GEBV prediction accuracy of Strategy Ⅰ was nearly identical to that of traditional GS, and the fluctuation range was less than 7%. Moreover, the breeding cost of the 4 strategies was lower than that of traditional GS methods, with Strategy Ⅳ being the lowest as it did not require genotyping.Our findings demonstrate that GS methods based on VIS/NIRS data have significant predictive potential and are worthy of further research to provide a valuable reference for the development of effective and affordable breeding strategies.展开更多
Online streaming feature selection(OSFS),as an online learning manner to handle streaming features,is critical in addressing high-dimensional data.In real big data-related applications,the patterns and distributions o...Online streaming feature selection(OSFS),as an online learning manner to handle streaming features,is critical in addressing high-dimensional data.In real big data-related applications,the patterns and distributions of streaming features constantly change over time due to dynamic data generation environments.However,existing OSFS methods rely on presented and fixed hyperparameters,which undoubtedly lead to poor selection performance when encountering dynamic features.To make up for the existing shortcomings,the authors propose a novel OSFS algorithm based on vague set,named OSFSVague.Its main idea is to combine uncertainty and three-way decision theories to improve feature selection from the traditional dichotomous method to the trichotomous method.OSFS-Vague also improves the calculation method of correlation between features and labels.Moreover,OSFS-Vague uses the distance correlation coefficient to classify streaming features into relevant features,weakly redundant features,and redundant features.Finally,the relevant features and weakly redundant features are filtered for an optimal feature set.To evaluate the proposed OSFS-Vague,extensive empirical experiments have been conducted on 11 datasets.The results demonstrate that OSFS-Vague outperforms six state-of-the-art OSFS algorithms in terms of selection accuracy and computational efficiency.展开更多
文摘Feature selection(FS)is a pivotal pre-processing step in developing data-driven models,influencing reliability,performance and optimization.Although existing FS techniques can yield high-performance metrics for certain models,they do not invariably guarantee the extraction of the most critical or impactful features.Prior literature underscores the significance of equitable FS practices and has proposed diverse methodologies for the identification of appropriate features.However,the challenge of discerning the most relevant and influential features persists,particularly in the context of the exponential growth and heterogeneity of big data—a challenge that is increasingly salient in modern artificial intelligence(AI)applications.In response,this study introduces an innovative,automated statistical method termed Farea Similarity for Feature Selection(FSFS).The FSFS approach computes a similarity metric for each feature by benchmarking it against the record-wise mean,thereby finding feature dependencies and mitigating the influence of outliers that could potentially distort evaluation outcomes.Features are subsequently ranked according to their similarity scores,with the threshold established at the average similarity score.Notably,lower FSFS values indicate higher similarity and stronger data correlations,whereas higher values suggest lower similarity.The FSFS method is designed not only to yield reliable evaluation metrics but also to reduce data complexity without compromising model performance.Comparative analyses were performed against several established techniques,including Chi-squared(CS),Correlation Coefficient(CC),Genetic Algorithm(GA),Exhaustive Approach,Greedy Stepwise Approach,Gain Ratio,and Filtered Subset Eval,using a variety of datasets such as the Experimental Dataset,Breast Cancer Wisconsin(Original),KDD CUP 1999,NSL-KDD,UNSW-NB15,and Edge-IIoT.In the absence of the FSFS method,the highest classifier accuracies observed were 60.00%,95.13%,97.02%,98.17%,95.86%,and 94.62%for the respective datasets.When the FSFS technique was integrated with data normalization,encoding,balancing,and feature importance selection processes,accuracies improved to 100.00%,97.81%,98.63%,98.94%,94.27%,and 98.46%,respectively.The FSFS method,with a computational complexity of O(fn log n),demonstrates robust scalability and is well-suited for datasets of large size,ensuring efficient processing even when the number of features is substantial.By automatically eliminating outliers and redundant data,FSFS reduces computational overhead,resulting in faster training and improved model performance.Overall,the FSFS framework not only optimizes performance but also enhances the interpretability and explainability of data-driven models,thereby facilitating more trustworthy decision-making in AI applications.
基金the National Natural Science Foundation of China(No.51974042)National Key Research and Development Program of China(No.2023YFC3009005).
文摘Ground hydraulic fracturing plays a crucial role in controlling the far-field hard roof,making it imperative to identify the most suitable target stratum for effective control.Physical experiments are conducted based on engineering properties to simulate the gradual collapse of the roof during longwall top coal caving(LTCC).A numerical model is established using the material point method(MPM)and the strain-softening damage constitutive model according to the structure of the physical model.Numerical simulations are conducted to analyze the LTCC process under different hard roofs for ground hydraulic fracturing.The results show that ground hydraulic fracturing releases the energy and stress of the target stratum,resulting in a substantial lag in the fracturing of the overburden before collapse occurs in the hydraulic fracturing stratum.Ground hydraulic fracturing of a low hard roof reduces the lag effect of hydraulic fractures,dissipates the energy consumed by the fracture of the hard roof,and reduces the abutment stress.Therefore,it is advisable to prioritize the selection of the lower hard roof as the target stratum.
文摘To lower the amylose content (AC) of the indica rice restorer line 057 with high AC, backcrosses were made respectively by using four indica varieties (R367, 91499, Yanhui 559, Hui 527) as low AC donor parents and 057 as the recurrent parent. A molecular marker (PCR-Acc Ⅰ) was used to identify the genotypes (GG, TT and GT) of the waxy (Wx) gene. Plants with GT genotype were selected and used as female parent and crossed with 057 to advance generation. The ACs of rice grains harvested from plants with different Wx genotypes were measured and compared to analyze the efficiency of marker-assisted selection. The ACs of the rice grain, harvested from the plants of Wx genotypes GG, GT and TT, were higher than 20%, in the range of 17.7-28.5%, and less than 18%, respectively. The PCR-Acc Ⅰ marker could be used for efficiently lowering the AC of 057 through backcrossing, and there were some influence of parental genetic background on the AC of rice grains with the same Wx genotype.
基金Funded by the Soft Science of Anhui Province(Grant NO.1302053004)
文摘More and more enterprises are outsourcing activities that are neither cost efficient if done in-house nor central to their businesses. Most of the studies in outsourcing decision making focus on vendor selection. However, little research has been done about location selection, which is also a critical step in offshore service outsourcing. The purpose of this paper is to offer a new method to deal with the destination selection problem in China. We employed the additive SE-DEA model to overcome the drawbacks of traditional DEA and SE-DEA methods, and calculated the relative efficiency of 20 service outsourcing model cities(excluding Xiamen). Based on two years of longitudinal study, we made a comparison of the 20 cities. Finally we classified the model cities by combining them with the service outsourcing ability dimension and also gave some selection suggestions and development suggestions for outsourcers' outsourcing service and the model cities, respectively.
文摘A multiobjective quality of service (QoS) routing algorithm was proposed and used as the QoS-aware path selection approach in differentiated services and multi-protocol label switching (DiffServ-MPLS) networks. It simultaneously optimizes multiple QoS objectives by a genetic algorithm in conjunction with concept of Pareto dominance. The simulation demonstrates that the proposed algorithm is capable of discovering a set of QoS-based near optimal paths within in a few iterations. In addition, the simulation results also show the scalability of the algorithm with increasing number of network nodes.
基金supported in part by the National Key Research and Development Program of China under Grant(No.2021YFB3300900)the NSFC Key Supported Project of the Major Research Plan under Grant(No.92267206)+2 种基金the National Natural Science Foundation of China under Grant(Nos.72201052,62032013,62173076)the Fundamental Research Funds for the Central Universities under Grant(No.N2204017)the Fundamental Research Funds for State Key Laboratory of Synthetical Automation for Process Industries under Grant(No.2013ZCX11).
文摘For data mining tasks on large-scale data,feature selection is a pivotal stage that plays an important role in removing redundant or irrelevant features while improving classifier performance.Traditional wrapper feature selection methodologies typically require extensive model training and evaluation,which cannot deliver desired outcomes within a reasonable computing time.In this paper,an innovative wrapper approach termed Contribution Tracking Feature Selection(CTFS)is proposed for feature selection of large-scale data,which can locate informative features without population-level evolution.In other words,fewer evaluations are needed for CTFS compared to other evolutionary methods.We initially introduce a refined sparse autoencoder to assess the prominence of each feature in the subsequent wrapper method.Subsequently,we utilize an enhanced wrapper feature selection technique that merges Mutual Information(MI)with individual feature contributions.Finally,a fine-tuning contribution tracking mechanism discerns informative features within the optimal feature subset,operating via a dominance accumulation mechanism.Experimental results for multiple classification performance metrics demonstrate that the proposed method effectively yields smaller feature subsets without degrading classification performance in an acceptable runtime compared to state-of-the-art algorithms across most large-scale benchmark datasets.
基金funded by the Ministry of Higher Education of Malaysia,grant number FRGS/1/2022/ICT02/UPSI/02/1.
文摘In recent years,feature selection(FS)optimization of high-dimensional gene expression data has become one of the most promising approaches for cancer prediction and classification.This work reviews FS and classification methods that utilize evolutionary algorithms(EAs)for gene expression profiles in cancer or medical applications based on research motivations,challenges,and recommendations.Relevant studies were retrieved from four major academic databases-IEEE,Scopus,Springer,and ScienceDirect-using the keywords‘cancer classification’,‘optimization’,‘FS’,and‘gene expression profile’.A total of 67 papers were finally selected with key advancements identified as follows:(1)The majority of papers(44.8%)focused on developing algorithms and models for FS and classification.(2)The second category encompassed studies on biomarker identification by EAs,including 20 papers(30%).(3)The third category comprised works that applied FS to cancer data for decision support system purposes,addressing high-dimensional data and the formulation of chromosome length.These studies accounted for 12%of the total number of studies.(4)The remaining three papers(4.5%)were reviews and surveys focusing on models and developments in prediction and classification optimization for cancer classification under current technical conditions.This review highlights the importance of optimizing FS in EAs to manage high-dimensional data effectively.Despite recent advancements,significant limitations remain:the dynamic formulation of chromosome length remains an underexplored area.Thus,further research is needed on dynamic-length chromosome techniques for more sophisticated biomarker gene selection techniques.The findings suggest that further advancements in dynamic chromosome length formulations and adaptive algorithms could enhance cancer classification accuracy and efficiency.
文摘The rapid rise of cyberattacks and the gradual failure of traditional defense systems and approaches led to using artificial intelligence(AI)techniques(such as machine learning(ML)and deep learning(DL))to build more efficient and reliable intrusion detection systems(IDSs).However,the advent of larger IDS datasets has negatively impacted the performance and computational complexity of AI-based IDSs.Many researchers used data preprocessing techniques such as feature selection and normalization to overcome such issues.While most of these researchers reported the success of these preprocessing techniques on a shallow level,very few studies have been performed on their effects on a wider scale.Furthermore,the performance of an IDS model is subject to not only the utilized preprocessing techniques but also the dataset and the ML/DL algorithm used,which most of the existing studies give little emphasis on.Thus,this study provides an in-depth analysis of feature selection and normalization effects on IDS models built using three IDS datasets:NSL-KDD,UNSW-NB15,and CSE–CIC–IDS2018,and various AI algorithms.A wrapper-based approach,which tends to give superior performance,and min-max normalization methods were used for feature selection and normalization,respectively.Numerous IDS models were implemented using the full and feature-selected copies of the datasets with and without normalization.The models were evaluated using popular evaluation metrics in IDS modeling,intra-and inter-model comparisons were performed between models and with state-of-the-art works.Random forest(RF)models performed better on NSL-KDD and UNSW-NB15 datasets with accuracies of 99.86%and 96.01%,respectively,whereas artificial neural network(ANN)achieved the best accuracy of 95.43%on the CSE–CIC–IDS2018 dataset.The RF models also achieved an excellent performance compared to recent works.The results show that normalization and feature selection positively affect IDS modeling.Furthermore,while feature selection benefits simpler algorithms(such as RF),normalization is more useful for complex algorithms like ANNs and deep neural networks(DNNs),and algorithms such as Naive Bayes are unsuitable for IDS modeling.The study also found that the UNSW-NB15 and CSE–CIC–IDS2018 datasets are more complex and more suitable for building and evaluating modern-day IDS than the NSL-KDD dataset.Our findings suggest that prioritizing robust algorithms like RF,alongside complex models such as ANN and DNN,can significantly enhance IDS performance.These insights provide valuable guidance for managers to develop more effective security measures by focusing on high detection rates and low false alert rates.
基金supported by the National Natural Science Foundation of China(42250101)the Macao Foundation。
文摘Earth’s internal core and crustal magnetic fields,as measured by geomagnetic satellites like MSS-1(Macao Science Satellite-1)and Swarm,are vital for understanding core dynamics and tectonic evolution.To model these internal magnetic fields accurately,data selection based on specific criteria is often employed to minimize the influence of rapidly changing current systems in the ionosphere and magnetosphere.However,the quantitative impact of various data selection criteria on internal geomagnetic field modeling is not well understood.This study aims to address this issue and provide a reference for constructing and applying geomagnetic field models.First,we collect the latest MSS-1 and Swarm satellite magnetic data and summarize widely used data selection criteria in geomagnetic field modeling.Second,we briefly describe the method to co-estimate the core,crustal,and large-scale magnetospheric fields using satellite magnetic data.Finally,we conduct a series of field modeling experiments with different data selection criteria to quantitatively estimate their influence.Our numerical experiments confirm that without selecting data from dark regions and geomagnetically quiet times,the resulting internal field differences at the Earth’s surface can range from tens to hundreds of nanotesla(nT).Additionally,we find that the uncertainties introduced into field models by different data selection criteria are significantly larger than the measurement accuracy of modern geomagnetic satellites.These uncertainties should be considered when utilizing constructed magnetic field models for scientific research and applications.
基金supported by the National Natural Science Foundation of China(32160782 and 32060737).
文摘The principle of genomic selection(GS) entails estimating breeding values(BVs) by summing all the SNP polygenic effects. The visible/near-infrared spectroscopy(VIS/NIRS) wavelength and abundance values can directly reflect the concentrations of chemical substances, and the measurement of meat traits by VIS/NIRS is similar to the processing of genomic selection data by summing all ‘polygenic effects' associated with spectral feature peaks. Therefore, it is meaningful to investigate the incorporation of VIS/NIRS information into GS models to establish an efficient and low-cost breeding model. In this study, we measured 6 meat quality traits in 359Duroc×Landrace×Yorkshire pigs from Guangxi Zhuang Autonomous Region, China, and genotyped them with high-density SNP chips. According to the completeness of the information for the target population, we proposed 4breeding strategies applied to different scenarios: Ⅰ, only spectral and genotypic data exist for the target population;Ⅱ, only spectral data exist for the target population;Ⅲ, only spectral and genotypic data but with different prediction processes exist for the target population;and Ⅳ, only spectral and phenotypic data exist for the target population.The 4 scenarios were used to evaluate the genomic estimated breeding value(GEBV) accuracy by increasing the VIS/NIR spectral information. In the results of the 5-fold cross-validation, the genetic algorithm showed remarkable potential for preselection of feature wavelengths. The breeding efficiency of Strategies Ⅱ, Ⅲ, and Ⅳ was superior to that of traditional GS for most traits, and the GEBV prediction accuracy was improved by 32.2, 40.8 and 15.5%, respectively on average. Among them, the prediction accuracy of Strategy Ⅱ for fat(%) even improved by 50.7% compared to traditional GS. The GEBV prediction accuracy of Strategy Ⅰ was nearly identical to that of traditional GS, and the fluctuation range was less than 7%. Moreover, the breeding cost of the 4 strategies was lower than that of traditional GS methods, with Strategy Ⅳ being the lowest as it did not require genotyping.Our findings demonstrate that GS methods based on VIS/NIRS data have significant predictive potential and are worthy of further research to provide a valuable reference for the development of effective and affordable breeding strategies.
基金Science and Technology Project of Zunyi,Grant/Award Number:ZSKRPT[2023]3Excellent Young Scientific and Technological Talents Foundation of Guizhou Province,Grant/Award Number:QKHplatform talent(2021)5627+4 种基金Science and Technology Top Talent Project of Guizhou Education Department,Grant/Award Number:QJJ2022(088)the Guizhou Provincial Department of Education Colleges and Universities Science and Technology Innovation Team,Grant/Award Number:QJJ[2023]084National Natural Science Foundation of China,Grant/Award Numbers:62066049,62221005,61936001,62176070Department of Education of Guizhou Province,Grant/Award Number:QJJ[2023]084Science and Technology Foundation of State Grid Corporation of China,Grant/Award Number:1400-202357341A-1-1-ZN。
文摘Online streaming feature selection(OSFS),as an online learning manner to handle streaming features,is critical in addressing high-dimensional data.In real big data-related applications,the patterns and distributions of streaming features constantly change over time due to dynamic data generation environments.However,existing OSFS methods rely on presented and fixed hyperparameters,which undoubtedly lead to poor selection performance when encountering dynamic features.To make up for the existing shortcomings,the authors propose a novel OSFS algorithm based on vague set,named OSFSVague.Its main idea is to combine uncertainty and three-way decision theories to improve feature selection from the traditional dichotomous method to the trichotomous method.OSFS-Vague also improves the calculation method of correlation between features and labels.Moreover,OSFS-Vague uses the distance correlation coefficient to classify streaming features into relevant features,weakly redundant features,and redundant features.Finally,the relevant features and weakly redundant features are filtered for an optimal feature set.To evaluate the proposed OSFS-Vague,extensive empirical experiments have been conducted on 11 datasets.The results demonstrate that OSFS-Vague outperforms six state-of-the-art OSFS algorithms in terms of selection accuracy and computational efficiency.