As a crucial data preprocessing method in data mining,feature selection(FS)can be regarded as a bi-objective optimization problem that aims to maximize classification accuracy and minimize the number of selected featu...As a crucial data preprocessing method in data mining,feature selection(FS)can be regarded as a bi-objective optimization problem that aims to maximize classification accuracy and minimize the number of selected features.Evolutionary computing(EC)is promising for FS owing to its powerful search capability.However,in traditional EC-based methods,feature subsets are represented via a length-fixed individual encoding.It is ineffective for high-dimensional data,because it results in a huge search space and prohibitive training time.This work proposes a length-adaptive non-dominated sorting genetic algorithm(LA-NSGA)with a length-variable individual encoding and a length-adaptive evolution mechanism for bi-objective highdimensional FS.In LA-NSGA,an initialization method based on correlation and redundancy is devised to initialize individuals of diverse lengths,and a Pareto dominance-based length change operator is introduced to guide individuals to explore in promising search space adaptively.Moreover,a dominance-based local search method is employed for further improvement.The experimental results based on 12 high-dimensional gene datasets show that the Pareto front of feature subsets produced by LA-NSGA is superior to those of existing algorithms.展开更多
Feature selection(FS)is an adequate data pre-processing method that reduces the dimensionality of datasets and is used in bioinformatics,finance,and medicine.Traditional FS approaches,however,frequently struggle to id...Feature selection(FS)is an adequate data pre-processing method that reduces the dimensionality of datasets and is used in bioinformatics,finance,and medicine.Traditional FS approaches,however,frequently struggle to identify the most important characteristics when dealing with high-dimensional information.To alleviate the imbalance of explore search ability and exploit search ability of the Whale Optimization Algorithm(WOA),we propose an enhanced WOA,namely SCLWOA,that incorporates sine chaos and comprehensive learning(CL)strategies.Among them,the CL mechanism contributes to improving the ability to explore.At the same time,the sine chaos is used to enhance the exploitation capacity and help the optimizer to gain a better initial solution.The hybrid performance of SCLWOA was evaluated comprehensively on IEEE CEC2017 test functions,including its qualitative analysis and comparisons with other optimizers.The results demonstrate that SCLWOA is superior to other algorithms in accuracy and converges faster than others.Besides,the variant of Binary SCLWOA(BSCLWOA)and other binary optimizers obtained by the mapping function was evaluated on 12 UCI data sets.Subsequently,BSCLWOA has proven very competitive in classification precision and feature reduction.展开更多
With the increasing dimensionality of the data,High-dimensional Feature Selection(HFS)becomes an increasingly dif-ficult task.It is not simple to find the best subset of features due to the breadth of the search space...With the increasing dimensionality of the data,High-dimensional Feature Selection(HFS)becomes an increasingly dif-ficult task.It is not simple to find the best subset of features due to the breadth of the search space and the intricacy of the interactions between features.Many of the Feature Selection(FS)approaches now in use for these problems perform sig-nificantly less well when faced with such intricate situations involving high-dimensional search spaces.It is demonstrated that meta-heuristic algorithms can provide sub-optimal results in an acceptable amount of time.This paper presents a new binary Boosted version of the Spider Wasp Optimizer(BSWO)called Binary Boosted SWO(BBSWO),which combines a number of successful and promising strategies,in order to deal with HFS.The shortcomings of the original BSWO,including early convergence,settling into local optimums,limited exploration and exploitation,and lack of population diversity,were addressed by the proposal of this new variant of SWO.The concept of chaos optimization is introduced in BSWO,where initialization is consistently produced by utilizing the properties of sine chaos mapping.A new convergence parameter was then incorporated into BSWO to achieve a promising balance between exploration and exploitation.Multiple exploration mechanisms were then applied in conjunction with several exploitation strategies to effectively enrich the search process of BSWO within the search space.Finally,quantum-based optimization was added to enhance the diversity of the search agents in BSWO.The proposed BBSWO not only offers the most suitable subset of features located,but it also lessens the data's redundancy structure.BBSWO was evaluated using the k-Nearest Neighbor(k-NN)classifier on 23 HFS problems from the biomedical domain taken from the UCI repository.The results were compared with those of traditional BSWO and other well-known meta-heuristics-based FS.The findings indicate that,in comparison to other competing techniques,the proposed BBSWO can,on average,identify the least significant subsets of features with efficient classification accuracy of the k-NN classifier.展开更多
Data collected in fields such as cybersecurity and biomedicine often encounter high dimensionality and class imbalance.To address the problem of low classification accuracy for minority class samples arising from nume...Data collected in fields such as cybersecurity and biomedicine often encounter high dimensionality and class imbalance.To address the problem of low classification accuracy for minority class samples arising from numerous irrelevant and redundant features in high-dimensional imbalanced data,we proposed a novel feature selection method named AMF-SGSK based on adaptive multi-filter and subspace-based gaining sharing knowledge.Firstly,the balanced dataset was obtained by random under-sampling.Secondly,combining the feature importance score with the AUC score for each filter method,we proposed a concept called feature hardness to judge the importance of feature,which could adaptively select the essential features.Finally,the optimal feature subset was obtained by gaining sharing knowledge in multiple subspaces.This approach effectively achieved dimensionality reduction for high-dimensional imbalanced data.The experiment results on 30 benchmark imbalanced datasets showed that AMF-SGSK performed better than other eight commonly used algorithms including BGWO and IG-SSO in terms of F1-score,AUC,and G-mean.The mean values of F1-score,AUC,and Gmean for AMF-SGSK are 0.950,0.967,and 0.965,respectively,achieving the highest among all algorithms.And the mean value of Gmean is higher than those of IG-PSO,ReliefF-GWO,and BGOA by 3.72%,11.12%,and 20.06%,respectively.Furthermore,the selected feature ratio is below 0.01 across the selected ten datasets,further demonstrating the proposed method’s overall superiority over competing approaches.AMF-SGSK could adaptively remove irrelevant and redundant features and effectively improve the classification accuracy of high-dimensional imbalanced data,providing scientific and technological references for practical applications.展开更多
Current high-dimensional feature screening methods still face significant challenges in handling mixed linear and nonlinear relationships,controlling redundant information,and improving model robustness.In this study,...Current high-dimensional feature screening methods still face significant challenges in handling mixed linear and nonlinear relationships,controlling redundant information,and improving model robustness.In this study,we propose a Dynamic Conditional Feature Screening(DCFS)method tailored for high-dimensional economic forecasting tasks.Our goal is to accurately identify key variables,enhance predictive performance,and provide both theoretical foundations and practical tools for macroeconomic modeling.The DCFS method constructs a comprehensive test statistic by integrating conditional mutual information with conditional regression error differences.By introducing a dynamic weighting mechanism,DCFS adaptively balances the linear and nonlinear contributions of features during the screening process.In addition,a dynamic thresholding mechanism is designed to effectively control the false discovery rate(FDR),thereby improving the stability and reliability of the screening results.On the theoretical front,we rigorously prove that the proposed method satisfies the sure screening property and rank consistency,ensuring accurate identification of the truly important feature set in high-dimensional settings.Simulation results demonstrate that under purely linear,purely nonlinear,and mixed dependency structures,DCFS consistently outperforms classical screening methods such as SIS,CSIS,and IG-SIS in terms of true positive rate(TPR),false discovery rate(FDR),and rank correlation.These results highlight the superior accuracy,robustness,and stability of our method.Furthermore,an empirical analysis based on the U.S.FRED-MD macroeconomic dataset confirms the practical value of DCFS in real-world forecasting tasks.The experimental results show that DCFS achieves lower prediction errors(RMSE and MAE)and higher R2 values in forecasting GDP growth.The selected key variables-including the Industrial Production Index(IP),Federal Funds Rate,Consumer Price Index(CPI),and Money Supply(M2)-possess clear economic interpretability,offering reliable support for economic forecasting and policy formulation.展开更多
Automated essay scoring(AES)systems have gained significant importance in educational settings,offering a scalable,efficient,and objective method for evaluating student essays.However,developing AES systems for Arabic...Automated essay scoring(AES)systems have gained significant importance in educational settings,offering a scalable,efficient,and objective method for evaluating student essays.However,developing AES systems for Arabic poses distinct challenges due to the language’s complex morphology,diglossia,and the scarcity of annotated datasets.This paper presents a hybrid approach to Arabic AES by combining text-based,vector-based,and embeddingbased similarity measures to improve essay scoring accuracy while minimizing the training data required.Using a large Arabic essay dataset categorized into thematic groups,the study conducted four experiments to evaluate the impact of feature selection,data size,and model performance.Experiment 1 established a baseline using a non-machine learning approach,selecting top-N correlated features to predict essay scores.The subsequent experiments employed 5-fold cross-validation.Experiment 2 showed that combining embedding-based,text-based,and vector-based features in a Random Forest(RF)model achieved an R2 of 88.92%and an accuracy of 83.3%within a 0.5-point tolerance.Experiment 3 further refined the feature selection process,demonstrating that 19 correlated features yielded optimal results,improving R2 to 88.95%.In Experiment 4,an optimal data efficiency training approach was introduced,where training data portions increased from 5%to 50%.The study found that using just 10%of the data achieved near-peak performance,with an R2 of 85.49%,emphasizing an effective trade-off between performance and computational costs.These findings highlight the potential of the hybrid approach for developing scalable Arabic AES systems,especially in low-resource environments,addressing linguistic challenges while ensuring efficient data usage.展开更多
The Financial Technology(FinTech)sector has witnessed rapid growth,resulting in increasingly complex and high-volume digital transactions.Although this expansion improves efficiency and accessibility,it also introduce...The Financial Technology(FinTech)sector has witnessed rapid growth,resulting in increasingly complex and high-volume digital transactions.Although this expansion improves efficiency and accessibility,it also introduces significant vulnerabilities,including fraud,money laundering,and market manipulation.Traditional anomaly detection techniques often fail to capture the relational and dynamic characteristics of financial data.Graph Neural Networks(GNNs),capable of modeling intricate interdependencies among entities,have emerged as a powerful framework for detecting subtle and sophisticated anomalies.However,the high-dimensionality and inherent noise of FinTech datasets demand robust feature selection strategies to improve model scalability,performance,and interpretability.This paper presents a comprehensive survey of GNN-based approaches for anomaly detection in FinTech,with an emphasis on the synergistic role of feature selection.We examine the theoretical foundations of GNNs,review state-of-the-art feature selection techniques,analyze their integration with GNNs,and categorize prevalent anomaly types in FinTech applications.In addition,we discuss practical implementation challenges,highlight representative case studies,and propose future research directions to advance the field of graph-based anomaly detection in financial systems.展开更多
High-dimensional data causes difficulties in machine learning due to high time consumption and large memory requirements.In particular,in amulti-label environment,higher complexity is required asmuch as the number of ...High-dimensional data causes difficulties in machine learning due to high time consumption and large memory requirements.In particular,in amulti-label environment,higher complexity is required asmuch as the number of labels.Moreover,an optimization problem that fully considers all dependencies between features and labels is difficult to solve.In this study,we propose a novel regression-basedmulti-label feature selectionmethod that integrates mutual information to better exploit the underlying data structure.By incorporating mutual information into the regression formulation,the model captures not only linear relationships but also complex non-linear dependencies.The proposed objective function simultaneously considers three types of relationships:(1)feature redundancy,(2)featurelabel relevance,and(3)inter-label dependency.These three quantities are computed usingmutual information,allowing the proposed formulation to capture nonlinear dependencies among variables.These three types of relationships are key factors in multi-label feature selection,and our method expresses them within a unified formulation,enabling efficient optimization while simultaneously accounting for all of them.To efficiently solve the proposed optimization problem under non-negativity constraints,we develop a gradient-based optimization algorithm with fast convergence.Theexperimental results on sevenmulti-label datasets show that the proposed method outperforms existingmulti-label feature selection techniques.展开更多
Most predictive maintenance studies have emphasized accuracy but provide very little focus on Interpretability or deployment readiness.This study improves on prior methods by developing a small yet robust system that ...Most predictive maintenance studies have emphasized accuracy but provide very little focus on Interpretability or deployment readiness.This study improves on prior methods by developing a small yet robust system that can predict when turbofan engines will fail.It uses the NASA CMAPSS dataset,which has over 200,000 engine cycles from260 engines.The process begins with systematic preprocessing,which includes imputation,outlier removal,scaling,and labelling of the remaining useful life.Dimensionality is reduced using a hybrid selection method that combines variance filtering,recursive elimination,and gradient-boosted importance scores,yielding a stable set of 10 informative sensors.To mitigate class imbalance,minority cases are oversampled,and class-weighted losses are applied during training.Benchmarking is carried out with logistic regression,gradient boosting,and a recurrent design that integrates gated recurrent units with long short-term memory networks.The Long Short-Term Memory–Gated Recurrent Unit(LSTM–GRU)hybrid achieved the strongest performance with an F1 score of 0.92,precision of 0.93,recall of 0.91,ReceiverOperating Characteristic–AreaUnder the Curve(ROC-AUC)of 0.97,andminority recall of 0.75.Interpretability testing using permutation importance and Shapley values indicates that sensors 13,15,and 11 are the most important indicators of engine wear.The proposed system combines imbalance handling,feature reduction,and Interpretability into a practical design suitable for real industrial settings.展开更多
Feature selection serves as a critical preprocessing step inmachine learning,focusing on identifying and preserving the most relevant features to improve the efficiency and performance of classification algorithms.Par...Feature selection serves as a critical preprocessing step inmachine learning,focusing on identifying and preserving the most relevant features to improve the efficiency and performance of classification algorithms.Particle Swarm Optimization has demonstrated significant potential in addressing feature selection challenges.However,there are inherent limitations in Particle Swarm Optimization,such as the delicate balance between exploration and exploitation,susceptibility to local optima,and suboptimal convergence rates,hinder its performance.To tackle these issues,this study introduces a novel Leveraged Opposition-Based Learning method within Fitness Landscape Particle Swarm Optimization,tailored for wrapper-based feature selection.The proposed approach integrates:(1)a fitness-landscape adaptive strategy to dynamically balance exploration and exploitation,(2)the lever principle within Opposition-Based Learning to improve search efficiency,and(3)a Local Selection and Re-optimization mechanism combined with random perturbation to expedite convergence and enhance the quality of the optimal feature subset.The effectiveness of is rigorously evaluated on 24 benchmark datasets and compared against 13 advancedmetaheuristic algorithms.Experimental results demonstrate that the proposed method outperforms the compared algorithms in classification accuracy on over half of the datasets,whilst also significantly reducing the number of selected features.These findings demonstrate its effectiveness and robustness in feature selection tasks.展开更多
Unconfined Compressive Strength(UCS)is a key parameter for the assessment of the stability and performance of stabilized soils,yet traditional laboratory testing is both time and resource intensive.In this study,an in...Unconfined Compressive Strength(UCS)is a key parameter for the assessment of the stability and performance of stabilized soils,yet traditional laboratory testing is both time and resource intensive.In this study,an interpretable machine learning approach to UCS prediction is presented,pairing five models(Random Forest(RF),Gradient Boosting(GB),Extreme Gradient Boosting(XGB),CatBoost,and K-Nearest Neighbors(KNN))with SHapley Additive exPlanations(SHAP)for enhanced interpretability and to guide feature removal.A complete dataset of 12 geotechnical and chemical parameters,i.e.,Atterberg limits,compaction properties,stabilizer chemistry,dosage,curing time,was used to train and test the models.R2,RMSE,MSE,and MAE were used to assess performance.Initial results with all 12 features indicated that boosting-based models(GB,XGB,CatBoost)exhibited the highest predictive accuracy(R^(2)=0.93)with satisfactory generalization on test data,followed by RF and KNN.SHAP analysis consistently picked CaO content,curing time,stabilizer dosage,and compaction parameters as the most important features,aligning with established soil stabilization mechanisms.Models were then re-trained on the top 8 and top 5 SHAP-ranked features.Interestingly,GB,XGB,and CatBoost maintained comparable accuracy with reduced input sets,while RF was moderately sensitive and KNN was somewhat better owing to reduced dimensionality.The findings confirm that feature reduction through SHAP enables cost-effective UCS prediction through the reduction of laboratory test requirements without significant accuracy loss.The suggested hybrid approach offers an explainable,interpretable,and cost-effective tool for geotechnical engineering practice.展开更多
Existing feature selection methods for intrusion detection systems in the Industrial Internet of Things often suffer from local optimality and high computational complexity.These challenges hinder traditional IDS from...Existing feature selection methods for intrusion detection systems in the Industrial Internet of Things often suffer from local optimality and high computational complexity.These challenges hinder traditional IDS from effectively extracting features while maintaining detection accuracy.This paper proposes an industrial Internet ofThings intrusion detection feature selection algorithm based on an improved whale optimization algorithm(GSLDWOA).The aim is to address the problems that feature selection algorithms under high-dimensional data are prone to,such as local optimality,long detection time,and reduced accuracy.First,the initial population’s diversity is increased using the Gaussian Mutation mechanism.Then,Non-linear Shrinking Factor balances global exploration and local development,avoiding premature convergence.Lastly,Variable-step Levy Flight operator and Dynamic Differential Evolution strategy are introduced to improve the algorithm’s search efficiency and convergence accuracy in highdimensional feature space.Experiments on the NSL-KDD and WUSTL-IIoT-2021 datasets demonstrate that the feature subset selected by GSLDWOA significantly improves detection performance.Compared to the traditional WOA algorithm,the detection rate and F1-score increased by 3.68%and 4.12%.On the WUSTL-IIoT-2021 dataset,accuracy,recall,and F1-score all exceed 99.9%.展开更多
Multi-label feature selection(MFS)is a crucial dimensionality reduction technique aimed at identifying informative features associated with multiple labels.However,traditional centralized methods face significant chal...Multi-label feature selection(MFS)is a crucial dimensionality reduction technique aimed at identifying informative features associated with multiple labels.However,traditional centralized methods face significant challenges in privacy-sensitive and distributed settings,often neglecting label dependencies and suffering from low computational efficiency.To address these issues,we introduce a novel framework,Fed-MFSDHBCPSO—federated MFS via dual-layer hybrid breeding cooperative particle swarm optimization algorithm with manifold and sparsity regularization(DHBCPSO-MSR).Leveraging the federated learning paradigm,Fed-MFSDHBCPSO allows clients to perform local feature selection(FS)using DHBCPSO-MSR.Locally selected feature subsets are encrypted with differential privacy(DP)and transmitted to a central server,where they are securely aggregated and refined through secure multi-party computation(SMPC)until global convergence is achieved.Within each client,DHBCPSO-MSR employs a dual-layer FS strategy.The inner layer constructs sample and label similarity graphs,generates Laplacian matrices to capture the manifold structure between samples and labels,and applies L2,1-norm regularization to sparsify the feature subset,yielding an optimized feature weight matrix.The outer layer uses a hybrid breeding cooperative particle swarm optimization algorithm to further refine the feature weight matrix and identify the optimal feature subset.The updated weight matrix is then fed back to the inner layer for further optimization.Comprehensive experiments on multiple real-world multi-label datasets demonstrate that Fed-MFSDHBCPSO consistently outperforms both centralized and federated baseline methods across several key evaluation metrics.展开更多
Tremendous amount of data are being generated and saved in many complex engineering and social systems every day.It is significant and feasible to utilize the big data to make better decisions by machine learning tech...Tremendous amount of data are being generated and saved in many complex engineering and social systems every day.It is significant and feasible to utilize the big data to make better decisions by machine learning techniques. In this paper, we focus on batch reinforcement learning(RL) algorithms for discounted Markov decision processes(MDPs) with large discrete or continuous state spaces, aiming to learn the best possible policy given a fixed amount of training data. The batch RL algorithms with handcrafted feature representations work well for low-dimensional MDPs. However, for many real-world RL tasks which often involve high-dimensional state spaces, it is difficult and even infeasible to use feature engineering methods to design features for value function approximation. To cope with high-dimensional RL problems, the desire to obtain data-driven features has led to a lot of works in incorporating feature selection and feature learning into traditional batch RL algorithms. In this paper, we provide a comprehensive survey on automatic feature selection and unsupervised feature learning for high-dimensional batch RL. Moreover, we present recent theoretical developments on applying statistical learning to establish finite-sample error bounds for batch RL algorithms based on weighted Lpnorms. Finally, we derive some future directions in the research of RL algorithms, theories and applications.展开更多
Speech emotion recognition(SER)uses acoustic analysis to find features for emotion recognition and examines variations in voice that are caused by emotions.The number of features acquired with acoustic analysis is ext...Speech emotion recognition(SER)uses acoustic analysis to find features for emotion recognition and examines variations in voice that are caused by emotions.The number of features acquired with acoustic analysis is extremely high,so we introduce a hybrid filter-wrapper feature selection algorithm based on an improved equilibrium optimizer for constructing an emotion recognition system.The proposed algorithm implements multi-objective emotion recognition with the minimum number of selected features and maximum accuracy.First,we use the information gain and Fisher Score to sort the features extracted from signals.Then,we employ a multi-objective ranking method to evaluate these features and assign different importance to them.Features with high rankings have a large probability of being selected.Finally,we propose a repair strategy to address the problem of duplicate solutions in multi-objective feature selection,which can improve the diversity of solutions and avoid falling into local traps.Using random forest and K-nearest neighbor classifiers,four English speech emotion datasets are employed to test the proposed algorithm(MBEO)as well as other multi-objective emotion identification techniques.The results illustrate that it performs well in inverted generational distance,hypervolume,Pareto solutions,and execution time,and MBEO is appropriate for high-dimensional English SER.展开更多
The rapid rise of cyberattacks and the gradual failure of traditional defense systems and approaches led to using artificial intelligence(AI)techniques(such as machine learning(ML)and deep learning(DL))to build more e...The rapid rise of cyberattacks and the gradual failure of traditional defense systems and approaches led to using artificial intelligence(AI)techniques(such as machine learning(ML)and deep learning(DL))to build more efficient and reliable intrusion detection systems(IDSs).However,the advent of larger IDS datasets has negatively impacted the performance and computational complexity of AI-based IDSs.Many researchers used data preprocessing techniques such as feature selection and normalization to overcome such issues.While most of these researchers reported the success of these preprocessing techniques on a shallow level,very few studies have been performed on their effects on a wider scale.Furthermore,the performance of an IDS model is subject to not only the utilized preprocessing techniques but also the dataset and the ML/DL algorithm used,which most of the existing studies give little emphasis on.Thus,this study provides an in-depth analysis of feature selection and normalization effects on IDS models built using three IDS datasets:NSL-KDD,UNSW-NB15,and CSE–CIC–IDS2018,and various AI algorithms.A wrapper-based approach,which tends to give superior performance,and min-max normalization methods were used for feature selection and normalization,respectively.Numerous IDS models were implemented using the full and feature-selected copies of the datasets with and without normalization.The models were evaluated using popular evaluation metrics in IDS modeling,intra-and inter-model comparisons were performed between models and with state-of-the-art works.Random forest(RF)models performed better on NSL-KDD and UNSW-NB15 datasets with accuracies of 99.86%and 96.01%,respectively,whereas artificial neural network(ANN)achieved the best accuracy of 95.43%on the CSE–CIC–IDS2018 dataset.The RF models also achieved an excellent performance compared to recent works.The results show that normalization and feature selection positively affect IDS modeling.Furthermore,while feature selection benefits simpler algorithms(such as RF),normalization is more useful for complex algorithms like ANNs and deep neural networks(DNNs),and algorithms such as Naive Bayes are unsuitable for IDS modeling.The study also found that the UNSW-NB15 and CSE–CIC–IDS2018 datasets are more complex and more suitable for building and evaluating modern-day IDS than the NSL-KDD dataset.Our findings suggest that prioritizing robust algorithms like RF,alongside complex models such as ANN and DNN,can significantly enhance IDS performance.These insights provide valuable guidance for managers to develop more effective security measures by focusing on high detection rates and low false alert rates.展开更多
Software defect prediction(SDP)aims to find a reliable method to predict defects in specific software projects and help software engineers allocate limited resources to release high-quality software products.Software ...Software defect prediction(SDP)aims to find a reliable method to predict defects in specific software projects and help software engineers allocate limited resources to release high-quality software products.Software defect prediction can be effectively performed using traditional features,but there are some redundant or irrelevant features in them(the presence or absence of this feature has little effect on the prediction results).These problems can be solved using feature selection.However,existing feature selection methods have shortcomings such as insignificant dimensionality reduction effect and low classification accuracy of the selected optimal feature subset.In order to reduce the impact of these shortcomings,this paper proposes a new feature selection method Cubic TraverseMa Beluga whale optimization algorithm(CTMBWO)based on the improved Beluga whale optimization algorithm(BWO).The goal of this study is to determine how well the CTMBWO can extract the features that are most important for correctly predicting software defects,improve the accuracy of fault prediction,reduce the number of the selected feature and mitigate the risk of overfitting,thereby achieving more efficient resource utilization and better distribution of test workload.The CTMBWO comprises three main stages:preprocessing the dataset,selecting relevant features,and evaluating the classification performance of the model.The novel feature selection method can effectively improve the performance of SDP.This study performs experiments on two software defect datasets(PROMISE,NASA)and shows the method’s classification performance using four detailed evaluation metrics,Accuracy,F1-score,MCC,AUC and Recall.The results indicate that the approach presented in this paper achieves outstanding classification performance on both datasets and has significant improvement over the baseline models.展开更多
Machine learning(ML)is increasingly applied for medical image processing with appropriate learning paradigms.These applications include analyzing images of various organs,such as the brain,lung,eye,etc.,to identify sp...Machine learning(ML)is increasingly applied for medical image processing with appropriate learning paradigms.These applications include analyzing images of various organs,such as the brain,lung,eye,etc.,to identify specific flaws/diseases for diagnosis.The primary concern of ML applications is the precise selection of flexible image features for pattern detection and region classification.Most of the extracted image features are irrelevant and lead to an increase in computation time.Therefore,this article uses an analytical learning paradigm to design a Congruent Feature Selection Method to select the most relevant image features.This process trains the learning paradigm using similarity and correlation-based features over different textural intensities and pixel distributions.The similarity between the pixels over the various distribution patterns with high indexes is recommended for disease diagnosis.Later,the correlation based on intensity and distribution is analyzed to improve the feature selection congruency.Therefore,the more congruent pixels are sorted in the descending order of the selection,which identifies better regions than the distribution.Now,the learning paradigm is trained using intensity and region-based similarity to maximize the chances of selection.Therefore,the probability of feature selection,regardless of the textures and medical image patterns,is improved.This process enhances the performance of ML applications for different medical image processing.The proposed method improves the accuracy,precision,and training rate by 13.19%,10.69%,and 11.06%,respectively,compared to other models for the selected dataset.The mean error and selection time is also reduced by 12.56%and 13.56%,respectively,compared to the same models and dataset.展开更多
This study provides a systematic investigation into the influence of feature selection methods on cryptocurrency price forecasting models employing technical indicators.In this work,over 130 technical indicators—cove...This study provides a systematic investigation into the influence of feature selection methods on cryptocurrency price forecasting models employing technical indicators.In this work,over 130 technical indicators—covering momentum,volatility,volume,and trend-related technical indicators—are subjected to three distinct feature selection approaches.Specifically,mutual information(MI),recursive feature elimination(RFE),and random forest importance(RFI).By extracting an optimal set of 20 predictors,the proposed framework aims to mitigate redundancy and overfitting while enhancing interpretability.These feature subsets are integrated into support vector regression(SVR),Huber regressors,and k-nearest neighbors(KNN)models to forecast the prices of three leading cryptocurrencies—Bitcoin(BTC/USDT),Ethereum(ETH/USDT),and Binance Coin(BNB/USDT)—across horizons ranging from 1 to 20 days.Model evaluation employs the coefficient of determination(R2)and the root mean squared logarithmic error(RMSLE),alongside a walk-forward validation scheme to approximate real-world trading contexts.Empirical results indicate that incorporating momentum and volatility measures substantially improves predictive accuracy,with particularly pronounced effects observed at longer forecast windows.Moreover,indicators related to volume and trend provide incremental benefits in select market conditions.Notably,an 80%–85% reduction in the original feature set frequently maintains or enhances model performance relative to the complete indicator set.These findings highlight the critical role of targeted feature selection in addressing high-dimensional financial data challenges while preserving model robustness.This research advances the field of cryptocurrency forecasting by offering a rigorous comparison of feature selection methods and their effects on multiple digital assets and prediction horizons.The outcomes highlight the importance of dimension-reduction strategies in developing more efficient and resilient forecasting algorithms.Future efforts should incorporate high-frequency data and explore alternative selection techniques to further refine predictive accuracy in this highly volatile domain.展开更多
The rapid evolution of smart cities through IoT,cloud computing,and connected infrastructures has significantly enhanced sectors such as transportation,healthcare,energy,and public safety,but also increased exposure t...The rapid evolution of smart cities through IoT,cloud computing,and connected infrastructures has significantly enhanced sectors such as transportation,healthcare,energy,and public safety,but also increased exposure to sophisticated cyber threats.The diversity of devices,high data volumes,and real-time operational demands complicate security,requiring not just robust intrusion detection but also effective feature selection for relevance and scalability.Traditional Machine Learning(ML)based Intrusion Detection System(IDS)improves detection but often lacks interpretability,limiting stakeholder trust and timely responses.Moreover,centralized feature selection in conventional IDS compromises data privacy and fails to accommodate the decentralized nature of smart city infrastructures.To address these limitations,this research introduces an Interpretable Federated Learning(FL)based Cyber Intrusion Detection model tailored for smart city applications.The proposed system leverages privacy-preserving feature selection,where each client node independently identifies top-ranked features using ML models integrated with SHAP-based explainability.These local feature subsets are then aggregated at a central server to construct a global model without compromising sensitive data.Furthermore,the global model is enhanced with Explainable AI(XAI)techniques such as SHAP and LIME,offering both global interpretability and instance-level transparency for cyber threat decisions.Experimental results demonstrate that the proposed global model achieves a high detection accuracy of 98.51%,with a significantly low miss rate of 1.49%,outperforming existing models while ensuring explainability,privacy,and scalability across smart city infrastructures.展开更多
基金supported in part by the National Natural Science Foundation of China(62172065,62072060)。
文摘As a crucial data preprocessing method in data mining,feature selection(FS)can be regarded as a bi-objective optimization problem that aims to maximize classification accuracy and minimize the number of selected features.Evolutionary computing(EC)is promising for FS owing to its powerful search capability.However,in traditional EC-based methods,feature subsets are represented via a length-fixed individual encoding.It is ineffective for high-dimensional data,because it results in a huge search space and prohibitive training time.This work proposes a length-adaptive non-dominated sorting genetic algorithm(LA-NSGA)with a length-variable individual encoding and a length-adaptive evolution mechanism for bi-objective highdimensional FS.In LA-NSGA,an initialization method based on correlation and redundancy is devised to initialize individuals of diverse lengths,and a Pareto dominance-based length change operator is introduced to guide individuals to explore in promising search space adaptively.Moreover,a dominance-based local search method is employed for further improvement.The experimental results based on 12 high-dimensional gene datasets show that the Pareto front of feature subsets produced by LA-NSGA is superior to those of existing algorithms.
基金This work is supported by Princess Nourah bint Abdulrahman University Researchers Supporting Project number(PNURSP2023R193)Princess Nourah bint Abdulrahman University,Riyadh,Saudi Arabia.This work was supported in part by the Natural Science Foundation of Zhejiang Province(LZ22F020005)+4 种基金National Natural Science Foundation of China(62076185,U1809209)Natural Science Foundation of Zhejiang Province(LD21F020001,LZ22F020005)National Natural Science Foundation of China(62076185)Key Laboratory of Intelligent Image Processing and Analysis,Wenzhou,China(2021HZSY0071)Wenzhou Major Scientific and Technological Innovation Project(ZY2019020).
文摘Feature selection(FS)is an adequate data pre-processing method that reduces the dimensionality of datasets and is used in bioinformatics,finance,and medicine.Traditional FS approaches,however,frequently struggle to identify the most important characteristics when dealing with high-dimensional information.To alleviate the imbalance of explore search ability and exploit search ability of the Whale Optimization Algorithm(WOA),we propose an enhanced WOA,namely SCLWOA,that incorporates sine chaos and comprehensive learning(CL)strategies.Among them,the CL mechanism contributes to improving the ability to explore.At the same time,the sine chaos is used to enhance the exploitation capacity and help the optimizer to gain a better initial solution.The hybrid performance of SCLWOA was evaluated comprehensively on IEEE CEC2017 test functions,including its qualitative analysis and comparisons with other optimizers.The results demonstrate that SCLWOA is superior to other algorithms in accuracy and converges faster than others.Besides,the variant of Binary SCLWOA(BSCLWOA)and other binary optimizers obtained by the mapping function was evaluated on 12 UCI data sets.Subsequently,BSCLWOA has proven very competitive in classification precision and feature reduction.
基金supported from the Deanship of Research and Graduate Studies(DRG)at Ajman University,Ajman,UAE(Grant No.2023-IRG-ENIT-34).
文摘With the increasing dimensionality of the data,High-dimensional Feature Selection(HFS)becomes an increasingly dif-ficult task.It is not simple to find the best subset of features due to the breadth of the search space and the intricacy of the interactions between features.Many of the Feature Selection(FS)approaches now in use for these problems perform sig-nificantly less well when faced with such intricate situations involving high-dimensional search spaces.It is demonstrated that meta-heuristic algorithms can provide sub-optimal results in an acceptable amount of time.This paper presents a new binary Boosted version of the Spider Wasp Optimizer(BSWO)called Binary Boosted SWO(BBSWO),which combines a number of successful and promising strategies,in order to deal with HFS.The shortcomings of the original BSWO,including early convergence,settling into local optimums,limited exploration and exploitation,and lack of population diversity,were addressed by the proposal of this new variant of SWO.The concept of chaos optimization is introduced in BSWO,where initialization is consistently produced by utilizing the properties of sine chaos mapping.A new convergence parameter was then incorporated into BSWO to achieve a promising balance between exploration and exploitation.Multiple exploration mechanisms were then applied in conjunction with several exploitation strategies to effectively enrich the search process of BSWO within the search space.Finally,quantum-based optimization was added to enhance the diversity of the search agents in BSWO.The proposed BBSWO not only offers the most suitable subset of features located,but it also lessens the data's redundancy structure.BBSWO was evaluated using the k-Nearest Neighbor(k-NN)classifier on 23 HFS problems from the biomedical domain taken from the UCI repository.The results were compared with those of traditional BSWO and other well-known meta-heuristics-based FS.The findings indicate that,in comparison to other competing techniques,the proposed BBSWO can,on average,identify the least significant subsets of features with efficient classification accuracy of the k-NN classifier.
基金supported by Fundamental Research Program of Shanxi Province(Nos.202203021211088,202403021212254,202403021221109)Graduate Research Innovation Project in Shanxi Province(No.2024KY616).
文摘Data collected in fields such as cybersecurity and biomedicine often encounter high dimensionality and class imbalance.To address the problem of low classification accuracy for minority class samples arising from numerous irrelevant and redundant features in high-dimensional imbalanced data,we proposed a novel feature selection method named AMF-SGSK based on adaptive multi-filter and subspace-based gaining sharing knowledge.Firstly,the balanced dataset was obtained by random under-sampling.Secondly,combining the feature importance score with the AUC score for each filter method,we proposed a concept called feature hardness to judge the importance of feature,which could adaptively select the essential features.Finally,the optimal feature subset was obtained by gaining sharing knowledge in multiple subspaces.This approach effectively achieved dimensionality reduction for high-dimensional imbalanced data.The experiment results on 30 benchmark imbalanced datasets showed that AMF-SGSK performed better than other eight commonly used algorithms including BGWO and IG-SSO in terms of F1-score,AUC,and G-mean.The mean values of F1-score,AUC,and Gmean for AMF-SGSK are 0.950,0.967,and 0.965,respectively,achieving the highest among all algorithms.And the mean value of Gmean is higher than those of IG-PSO,ReliefF-GWO,and BGOA by 3.72%,11.12%,and 20.06%,respectively.Furthermore,the selected feature ratio is below 0.01 across the selected ten datasets,further demonstrating the proposed method’s overall superiority over competing approaches.AMF-SGSK could adaptively remove irrelevant and redundant features and effectively improve the classification accuracy of high-dimensional imbalanced data,providing scientific and technological references for practical applications.
文摘Current high-dimensional feature screening methods still face significant challenges in handling mixed linear and nonlinear relationships,controlling redundant information,and improving model robustness.In this study,we propose a Dynamic Conditional Feature Screening(DCFS)method tailored for high-dimensional economic forecasting tasks.Our goal is to accurately identify key variables,enhance predictive performance,and provide both theoretical foundations and practical tools for macroeconomic modeling.The DCFS method constructs a comprehensive test statistic by integrating conditional mutual information with conditional regression error differences.By introducing a dynamic weighting mechanism,DCFS adaptively balances the linear and nonlinear contributions of features during the screening process.In addition,a dynamic thresholding mechanism is designed to effectively control the false discovery rate(FDR),thereby improving the stability and reliability of the screening results.On the theoretical front,we rigorously prove that the proposed method satisfies the sure screening property and rank consistency,ensuring accurate identification of the truly important feature set in high-dimensional settings.Simulation results demonstrate that under purely linear,purely nonlinear,and mixed dependency structures,DCFS consistently outperforms classical screening methods such as SIS,CSIS,and IG-SIS in terms of true positive rate(TPR),false discovery rate(FDR),and rank correlation.These results highlight the superior accuracy,robustness,and stability of our method.Furthermore,an empirical analysis based on the U.S.FRED-MD macroeconomic dataset confirms the practical value of DCFS in real-world forecasting tasks.The experimental results show that DCFS achieves lower prediction errors(RMSE and MAE)and higher R2 values in forecasting GDP growth.The selected key variables-including the Industrial Production Index(IP),Federal Funds Rate,Consumer Price Index(CPI),and Money Supply(M2)-possess clear economic interpretability,offering reliable support for economic forecasting and policy formulation.
基金funded by Deanship of Graduate studies and Scientific Research at Jouf University under grant No.(DGSSR-2024-02-01264).
文摘Automated essay scoring(AES)systems have gained significant importance in educational settings,offering a scalable,efficient,and objective method for evaluating student essays.However,developing AES systems for Arabic poses distinct challenges due to the language’s complex morphology,diglossia,and the scarcity of annotated datasets.This paper presents a hybrid approach to Arabic AES by combining text-based,vector-based,and embeddingbased similarity measures to improve essay scoring accuracy while minimizing the training data required.Using a large Arabic essay dataset categorized into thematic groups,the study conducted four experiments to evaluate the impact of feature selection,data size,and model performance.Experiment 1 established a baseline using a non-machine learning approach,selecting top-N correlated features to predict essay scores.The subsequent experiments employed 5-fold cross-validation.Experiment 2 showed that combining embedding-based,text-based,and vector-based features in a Random Forest(RF)model achieved an R2 of 88.92%and an accuracy of 83.3%within a 0.5-point tolerance.Experiment 3 further refined the feature selection process,demonstrating that 19 correlated features yielded optimal results,improving R2 to 88.95%.In Experiment 4,an optimal data efficiency training approach was introduced,where training data portions increased from 5%to 50%.The study found that using just 10%of the data achieved near-peak performance,with an R2 of 85.49%,emphasizing an effective trade-off between performance and computational costs.These findings highlight the potential of the hybrid approach for developing scalable Arabic AES systems,especially in low-resource environments,addressing linguistic challenges while ensuring efficient data usage.
基金supported by Ho Chi Minh City Open University,Vietnam under grant number E2024.02.1CD and Suan Sunandha Rajabhat University,Thailand.
文摘The Financial Technology(FinTech)sector has witnessed rapid growth,resulting in increasingly complex and high-volume digital transactions.Although this expansion improves efficiency and accessibility,it also introduces significant vulnerabilities,including fraud,money laundering,and market manipulation.Traditional anomaly detection techniques often fail to capture the relational and dynamic characteristics of financial data.Graph Neural Networks(GNNs),capable of modeling intricate interdependencies among entities,have emerged as a powerful framework for detecting subtle and sophisticated anomalies.However,the high-dimensionality and inherent noise of FinTech datasets demand robust feature selection strategies to improve model scalability,performance,and interpretability.This paper presents a comprehensive survey of GNN-based approaches for anomaly detection in FinTech,with an emphasis on the synergistic role of feature selection.We examine the theoretical foundations of GNNs,review state-of-the-art feature selection techniques,analyze their integration with GNNs,and categorize prevalent anomaly types in FinTech applications.In addition,we discuss practical implementation challenges,highlight representative case studies,and propose future research directions to advance the field of graph-based anomaly detection in financial systems.
基金supported by Basic Science Research Program through the National Research Foundation of Korea(NRF)funded by the Ministry of Education(RS-2020-NR049579).
文摘High-dimensional data causes difficulties in machine learning due to high time consumption and large memory requirements.In particular,in amulti-label environment,higher complexity is required asmuch as the number of labels.Moreover,an optimization problem that fully considers all dependencies between features and labels is difficult to solve.In this study,we propose a novel regression-basedmulti-label feature selectionmethod that integrates mutual information to better exploit the underlying data structure.By incorporating mutual information into the regression formulation,the model captures not only linear relationships but also complex non-linear dependencies.The proposed objective function simultaneously considers three types of relationships:(1)feature redundancy,(2)featurelabel relevance,and(3)inter-label dependency.These three quantities are computed usingmutual information,allowing the proposed formulation to capture nonlinear dependencies among variables.These three types of relationships are key factors in multi-label feature selection,and our method expresses them within a unified formulation,enabling efficient optimization while simultaneously accounting for all of them.To efficiently solve the proposed optimization problem under non-negativity constraints,we develop a gradient-based optimization algorithm with fast convergence.Theexperimental results on sevenmulti-label datasets show that the proposed method outperforms existingmulti-label feature selection techniques.
基金supported by the Deanship of Scientific Research,Vice Presidency for Graduate Studies and Scientific Research,King Faisal University,Saudi Arabia Grant No.KFU253765.
文摘Most predictive maintenance studies have emphasized accuracy but provide very little focus on Interpretability or deployment readiness.This study improves on prior methods by developing a small yet robust system that can predict when turbofan engines will fail.It uses the NASA CMAPSS dataset,which has over 200,000 engine cycles from260 engines.The process begins with systematic preprocessing,which includes imputation,outlier removal,scaling,and labelling of the remaining useful life.Dimensionality is reduced using a hybrid selection method that combines variance filtering,recursive elimination,and gradient-boosted importance scores,yielding a stable set of 10 informative sensors.To mitigate class imbalance,minority cases are oversampled,and class-weighted losses are applied during training.Benchmarking is carried out with logistic regression,gradient boosting,and a recurrent design that integrates gated recurrent units with long short-term memory networks.The Long Short-Term Memory–Gated Recurrent Unit(LSTM–GRU)hybrid achieved the strongest performance with an F1 score of 0.92,precision of 0.93,recall of 0.91,ReceiverOperating Characteristic–AreaUnder the Curve(ROC-AUC)of 0.97,andminority recall of 0.75.Interpretability testing using permutation importance and Shapley values indicates that sensors 13,15,and 11 are the most important indicators of engine wear.The proposed system combines imbalance handling,feature reduction,and Interpretability into a practical design suitable for real industrial settings.
基金supported by National Natural Science Foundation of China(62106092)Natural Science Foundation of Fujian Province(2024J01822,2024J01820,2022J01916)Natural Science Foundation of Zhangzhou City(ZZ2024J28).
文摘Feature selection serves as a critical preprocessing step inmachine learning,focusing on identifying and preserving the most relevant features to improve the efficiency and performance of classification algorithms.Particle Swarm Optimization has demonstrated significant potential in addressing feature selection challenges.However,there are inherent limitations in Particle Swarm Optimization,such as the delicate balance between exploration and exploitation,susceptibility to local optima,and suboptimal convergence rates,hinder its performance.To tackle these issues,this study introduces a novel Leveraged Opposition-Based Learning method within Fitness Landscape Particle Swarm Optimization,tailored for wrapper-based feature selection.The proposed approach integrates:(1)a fitness-landscape adaptive strategy to dynamically balance exploration and exploitation,(2)the lever principle within Opposition-Based Learning to improve search efficiency,and(3)a Local Selection and Re-optimization mechanism combined with random perturbation to expedite convergence and enhance the quality of the optimal feature subset.The effectiveness of is rigorously evaluated on 24 benchmark datasets and compared against 13 advancedmetaheuristic algorithms.Experimental results demonstrate that the proposed method outperforms the compared algorithms in classification accuracy on over half of the datasets,whilst also significantly reducing the number of selected features.These findings demonstrate its effectiveness and robustness in feature selection tasks.
文摘Unconfined Compressive Strength(UCS)is a key parameter for the assessment of the stability and performance of stabilized soils,yet traditional laboratory testing is both time and resource intensive.In this study,an interpretable machine learning approach to UCS prediction is presented,pairing five models(Random Forest(RF),Gradient Boosting(GB),Extreme Gradient Boosting(XGB),CatBoost,and K-Nearest Neighbors(KNN))with SHapley Additive exPlanations(SHAP)for enhanced interpretability and to guide feature removal.A complete dataset of 12 geotechnical and chemical parameters,i.e.,Atterberg limits,compaction properties,stabilizer chemistry,dosage,curing time,was used to train and test the models.R2,RMSE,MSE,and MAE were used to assess performance.Initial results with all 12 features indicated that boosting-based models(GB,XGB,CatBoost)exhibited the highest predictive accuracy(R^(2)=0.93)with satisfactory generalization on test data,followed by RF and KNN.SHAP analysis consistently picked CaO content,curing time,stabilizer dosage,and compaction parameters as the most important features,aligning with established soil stabilization mechanisms.Models were then re-trained on the top 8 and top 5 SHAP-ranked features.Interestingly,GB,XGB,and CatBoost maintained comparable accuracy with reduced input sets,while RF was moderately sensitive and KNN was somewhat better owing to reduced dimensionality.The findings confirm that feature reduction through SHAP enables cost-effective UCS prediction through the reduction of laboratory test requirements without significant accuracy loss.The suggested hybrid approach offers an explainable,interpretable,and cost-effective tool for geotechnical engineering practice.
基金supported by the Major Science and Technology Programs in Henan Province(No.241100210100)Henan Provincial Science and Technology Research Project(No.252102211085,No.252102211105)+3 种基金Endogenous Security Cloud Network Convergence R&D Center(No.602431011PQ1)The Special Project for Research and Development in Key Areas of Guangdong Province(No.2021ZDZX1098)The Stabilization Support Program of Science,Technology and Innovation Commission of Shenzhen Municipality(No.20231128083944001)The Key scientific research projects of Henan higher education institutions(No.24A520042).
文摘Existing feature selection methods for intrusion detection systems in the Industrial Internet of Things often suffer from local optimality and high computational complexity.These challenges hinder traditional IDS from effectively extracting features while maintaining detection accuracy.This paper proposes an industrial Internet ofThings intrusion detection feature selection algorithm based on an improved whale optimization algorithm(GSLDWOA).The aim is to address the problems that feature selection algorithms under high-dimensional data are prone to,such as local optimality,long detection time,and reduced accuracy.First,the initial population’s diversity is increased using the Gaussian Mutation mechanism.Then,Non-linear Shrinking Factor balances global exploration and local development,avoiding premature convergence.Lastly,Variable-step Levy Flight operator and Dynamic Differential Evolution strategy are introduced to improve the algorithm’s search efficiency and convergence accuracy in highdimensional feature space.Experiments on the NSL-KDD and WUSTL-IIoT-2021 datasets demonstrate that the feature subset selected by GSLDWOA significantly improves detection performance.Compared to the traditional WOA algorithm,the detection rate and F1-score increased by 3.68%and 4.12%.On the WUSTL-IIoT-2021 dataset,accuracy,recall,and F1-score all exceed 99.9%.
文摘Multi-label feature selection(MFS)is a crucial dimensionality reduction technique aimed at identifying informative features associated with multiple labels.However,traditional centralized methods face significant challenges in privacy-sensitive and distributed settings,often neglecting label dependencies and suffering from low computational efficiency.To address these issues,we introduce a novel framework,Fed-MFSDHBCPSO—federated MFS via dual-layer hybrid breeding cooperative particle swarm optimization algorithm with manifold and sparsity regularization(DHBCPSO-MSR).Leveraging the federated learning paradigm,Fed-MFSDHBCPSO allows clients to perform local feature selection(FS)using DHBCPSO-MSR.Locally selected feature subsets are encrypted with differential privacy(DP)and transmitted to a central server,where they are securely aggregated and refined through secure multi-party computation(SMPC)until global convergence is achieved.Within each client,DHBCPSO-MSR employs a dual-layer FS strategy.The inner layer constructs sample and label similarity graphs,generates Laplacian matrices to capture the manifold structure between samples and labels,and applies L2,1-norm regularization to sparsify the feature subset,yielding an optimized feature weight matrix.The outer layer uses a hybrid breeding cooperative particle swarm optimization algorithm to further refine the feature weight matrix and identify the optimal feature subset.The updated weight matrix is then fed back to the inner layer for further optimization.Comprehensive experiments on multiple real-world multi-label datasets demonstrate that Fed-MFSDHBCPSO consistently outperforms both centralized and federated baseline methods across several key evaluation metrics.
基金supported by National Natural Science Foundation of China(Nos.61034002,61233001 and 61273140)
文摘Tremendous amount of data are being generated and saved in many complex engineering and social systems every day.It is significant and feasible to utilize the big data to make better decisions by machine learning techniques. In this paper, we focus on batch reinforcement learning(RL) algorithms for discounted Markov decision processes(MDPs) with large discrete or continuous state spaces, aiming to learn the best possible policy given a fixed amount of training data. The batch RL algorithms with handcrafted feature representations work well for low-dimensional MDPs. However, for many real-world RL tasks which often involve high-dimensional state spaces, it is difficult and even infeasible to use feature engineering methods to design features for value function approximation. To cope with high-dimensional RL problems, the desire to obtain data-driven features has led to a lot of works in incorporating feature selection and feature learning into traditional batch RL algorithms. In this paper, we provide a comprehensive survey on automatic feature selection and unsupervised feature learning for high-dimensional batch RL. Moreover, we present recent theoretical developments on applying statistical learning to establish finite-sample error bounds for batch RL algorithms based on weighted Lpnorms. Finally, we derive some future directions in the research of RL algorithms, theories and applications.
文摘Speech emotion recognition(SER)uses acoustic analysis to find features for emotion recognition and examines variations in voice that are caused by emotions.The number of features acquired with acoustic analysis is extremely high,so we introduce a hybrid filter-wrapper feature selection algorithm based on an improved equilibrium optimizer for constructing an emotion recognition system.The proposed algorithm implements multi-objective emotion recognition with the minimum number of selected features and maximum accuracy.First,we use the information gain and Fisher Score to sort the features extracted from signals.Then,we employ a multi-objective ranking method to evaluate these features and assign different importance to them.Features with high rankings have a large probability of being selected.Finally,we propose a repair strategy to address the problem of duplicate solutions in multi-objective feature selection,which can improve the diversity of solutions and avoid falling into local traps.Using random forest and K-nearest neighbor classifiers,four English speech emotion datasets are employed to test the proposed algorithm(MBEO)as well as other multi-objective emotion identification techniques.The results illustrate that it performs well in inverted generational distance,hypervolume,Pareto solutions,and execution time,and MBEO is appropriate for high-dimensional English SER.
文摘The rapid rise of cyberattacks and the gradual failure of traditional defense systems and approaches led to using artificial intelligence(AI)techniques(such as machine learning(ML)and deep learning(DL))to build more efficient and reliable intrusion detection systems(IDSs).However,the advent of larger IDS datasets has negatively impacted the performance and computational complexity of AI-based IDSs.Many researchers used data preprocessing techniques such as feature selection and normalization to overcome such issues.While most of these researchers reported the success of these preprocessing techniques on a shallow level,very few studies have been performed on their effects on a wider scale.Furthermore,the performance of an IDS model is subject to not only the utilized preprocessing techniques but also the dataset and the ML/DL algorithm used,which most of the existing studies give little emphasis on.Thus,this study provides an in-depth analysis of feature selection and normalization effects on IDS models built using three IDS datasets:NSL-KDD,UNSW-NB15,and CSE–CIC–IDS2018,and various AI algorithms.A wrapper-based approach,which tends to give superior performance,and min-max normalization methods were used for feature selection and normalization,respectively.Numerous IDS models were implemented using the full and feature-selected copies of the datasets with and without normalization.The models were evaluated using popular evaluation metrics in IDS modeling,intra-and inter-model comparisons were performed between models and with state-of-the-art works.Random forest(RF)models performed better on NSL-KDD and UNSW-NB15 datasets with accuracies of 99.86%and 96.01%,respectively,whereas artificial neural network(ANN)achieved the best accuracy of 95.43%on the CSE–CIC–IDS2018 dataset.The RF models also achieved an excellent performance compared to recent works.The results show that normalization and feature selection positively affect IDS modeling.Furthermore,while feature selection benefits simpler algorithms(such as RF),normalization is more useful for complex algorithms like ANNs and deep neural networks(DNNs),and algorithms such as Naive Bayes are unsuitable for IDS modeling.The study also found that the UNSW-NB15 and CSE–CIC–IDS2018 datasets are more complex and more suitable for building and evaluating modern-day IDS than the NSL-KDD dataset.Our findings suggest that prioritizing robust algorithms like RF,alongside complex models such as ANN and DNN,can significantly enhance IDS performance.These insights provide valuable guidance for managers to develop more effective security measures by focusing on high detection rates and low false alert rates.
文摘Software defect prediction(SDP)aims to find a reliable method to predict defects in specific software projects and help software engineers allocate limited resources to release high-quality software products.Software defect prediction can be effectively performed using traditional features,but there are some redundant or irrelevant features in them(the presence or absence of this feature has little effect on the prediction results).These problems can be solved using feature selection.However,existing feature selection methods have shortcomings such as insignificant dimensionality reduction effect and low classification accuracy of the selected optimal feature subset.In order to reduce the impact of these shortcomings,this paper proposes a new feature selection method Cubic TraverseMa Beluga whale optimization algorithm(CTMBWO)based on the improved Beluga whale optimization algorithm(BWO).The goal of this study is to determine how well the CTMBWO can extract the features that are most important for correctly predicting software defects,improve the accuracy of fault prediction,reduce the number of the selected feature and mitigate the risk of overfitting,thereby achieving more efficient resource utilization and better distribution of test workload.The CTMBWO comprises three main stages:preprocessing the dataset,selecting relevant features,and evaluating the classification performance of the model.The novel feature selection method can effectively improve the performance of SDP.This study performs experiments on two software defect datasets(PROMISE,NASA)and shows the method’s classification performance using four detailed evaluation metrics,Accuracy,F1-score,MCC,AUC and Recall.The results indicate that the approach presented in this paper achieves outstanding classification performance on both datasets and has significant improvement over the baseline models.
基金the Deanship of Scientifc Research at King Khalid University for funding this work through large group Research Project under grant number RGP2/421/45supported via funding from Prince Sattam bin Abdulaziz University project number(PSAU/2024/R/1446)+1 种基金supported by theResearchers Supporting Project Number(UM-DSR-IG-2023-07)Almaarefa University,Riyadh,Saudi Arabia.supported by the Basic Science Research Program through the National Research Foundation of Korea(NRF)funded by the Ministry of Education(No.2021R1F1A1055408).
文摘Machine learning(ML)is increasingly applied for medical image processing with appropriate learning paradigms.These applications include analyzing images of various organs,such as the brain,lung,eye,etc.,to identify specific flaws/diseases for diagnosis.The primary concern of ML applications is the precise selection of flexible image features for pattern detection and region classification.Most of the extracted image features are irrelevant and lead to an increase in computation time.Therefore,this article uses an analytical learning paradigm to design a Congruent Feature Selection Method to select the most relevant image features.This process trains the learning paradigm using similarity and correlation-based features over different textural intensities and pixel distributions.The similarity between the pixels over the various distribution patterns with high indexes is recommended for disease diagnosis.Later,the correlation based on intensity and distribution is analyzed to improve the feature selection congruency.Therefore,the more congruent pixels are sorted in the descending order of the selection,which identifies better regions than the distribution.Now,the learning paradigm is trained using intensity and region-based similarity to maximize the chances of selection.Therefore,the probability of feature selection,regardless of the textures and medical image patterns,is improved.This process enhances the performance of ML applications for different medical image processing.The proposed method improves the accuracy,precision,and training rate by 13.19%,10.69%,and 11.06%,respectively,compared to other models for the selected dataset.The mean error and selection time is also reduced by 12.56%and 13.56%,respectively,compared to the same models and dataset.
文摘This study provides a systematic investigation into the influence of feature selection methods on cryptocurrency price forecasting models employing technical indicators.In this work,over 130 technical indicators—covering momentum,volatility,volume,and trend-related technical indicators—are subjected to three distinct feature selection approaches.Specifically,mutual information(MI),recursive feature elimination(RFE),and random forest importance(RFI).By extracting an optimal set of 20 predictors,the proposed framework aims to mitigate redundancy and overfitting while enhancing interpretability.These feature subsets are integrated into support vector regression(SVR),Huber regressors,and k-nearest neighbors(KNN)models to forecast the prices of three leading cryptocurrencies—Bitcoin(BTC/USDT),Ethereum(ETH/USDT),and Binance Coin(BNB/USDT)—across horizons ranging from 1 to 20 days.Model evaluation employs the coefficient of determination(R2)and the root mean squared logarithmic error(RMSLE),alongside a walk-forward validation scheme to approximate real-world trading contexts.Empirical results indicate that incorporating momentum and volatility measures substantially improves predictive accuracy,with particularly pronounced effects observed at longer forecast windows.Moreover,indicators related to volume and trend provide incremental benefits in select market conditions.Notably,an 80%–85% reduction in the original feature set frequently maintains or enhances model performance relative to the complete indicator set.These findings highlight the critical role of targeted feature selection in addressing high-dimensional financial data challenges while preserving model robustness.This research advances the field of cryptocurrency forecasting by offering a rigorous comparison of feature selection methods and their effects on multiple digital assets and prediction horizons.The outcomes highlight the importance of dimension-reduction strategies in developing more efficient and resilient forecasting algorithms.Future efforts should incorporate high-frequency data and explore alternative selection techniques to further refine predictive accuracy in this highly volatile domain.
文摘The rapid evolution of smart cities through IoT,cloud computing,and connected infrastructures has significantly enhanced sectors such as transportation,healthcare,energy,and public safety,but also increased exposure to sophisticated cyber threats.The diversity of devices,high data volumes,and real-time operational demands complicate security,requiring not just robust intrusion detection but also effective feature selection for relevance and scalability.Traditional Machine Learning(ML)based Intrusion Detection System(IDS)improves detection but often lacks interpretability,limiting stakeholder trust and timely responses.Moreover,centralized feature selection in conventional IDS compromises data privacy and fails to accommodate the decentralized nature of smart city infrastructures.To address these limitations,this research introduces an Interpretable Federated Learning(FL)based Cyber Intrusion Detection model tailored for smart city applications.The proposed system leverages privacy-preserving feature selection,where each client node independently identifies top-ranked features using ML models integrated with SHAP-based explainability.These local feature subsets are then aggregated at a central server to construct a global model without compromising sensitive data.Furthermore,the global model is enhanced with Explainable AI(XAI)techniques such as SHAP and LIME,offering both global interpretability and instance-level transparency for cyber threat decisions.Experimental results demonstrate that the proposed global model achieves a high detection accuracy of 98.51%,with a significantly low miss rate of 1.49%,outperforming existing models while ensuring explainability,privacy,and scalability across smart city infrastructures.