The rapid rise of cyberattacks and the gradual failure of traditional defense systems and approaches led to using artificial intelligence(AI)techniques(such as machine learning(ML)and deep learning(DL))to build more e...The rapid rise of cyberattacks and the gradual failure of traditional defense systems and approaches led to using artificial intelligence(AI)techniques(such as machine learning(ML)and deep learning(DL))to build more efficient and reliable intrusion detection systems(IDSs).However,the advent of larger IDS datasets has negatively impacted the performance and computational complexity of AI-based IDSs.Many researchers used data preprocessing techniques such as feature selection and normalization to overcome such issues.While most of these researchers reported the success of these preprocessing techniques on a shallow level,very few studies have been performed on their effects on a wider scale.Furthermore,the performance of an IDS model is subject to not only the utilized preprocessing techniques but also the dataset and the ML/DL algorithm used,which most of the existing studies give little emphasis on.Thus,this study provides an in-depth analysis of feature selection and normalization effects on IDS models built using three IDS datasets:NSL-KDD,UNSW-NB15,and CSE–CIC–IDS2018,and various AI algorithms.A wrapper-based approach,which tends to give superior performance,and min-max normalization methods were used for feature selection and normalization,respectively.Numerous IDS models were implemented using the full and feature-selected copies of the datasets with and without normalization.The models were evaluated using popular evaluation metrics in IDS modeling,intra-and inter-model comparisons were performed between models and with state-of-the-art works.Random forest(RF)models performed better on NSL-KDD and UNSW-NB15 datasets with accuracies of 99.86%and 96.01%,respectively,whereas artificial neural network(ANN)achieved the best accuracy of 95.43%on the CSE–CIC–IDS2018 dataset.The RF models also achieved an excellent performance compared to recent works.The results show that normalization and feature selection positively affect IDS modeling.Furthermore,while feature selection benefits simpler algorithms(such as RF),normalization is more useful for complex algorithms like ANNs and deep neural networks(DNNs),and algorithms such as Naive Bayes are unsuitable for IDS modeling.The study also found that the UNSW-NB15 and CSE–CIC–IDS2018 datasets are more complex and more suitable for building and evaluating modern-day IDS than the NSL-KDD dataset.Our findings suggest that prioritizing robust algorithms like RF,alongside complex models such as ANN and DNN,can significantly enhance IDS performance.These insights provide valuable guidance for managers to develop more effective security measures by focusing on high detection rates and low false alert rates.展开更多
Machine learning(ML)is increasingly applied for medical image processing with appropriate learning paradigms.These applications include analyzing images of various organs,such as the brain,lung,eye,etc.,to identify sp...Machine learning(ML)is increasingly applied for medical image processing with appropriate learning paradigms.These applications include analyzing images of various organs,such as the brain,lung,eye,etc.,to identify specific flaws/diseases for diagnosis.The primary concern of ML applications is the precise selection of flexible image features for pattern detection and region classification.Most of the extracted image features are irrelevant and lead to an increase in computation time.Therefore,this article uses an analytical learning paradigm to design a Congruent Feature Selection Method to select the most relevant image features.This process trains the learning paradigm using similarity and correlation-based features over different textural intensities and pixel distributions.The similarity between the pixels over the various distribution patterns with high indexes is recommended for disease diagnosis.Later,the correlation based on intensity and distribution is analyzed to improve the feature selection congruency.Therefore,the more congruent pixels are sorted in the descending order of the selection,which identifies better regions than the distribution.Now,the learning paradigm is trained using intensity and region-based similarity to maximize the chances of selection.Therefore,the probability of feature selection,regardless of the textures and medical image patterns,is improved.This process enhances the performance of ML applications for different medical image processing.The proposed method improves the accuracy,precision,and training rate by 13.19%,10.69%,and 11.06%,respectively,compared to other models for the selected dataset.The mean error and selection time is also reduced by 12.56%and 13.56%,respectively,compared to the same models and dataset.展开更多
The Financial Technology(FinTech)sector has witnessed rapid growth,resulting in increasingly complex and high-volume digital transactions.Although this expansion improves efficiency and accessibility,it also introduce...The Financial Technology(FinTech)sector has witnessed rapid growth,resulting in increasingly complex and high-volume digital transactions.Although this expansion improves efficiency and accessibility,it also introduces significant vulnerabilities,including fraud,money laundering,and market manipulation.Traditional anomaly detection techniques often fail to capture the relational and dynamic characteristics of financial data.Graph Neural Networks(GNNs),capable of modeling intricate interdependencies among entities,have emerged as a powerful framework for detecting subtle and sophisticated anomalies.However,the high-dimensionality and inherent noise of FinTech datasets demand robust feature selection strategies to improve model scalability,performance,and interpretability.This paper presents a comprehensive survey of GNN-based approaches for anomaly detection in FinTech,with an emphasis on the synergistic role of feature selection.We examine the theoretical foundations of GNNs,review state-of-the-art feature selection techniques,analyze their integration with GNNs,and categorize prevalent anomaly types in FinTech applications.In addition,we discuss practical implementation challenges,highlight representative case studies,and propose future research directions to advance the field of graph-based anomaly detection in financial systems.展开更多
This study provides a systematic investigation into the influence of feature selection methods on cryptocurrency price forecasting models employing technical indicators.In this work,over 130 technical indicators—cove...This study provides a systematic investigation into the influence of feature selection methods on cryptocurrency price forecasting models employing technical indicators.In this work,over 130 technical indicators—covering momentum,volatility,volume,and trend-related technical indicators—are subjected to three distinct feature selection approaches.Specifically,mutual information(MI),recursive feature elimination(RFE),and random forest importance(RFI).By extracting an optimal set of 20 predictors,the proposed framework aims to mitigate redundancy and overfitting while enhancing interpretability.These feature subsets are integrated into support vector regression(SVR),Huber regressors,and k-nearest neighbors(KNN)models to forecast the prices of three leading cryptocurrencies—Bitcoin(BTC/USDT),Ethereum(ETH/USDT),and Binance Coin(BNB/USDT)—across horizons ranging from 1 to 20 days.Model evaluation employs the coefficient of determination(R2)and the root mean squared logarithmic error(RMSLE),alongside a walk-forward validation scheme to approximate real-world trading contexts.Empirical results indicate that incorporating momentum and volatility measures substantially improves predictive accuracy,with particularly pronounced effects observed at longer forecast windows.Moreover,indicators related to volume and trend provide incremental benefits in select market conditions.Notably,an 80%–85% reduction in the original feature set frequently maintains or enhances model performance relative to the complete indicator set.These findings highlight the critical role of targeted feature selection in addressing high-dimensional financial data challenges while preserving model robustness.This research advances the field of cryptocurrency forecasting by offering a rigorous comparison of feature selection methods and their effects on multiple digital assets and prediction horizons.The outcomes highlight the importance of dimension-reduction strategies in developing more efficient and resilient forecasting algorithms.Future efforts should incorporate high-frequency data and explore alternative selection techniques to further refine predictive accuracy in this highly volatile domain.展开更多
Heart disease prediction is a critical issue in healthcare,where accurate early diagnosis can save lives and reduce healthcare costs.The problem is inherently complex due to the high dimensionality of medical data,irr...Heart disease prediction is a critical issue in healthcare,where accurate early diagnosis can save lives and reduce healthcare costs.The problem is inherently complex due to the high dimensionality of medical data,irrelevant or redundant features,and the variability in risk factors such as age,lifestyle,andmedical history.These challenges often lead to inefficient and less accuratemodels.Traditional predictionmethodologies face limitations in effectively handling large feature sets and optimizing classification performance,which can result in overfitting poor generalization,and high computational cost.This work proposes a novel classification model for heart disease prediction that addresses these challenges by integrating feature selection through a Genetic Algorithm(GA)with an ensemble deep learning approach optimized using the Tunicate Swarm Algorithm(TSA).GA selects the most relevant features,reducing dimensionality and improvingmodel efficiency.Theselected features are then used to train an ensemble of deep learning models,where the TSA optimizes the weight of each model in the ensemble to enhance prediction accuracy.This hybrid approach addresses key challenges in the field,such as high dimensionality,redundant features,and classification performance,by introducing an efficient feature selection mechanism and optimizing the weighting of deep learning models in the ensemble.These enhancements result in a model that achieves superior accuracy,generalization,and efficiency compared to traditional methods.The proposed model demonstrated notable advancements in both prediction accuracy and computational efficiency over traditionalmodels.Specifically,it achieved an accuracy of 97.5%,a sensitivity of 97.2%,and a specificity of 97.8%.Additionally,with a 60-40 data split and 5-fold cross-validation,the model showed a significant reduction in training time(90 s),memory consumption(950 MB),and CPU usage(80%),highlighting its effectiveness in processing large,complex medical datasets for heart disease prediction.展开更多
The rapid evolution of smart cities through IoT,cloud computing,and connected infrastructures has significantly enhanced sectors such as transportation,healthcare,energy,and public safety,but also increased exposure t...The rapid evolution of smart cities through IoT,cloud computing,and connected infrastructures has significantly enhanced sectors such as transportation,healthcare,energy,and public safety,but also increased exposure to sophisticated cyber threats.The diversity of devices,high data volumes,and real-time operational demands complicate security,requiring not just robust intrusion detection but also effective feature selection for relevance and scalability.Traditional Machine Learning(ML)based Intrusion Detection System(IDS)improves detection but often lacks interpretability,limiting stakeholder trust and timely responses.Moreover,centralized feature selection in conventional IDS compromises data privacy and fails to accommodate the decentralized nature of smart city infrastructures.To address these limitations,this research introduces an Interpretable Federated Learning(FL)based Cyber Intrusion Detection model tailored for smart city applications.The proposed system leverages privacy-preserving feature selection,where each client node independently identifies top-ranked features using ML models integrated with SHAP-based explainability.These local feature subsets are then aggregated at a central server to construct a global model without compromising sensitive data.Furthermore,the global model is enhanced with Explainable AI(XAI)techniques such as SHAP and LIME,offering both global interpretability and instance-level transparency for cyber threat decisions.Experimental results demonstrate that the proposed global model achieves a high detection accuracy of 98.51%,with a significantly low miss rate of 1.49%,outperforming existing models while ensuring explainability,privacy,and scalability across smart city infrastructures.展开更多
This paper proposes a novel hybrid fraud detection framework that integrates multi-stage feature selection,unsupervised clustering,and ensemble learning to improve classification performance in financial transaction m...This paper proposes a novel hybrid fraud detection framework that integrates multi-stage feature selection,unsupervised clustering,and ensemble learning to improve classification performance in financial transaction monitoring systems.The framework is structured into three core layers:(1)feature selection using Recursive Feature Elimination(RFE),Principal Component Analysis(PCA),and Mutual Information(MI)to reduce dimensionality and enhance input relevance;(2)anomaly detection through unsupervised clustering using K-Means,Density-Based Spatial Clustering(DBSCAN),and Hierarchical Clustering to flag suspicious patterns in unlabeled data;and(3)final classification using a voting-based hybrid ensemble of Support Vector Machine(SVM),Random Forest(RF),and Gradient Boosting Classifier(GBC).The experimental evaluation is conducted on a synthetically generated dataset comprising one million financial transactions,with 5% labelled as fraudulent,simulating realistic fraud rates and behavioural features,including transaction time,origin,amount,and geo-location.The proposed model demonstrated a significant improvement over baseline classifiers,achieving an accuracy of 99%,a precision of 99%,a recall of 97%,and an F1-score of 99%.Compared to individual models,it yielded a 9% gain in overall detection accuracy.It reduced the false positive rate to below 3.5%,thereby minimising the operational costs associated with manually reviewing false alerts.The model’s interpretability is enhanced by the integration of Shapley Additive Explanations(SHAP)values for feature importance,supporting transparency and regulatory auditability.These results affirm the practical relevance of the proposed system for deployment in real-time fraud detection scenarios such as credit card transactions,mobile banking,and cross-border payments.The study also highlights future directions,including the deployment of lightweight models and the integration of multimodal data for scalable fraud analytics.展开更多
Recent advancements in computational and database technologies have led to the exponential growth of large-scale medical datasets,significantly increasing data complexity and dimensionality in medical diagnostics.Effi...Recent advancements in computational and database technologies have led to the exponential growth of large-scale medical datasets,significantly increasing data complexity and dimensionality in medical diagnostics.Efficient feature selection methods are critical for improving diagnostic accuracy,reducing computational costs,and enhancing the interpretability of predictive models.Particle Swarm Optimization(PSO),a widely used metaheuristic inspired by swarm intelligence,has shown considerable promise in feature selection tasks.However,conventional PSO often suffers from premature convergence and limited exploration capabilities,particularly in high-dimensional spaces.To overcome these limitations,this study proposes an enhanced PSO framework incorporating Orthogonal Initializa-tion and a Crossover Operator(OrPSOC).Orthogonal Initialization ensures a diverse and uniformly distributed initial particle population,substantially improving the algorithm’s exploration capability.The Crossover Operator,inspired by genetic algorithms,introduces additional diversity during the search process,effectively mitigating premature convergence and enhancing global search performance.The effectiveness of OrPSOC was rigorously evaluated on three benchmark medical datasets—Colon,Leukemia,and Prostate Tumor.Comparative analyses were conducted against traditional filter-based methods,including Fast Clustering-Based Feature Selection Technique(Fast-C),Minimum Redundancy Maximum Relevance(MinRedMaxRel),and Five-Way Joint Mutual Information(FJMI),as well as prominent metaheuristic algorithms such as standard PSO,Ant Colony Optimization(ACO),Comprehensive Learning Gravitational Search Algorithm(CLGSA),and Fuzzy-Based CLGSA(FCLGSA).Experimental results demonstrated that OrPSOC consistently outperformed these existing methods in terms of classification accuracy,computational efficiency,and result stability,achieving significant improvements even with fewer selected features.Additionally,a sensitivity analysis of the crossover parameter provided valuable insights into parameter tuning and its impact on model performance.These findings highlight the superiority and robustness of the proposed OrPSOC approach for feature selection in medical diagnostic applications and underscore its potential for broader adoption in various high-dimensional,data-driven fields.展开更多
Dementias such as Alzheimer disease(AD)and mild cognitive impairment(MCI)lead to problems with memory,language,and daily activities resulting from damage to neurons in the brain.Given the irreversibility of this neuro...Dementias such as Alzheimer disease(AD)and mild cognitive impairment(MCI)lead to problems with memory,language,and daily activities resulting from damage to neurons in the brain.Given the irreversibility of this neuronal damage,it is crucial to find a biomarker to distinguish individuals with these diseases from healthy people.In this study,we construct a brain function network based on electroencephalography data to study changes in AD and MCI patients.Using a graph-theoretical approach,we examine connectivity features and explore their contributions to dementia recognition at edge,node,and network levels.We find that connectivity is reduced in AD and MCI patients compared with healthy controls.We also find that the edge-level features give the best performance when machine learning models are used to recognize dementia.The results of feature selection identify the top 50 ranked edge-level features constituting an optimal subset,which is mainly connected with the frontal nodes.A threshold analysis reveals that the performance of edge-level features is more sensitive to the threshold for the connection strength than that of node-and network-level features.In addition,edge-level features with a threshold of 0 provide the most effective dementia recognition.The K-nearest neighbors(KNN)machine learning model achieves the highest accuracy of 0.978 with the optimal subset when the threshold is 0.Visualization of edge-level features suggests that there are more long connections linking the frontal region with the occipital and parietal regions in AD and MCI patients compared with healthy controls.Our codes are publicly available at https://github.com/Debbie-85/eeg-connectivity.展开更多
The complex compositions of high-entropy alloys(HEAs)enable a variety of phase structures like FCC single phase,BCC single phase,or duplex FCC+BCC phase.Accurate and efficient prediction of phase structure is crucial ...The complex compositions of high-entropy alloys(HEAs)enable a variety of phase structures like FCC single phase,BCC single phase,or duplex FCC+BCC phase.Accurate and efficient prediction of phase structure is crucial for accelerating the discovery of new components and designing HEAs with desired phase structure.In this work,five machine learning strategies were utilized to predict the phase structures of HEAs with a dataset of 296.Specifically,a two-step feature selection strategy was proposed,enabling pronounced improvement in the computational efficiency from 2047 to 12 iterations for each model while ensuring fewer input features and higher prediction accuracy.Compared with traditional valence electron concentration criterion,the prediction accuracy of collected dataset was highly improved from 0.79 to 0.98 for random forest.Furthermore,HEAs with compositions of Al_(x)CoCu_(6)Ni_(6)Fe_(6)(x=1,3,6)were developed to validate the prediction results of machine learning models,and the mechanical properties as well as corrosion resistance were investigated.It is found that the higher Al content enhances the yield strength but deteriorates corrosion resistance.The present two-step feature selection strategy provides an alternative method that is feasible for predicting the phase structure of HEAs with high efficiency and accuracy.展开更多
Metaheuristic optimization methods are iterative search processes that aim to efficiently solve complexoptimization problems. These basically find the solution space very efficiently, often without utilizing the gradi...Metaheuristic optimization methods are iterative search processes that aim to efficiently solve complexoptimization problems. These basically find the solution space very efficiently, often without utilizing the gradientinformation, and are inspired by the bio-inspired and socially motivated heuristics. Metaheuristic optimizationalgorithms are increasingly applied to complex feature selection problems in high-dimensional medical datasets.Among these, Teaching-Learning-Based optimization (TLBO) has proven effective for continuous design tasks bybalancing exploration and exploitation phases. However, its binary version (BTLBO) suffers from limited exploitationability, often converging prematurely or getting trapped in local optima, particularly when applied to discrete featureselection tasks. Previous studies reported that BTLBO yields lower classification accuracy and higher feature subsetvariance compared to other hybrid methods in benchmark tests, motivating the development of hybrid approaches.This study proposes a novel hybrid algorithm, BTLBO-Cheetah Optimizer (BTLBO-CO), which integrates the globalexploration strength of BTLBO with the local exploitation efficiency of the Cheetah Optimization (CO) algorithm. Theobjective is to enhance the feature selection process for cancer classification tasks involving high-dimensional data. Theproposed BTLBO-CO algorithm was evaluated on six benchmark cancer datasets: 11 tumors (T), Lung Cancer (LUC),Leukemia (LEU), Small Round Blue Cell Tumor or SRBCT (SR), Diffuse Large B-cell Lymphoma or DLBCL (DL), andProstate Tumor (PT).The results demonstrate superior classification accuracy across all six datasets, achieving 93.71%,96.12%, 98.13%, 97.11%, 98.44%, and 98.84%, respectively.These results validate the effectiveness of the hybrid approachin addressing diverse feature selection challenges using a Support Vector Machine (SVM) classifier.展开更多
Feature selection methods rooted in rough sets confront two notable limitations:their high computa-tional complexity and sensitivity to noise,rendering them impractical for managing large-scale and noisy datasets.The ...Feature selection methods rooted in rough sets confront two notable limitations:their high computa-tional complexity and sensitivity to noise,rendering them impractical for managing large-scale and noisy datasets.The primary issue stems from these methods’undue reliance on all samples.To overcome these challenges,we introduce the concept of cross-similarity grounded in a robust fuzzy relation and design a rapid and robust feature selection algorithm.Firstly,we construct a robust fuzzy relation by introducing a truncation parameter.Then,based on this fuzzy relation,we propose the concept of cross-similarity,which emphasizes the sample-to-sample similarity relations that uniquely determine feature importance,rather than considering all such relations equally.After studying the manifestations and properties of cross-similarity across different fuzzy granularities,we propose a forward greedy feature selection algorithm that leverages cross-similarity as the foundation for information measurement.This algorithm significantly reduces the time complexity from O(m2n2)to O(mn2).Experimental findings reveal that the average runtime of five state-of-the-art comparison algorithms is roughly 3.7 times longer than our algorithm,while our algorithm achieves an average accuracy that surpasses those of the five comparison algorithms by approximately 3.52%.This underscores the effectiveness of our approach.This paper paves the way for applying feature selection algorithms grounded in fuzzy rough sets to large-scale gene datasets.展开更多
In the evolving landscape of cyber threats,phishing attacks pose significant challenges,particularly through deceptive webpages designed to extract sensitive information under the guise of legitimacy.Conventional and ...In the evolving landscape of cyber threats,phishing attacks pose significant challenges,particularly through deceptive webpages designed to extract sensitive information under the guise of legitimacy.Conventional and machine learning(ML)-based detection systems struggle to detect phishing websites owing to their constantly changing tactics.Furthermore,newer phishing websites exhibit subtle and expertly concealed indicators that are not readily detectable.Hence,effective detection depends on identifying the most critical features.Traditional feature selection(FS)methods often struggle to enhance ML model performance and instead decrease it.To combat these issues,we propose an innovative method using explainable AI(XAI)to enhance FS in ML models and improve the identification of phishing websites.Specifically,we employ SHapley Additive exPlanations(SHAP)for global perspective and aggregated local interpretable model-agnostic explanations(LIME)to deter-mine specific localized patterns.The proposed SHAP and LIME-aggregated FS(SLA-FS)framework pinpoints the most informative features,enabling more precise,swift,and adaptable phishing detection.Applying this approach to an up-to-date web phishing dataset,we evaluate the performance of three ML models before and after FS to assess their effectiveness.Our findings reveal that random forest(RF),with an accuracy of 97.41%and XGBoost(XGB)at 97.21%significantly benefit from the SLA-FS framework,while k-nearest neighbors lags.Our framework increases the accuracy of RF and XGB by 0.65%and 0.41%,respectively,outperforming traditional filter or wrapper methods and any prior methods evaluated on this dataset,showcasing its potential.展开更多
Today,phishing is an online attack designed to obtain sensitive information such as credit card and bank account numbers,passwords,and usernames.We can find several anti-phishing solutions,such as heuristic detection,...Today,phishing is an online attack designed to obtain sensitive information such as credit card and bank account numbers,passwords,and usernames.We can find several anti-phishing solutions,such as heuristic detection,virtual similarity detection,black and white lists,and machine learning(ML).However,phishing attempts remain a problem,and establishing an effective anti-phishing strategy is a work in progress.Furthermore,while most antiphishing solutions achieve the highest levels of accuracy on a given dataset,their methods suffer from an increased number of false positives.These methods are ineffective against zero-hour attacks.Phishing sites with a high False Positive Rate(FPR)are considered genuine because they can cause people to lose a lot ofmoney by visiting them.Feature selection is critical when developing phishing detection strategies.Good feature selection helps improve accuracy;however,duplicate features can also increase noise in the dataset and reduce the accuracy of the algorithm.Therefore,a combination of filter-based feature selection methods is proposed to detect phishing attacks,including constant feature removal,duplicate feature removal,quasi-feature removal,correlated feature removal,mutual information extraction,and Analysis of Variance(ANOVA)testing.The technique has been tested with differentMachine Learning classifiers:Random Forest,Artificial Neural Network(ANN),Ada-Boost,Extreme Gradient Boosting(XGBoost),Logistic Regression,Decision Trees,Gradient Boosting Classifiers,Support Vector Machine(SVM),and two types of ensemble models,stacking and majority voting to gain A low false positive rate is achieved.Stacked ensemble classifiers(gradient boosting,randomforest,support vector machine)achieve 1.31%FPR and 98.17%accuracy on Dataset 1,2.81%FPR and Dataset 3 shows 2.81%FPR and 97.61%accuracy,while Dataset 2 shows 3.47%FPR and 96.47%accuracy.展开更多
Lithium-ion batteries are essential for renewable energy storage,necessitating efficient battery management systems(BMS)for optimal performance and longevity.Accurate estimation of the state of health(SOH)is crucial f...Lithium-ion batteries are essential for renewable energy storage,necessitating efficient battery management systems(BMS)for optimal performance and longevity.Accurate estimation of the state of health(SOH)is crucial for BMS safety,yet current machine learning-based SOH estimation relying on global aging features often overlooks localized degradation patterns.In this study,we introduce a novel SOH estimation pipeline that integrates voltage-range-specific segmentation with a multi-stage,crossvalidation-driven localized feature-selection framework and a feature-augmented dual-stream fusion network.Our methodology partitions full-range voltage into localized intervals to construct a degradation-sensitive feature library,from which 4 optimal features are identified from a set of 336 candidates.These selected features are combined with raw voltage signals via a dual-stream architecture that employs a dynamic gating mechanism to recalibrate feature contributions during training.Crossvalidation-based evaluation on datasets encompassing different chemistries and charge/discharge protocols demonstrate that our approach can achieve lower average root-mean-square-error(Oxford dataset:0.7201%,Massachusetts Institute of Technology(MIT)dataset:0.7184%)compared to baseline models.An in-depth analysis of the physical significance of the screened features improves the interpretability of the features.This work underscores the significant potential of leveraging localized feature enhancement in SOH estimation by systematically integrating degradation-sensitive features,thereby offering precise estimation.展开更多
Selecting proper descriptors(also known feature selection,FS)is key in the process of establishing mechanical properties prediction model of hot-rolled microalloyed steels by using machine learning(ML)algorithm.FS met...Selecting proper descriptors(also known feature selection,FS)is key in the process of establishing mechanical properties prediction model of hot-rolled microalloyed steels by using machine learning(ML)algorithm.FS methods based on data-driving can reduce the redundancy of data features and improve the prediction accuracy of mechanical properties.Based on the collected data of hot-rolled microalloyed steels,the association rules are used to mine the correlation information between the data.High-quality feature subsets are selected by the proposed FS method(FS method based on genetic algorithm embedding,GAMIC).Compared with the common FS method,it is shown on dataset that GAMIC selects feature subsets more appropriately.Six different ML algorithms are trained and tested for mechanical properties prediction.The result shows that the root-mean-square error of yield strength,tensile strength and elongation based on limit gradient enhancement(XGBoost)algorithm is 21.95 MPa,20.85 MPa and 1.96%,the correlation coefficient(R^(2))is 0.969,0.968 and 0.830,and the mean absolute error is 16.84 MPa,15.83 MPa and 1.48%,respectively,showing the best prediction performance.Finally,SHapley Additive exPlanation is used to further explore the influence of feature variables on mechanical properties.GAMIC feature selection method proposed is universal,which provides a basis for the development of high-precision mechanical property prediction model.展开更多
In recent years, particle swarm optimization (PSO) has received widespread attention in feature selection due to its simplicity and potential for global search. However, in traditional PSO, particles primarily update ...In recent years, particle swarm optimization (PSO) has received widespread attention in feature selection due to its simplicity and potential for global search. However, in traditional PSO, particles primarily update based on two extreme values: personal best and global best, which limits the diversity of information. Ideally, particles should learn from multiple advantageous particles to enhance interactivity and optimization efficiency. Accordingly, this paper proposes a PSO that simulates the evolutionary dynamics of species survival in mountain peak ecology (PEPSO) for feature selection. Based on the pyramid topology, the algorithm simulates the features of mountain peak ecology in nature and the competitive-cooperative strategies among species. According to the principles of the algorithm, the population is first adaptively divided into many subgroups based on the fitness level of particles. Then, particles within each subgroup are divided into three different types based on their evolutionary levels, employing different adaptive inertia weight rules and dynamic learning mechanisms to define distinct learning modes. Consequently, all particles play their respective roles in promoting the global optimization performance of the algorithm, similar to different species in the ecological pattern of mountain peaks. Experimental validation of the PEPSO performance was conducted on 18 public datasets. The experimental results demonstrate that the PEPSO outperforms other PSO variant-based feature selection methods and mainstream feature selection methods based on intelligent optimization algorithms in terms of overall performance in global search capability, classification accuracy, and reduction of feature space dimensions. Wilcoxon signed-rank test also confirms the excellent performance of the PEPSO.展开更多
Object detection plays a critical role in drone imagery analysis,especially in remote sensing applications where accurate and efficient detection of small objects is essential.Despite significant advancements in drone...Object detection plays a critical role in drone imagery analysis,especially in remote sensing applications where accurate and efficient detection of small objects is essential.Despite significant advancements in drone imagery detection,most models still struggle with small object detection due to challenges such as object size,complex backgrounds.To address these issues,we propose a robust detection model based on You Only Look Once(YOLO)that balances accuracy and efficiency.The model mainly contains several major innovation:feature selection pyramid network,Inner-Shape Intersection over Union(ISIoU)loss function and small object detection head.To overcome the limitations of traditional fusion methods in handling multi-level features,we introduce a Feature Selection Pyramid Network integrated into the Neck component,which preserves shallow feature details critical for detecting small objects.Additionally,recognizing that deep network structures often neglect or degrade small object features,we design a specialized small object detection head in the shallow layers to enhance detection accuracy for these challenging targets.To effectively model both local and global dependencies,we introduce a Conv-Former module that simulates Transformer mechanisms using a convolutional structure,thereby improving feature enhancement.Furthermore,we employ ISIoU to address object imbalance and scale variation This approach accelerates model conver-gence and improves regression accuracy.Experimental results show that,compared to the baseline model,the proposed method significantly improves small object detection performance on the VisDrone2019 dataset,with mAP@50 increasing by 4.9%and mAP@50-95 rising by 6.7%.This model also outperforms other state-of-the-art algorithms,demonstrating its reliability and effectiveness in both small object detection and remote sensing image fusion tasks.展开更多
Feature selection(FS)is essential in machine learning(ML)and data mapping by its ability to preprocess high-dimensional data.By selecting a subset of relevant features,feature selection cuts down on the dimension of t...Feature selection(FS)is essential in machine learning(ML)and data mapping by its ability to preprocess high-dimensional data.By selecting a subset of relevant features,feature selection cuts down on the dimension of the data.It excludes irrelevant or surplus features,thus boosting the performance and efficiency of the model.Particle Swarm Optimization(PSO)boasts a streamlined algorithmic framework and exhibits rapid convergence traits.Compared with other algorithms,it incurs reduced computational expenses when tackling high-dimensional datasets.However,PSO faces challenges like inadequate convergence precision.Therefore,regarding FS problems,this paper presents a binary version enhanced PSO based on the Support Vector Machines(SVM)classifier.First,the Sand Cat Swarm Optimization(SCSO)is added to enhance the global search capability of PSO and improve the accuracy of the solution.Secondly,the Latin hypercube sampling strategy initializes populations more uniformly and helps to increase population diversity.The last is the roundup search strategy introducing the grey wolf hierarchy idea to help improve convergence speed.To verify the capability of Self-adaptive Cooperative Particle Swarm Optimization(SCPSO),the CEC2020 test suite and CEC2022 test suite are selected for experiments and applied to three engineering problems.Compared with the standard PSO algorithm,SCPSO converges faster,and the convergence accuracy is significantly improved.Moreover,SCPSO’s comprehensive performance far exceeds that of other algorithms.Six datasets from the University of California,Irvine(UCI)database were selected to evaluate SCPSO’s effectiveness in solving feature selection problems.The results indicate that SCPSO has significant potential for addressing these problems.展开更多
Automated essay scoring(AES)systems have gained significant importance in educational settings,offering a scalable,efficient,and objective method for evaluating student essays.However,developing AES systems for Arabic...Automated essay scoring(AES)systems have gained significant importance in educational settings,offering a scalable,efficient,and objective method for evaluating student essays.However,developing AES systems for Arabic poses distinct challenges due to the language’s complex morphology,diglossia,and the scarcity of annotated datasets.This paper presents a hybrid approach to Arabic AES by combining text-based,vector-based,and embeddingbased similarity measures to improve essay scoring accuracy while minimizing the training data required.Using a large Arabic essay dataset categorized into thematic groups,the study conducted four experiments to evaluate the impact of feature selection,data size,and model performance.Experiment 1 established a baseline using a non-machine learning approach,selecting top-N correlated features to predict essay scores.The subsequent experiments employed 5-fold cross-validation.Experiment 2 showed that combining embedding-based,text-based,and vector-based features in a Random Forest(RF)model achieved an R2 of 88.92%and an accuracy of 83.3%within a 0.5-point tolerance.Experiment 3 further refined the feature selection process,demonstrating that 19 correlated features yielded optimal results,improving R2 to 88.95%.In Experiment 4,an optimal data efficiency training approach was introduced,where training data portions increased from 5%to 50%.The study found that using just 10%of the data achieved near-peak performance,with an R2 of 85.49%,emphasizing an effective trade-off between performance and computational costs.These findings highlight the potential of the hybrid approach for developing scalable Arabic AES systems,especially in low-resource environments,addressing linguistic challenges while ensuring efficient data usage.展开更多
文摘The rapid rise of cyberattacks and the gradual failure of traditional defense systems and approaches led to using artificial intelligence(AI)techniques(such as machine learning(ML)and deep learning(DL))to build more efficient and reliable intrusion detection systems(IDSs).However,the advent of larger IDS datasets has negatively impacted the performance and computational complexity of AI-based IDSs.Many researchers used data preprocessing techniques such as feature selection and normalization to overcome such issues.While most of these researchers reported the success of these preprocessing techniques on a shallow level,very few studies have been performed on their effects on a wider scale.Furthermore,the performance of an IDS model is subject to not only the utilized preprocessing techniques but also the dataset and the ML/DL algorithm used,which most of the existing studies give little emphasis on.Thus,this study provides an in-depth analysis of feature selection and normalization effects on IDS models built using three IDS datasets:NSL-KDD,UNSW-NB15,and CSE–CIC–IDS2018,and various AI algorithms.A wrapper-based approach,which tends to give superior performance,and min-max normalization methods were used for feature selection and normalization,respectively.Numerous IDS models were implemented using the full and feature-selected copies of the datasets with and without normalization.The models were evaluated using popular evaluation metrics in IDS modeling,intra-and inter-model comparisons were performed between models and with state-of-the-art works.Random forest(RF)models performed better on NSL-KDD and UNSW-NB15 datasets with accuracies of 99.86%and 96.01%,respectively,whereas artificial neural network(ANN)achieved the best accuracy of 95.43%on the CSE–CIC–IDS2018 dataset.The RF models also achieved an excellent performance compared to recent works.The results show that normalization and feature selection positively affect IDS modeling.Furthermore,while feature selection benefits simpler algorithms(such as RF),normalization is more useful for complex algorithms like ANNs and deep neural networks(DNNs),and algorithms such as Naive Bayes are unsuitable for IDS modeling.The study also found that the UNSW-NB15 and CSE–CIC–IDS2018 datasets are more complex and more suitable for building and evaluating modern-day IDS than the NSL-KDD dataset.Our findings suggest that prioritizing robust algorithms like RF,alongside complex models such as ANN and DNN,can significantly enhance IDS performance.These insights provide valuable guidance for managers to develop more effective security measures by focusing on high detection rates and low false alert rates.
基金the Deanship of Scientifc Research at King Khalid University for funding this work through large group Research Project under grant number RGP2/421/45supported via funding from Prince Sattam bin Abdulaziz University project number(PSAU/2024/R/1446)+1 种基金supported by theResearchers Supporting Project Number(UM-DSR-IG-2023-07)Almaarefa University,Riyadh,Saudi Arabia.supported by the Basic Science Research Program through the National Research Foundation of Korea(NRF)funded by the Ministry of Education(No.2021R1F1A1055408).
文摘Machine learning(ML)is increasingly applied for medical image processing with appropriate learning paradigms.These applications include analyzing images of various organs,such as the brain,lung,eye,etc.,to identify specific flaws/diseases for diagnosis.The primary concern of ML applications is the precise selection of flexible image features for pattern detection and region classification.Most of the extracted image features are irrelevant and lead to an increase in computation time.Therefore,this article uses an analytical learning paradigm to design a Congruent Feature Selection Method to select the most relevant image features.This process trains the learning paradigm using similarity and correlation-based features over different textural intensities and pixel distributions.The similarity between the pixels over the various distribution patterns with high indexes is recommended for disease diagnosis.Later,the correlation based on intensity and distribution is analyzed to improve the feature selection congruency.Therefore,the more congruent pixels are sorted in the descending order of the selection,which identifies better regions than the distribution.Now,the learning paradigm is trained using intensity and region-based similarity to maximize the chances of selection.Therefore,the probability of feature selection,regardless of the textures and medical image patterns,is improved.This process enhances the performance of ML applications for different medical image processing.The proposed method improves the accuracy,precision,and training rate by 13.19%,10.69%,and 11.06%,respectively,compared to other models for the selected dataset.The mean error and selection time is also reduced by 12.56%and 13.56%,respectively,compared to the same models and dataset.
基金supported by Ho Chi Minh City Open University,Vietnam under grant number E2024.02.1CD and Suan Sunandha Rajabhat University,Thailand.
文摘The Financial Technology(FinTech)sector has witnessed rapid growth,resulting in increasingly complex and high-volume digital transactions.Although this expansion improves efficiency and accessibility,it also introduces significant vulnerabilities,including fraud,money laundering,and market manipulation.Traditional anomaly detection techniques often fail to capture the relational and dynamic characteristics of financial data.Graph Neural Networks(GNNs),capable of modeling intricate interdependencies among entities,have emerged as a powerful framework for detecting subtle and sophisticated anomalies.However,the high-dimensionality and inherent noise of FinTech datasets demand robust feature selection strategies to improve model scalability,performance,and interpretability.This paper presents a comprehensive survey of GNN-based approaches for anomaly detection in FinTech,with an emphasis on the synergistic role of feature selection.We examine the theoretical foundations of GNNs,review state-of-the-art feature selection techniques,analyze their integration with GNNs,and categorize prevalent anomaly types in FinTech applications.In addition,we discuss practical implementation challenges,highlight representative case studies,and propose future research directions to advance the field of graph-based anomaly detection in financial systems.
文摘This study provides a systematic investigation into the influence of feature selection methods on cryptocurrency price forecasting models employing technical indicators.In this work,over 130 technical indicators—covering momentum,volatility,volume,and trend-related technical indicators—are subjected to three distinct feature selection approaches.Specifically,mutual information(MI),recursive feature elimination(RFE),and random forest importance(RFI).By extracting an optimal set of 20 predictors,the proposed framework aims to mitigate redundancy and overfitting while enhancing interpretability.These feature subsets are integrated into support vector regression(SVR),Huber regressors,and k-nearest neighbors(KNN)models to forecast the prices of three leading cryptocurrencies—Bitcoin(BTC/USDT),Ethereum(ETH/USDT),and Binance Coin(BNB/USDT)—across horizons ranging from 1 to 20 days.Model evaluation employs the coefficient of determination(R2)and the root mean squared logarithmic error(RMSLE),alongside a walk-forward validation scheme to approximate real-world trading contexts.Empirical results indicate that incorporating momentum and volatility measures substantially improves predictive accuracy,with particularly pronounced effects observed at longer forecast windows.Moreover,indicators related to volume and trend provide incremental benefits in select market conditions.Notably,an 80%–85% reduction in the original feature set frequently maintains or enhances model performance relative to the complete indicator set.These findings highlight the critical role of targeted feature selection in addressing high-dimensional financial data challenges while preserving model robustness.This research advances the field of cryptocurrency forecasting by offering a rigorous comparison of feature selection methods and their effects on multiple digital assets and prediction horizons.The outcomes highlight the importance of dimension-reduction strategies in developing more efficient and resilient forecasting algorithms.Future efforts should incorporate high-frequency data and explore alternative selection techniques to further refine predictive accuracy in this highly volatile domain.
文摘Heart disease prediction is a critical issue in healthcare,where accurate early diagnosis can save lives and reduce healthcare costs.The problem is inherently complex due to the high dimensionality of medical data,irrelevant or redundant features,and the variability in risk factors such as age,lifestyle,andmedical history.These challenges often lead to inefficient and less accuratemodels.Traditional predictionmethodologies face limitations in effectively handling large feature sets and optimizing classification performance,which can result in overfitting poor generalization,and high computational cost.This work proposes a novel classification model for heart disease prediction that addresses these challenges by integrating feature selection through a Genetic Algorithm(GA)with an ensemble deep learning approach optimized using the Tunicate Swarm Algorithm(TSA).GA selects the most relevant features,reducing dimensionality and improvingmodel efficiency.Theselected features are then used to train an ensemble of deep learning models,where the TSA optimizes the weight of each model in the ensemble to enhance prediction accuracy.This hybrid approach addresses key challenges in the field,such as high dimensionality,redundant features,and classification performance,by introducing an efficient feature selection mechanism and optimizing the weighting of deep learning models in the ensemble.These enhancements result in a model that achieves superior accuracy,generalization,and efficiency compared to traditional methods.The proposed model demonstrated notable advancements in both prediction accuracy and computational efficiency over traditionalmodels.Specifically,it achieved an accuracy of 97.5%,a sensitivity of 97.2%,and a specificity of 97.8%.Additionally,with a 60-40 data split and 5-fold cross-validation,the model showed a significant reduction in training time(90 s),memory consumption(950 MB),and CPU usage(80%),highlighting its effectiveness in processing large,complex medical datasets for heart disease prediction.
文摘The rapid evolution of smart cities through IoT,cloud computing,and connected infrastructures has significantly enhanced sectors such as transportation,healthcare,energy,and public safety,but also increased exposure to sophisticated cyber threats.The diversity of devices,high data volumes,and real-time operational demands complicate security,requiring not just robust intrusion detection but also effective feature selection for relevance and scalability.Traditional Machine Learning(ML)based Intrusion Detection System(IDS)improves detection but often lacks interpretability,limiting stakeholder trust and timely responses.Moreover,centralized feature selection in conventional IDS compromises data privacy and fails to accommodate the decentralized nature of smart city infrastructures.To address these limitations,this research introduces an Interpretable Federated Learning(FL)based Cyber Intrusion Detection model tailored for smart city applications.The proposed system leverages privacy-preserving feature selection,where each client node independently identifies top-ranked features using ML models integrated with SHAP-based explainability.These local feature subsets are then aggregated at a central server to construct a global model without compromising sensitive data.Furthermore,the global model is enhanced with Explainable AI(XAI)techniques such as SHAP and LIME,offering both global interpretability and instance-level transparency for cyber threat decisions.Experimental results demonstrate that the proposed global model achieves a high detection accuracy of 98.51%,with a significantly low miss rate of 1.49%,outperforming existing models while ensuring explainability,privacy,and scalability across smart city infrastructures.
基金funded by the Deanship of Scientific Research,Vice Presidency for Graduate Studies and Scientific Research,King Faisal University,Saudi Arabia[Grant No.KFU241683].
文摘This paper proposes a novel hybrid fraud detection framework that integrates multi-stage feature selection,unsupervised clustering,and ensemble learning to improve classification performance in financial transaction monitoring systems.The framework is structured into three core layers:(1)feature selection using Recursive Feature Elimination(RFE),Principal Component Analysis(PCA),and Mutual Information(MI)to reduce dimensionality and enhance input relevance;(2)anomaly detection through unsupervised clustering using K-Means,Density-Based Spatial Clustering(DBSCAN),and Hierarchical Clustering to flag suspicious patterns in unlabeled data;and(3)final classification using a voting-based hybrid ensemble of Support Vector Machine(SVM),Random Forest(RF),and Gradient Boosting Classifier(GBC).The experimental evaluation is conducted on a synthetically generated dataset comprising one million financial transactions,with 5% labelled as fraudulent,simulating realistic fraud rates and behavioural features,including transaction time,origin,amount,and geo-location.The proposed model demonstrated a significant improvement over baseline classifiers,achieving an accuracy of 99%,a precision of 99%,a recall of 97%,and an F1-score of 99%.Compared to individual models,it yielded a 9% gain in overall detection accuracy.It reduced the false positive rate to below 3.5%,thereby minimising the operational costs associated with manually reviewing false alerts.The model’s interpretability is enhanced by the integration of Shapley Additive Explanations(SHAP)values for feature importance,supporting transparency and regulatory auditability.These results affirm the practical relevance of the proposed system for deployment in real-time fraud detection scenarios such as credit card transactions,mobile banking,and cross-border payments.The study also highlights future directions,including the deployment of lightweight models and the integration of multimodal data for scalable fraud analytics.
文摘Recent advancements in computational and database technologies have led to the exponential growth of large-scale medical datasets,significantly increasing data complexity and dimensionality in medical diagnostics.Efficient feature selection methods are critical for improving diagnostic accuracy,reducing computational costs,and enhancing the interpretability of predictive models.Particle Swarm Optimization(PSO),a widely used metaheuristic inspired by swarm intelligence,has shown considerable promise in feature selection tasks.However,conventional PSO often suffers from premature convergence and limited exploration capabilities,particularly in high-dimensional spaces.To overcome these limitations,this study proposes an enhanced PSO framework incorporating Orthogonal Initializa-tion and a Crossover Operator(OrPSOC).Orthogonal Initialization ensures a diverse and uniformly distributed initial particle population,substantially improving the algorithm’s exploration capability.The Crossover Operator,inspired by genetic algorithms,introduces additional diversity during the search process,effectively mitigating premature convergence and enhancing global search performance.The effectiveness of OrPSOC was rigorously evaluated on three benchmark medical datasets—Colon,Leukemia,and Prostate Tumor.Comparative analyses were conducted against traditional filter-based methods,including Fast Clustering-Based Feature Selection Technique(Fast-C),Minimum Redundancy Maximum Relevance(MinRedMaxRel),and Five-Way Joint Mutual Information(FJMI),as well as prominent metaheuristic algorithms such as standard PSO,Ant Colony Optimization(ACO),Comprehensive Learning Gravitational Search Algorithm(CLGSA),and Fuzzy-Based CLGSA(FCLGSA).Experimental results demonstrated that OrPSOC consistently outperformed these existing methods in terms of classification accuracy,computational efficiency,and result stability,achieving significant improvements even with fewer selected features.Additionally,a sensitivity analysis of the crossover parameter provided valuable insights into parameter tuning and its impact on model performance.These findings highlight the superiority and robustness of the proposed OrPSOC approach for feature selection in medical diagnostic applications and underscore its potential for broader adoption in various high-dimensional,data-driven fields.
基金supported by the National Natural Science Foundation of China(Grant Nos.62071451,62331025,and U21A20447)the National Key Research and Development Project(Grant No.2021YFC3002204)the CAMS Innovation Fund for Medical Sciences(Grant No.2019-I2M-5-019).
文摘Dementias such as Alzheimer disease(AD)and mild cognitive impairment(MCI)lead to problems with memory,language,and daily activities resulting from damage to neurons in the brain.Given the irreversibility of this neuronal damage,it is crucial to find a biomarker to distinguish individuals with these diseases from healthy people.In this study,we construct a brain function network based on electroencephalography data to study changes in AD and MCI patients.Using a graph-theoretical approach,we examine connectivity features and explore their contributions to dementia recognition at edge,node,and network levels.We find that connectivity is reduced in AD and MCI patients compared with healthy controls.We also find that the edge-level features give the best performance when machine learning models are used to recognize dementia.The results of feature selection identify the top 50 ranked edge-level features constituting an optimal subset,which is mainly connected with the frontal nodes.A threshold analysis reveals that the performance of edge-level features is more sensitive to the threshold for the connection strength than that of node-and network-level features.In addition,edge-level features with a threshold of 0 provide the most effective dementia recognition.The K-nearest neighbors(KNN)machine learning model achieves the highest accuracy of 0.978 with the optimal subset when the threshold is 0.Visualization of edge-level features suggests that there are more long connections linking the frontal region with the occipital and parietal regions in AD and MCI patients compared with healthy controls.Our codes are publicly available at https://github.com/Debbie-85/eeg-connectivity.
基金the Shenzhen Fundamental Research Fund(No.JCYJ20210324122801005)the Fundamental Research Funds for the Central Universities(No.HIT.OCEF.2023022).
文摘The complex compositions of high-entropy alloys(HEAs)enable a variety of phase structures like FCC single phase,BCC single phase,or duplex FCC+BCC phase.Accurate and efficient prediction of phase structure is crucial for accelerating the discovery of new components and designing HEAs with desired phase structure.In this work,five machine learning strategies were utilized to predict the phase structures of HEAs with a dataset of 296.Specifically,a two-step feature selection strategy was proposed,enabling pronounced improvement in the computational efficiency from 2047 to 12 iterations for each model while ensuring fewer input features and higher prediction accuracy.Compared with traditional valence electron concentration criterion,the prediction accuracy of collected dataset was highly improved from 0.79 to 0.98 for random forest.Furthermore,HEAs with compositions of Al_(x)CoCu_(6)Ni_(6)Fe_(6)(x=1,3,6)were developed to validate the prediction results of machine learning models,and the mechanical properties as well as corrosion resistance were investigated.It is found that the higher Al content enhances the yield strength but deteriorates corrosion resistance.The present two-step feature selection strategy provides an alternative method that is feasible for predicting the phase structure of HEAs with high efficiency and accuracy.
基金funded by the Deanship of Research andGraduate Studies at King Khalid University through the Large Research Project under grant number RGP2/417/46.
文摘Metaheuristic optimization methods are iterative search processes that aim to efficiently solve complexoptimization problems. These basically find the solution space very efficiently, often without utilizing the gradientinformation, and are inspired by the bio-inspired and socially motivated heuristics. Metaheuristic optimizationalgorithms are increasingly applied to complex feature selection problems in high-dimensional medical datasets.Among these, Teaching-Learning-Based optimization (TLBO) has proven effective for continuous design tasks bybalancing exploration and exploitation phases. However, its binary version (BTLBO) suffers from limited exploitationability, often converging prematurely or getting trapped in local optima, particularly when applied to discrete featureselection tasks. Previous studies reported that BTLBO yields lower classification accuracy and higher feature subsetvariance compared to other hybrid methods in benchmark tests, motivating the development of hybrid approaches.This study proposes a novel hybrid algorithm, BTLBO-Cheetah Optimizer (BTLBO-CO), which integrates the globalexploration strength of BTLBO with the local exploitation efficiency of the Cheetah Optimization (CO) algorithm. Theobjective is to enhance the feature selection process for cancer classification tasks involving high-dimensional data. Theproposed BTLBO-CO algorithm was evaluated on six benchmark cancer datasets: 11 tumors (T), Lung Cancer (LUC),Leukemia (LEU), Small Round Blue Cell Tumor or SRBCT (SR), Diffuse Large B-cell Lymphoma or DLBCL (DL), andProstate Tumor (PT).The results demonstrate superior classification accuracy across all six datasets, achieving 93.71%,96.12%, 98.13%, 97.11%, 98.44%, and 98.84%, respectively.These results validate the effectiveness of the hybrid approachin addressing diverse feature selection challenges using a Support Vector Machine (SVM) classifier.
基金supported by the Anhui Provincial Department of Education University Research Project(2024AH051375)Research Project of Chizhou University(CZ2022ZRZ06)+1 种基金Anhui Province Natural Science Research Project of Colleges and Universities(2024AH051368)Excellent Scientific Research and Innovation Team of Anhui Colleges(2022AH010098).
文摘Feature selection methods rooted in rough sets confront two notable limitations:their high computa-tional complexity and sensitivity to noise,rendering them impractical for managing large-scale and noisy datasets.The primary issue stems from these methods’undue reliance on all samples.To overcome these challenges,we introduce the concept of cross-similarity grounded in a robust fuzzy relation and design a rapid and robust feature selection algorithm.Firstly,we construct a robust fuzzy relation by introducing a truncation parameter.Then,based on this fuzzy relation,we propose the concept of cross-similarity,which emphasizes the sample-to-sample similarity relations that uniquely determine feature importance,rather than considering all such relations equally.After studying the manifestations and properties of cross-similarity across different fuzzy granularities,we propose a forward greedy feature selection algorithm that leverages cross-similarity as the foundation for information measurement.This algorithm significantly reduces the time complexity from O(m2n2)to O(mn2).Experimental findings reveal that the average runtime of five state-of-the-art comparison algorithms is roughly 3.7 times longer than our algorithm,while our algorithm achieves an average accuracy that surpasses those of the five comparison algorithms by approximately 3.52%.This underscores the effectiveness of our approach.This paper paves the way for applying feature selection algorithms grounded in fuzzy rough sets to large-scale gene datasets.
文摘In the evolving landscape of cyber threats,phishing attacks pose significant challenges,particularly through deceptive webpages designed to extract sensitive information under the guise of legitimacy.Conventional and machine learning(ML)-based detection systems struggle to detect phishing websites owing to their constantly changing tactics.Furthermore,newer phishing websites exhibit subtle and expertly concealed indicators that are not readily detectable.Hence,effective detection depends on identifying the most critical features.Traditional feature selection(FS)methods often struggle to enhance ML model performance and instead decrease it.To combat these issues,we propose an innovative method using explainable AI(XAI)to enhance FS in ML models and improve the identification of phishing websites.Specifically,we employ SHapley Additive exPlanations(SHAP)for global perspective and aggregated local interpretable model-agnostic explanations(LIME)to deter-mine specific localized patterns.The proposed SHAP and LIME-aggregated FS(SLA-FS)framework pinpoints the most informative features,enabling more precise,swift,and adaptable phishing detection.Applying this approach to an up-to-date web phishing dataset,we evaluate the performance of three ML models before and after FS to assess their effectiveness.Our findings reveal that random forest(RF),with an accuracy of 97.41%and XGBoost(XGB)at 97.21%significantly benefit from the SLA-FS framework,while k-nearest neighbors lags.Our framework increases the accuracy of RF and XGB by 0.65%and 0.41%,respectively,outperforming traditional filter or wrapper methods and any prior methods evaluated on this dataset,showcasing its potential.
基金financially supported by the Deanship of Scientific Research and Graduate Studies at King Khalid University under research grant number(R.G.P.2/21/46)in part by the Deanship of Scientific Research,Vice Presidency for Graduate Studies and Scientific Research,King Faisal University,Saudi Arabia,under Grant KFU253116.
文摘Today,phishing is an online attack designed to obtain sensitive information such as credit card and bank account numbers,passwords,and usernames.We can find several anti-phishing solutions,such as heuristic detection,virtual similarity detection,black and white lists,and machine learning(ML).However,phishing attempts remain a problem,and establishing an effective anti-phishing strategy is a work in progress.Furthermore,while most antiphishing solutions achieve the highest levels of accuracy on a given dataset,their methods suffer from an increased number of false positives.These methods are ineffective against zero-hour attacks.Phishing sites with a high False Positive Rate(FPR)are considered genuine because they can cause people to lose a lot ofmoney by visiting them.Feature selection is critical when developing phishing detection strategies.Good feature selection helps improve accuracy;however,duplicate features can also increase noise in the dataset and reduce the accuracy of the algorithm.Therefore,a combination of filter-based feature selection methods is proposed to detect phishing attacks,including constant feature removal,duplicate feature removal,quasi-feature removal,correlated feature removal,mutual information extraction,and Analysis of Variance(ANOVA)testing.The technique has been tested with differentMachine Learning classifiers:Random Forest,Artificial Neural Network(ANN),Ada-Boost,Extreme Gradient Boosting(XGBoost),Logistic Regression,Decision Trees,Gradient Boosting Classifiers,Support Vector Machine(SVM),and two types of ensemble models,stacking and majority voting to gain A low false positive rate is achieved.Stacked ensemble classifiers(gradient boosting,randomforest,support vector machine)achieve 1.31%FPR and 98.17%accuracy on Dataset 1,2.81%FPR and Dataset 3 shows 2.81%FPR and 97.61%accuracy,while Dataset 2 shows 3.47%FPR and 96.47%accuracy.
基金financially supported by the National Natural Science Foundation of China(22273096)the International Postdoctoral Exchange Fellowship Program between Helmholtz and OCPC(ZD2023019)+1 种基金the Young Scientists Fund of the National Natural Science Foundation of China(Grant No.22409139)the Sichuan Provincial Natural Science Foundation for Young Scientists(24NSFSC607)。
文摘Lithium-ion batteries are essential for renewable energy storage,necessitating efficient battery management systems(BMS)for optimal performance and longevity.Accurate estimation of the state of health(SOH)is crucial for BMS safety,yet current machine learning-based SOH estimation relying on global aging features often overlooks localized degradation patterns.In this study,we introduce a novel SOH estimation pipeline that integrates voltage-range-specific segmentation with a multi-stage,crossvalidation-driven localized feature-selection framework and a feature-augmented dual-stream fusion network.Our methodology partitions full-range voltage into localized intervals to construct a degradation-sensitive feature library,from which 4 optimal features are identified from a set of 336 candidates.These selected features are combined with raw voltage signals via a dual-stream architecture that employs a dynamic gating mechanism to recalibrate feature contributions during training.Crossvalidation-based evaluation on datasets encompassing different chemistries and charge/discharge protocols demonstrate that our approach can achieve lower average root-mean-square-error(Oxford dataset:0.7201%,Massachusetts Institute of Technology(MIT)dataset:0.7184%)compared to baseline models.An in-depth analysis of the physical significance of the screened features improves the interpretability of the features.This work underscores the significant potential of leveraging localized feature enhancement in SOH estimation by systematically integrating degradation-sensitive features,thereby offering precise estimation.
基金supported by the National Key Research and Development Program of China(Grant No.2021YFB3702404)the National Natural Science Foundation of China(Grant No.52104370)+4 种基金the Reviving-Liaoning Excellence Plan(XLYC2203186)Science and Technology Special Projects of Liaoning Province(Grant No.2022JH25/10200001)the Postdoctoral Research Fund for Northeastern(Grant No.20210203)Independent Projects of Basic Scientific Research(ZZ2021005)CITIC Niobium Steel Development Award Fund(2022-M1824).
文摘Selecting proper descriptors(also known feature selection,FS)is key in the process of establishing mechanical properties prediction model of hot-rolled microalloyed steels by using machine learning(ML)algorithm.FS methods based on data-driving can reduce the redundancy of data features and improve the prediction accuracy of mechanical properties.Based on the collected data of hot-rolled microalloyed steels,the association rules are used to mine the correlation information between the data.High-quality feature subsets are selected by the proposed FS method(FS method based on genetic algorithm embedding,GAMIC).Compared with the common FS method,it is shown on dataset that GAMIC selects feature subsets more appropriately.Six different ML algorithms are trained and tested for mechanical properties prediction.The result shows that the root-mean-square error of yield strength,tensile strength and elongation based on limit gradient enhancement(XGBoost)algorithm is 21.95 MPa,20.85 MPa and 1.96%,the correlation coefficient(R^(2))is 0.969,0.968 and 0.830,and the mean absolute error is 16.84 MPa,15.83 MPa and 1.48%,respectively,showing the best prediction performance.Finally,SHapley Additive exPlanation is used to further explore the influence of feature variables on mechanical properties.GAMIC feature selection method proposed is universal,which provides a basis for the development of high-precision mechanical property prediction model.
文摘In recent years, particle swarm optimization (PSO) has received widespread attention in feature selection due to its simplicity and potential for global search. However, in traditional PSO, particles primarily update based on two extreme values: personal best and global best, which limits the diversity of information. Ideally, particles should learn from multiple advantageous particles to enhance interactivity and optimization efficiency. Accordingly, this paper proposes a PSO that simulates the evolutionary dynamics of species survival in mountain peak ecology (PEPSO) for feature selection. Based on the pyramid topology, the algorithm simulates the features of mountain peak ecology in nature and the competitive-cooperative strategies among species. According to the principles of the algorithm, the population is first adaptively divided into many subgroups based on the fitness level of particles. Then, particles within each subgroup are divided into three different types based on their evolutionary levels, employing different adaptive inertia weight rules and dynamic learning mechanisms to define distinct learning modes. Consequently, all particles play their respective roles in promoting the global optimization performance of the algorithm, similar to different species in the ecological pattern of mountain peaks. Experimental validation of the PEPSO performance was conducted on 18 public datasets. The experimental results demonstrate that the PEPSO outperforms other PSO variant-based feature selection methods and mainstream feature selection methods based on intelligent optimization algorithms in terms of overall performance in global search capability, classification accuracy, and reduction of feature space dimensions. Wilcoxon signed-rank test also confirms the excellent performance of the PEPSO.
文摘Object detection plays a critical role in drone imagery analysis,especially in remote sensing applications where accurate and efficient detection of small objects is essential.Despite significant advancements in drone imagery detection,most models still struggle with small object detection due to challenges such as object size,complex backgrounds.To address these issues,we propose a robust detection model based on You Only Look Once(YOLO)that balances accuracy and efficiency.The model mainly contains several major innovation:feature selection pyramid network,Inner-Shape Intersection over Union(ISIoU)loss function and small object detection head.To overcome the limitations of traditional fusion methods in handling multi-level features,we introduce a Feature Selection Pyramid Network integrated into the Neck component,which preserves shallow feature details critical for detecting small objects.Additionally,recognizing that deep network structures often neglect or degrade small object features,we design a specialized small object detection head in the shallow layers to enhance detection accuracy for these challenging targets.To effectively model both local and global dependencies,we introduce a Conv-Former module that simulates Transformer mechanisms using a convolutional structure,thereby improving feature enhancement.Furthermore,we employ ISIoU to address object imbalance and scale variation This approach accelerates model conver-gence and improves regression accuracy.Experimental results show that,compared to the baseline model,the proposed method significantly improves small object detection performance on the VisDrone2019 dataset,with mAP@50 increasing by 4.9%and mAP@50-95 rising by 6.7%.This model also outperforms other state-of-the-art algorithms,demonstrating its reliability and effectiveness in both small object detection and remote sensing image fusion tasks.
基金supported by the Fundamental Research Funds for the Central Universities of China(No.300102122105)the Natural Science Basic Research Plan in Shaanxi Province of China(2023-JC-YB-023).
文摘Feature selection(FS)is essential in machine learning(ML)and data mapping by its ability to preprocess high-dimensional data.By selecting a subset of relevant features,feature selection cuts down on the dimension of the data.It excludes irrelevant or surplus features,thus boosting the performance and efficiency of the model.Particle Swarm Optimization(PSO)boasts a streamlined algorithmic framework and exhibits rapid convergence traits.Compared with other algorithms,it incurs reduced computational expenses when tackling high-dimensional datasets.However,PSO faces challenges like inadequate convergence precision.Therefore,regarding FS problems,this paper presents a binary version enhanced PSO based on the Support Vector Machines(SVM)classifier.First,the Sand Cat Swarm Optimization(SCSO)is added to enhance the global search capability of PSO and improve the accuracy of the solution.Secondly,the Latin hypercube sampling strategy initializes populations more uniformly and helps to increase population diversity.The last is the roundup search strategy introducing the grey wolf hierarchy idea to help improve convergence speed.To verify the capability of Self-adaptive Cooperative Particle Swarm Optimization(SCPSO),the CEC2020 test suite and CEC2022 test suite are selected for experiments and applied to three engineering problems.Compared with the standard PSO algorithm,SCPSO converges faster,and the convergence accuracy is significantly improved.Moreover,SCPSO’s comprehensive performance far exceeds that of other algorithms.Six datasets from the University of California,Irvine(UCI)database were selected to evaluate SCPSO’s effectiveness in solving feature selection problems.The results indicate that SCPSO has significant potential for addressing these problems.
基金funded by Deanship of Graduate studies and Scientific Research at Jouf University under grant No.(DGSSR-2024-02-01264).
文摘Automated essay scoring(AES)systems have gained significant importance in educational settings,offering a scalable,efficient,and objective method for evaluating student essays.However,developing AES systems for Arabic poses distinct challenges due to the language’s complex morphology,diglossia,and the scarcity of annotated datasets.This paper presents a hybrid approach to Arabic AES by combining text-based,vector-based,and embeddingbased similarity measures to improve essay scoring accuracy while minimizing the training data required.Using a large Arabic essay dataset categorized into thematic groups,the study conducted four experiments to evaluate the impact of feature selection,data size,and model performance.Experiment 1 established a baseline using a non-machine learning approach,selecting top-N correlated features to predict essay scores.The subsequent experiments employed 5-fold cross-validation.Experiment 2 showed that combining embedding-based,text-based,and vector-based features in a Random Forest(RF)model achieved an R2 of 88.92%and an accuracy of 83.3%within a 0.5-point tolerance.Experiment 3 further refined the feature selection process,demonstrating that 19 correlated features yielded optimal results,improving R2 to 88.95%.In Experiment 4,an optimal data efficiency training approach was introduced,where training data portions increased from 5%to 50%.The study found that using just 10%of the data achieved near-peak performance,with an R2 of 85.49%,emphasizing an effective trade-off between performance and computational costs.These findings highlight the potential of the hybrid approach for developing scalable Arabic AES systems,especially in low-resource environments,addressing linguistic challenges while ensuring efficient data usage.