In the area of pattern recognition and machine learning,features play a key role in prediction.The famous applications of features are medical imaging,image classification,and name a few more.With the exponential grow...In the area of pattern recognition and machine learning,features play a key role in prediction.The famous applications of features are medical imaging,image classification,and name a few more.With the exponential growth of information investments in medical data repositories and health service provision,medical institutions are collecting large volumes of data.These data repositories contain details information essential to support medical diagnostic decisions and also improve patient care quality.On the other hand,this growth also made it difficult to comprehend and utilize data for various purposes.The results of imaging data can become biased because of extraneous features present in larger datasets.Feature selection gives a chance to decrease the number of components in such large datasets.Through selection techniques,ousting the unimportant features and selecting a subset of components that produces prevalent characterization precision.The correct decision to find a good attribute produces a precise grouping model,which enhances learning pace and forecast control.This paper presents a review of feature selection techniques and attributes selection measures for medical imaging.This review is meant to describe feature selection techniques in a medical domainwith their pros and cons and to signify its application in imaging data and data mining algorithms.The review reveals the shortcomings of the existing feature and attributes selection techniques to multi-sourced data.Moreover,this review provides the importance of feature selection for correct classification of medical infections.In the end,critical analysis and future directions are provided.展开更多
Fruit diseases seriously affect the production of the agricultural sector,which builds financial pressure on the country’s economy.The manual inspection of fruit diseases is a chaotic process that is both time and co...Fruit diseases seriously affect the production of the agricultural sector,which builds financial pressure on the country’s economy.The manual inspection of fruit diseases is a chaotic process that is both time and cost-consuming since it involves an accurate manual inspection by an expert.Hence,it is essential that an automated computerised approach is developed to recognise fruit diseases based on leaf images.According to the literature,many automated methods have been developed for the recognition of fruit diseases at the early stage.However,these techniques still face some challenges,such as the similar symptoms of different fruit diseases and the selection of irrelevant features.Image processing and deep learning techniques have been extremely successful in the last decade,but there is still room for improvement due to these challenges.Therefore,we propose a novel computerised approach in this work using deep learning and featuring an ant colony optimisation(ACO)based selection.The proposed method consists of four fundamental steps:data augmentation to solve the imbalanced dataset,fine-tuned pretrained deep learning models(NasNetMobile andMobileNet-V2),the fusion of extracted deep features using matrix length,and finally,a selection of the best features using a hybrid ACO and a Neighbourhood Component Analysis(NCA).The best-selected features were eventually passed to many classifiers for final recognition.The experimental process involved an augmented dataset and achieved an average accuracy of 99.7%.Comparison with existing techniques showed that the proposed method was effective.展开更多
Gait recognition is an active research area that uses a walking theme to identify the subject correctly.Human Gait Recognition(HGR)is performed without any cooperation from the individual.However,in practice,it remain...Gait recognition is an active research area that uses a walking theme to identify the subject correctly.Human Gait Recognition(HGR)is performed without any cooperation from the individual.However,in practice,it remains a challenging task under diverse walking sequences due to the covariant factors such as normal walking and walking with wearing a coat.Researchers,over the years,have worked on successfully identifying subjects using different techniques,but there is still room for improvement in accuracy due to these covariant factors.This paper proposes an automated model-free framework for human gait recognition in this article.There are a few critical steps in the proposed method.Firstly,optical flow-based motion region esti-mation and dynamic coordinates-based cropping are performed.The second step involves training a fine-tuned pre-trained MobileNetV2 model on both original and optical flow cropped frames;the training has been conducted using static hyperparameters.The third step proposed a fusion technique known as normal distribution serially fusion.In the fourth step,a better optimization algorithm is applied to select the best features,which are then classified using a Bi-Layered neural network.Three publicly available datasets,CASIA A,CASIA B,and CASIA C,were used in the experimental process and obtained average accuracies of 99.6%,91.6%,and 95.02%,respectively.The proposed framework has achieved improved accuracy compared to the other methods.展开更多
The rapid rise of cyberattacks and the gradual failure of traditional defense systems and approaches led to using artificial intelligence(AI)techniques(such as machine learning(ML)and deep learning(DL))to build more e...The rapid rise of cyberattacks and the gradual failure of traditional defense systems and approaches led to using artificial intelligence(AI)techniques(such as machine learning(ML)and deep learning(DL))to build more efficient and reliable intrusion detection systems(IDSs).However,the advent of larger IDS datasets has negatively impacted the performance and computational complexity of AI-based IDSs.Many researchers used data preprocessing techniques such as feature selection and normalization to overcome such issues.While most of these researchers reported the success of these preprocessing techniques on a shallow level,very few studies have been performed on their effects on a wider scale.Furthermore,the performance of an IDS model is subject to not only the utilized preprocessing techniques but also the dataset and the ML/DL algorithm used,which most of the existing studies give little emphasis on.Thus,this study provides an in-depth analysis of feature selection and normalization effects on IDS models built using three IDS datasets:NSL-KDD,UNSW-NB15,and CSE–CIC–IDS2018,and various AI algorithms.A wrapper-based approach,which tends to give superior performance,and min-max normalization methods were used for feature selection and normalization,respectively.Numerous IDS models were implemented using the full and feature-selected copies of the datasets with and without normalization.The models were evaluated using popular evaluation metrics in IDS modeling,intra-and inter-model comparisons were performed between models and with state-of-the-art works.Random forest(RF)models performed better on NSL-KDD and UNSW-NB15 datasets with accuracies of 99.86%and 96.01%,respectively,whereas artificial neural network(ANN)achieved the best accuracy of 95.43%on the CSE–CIC–IDS2018 dataset.The RF models also achieved an excellent performance compared to recent works.The results show that normalization and feature selection positively affect IDS modeling.Furthermore,while feature selection benefits simpler algorithms(such as RF),normalization is more useful for complex algorithms like ANNs and deep neural networks(DNNs),and algorithms such as Naive Bayes are unsuitable for IDS modeling.The study also found that the UNSW-NB15 and CSE–CIC–IDS2018 datasets are more complex and more suitable for building and evaluating modern-day IDS than the NSL-KDD dataset.Our findings suggest that prioritizing robust algorithms like RF,alongside complex models such as ANN and DNN,can significantly enhance IDS performance.These insights provide valuable guidance for managers to develop more effective security measures by focusing on high detection rates and low false alert rates.展开更多
Machine learning(ML)is increasingly applied for medical image processing with appropriate learning paradigms.These applications include analyzing images of various organs,such as the brain,lung,eye,etc.,to identify sp...Machine learning(ML)is increasingly applied for medical image processing with appropriate learning paradigms.These applications include analyzing images of various organs,such as the brain,lung,eye,etc.,to identify specific flaws/diseases for diagnosis.The primary concern of ML applications is the precise selection of flexible image features for pattern detection and region classification.Most of the extracted image features are irrelevant and lead to an increase in computation time.Therefore,this article uses an analytical learning paradigm to design a Congruent Feature Selection Method to select the most relevant image features.This process trains the learning paradigm using similarity and correlation-based features over different textural intensities and pixel distributions.The similarity between the pixels over the various distribution patterns with high indexes is recommended for disease diagnosis.Later,the correlation based on intensity and distribution is analyzed to improve the feature selection congruency.Therefore,the more congruent pixels are sorted in the descending order of the selection,which identifies better regions than the distribution.Now,the learning paradigm is trained using intensity and region-based similarity to maximize the chances of selection.Therefore,the probability of feature selection,regardless of the textures and medical image patterns,is improved.This process enhances the performance of ML applications for different medical image processing.The proposed method improves the accuracy,precision,and training rate by 13.19%,10.69%,and 11.06%,respectively,compared to other models for the selected dataset.The mean error and selection time is also reduced by 12.56%and 13.56%,respectively,compared to the same models and dataset.展开更多
Purpose-Single-shot multi-category clothing recognition and retrieval play a crucial role in online searching and offline settlement scenarios.Existing clothing recognition methods based on RGBD clothing images often ...Purpose-Single-shot multi-category clothing recognition and retrieval play a crucial role in online searching and offline settlement scenarios.Existing clothing recognition methods based on RGBD clothing images often suffer from high-dimensional feature representations,leading to compromised performance and efficiency.Design/methodology/approach-To address this issue,this paper proposes a novel method called Manifold Embedded Discriminative Feature Selection(MEDFS)to select global and local features,thereby reducing the dimensionality of the feature representation and improving performance.Specifically,by combining three global features and three local features,a low-dimensional embedding is constructed to capture the correlations between features and categories.The MEDFS method designs an optimization framework utilizing manifold mapping and sparse regularization to achieve feature selection.The optimization objective is solved using an alternating iterative strategy,ensuring convergence.Findings-Empirical studies conducted on a publicly available RGBD clothing image dataset demonstrate that the proposed MEDFS method achieves highly competitive clothing classification performance while maintaining efficiency in clothing recognition and retrieval.Originality/value-This paper introduces a novel approach for multi-category clothing recognition and retrieval,incorporating the selection of global and local features.The proposed method holds potential for practical applications in real-world clothing scenarios.展开更多
This study provides a systematic investigation into the influence of feature selection methods on cryptocurrency price forecasting models employing technical indicators.In this work,over 130 technical indicators—cove...This study provides a systematic investigation into the influence of feature selection methods on cryptocurrency price forecasting models employing technical indicators.In this work,over 130 technical indicators—covering momentum,volatility,volume,and trend-related technical indicators—are subjected to three distinct feature selection approaches.Specifically,mutual information(MI),recursive feature elimination(RFE),and random forest importance(RFI).By extracting an optimal set of 20 predictors,the proposed framework aims to mitigate redundancy and overfitting while enhancing interpretability.These feature subsets are integrated into support vector regression(SVR),Huber regressors,and k-nearest neighbors(KNN)models to forecast the prices of three leading cryptocurrencies—Bitcoin(BTC/USDT),Ethereum(ETH/USDT),and Binance Coin(BNB/USDT)—across horizons ranging from 1 to 20 days.Model evaluation employs the coefficient of determination(R2)and the root mean squared logarithmic error(RMSLE),alongside a walk-forward validation scheme to approximate real-world trading contexts.Empirical results indicate that incorporating momentum and volatility measures substantially improves predictive accuracy,with particularly pronounced effects observed at longer forecast windows.Moreover,indicators related to volume and trend provide incremental benefits in select market conditions.Notably,an 80%–85% reduction in the original feature set frequently maintains or enhances model performance relative to the complete indicator set.These findings highlight the critical role of targeted feature selection in addressing high-dimensional financial data challenges while preserving model robustness.This research advances the field of cryptocurrency forecasting by offering a rigorous comparison of feature selection methods and their effects on multiple digital assets and prediction horizons.The outcomes highlight the importance of dimension-reduction strategies in developing more efficient and resilient forecasting algorithms.Future efforts should incorporate high-frequency data and explore alternative selection techniques to further refine predictive accuracy in this highly volatile domain.展开更多
Feature selection(FS)plays a crucial role in medical imaging by reducing dimensionality,improving computational efficiency,and enhancing diagnostic accuracy.Traditional FS techniques,including filter,wrapper,and embed...Feature selection(FS)plays a crucial role in medical imaging by reducing dimensionality,improving computational efficiency,and enhancing diagnostic accuracy.Traditional FS techniques,including filter,wrapper,and embedded methods,have been widely used but often struggle with high-dimensional and heterogeneous medical imaging data.Deep learning-based FS methods,particularly Convolutional Neural Networks(CNNs)and autoencoders,have demonstrated superior performance but lack interpretability.Hybrid approaches that combine classical and deep learning techniques have emerged as a promising solution,offering improved accuracy and explainability.Furthermore,integratingmulti-modal imaging data(e.g.,MagneticResonance Imaging(MRI),ComputedTomography(CT),Positron Emission Tomography(PET),and Ultrasound(US))poses additional challenges in FS,necessitating advanced feature fusion strategies.Multi-modal feature fusion combines information fromdifferent imagingmodalities to improve diagnostic accuracy.Recently,quantum computing has gained attention as a revolutionary approach for FS,providing the potential to handle high-dimensional medical data more efficiently.This systematic literature review comprehensively examines classical,Deep Learning(DL),hybrid,and quantum-based FS techniques inmedical imaging.Key outcomes include a structured taxonomy of FS methods,a critical evaluation of their performance across modalities,and identification of core challenges such as computational burden,interpretability,and ethical considerations.Future research directions—such as explainable AI(XAI),federated learning,and quantum-enhanced FS—are also emphasized to bridge the current gaps.This review provides actionable insights for developing scalable,interpretable,and clinically applicable FS methods in the evolving landscape of medical imaging.展开更多
Recent advancements in computational and database technologies have led to the exponential growth of large-scale medical datasets,significantly increasing data complexity and dimensionality in medical diagnostics.Effi...Recent advancements in computational and database technologies have led to the exponential growth of large-scale medical datasets,significantly increasing data complexity and dimensionality in medical diagnostics.Efficient feature selection methods are critical for improving diagnostic accuracy,reducing computational costs,and enhancing the interpretability of predictive models.Particle Swarm Optimization(PSO),a widely used metaheuristic inspired by swarm intelligence,has shown considerable promise in feature selection tasks.However,conventional PSO often suffers from premature convergence and limited exploration capabilities,particularly in high-dimensional spaces.To overcome these limitations,this study proposes an enhanced PSO framework incorporating Orthogonal Initializa-tion and a Crossover Operator(OrPSOC).Orthogonal Initialization ensures a diverse and uniformly distributed initial particle population,substantially improving the algorithm’s exploration capability.The Crossover Operator,inspired by genetic algorithms,introduces additional diversity during the search process,effectively mitigating premature convergence and enhancing global search performance.The effectiveness of OrPSOC was rigorously evaluated on three benchmark medical datasets—Colon,Leukemia,and Prostate Tumor.Comparative analyses were conducted against traditional filter-based methods,including Fast Clustering-Based Feature Selection Technique(Fast-C),Minimum Redundancy Maximum Relevance(MinRedMaxRel),and Five-Way Joint Mutual Information(FJMI),as well as prominent metaheuristic algorithms such as standard PSO,Ant Colony Optimization(ACO),Comprehensive Learning Gravitational Search Algorithm(CLGSA),and Fuzzy-Based CLGSA(FCLGSA).Experimental results demonstrated that OrPSOC consistently outperformed these existing methods in terms of classification accuracy,computational efficiency,and result stability,achieving significant improvements even with fewer selected features.Additionally,a sensitivity analysis of the crossover parameter provided valuable insights into parameter tuning and its impact on model performance.These findings highlight the superiority and robustness of the proposed OrPSOC approach for feature selection in medical diagnostic applications and underscore its potential for broader adoption in various high-dimensional,data-driven fields.展开更多
The Financial Technology(FinTech)sector has witnessed rapid growth,resulting in increasingly complex and high-volume digital transactions.Although this expansion improves efficiency and accessibility,it also introduce...The Financial Technology(FinTech)sector has witnessed rapid growth,resulting in increasingly complex and high-volume digital transactions.Although this expansion improves efficiency and accessibility,it also introduces significant vulnerabilities,including fraud,money laundering,and market manipulation.Traditional anomaly detection techniques often fail to capture the relational and dynamic characteristics of financial data.Graph Neural Networks(GNNs),capable of modeling intricate interdependencies among entities,have emerged as a powerful framework for detecting subtle and sophisticated anomalies.However,the high-dimensionality and inherent noise of FinTech datasets demand robust feature selection strategies to improve model scalability,performance,and interpretability.This paper presents a comprehensive survey of GNN-based approaches for anomaly detection in FinTech,with an emphasis on the synergistic role of feature selection.We examine the theoretical foundations of GNNs,review state-of-the-art feature selection techniques,analyze their integration with GNNs,and categorize prevalent anomaly types in FinTech applications.In addition,we discuss practical implementation challenges,highlight representative case studies,and propose future research directions to advance the field of graph-based anomaly detection in financial systems.展开更多
Advanced Persistent Threats(APTs)represent one of the most complex and dangerous categories of cyber-attacks characterised by their stealthy behaviour,long-term persistence,and ability to bypass traditional detection ...Advanced Persistent Threats(APTs)represent one of the most complex and dangerous categories of cyber-attacks characterised by their stealthy behaviour,long-term persistence,and ability to bypass traditional detection systems.The complexity of real-world network data poses significant challenges in detection.Machine learning models have shown promise in detecting APTs;however,their performance often suffers when trained on large datasets with redundant or irrelevant features.This study presents a novel,hybrid feature selection method designed to improve APT detection by reducing dimensionality while preserving the informative characteristics of the data.It combines Mutual Information(MI),Symmetric Uncertainty(SU)and Minimum Redundancy Maximum Relevance(mRMR)to enhance feature selection.MI and SU assess feature relevance,while mRMR maximises relevance and minimises redundancy,ensuring that the most impactful features are prioritised.This method addresses redundancy among selected features,improving the overall efficiency and effectiveness of the detection model.Experiments on a real-world APT datasets were conducted to evaluate the proposed method.Multiple classifiers including,Random Forest,Support Vector Machine(SVM),Gradient Boosting,and Neural Networks were used to assess classification performance.The results demonstrate that the proposed feature selection method significantly enhances detection accuracy compared to baseline models trained on the full feature set.The Random Forest algorithm achieved the highest performance,with near-perfect accuracy,precision,recall,and F1 scores(99.97%).The proposed adaptive thresholding algorithm within the selection method allows each classifier to benefit from a reduced and optimised feature space,resulting in improved training and predictive performance.This research offers a scalable and classifier-agnostic solution for dimensionality reduction in cybersecurity applications.展开更多
Heart disease prediction is a critical issue in healthcare,where accurate early diagnosis can save lives and reduce healthcare costs.The problem is inherently complex due to the high dimensionality of medical data,irr...Heart disease prediction is a critical issue in healthcare,where accurate early diagnosis can save lives and reduce healthcare costs.The problem is inherently complex due to the high dimensionality of medical data,irrelevant or redundant features,and the variability in risk factors such as age,lifestyle,andmedical history.These challenges often lead to inefficient and less accuratemodels.Traditional predictionmethodologies face limitations in effectively handling large feature sets and optimizing classification performance,which can result in overfitting poor generalization,and high computational cost.This work proposes a novel classification model for heart disease prediction that addresses these challenges by integrating feature selection through a Genetic Algorithm(GA)with an ensemble deep learning approach optimized using the Tunicate Swarm Algorithm(TSA).GA selects the most relevant features,reducing dimensionality and improvingmodel efficiency.Theselected features are then used to train an ensemble of deep learning models,where the TSA optimizes the weight of each model in the ensemble to enhance prediction accuracy.This hybrid approach addresses key challenges in the field,such as high dimensionality,redundant features,and classification performance,by introducing an efficient feature selection mechanism and optimizing the weighting of deep learning models in the ensemble.These enhancements result in a model that achieves superior accuracy,generalization,and efficiency compared to traditional methods.The proposed model demonstrated notable advancements in both prediction accuracy and computational efficiency over traditionalmodels.Specifically,it achieved an accuracy of 97.5%,a sensitivity of 97.2%,and a specificity of 97.8%.Additionally,with a 60-40 data split and 5-fold cross-validation,the model showed a significant reduction in training time(90 s),memory consumption(950 MB),and CPU usage(80%),highlighting its effectiveness in processing large,complex medical datasets for heart disease prediction.展开更多
This paper proposes a novel hybrid fraud detection framework that integrates multi-stage feature selection,unsupervised clustering,and ensemble learning to improve classification performance in financial transaction m...This paper proposes a novel hybrid fraud detection framework that integrates multi-stage feature selection,unsupervised clustering,and ensemble learning to improve classification performance in financial transaction monitoring systems.The framework is structured into three core layers:(1)feature selection using Recursive Feature Elimination(RFE),Principal Component Analysis(PCA),and Mutual Information(MI)to reduce dimensionality and enhance input relevance;(2)anomaly detection through unsupervised clustering using K-Means,Density-Based Spatial Clustering(DBSCAN),and Hierarchical Clustering to flag suspicious patterns in unlabeled data;and(3)final classification using a voting-based hybrid ensemble of Support Vector Machine(SVM),Random Forest(RF),and Gradient Boosting Classifier(GBC).The experimental evaluation is conducted on a synthetically generated dataset comprising one million financial transactions,with 5% labelled as fraudulent,simulating realistic fraud rates and behavioural features,including transaction time,origin,amount,and geo-location.The proposed model demonstrated a significant improvement over baseline classifiers,achieving an accuracy of 99%,a precision of 99%,a recall of 97%,and an F1-score of 99%.Compared to individual models,it yielded a 9% gain in overall detection accuracy.It reduced the false positive rate to below 3.5%,thereby minimising the operational costs associated with manually reviewing false alerts.The model’s interpretability is enhanced by the integration of Shapley Additive Explanations(SHAP)values for feature importance,supporting transparency and regulatory auditability.These results affirm the practical relevance of the proposed system for deployment in real-time fraud detection scenarios such as credit card transactions,mobile banking,and cross-border payments.The study also highlights future directions,including the deployment of lightweight models and the integration of multimodal data for scalable fraud analytics.展开更多
Today,phishing is an online attack designed to obtain sensitive information such as credit card and bank account numbers,passwords,and usernames.We can find several anti-phishing solutions,such as heuristic detection,...Today,phishing is an online attack designed to obtain sensitive information such as credit card and bank account numbers,passwords,and usernames.We can find several anti-phishing solutions,such as heuristic detection,virtual similarity detection,black and white lists,and machine learning(ML).However,phishing attempts remain a problem,and establishing an effective anti-phishing strategy is a work in progress.Furthermore,while most antiphishing solutions achieve the highest levels of accuracy on a given dataset,their methods suffer from an increased number of false positives.These methods are ineffective against zero-hour attacks.Phishing sites with a high False Positive Rate(FPR)are considered genuine because they can cause people to lose a lot ofmoney by visiting them.Feature selection is critical when developing phishing detection strategies.Good feature selection helps improve accuracy;however,duplicate features can also increase noise in the dataset and reduce the accuracy of the algorithm.Therefore,a combination of filter-based feature selection methods is proposed to detect phishing attacks,including constant feature removal,duplicate feature removal,quasi-feature removal,correlated feature removal,mutual information extraction,and Analysis of Variance(ANOVA)testing.The technique has been tested with differentMachine Learning classifiers:Random Forest,Artificial Neural Network(ANN),Ada-Boost,Extreme Gradient Boosting(XGBoost),Logistic Regression,Decision Trees,Gradient Boosting Classifiers,Support Vector Machine(SVM),and two types of ensemble models,stacking and majority voting to gain A low false positive rate is achieved.Stacked ensemble classifiers(gradient boosting,randomforest,support vector machine)achieve 1.31%FPR and 98.17%accuracy on Dataset 1,2.81%FPR and Dataset 3 shows 2.81%FPR and 97.61%accuracy,while Dataset 2 shows 3.47%FPR and 96.47%accuracy.展开更多
Soil moisture is a key parameter in the exchange of energy and water between the land surface and the atmosphere.This parameter plays an important role in the dynamics of permafrost on the Qinghai-Xizang Plateau,China...Soil moisture is a key parameter in the exchange of energy and water between the land surface and the atmosphere.This parameter plays an important role in the dynamics of permafrost on the Qinghai-Xizang Plateau,China,as well as in the related ecological and hydrological processes.However,the region's complex terrain and extreme climatic conditions result in low-accuracy soil moisture estimations using traditional remote sensing techniques.Thus,this study considered parameters of the backscatter coefficient of Sentinel-1A ground range detected(GRD)data,the polarization decomposition parameters of Sentinel-1A single-look complex(SLC)data,the normalized difference vegetation index(NDVI)based on Sentinel-2B data,and the topographic factors based on digital elevation model(DEM)data.By combining these parameters with a machine learning model,we established a feature selection rule.A cumulative importance threshold was derived for feature variables,and those variables that failed to meet the threshold were eliminated based on variations in the coefficient of determination(R^(2))and the unbiased root mean square error(ubRMSE).The eight most influential variables were selected and combined with the CatBoost model for soil moisture inversion,and the SHapley Additive exPlanations(SHAP)method was used to analyze the importance of these variables.The results demonstrated that the optimized model significantly improved the accuracy of soil moisture inversion.Compared to the unfiltered model,the optimal feature combination led to a 0.09 increase in R^(2)and a 0.7%reduction in ubRMSE.Ultimately,the optimized model achieved a R²of 0.87 and an ubRMSE of 5.6%.Analysis revealed that soil particle size had significant impact on soil water retention capacity.The impact of vegetation on the estimated soil moisture on the Qinghai-Xizang Plateau was considerable,demonstrating a significant positive correlation.Moreover,the microtopographical features of hummocks interfered with soil moisture estimation,indicating that such terrain effects warrant increased attention in future studies within the permafrost regions.The developed method not only enhances the accuracy of soil moisture retrieval in the complex terrain of the Qinghai-Xizang Plateau,but also exhibits high computational efficiency(with a relative time reduction of 18.5%),striking an excellent balance between accuracy and efficiency.This approach provides a robust framework for efficient soil moisture monitoring in remote areas with limited ground data,offering critical insights for ecological conservation,water resource management,and climate change adaptation on the Qinghai-Xizang Plateau.展开更多
Feature selection(FS)is a pivotal pre-processing step in developing data-driven models,influencing reliability,performance and optimization.Although existing FS techniques can yield high-performance metrics for certai...Feature selection(FS)is a pivotal pre-processing step in developing data-driven models,influencing reliability,performance and optimization.Although existing FS techniques can yield high-performance metrics for certain models,they do not invariably guarantee the extraction of the most critical or impactful features.Prior literature underscores the significance of equitable FS practices and has proposed diverse methodologies for the identification of appropriate features.However,the challenge of discerning the most relevant and influential features persists,particularly in the context of the exponential growth and heterogeneity of big data—a challenge that is increasingly salient in modern artificial intelligence(AI)applications.In response,this study introduces an innovative,automated statistical method termed Farea Similarity for Feature Selection(FSFS).The FSFS approach computes a similarity metric for each feature by benchmarking it against the record-wise mean,thereby finding feature dependencies and mitigating the influence of outliers that could potentially distort evaluation outcomes.Features are subsequently ranked according to their similarity scores,with the threshold established at the average similarity score.Notably,lower FSFS values indicate higher similarity and stronger data correlations,whereas higher values suggest lower similarity.The FSFS method is designed not only to yield reliable evaluation metrics but also to reduce data complexity without compromising model performance.Comparative analyses were performed against several established techniques,including Chi-squared(CS),Correlation Coefficient(CC),Genetic Algorithm(GA),Exhaustive Approach,Greedy Stepwise Approach,Gain Ratio,and Filtered Subset Eval,using a variety of datasets such as the Experimental Dataset,Breast Cancer Wisconsin(Original),KDD CUP 1999,NSL-KDD,UNSW-NB15,and Edge-IIoT.In the absence of the FSFS method,the highest classifier accuracies observed were 60.00%,95.13%,97.02%,98.17%,95.86%,and 94.62%for the respective datasets.When the FSFS technique was integrated with data normalization,encoding,balancing,and feature importance selection processes,accuracies improved to 100.00%,97.81%,98.63%,98.94%,94.27%,and 98.46%,respectively.The FSFS method,with a computational complexity of O(fn log n),demonstrates robust scalability and is well-suited for datasets of large size,ensuring efficient processing even when the number of features is substantial.By automatically eliminating outliers and redundant data,FSFS reduces computational overhead,resulting in faster training and improved model performance.Overall,the FSFS framework not only optimizes performance but also enhances the interpretability and explainability of data-driven models,thereby facilitating more trustworthy decision-making in AI applications.展开更多
Automated essay scoring(AES)systems have gained significant importance in educational settings,offering a scalable,efficient,and objective method for evaluating student essays.However,developing AES systems for Arabic...Automated essay scoring(AES)systems have gained significant importance in educational settings,offering a scalable,efficient,and objective method for evaluating student essays.However,developing AES systems for Arabic poses distinct challenges due to the language’s complex morphology,diglossia,and the scarcity of annotated datasets.This paper presents a hybrid approach to Arabic AES by combining text-based,vector-based,and embeddingbased similarity measures to improve essay scoring accuracy while minimizing the training data required.Using a large Arabic essay dataset categorized into thematic groups,the study conducted four experiments to evaluate the impact of feature selection,data size,and model performance.Experiment 1 established a baseline using a non-machine learning approach,selecting top-N correlated features to predict essay scores.The subsequent experiments employed 5-fold cross-validation.Experiment 2 showed that combining embedding-based,text-based,and vector-based features in a Random Forest(RF)model achieved an R2 of 88.92%and an accuracy of 83.3%within a 0.5-point tolerance.Experiment 3 further refined the feature selection process,demonstrating that 19 correlated features yielded optimal results,improving R2 to 88.95%.In Experiment 4,an optimal data efficiency training approach was introduced,where training data portions increased from 5%to 50%.The study found that using just 10%of the data achieved near-peak performance,with an R2 of 85.49%,emphasizing an effective trade-off between performance and computational costs.These findings highlight the potential of the hybrid approach for developing scalable Arabic AES systems,especially in low-resource environments,addressing linguistic challenges while ensuring efficient data usage.展开更多
The rapid evolution of smart cities through IoT,cloud computing,and connected infrastructures has significantly enhanced sectors such as transportation,healthcare,energy,and public safety,but also increased exposure t...The rapid evolution of smart cities through IoT,cloud computing,and connected infrastructures has significantly enhanced sectors such as transportation,healthcare,energy,and public safety,but also increased exposure to sophisticated cyber threats.The diversity of devices,high data volumes,and real-time operational demands complicate security,requiring not just robust intrusion detection but also effective feature selection for relevance and scalability.Traditional Machine Learning(ML)based Intrusion Detection System(IDS)improves detection but often lacks interpretability,limiting stakeholder trust and timely responses.Moreover,centralized feature selection in conventional IDS compromises data privacy and fails to accommodate the decentralized nature of smart city infrastructures.To address these limitations,this research introduces an Interpretable Federated Learning(FL)based Cyber Intrusion Detection model tailored for smart city applications.The proposed system leverages privacy-preserving feature selection,where each client node independently identifies top-ranked features using ML models integrated with SHAP-based explainability.These local feature subsets are then aggregated at a central server to construct a global model without compromising sensitive data.Furthermore,the global model is enhanced with Explainable AI(XAI)techniques such as SHAP and LIME,offering both global interpretability and instance-level transparency for cyber threat decisions.Experimental results demonstrate that the proposed global model achieves a high detection accuracy of 98.51%,with a significantly low miss rate of 1.49%,outperforming existing models while ensuring explainability,privacy,and scalability across smart city infrastructures.展开更多
Multi-label feature selection(MFS)is a crucial dimensionality reduction technique aimed at identifying informative features associated with multiple labels.However,traditional centralized methods face significant chal...Multi-label feature selection(MFS)is a crucial dimensionality reduction technique aimed at identifying informative features associated with multiple labels.However,traditional centralized methods face significant challenges in privacy-sensitive and distributed settings,often neglecting label dependencies and suffering from low computational efficiency.To address these issues,we introduce a novel framework,Fed-MFSDHBCPSO—federated MFS via dual-layer hybrid breeding cooperative particle swarm optimization algorithm with manifold and sparsity regularization(DHBCPSO-MSR).Leveraging the federated learning paradigm,Fed-MFSDHBCPSO allows clients to perform local feature selection(FS)using DHBCPSO-MSR.Locally selected feature subsets are encrypted with differential privacy(DP)and transmitted to a central server,where they are securely aggregated and refined through secure multi-party computation(SMPC)until global convergence is achieved.Within each client,DHBCPSO-MSR employs a dual-layer FS strategy.The inner layer constructs sample and label similarity graphs,generates Laplacian matrices to capture the manifold structure between samples and labels,and applies L2,1-norm regularization to sparsify the feature subset,yielding an optimized feature weight matrix.The outer layer uses a hybrid breeding cooperative particle swarm optimization algorithm to further refine the feature weight matrix and identify the optimal feature subset.The updated weight matrix is then fed back to the inner layer for further optimization.Comprehensive experiments on multiple real-world multi-label datasets demonstrate that Fed-MFSDHBCPSO consistently outperforms both centralized and federated baseline methods across several key evaluation metrics.展开更多
Software defect prediction(SDP)aims to find a reliable method to predict defects in specific software projects and help software engineers allocate limited resources to release high-quality software products.Software ...Software defect prediction(SDP)aims to find a reliable method to predict defects in specific software projects and help software engineers allocate limited resources to release high-quality software products.Software defect prediction can be effectively performed using traditional features,but there are some redundant or irrelevant features in them(the presence or absence of this feature has little effect on the prediction results).These problems can be solved using feature selection.However,existing feature selection methods have shortcomings such as insignificant dimensionality reduction effect and low classification accuracy of the selected optimal feature subset.In order to reduce the impact of these shortcomings,this paper proposes a new feature selection method Cubic TraverseMa Beluga whale optimization algorithm(CTMBWO)based on the improved Beluga whale optimization algorithm(BWO).The goal of this study is to determine how well the CTMBWO can extract the features that are most important for correctly predicting software defects,improve the accuracy of fault prediction,reduce the number of the selected feature and mitigate the risk of overfitting,thereby achieving more efficient resource utilization and better distribution of test workload.The CTMBWO comprises three main stages:preprocessing the dataset,selecting relevant features,and evaluating the classification performance of the model.The novel feature selection method can effectively improve the performance of SDP.This study performs experiments on two software defect datasets(PROMISE,NASA)and shows the method’s classification performance using four detailed evaluation metrics,Accuracy,F1-score,MCC,AUC and Recall.The results indicate that the approach presented in this paper achieves outstanding classification performance on both datasets and has significant improvement over the baseline models.展开更多
文摘In the area of pattern recognition and machine learning,features play a key role in prediction.The famous applications of features are medical imaging,image classification,and name a few more.With the exponential growth of information investments in medical data repositories and health service provision,medical institutions are collecting large volumes of data.These data repositories contain details information essential to support medical diagnostic decisions and also improve patient care quality.On the other hand,this growth also made it difficult to comprehend and utilize data for various purposes.The results of imaging data can become biased because of extraneous features present in larger datasets.Feature selection gives a chance to decrease the number of components in such large datasets.Through selection techniques,ousting the unimportant features and selecting a subset of components that produces prevalent characterization precision.The correct decision to find a good attribute produces a precise grouping model,which enhances learning pace and forecast control.This paper presents a review of feature selection techniques and attributes selection measures for medical imaging.This review is meant to describe feature selection techniques in a medical domainwith their pros and cons and to signify its application in imaging data and data mining algorithms.The review reveals the shortcomings of the existing feature and attributes selection techniques to multi-sourced data.Moreover,this review provides the importance of feature selection for correct classification of medical infections.In the end,critical analysis and future directions are provided.
基金This research work was partially supported by Chiang Mai University.
文摘Fruit diseases seriously affect the production of the agricultural sector,which builds financial pressure on the country’s economy.The manual inspection of fruit diseases is a chaotic process that is both time and cost-consuming since it involves an accurate manual inspection by an expert.Hence,it is essential that an automated computerised approach is developed to recognise fruit diseases based on leaf images.According to the literature,many automated methods have been developed for the recognition of fruit diseases at the early stage.However,these techniques still face some challenges,such as the similar symptoms of different fruit diseases and the selection of irrelevant features.Image processing and deep learning techniques have been extremely successful in the last decade,but there is still room for improvement due to these challenges.Therefore,we propose a novel computerised approach in this work using deep learning and featuring an ant colony optimisation(ACO)based selection.The proposed method consists of four fundamental steps:data augmentation to solve the imbalanced dataset,fine-tuned pretrained deep learning models(NasNetMobile andMobileNet-V2),the fusion of extracted deep features using matrix length,and finally,a selection of the best features using a hybrid ACO and a Neighbourhood Component Analysis(NCA).The best-selected features were eventually passed to many classifiers for final recognition.The experimental process involved an augmented dataset and achieved an average accuracy of 99.7%.Comparison with existing techniques showed that the proposed method was effective.
基金supported by“Human Resources Program in Energy Technology”of the Korea Institute of Energy Technology Evaluation and Planning(KETEP)granted financial resources from the Ministry of Trade,Industry&Energy,Republic of Korea.(No.20204010600090).
文摘Gait recognition is an active research area that uses a walking theme to identify the subject correctly.Human Gait Recognition(HGR)is performed without any cooperation from the individual.However,in practice,it remains a challenging task under diverse walking sequences due to the covariant factors such as normal walking and walking with wearing a coat.Researchers,over the years,have worked on successfully identifying subjects using different techniques,but there is still room for improvement in accuracy due to these covariant factors.This paper proposes an automated model-free framework for human gait recognition in this article.There are a few critical steps in the proposed method.Firstly,optical flow-based motion region esti-mation and dynamic coordinates-based cropping are performed.The second step involves training a fine-tuned pre-trained MobileNetV2 model on both original and optical flow cropped frames;the training has been conducted using static hyperparameters.The third step proposed a fusion technique known as normal distribution serially fusion.In the fourth step,a better optimization algorithm is applied to select the best features,which are then classified using a Bi-Layered neural network.Three publicly available datasets,CASIA A,CASIA B,and CASIA C,were used in the experimental process and obtained average accuracies of 99.6%,91.6%,and 95.02%,respectively.The proposed framework has achieved improved accuracy compared to the other methods.
文摘The rapid rise of cyberattacks and the gradual failure of traditional defense systems and approaches led to using artificial intelligence(AI)techniques(such as machine learning(ML)and deep learning(DL))to build more efficient and reliable intrusion detection systems(IDSs).However,the advent of larger IDS datasets has negatively impacted the performance and computational complexity of AI-based IDSs.Many researchers used data preprocessing techniques such as feature selection and normalization to overcome such issues.While most of these researchers reported the success of these preprocessing techniques on a shallow level,very few studies have been performed on their effects on a wider scale.Furthermore,the performance of an IDS model is subject to not only the utilized preprocessing techniques but also the dataset and the ML/DL algorithm used,which most of the existing studies give little emphasis on.Thus,this study provides an in-depth analysis of feature selection and normalization effects on IDS models built using three IDS datasets:NSL-KDD,UNSW-NB15,and CSE–CIC–IDS2018,and various AI algorithms.A wrapper-based approach,which tends to give superior performance,and min-max normalization methods were used for feature selection and normalization,respectively.Numerous IDS models were implemented using the full and feature-selected copies of the datasets with and without normalization.The models were evaluated using popular evaluation metrics in IDS modeling,intra-and inter-model comparisons were performed between models and with state-of-the-art works.Random forest(RF)models performed better on NSL-KDD and UNSW-NB15 datasets with accuracies of 99.86%and 96.01%,respectively,whereas artificial neural network(ANN)achieved the best accuracy of 95.43%on the CSE–CIC–IDS2018 dataset.The RF models also achieved an excellent performance compared to recent works.The results show that normalization and feature selection positively affect IDS modeling.Furthermore,while feature selection benefits simpler algorithms(such as RF),normalization is more useful for complex algorithms like ANNs and deep neural networks(DNNs),and algorithms such as Naive Bayes are unsuitable for IDS modeling.The study also found that the UNSW-NB15 and CSE–CIC–IDS2018 datasets are more complex and more suitable for building and evaluating modern-day IDS than the NSL-KDD dataset.Our findings suggest that prioritizing robust algorithms like RF,alongside complex models such as ANN and DNN,can significantly enhance IDS performance.These insights provide valuable guidance for managers to develop more effective security measures by focusing on high detection rates and low false alert rates.
基金the Deanship of Scientifc Research at King Khalid University for funding this work through large group Research Project under grant number RGP2/421/45supported via funding from Prince Sattam bin Abdulaziz University project number(PSAU/2024/R/1446)+1 种基金supported by theResearchers Supporting Project Number(UM-DSR-IG-2023-07)Almaarefa University,Riyadh,Saudi Arabia.supported by the Basic Science Research Program through the National Research Foundation of Korea(NRF)funded by the Ministry of Education(No.2021R1F1A1055408).
文摘Machine learning(ML)is increasingly applied for medical image processing with appropriate learning paradigms.These applications include analyzing images of various organs,such as the brain,lung,eye,etc.,to identify specific flaws/diseases for diagnosis.The primary concern of ML applications is the precise selection of flexible image features for pattern detection and region classification.Most of the extracted image features are irrelevant and lead to an increase in computation time.Therefore,this article uses an analytical learning paradigm to design a Congruent Feature Selection Method to select the most relevant image features.This process trains the learning paradigm using similarity and correlation-based features over different textural intensities and pixel distributions.The similarity between the pixels over the various distribution patterns with high indexes is recommended for disease diagnosis.Later,the correlation based on intensity and distribution is analyzed to improve the feature selection congruency.Therefore,the more congruent pixels are sorted in the descending order of the selection,which identifies better regions than the distribution.Now,the learning paradigm is trained using intensity and region-based similarity to maximize the chances of selection.Therefore,the probability of feature selection,regardless of the textures and medical image patterns,is improved.This process enhances the performance of ML applications for different medical image processing.The proposed method improves the accuracy,precision,and training rate by 13.19%,10.69%,and 11.06%,respectively,compared to other models for the selected dataset.The mean error and selection time is also reduced by 12.56%and 13.56%,respectively,compared to the same models and dataset.
文摘Purpose-Single-shot multi-category clothing recognition and retrieval play a crucial role in online searching and offline settlement scenarios.Existing clothing recognition methods based on RGBD clothing images often suffer from high-dimensional feature representations,leading to compromised performance and efficiency.Design/methodology/approach-To address this issue,this paper proposes a novel method called Manifold Embedded Discriminative Feature Selection(MEDFS)to select global and local features,thereby reducing the dimensionality of the feature representation and improving performance.Specifically,by combining three global features and three local features,a low-dimensional embedding is constructed to capture the correlations between features and categories.The MEDFS method designs an optimization framework utilizing manifold mapping and sparse regularization to achieve feature selection.The optimization objective is solved using an alternating iterative strategy,ensuring convergence.Findings-Empirical studies conducted on a publicly available RGBD clothing image dataset demonstrate that the proposed MEDFS method achieves highly competitive clothing classification performance while maintaining efficiency in clothing recognition and retrieval.Originality/value-This paper introduces a novel approach for multi-category clothing recognition and retrieval,incorporating the selection of global and local features.The proposed method holds potential for practical applications in real-world clothing scenarios.
文摘This study provides a systematic investigation into the influence of feature selection methods on cryptocurrency price forecasting models employing technical indicators.In this work,over 130 technical indicators—covering momentum,volatility,volume,and trend-related technical indicators—are subjected to three distinct feature selection approaches.Specifically,mutual information(MI),recursive feature elimination(RFE),and random forest importance(RFI).By extracting an optimal set of 20 predictors,the proposed framework aims to mitigate redundancy and overfitting while enhancing interpretability.These feature subsets are integrated into support vector regression(SVR),Huber regressors,and k-nearest neighbors(KNN)models to forecast the prices of three leading cryptocurrencies—Bitcoin(BTC/USDT),Ethereum(ETH/USDT),and Binance Coin(BNB/USDT)—across horizons ranging from 1 to 20 days.Model evaluation employs the coefficient of determination(R2)and the root mean squared logarithmic error(RMSLE),alongside a walk-forward validation scheme to approximate real-world trading contexts.Empirical results indicate that incorporating momentum and volatility measures substantially improves predictive accuracy,with particularly pronounced effects observed at longer forecast windows.Moreover,indicators related to volume and trend provide incremental benefits in select market conditions.Notably,an 80%–85% reduction in the original feature set frequently maintains or enhances model performance relative to the complete indicator set.These findings highlight the critical role of targeted feature selection in addressing high-dimensional financial data challenges while preserving model robustness.This research advances the field of cryptocurrency forecasting by offering a rigorous comparison of feature selection methods and their effects on multiple digital assets and prediction horizons.The outcomes highlight the importance of dimension-reduction strategies in developing more efficient and resilient forecasting algorithms.Future efforts should incorporate high-frequency data and explore alternative selection techniques to further refine predictive accuracy in this highly volatile domain.
文摘Feature selection(FS)plays a crucial role in medical imaging by reducing dimensionality,improving computational efficiency,and enhancing diagnostic accuracy.Traditional FS techniques,including filter,wrapper,and embedded methods,have been widely used but often struggle with high-dimensional and heterogeneous medical imaging data.Deep learning-based FS methods,particularly Convolutional Neural Networks(CNNs)and autoencoders,have demonstrated superior performance but lack interpretability.Hybrid approaches that combine classical and deep learning techniques have emerged as a promising solution,offering improved accuracy and explainability.Furthermore,integratingmulti-modal imaging data(e.g.,MagneticResonance Imaging(MRI),ComputedTomography(CT),Positron Emission Tomography(PET),and Ultrasound(US))poses additional challenges in FS,necessitating advanced feature fusion strategies.Multi-modal feature fusion combines information fromdifferent imagingmodalities to improve diagnostic accuracy.Recently,quantum computing has gained attention as a revolutionary approach for FS,providing the potential to handle high-dimensional medical data more efficiently.This systematic literature review comprehensively examines classical,Deep Learning(DL),hybrid,and quantum-based FS techniques inmedical imaging.Key outcomes include a structured taxonomy of FS methods,a critical evaluation of their performance across modalities,and identification of core challenges such as computational burden,interpretability,and ethical considerations.Future research directions—such as explainable AI(XAI),federated learning,and quantum-enhanced FS—are also emphasized to bridge the current gaps.This review provides actionable insights for developing scalable,interpretable,and clinically applicable FS methods in the evolving landscape of medical imaging.
文摘Recent advancements in computational and database technologies have led to the exponential growth of large-scale medical datasets,significantly increasing data complexity and dimensionality in medical diagnostics.Efficient feature selection methods are critical for improving diagnostic accuracy,reducing computational costs,and enhancing the interpretability of predictive models.Particle Swarm Optimization(PSO),a widely used metaheuristic inspired by swarm intelligence,has shown considerable promise in feature selection tasks.However,conventional PSO often suffers from premature convergence and limited exploration capabilities,particularly in high-dimensional spaces.To overcome these limitations,this study proposes an enhanced PSO framework incorporating Orthogonal Initializa-tion and a Crossover Operator(OrPSOC).Orthogonal Initialization ensures a diverse and uniformly distributed initial particle population,substantially improving the algorithm’s exploration capability.The Crossover Operator,inspired by genetic algorithms,introduces additional diversity during the search process,effectively mitigating premature convergence and enhancing global search performance.The effectiveness of OrPSOC was rigorously evaluated on three benchmark medical datasets—Colon,Leukemia,and Prostate Tumor.Comparative analyses were conducted against traditional filter-based methods,including Fast Clustering-Based Feature Selection Technique(Fast-C),Minimum Redundancy Maximum Relevance(MinRedMaxRel),and Five-Way Joint Mutual Information(FJMI),as well as prominent metaheuristic algorithms such as standard PSO,Ant Colony Optimization(ACO),Comprehensive Learning Gravitational Search Algorithm(CLGSA),and Fuzzy-Based CLGSA(FCLGSA).Experimental results demonstrated that OrPSOC consistently outperformed these existing methods in terms of classification accuracy,computational efficiency,and result stability,achieving significant improvements even with fewer selected features.Additionally,a sensitivity analysis of the crossover parameter provided valuable insights into parameter tuning and its impact on model performance.These findings highlight the superiority and robustness of the proposed OrPSOC approach for feature selection in medical diagnostic applications and underscore its potential for broader adoption in various high-dimensional,data-driven fields.
基金supported by Ho Chi Minh City Open University,Vietnam under grant number E2024.02.1CD and Suan Sunandha Rajabhat University,Thailand.
文摘The Financial Technology(FinTech)sector has witnessed rapid growth,resulting in increasingly complex and high-volume digital transactions.Although this expansion improves efficiency and accessibility,it also introduces significant vulnerabilities,including fraud,money laundering,and market manipulation.Traditional anomaly detection techniques often fail to capture the relational and dynamic characteristics of financial data.Graph Neural Networks(GNNs),capable of modeling intricate interdependencies among entities,have emerged as a powerful framework for detecting subtle and sophisticated anomalies.However,the high-dimensionality and inherent noise of FinTech datasets demand robust feature selection strategies to improve model scalability,performance,and interpretability.This paper presents a comprehensive survey of GNN-based approaches for anomaly detection in FinTech,with an emphasis on the synergistic role of feature selection.We examine the theoretical foundations of GNNs,review state-of-the-art feature selection techniques,analyze their integration with GNNs,and categorize prevalent anomaly types in FinTech applications.In addition,we discuss practical implementation challenges,highlight representative case studies,and propose future research directions to advance the field of graph-based anomaly detection in financial systems.
基金funded by Universiti Teknologi Malaysia under the UTM RA ICONIC Grant(Q.J130000.4351.09G61).
文摘Advanced Persistent Threats(APTs)represent one of the most complex and dangerous categories of cyber-attacks characterised by their stealthy behaviour,long-term persistence,and ability to bypass traditional detection systems.The complexity of real-world network data poses significant challenges in detection.Machine learning models have shown promise in detecting APTs;however,their performance often suffers when trained on large datasets with redundant or irrelevant features.This study presents a novel,hybrid feature selection method designed to improve APT detection by reducing dimensionality while preserving the informative characteristics of the data.It combines Mutual Information(MI),Symmetric Uncertainty(SU)and Minimum Redundancy Maximum Relevance(mRMR)to enhance feature selection.MI and SU assess feature relevance,while mRMR maximises relevance and minimises redundancy,ensuring that the most impactful features are prioritised.This method addresses redundancy among selected features,improving the overall efficiency and effectiveness of the detection model.Experiments on a real-world APT datasets were conducted to evaluate the proposed method.Multiple classifiers including,Random Forest,Support Vector Machine(SVM),Gradient Boosting,and Neural Networks were used to assess classification performance.The results demonstrate that the proposed feature selection method significantly enhances detection accuracy compared to baseline models trained on the full feature set.The Random Forest algorithm achieved the highest performance,with near-perfect accuracy,precision,recall,and F1 scores(99.97%).The proposed adaptive thresholding algorithm within the selection method allows each classifier to benefit from a reduced and optimised feature space,resulting in improved training and predictive performance.This research offers a scalable and classifier-agnostic solution for dimensionality reduction in cybersecurity applications.
文摘Heart disease prediction is a critical issue in healthcare,where accurate early diagnosis can save lives and reduce healthcare costs.The problem is inherently complex due to the high dimensionality of medical data,irrelevant or redundant features,and the variability in risk factors such as age,lifestyle,andmedical history.These challenges often lead to inefficient and less accuratemodels.Traditional predictionmethodologies face limitations in effectively handling large feature sets and optimizing classification performance,which can result in overfitting poor generalization,and high computational cost.This work proposes a novel classification model for heart disease prediction that addresses these challenges by integrating feature selection through a Genetic Algorithm(GA)with an ensemble deep learning approach optimized using the Tunicate Swarm Algorithm(TSA).GA selects the most relevant features,reducing dimensionality and improvingmodel efficiency.Theselected features are then used to train an ensemble of deep learning models,where the TSA optimizes the weight of each model in the ensemble to enhance prediction accuracy.This hybrid approach addresses key challenges in the field,such as high dimensionality,redundant features,and classification performance,by introducing an efficient feature selection mechanism and optimizing the weighting of deep learning models in the ensemble.These enhancements result in a model that achieves superior accuracy,generalization,and efficiency compared to traditional methods.The proposed model demonstrated notable advancements in both prediction accuracy and computational efficiency over traditionalmodels.Specifically,it achieved an accuracy of 97.5%,a sensitivity of 97.2%,and a specificity of 97.8%.Additionally,with a 60-40 data split and 5-fold cross-validation,the model showed a significant reduction in training time(90 s),memory consumption(950 MB),and CPU usage(80%),highlighting its effectiveness in processing large,complex medical datasets for heart disease prediction.
基金funded by the Deanship of Scientific Research,Vice Presidency for Graduate Studies and Scientific Research,King Faisal University,Saudi Arabia[Grant No.KFU241683].
文摘This paper proposes a novel hybrid fraud detection framework that integrates multi-stage feature selection,unsupervised clustering,and ensemble learning to improve classification performance in financial transaction monitoring systems.The framework is structured into three core layers:(1)feature selection using Recursive Feature Elimination(RFE),Principal Component Analysis(PCA),and Mutual Information(MI)to reduce dimensionality and enhance input relevance;(2)anomaly detection through unsupervised clustering using K-Means,Density-Based Spatial Clustering(DBSCAN),and Hierarchical Clustering to flag suspicious patterns in unlabeled data;and(3)final classification using a voting-based hybrid ensemble of Support Vector Machine(SVM),Random Forest(RF),and Gradient Boosting Classifier(GBC).The experimental evaluation is conducted on a synthetically generated dataset comprising one million financial transactions,with 5% labelled as fraudulent,simulating realistic fraud rates and behavioural features,including transaction time,origin,amount,and geo-location.The proposed model demonstrated a significant improvement over baseline classifiers,achieving an accuracy of 99%,a precision of 99%,a recall of 97%,and an F1-score of 99%.Compared to individual models,it yielded a 9% gain in overall detection accuracy.It reduced the false positive rate to below 3.5%,thereby minimising the operational costs associated with manually reviewing false alerts.The model’s interpretability is enhanced by the integration of Shapley Additive Explanations(SHAP)values for feature importance,supporting transparency and regulatory auditability.These results affirm the practical relevance of the proposed system for deployment in real-time fraud detection scenarios such as credit card transactions,mobile banking,and cross-border payments.The study also highlights future directions,including the deployment of lightweight models and the integration of multimodal data for scalable fraud analytics.
基金financially supported by the Deanship of Scientific Research and Graduate Studies at King Khalid University under research grant number(R.G.P.2/21/46)in part by the Deanship of Scientific Research,Vice Presidency for Graduate Studies and Scientific Research,King Faisal University,Saudi Arabia,under Grant KFU253116.
文摘Today,phishing is an online attack designed to obtain sensitive information such as credit card and bank account numbers,passwords,and usernames.We can find several anti-phishing solutions,such as heuristic detection,virtual similarity detection,black and white lists,and machine learning(ML).However,phishing attempts remain a problem,and establishing an effective anti-phishing strategy is a work in progress.Furthermore,while most antiphishing solutions achieve the highest levels of accuracy on a given dataset,their methods suffer from an increased number of false positives.These methods are ineffective against zero-hour attacks.Phishing sites with a high False Positive Rate(FPR)are considered genuine because they can cause people to lose a lot ofmoney by visiting them.Feature selection is critical when developing phishing detection strategies.Good feature selection helps improve accuracy;however,duplicate features can also increase noise in the dataset and reduce the accuracy of the algorithm.Therefore,a combination of filter-based feature selection methods is proposed to detect phishing attacks,including constant feature removal,duplicate feature removal,quasi-feature removal,correlated feature removal,mutual information extraction,and Analysis of Variance(ANOVA)testing.The technique has been tested with differentMachine Learning classifiers:Random Forest,Artificial Neural Network(ANN),Ada-Boost,Extreme Gradient Boosting(XGBoost),Logistic Regression,Decision Trees,Gradient Boosting Classifiers,Support Vector Machine(SVM),and two types of ensemble models,stacking and majority voting to gain A low false positive rate is achieved.Stacked ensemble classifiers(gradient boosting,randomforest,support vector machine)achieve 1.31%FPR and 98.17%accuracy on Dataset 1,2.81%FPR and Dataset 3 shows 2.81%FPR and 97.61%accuracy,while Dataset 2 shows 3.47%FPR and 96.47%accuracy.
基金supported by the Scientific Research Foundation for High-level Talents of Anhui University of Science and Technology(13230550)the Coal Industry Engineering Research Center of Mining Area Environmental and Disaster Cooperative Monitoring,Anhui University of Science and Technology(KSXTJC202305)+1 种基金the State Key Laboratory of Geodesy and Earth's Dynamics,Innovation Academy for Precision Measurement Science and Technology(SKLGED2023-5-1)the China Postdoctoral Science Foundation(2023M733604).
文摘Soil moisture is a key parameter in the exchange of energy and water between the land surface and the atmosphere.This parameter plays an important role in the dynamics of permafrost on the Qinghai-Xizang Plateau,China,as well as in the related ecological and hydrological processes.However,the region's complex terrain and extreme climatic conditions result in low-accuracy soil moisture estimations using traditional remote sensing techniques.Thus,this study considered parameters of the backscatter coefficient of Sentinel-1A ground range detected(GRD)data,the polarization decomposition parameters of Sentinel-1A single-look complex(SLC)data,the normalized difference vegetation index(NDVI)based on Sentinel-2B data,and the topographic factors based on digital elevation model(DEM)data.By combining these parameters with a machine learning model,we established a feature selection rule.A cumulative importance threshold was derived for feature variables,and those variables that failed to meet the threshold were eliminated based on variations in the coefficient of determination(R^(2))and the unbiased root mean square error(ubRMSE).The eight most influential variables were selected and combined with the CatBoost model for soil moisture inversion,and the SHapley Additive exPlanations(SHAP)method was used to analyze the importance of these variables.The results demonstrated that the optimized model significantly improved the accuracy of soil moisture inversion.Compared to the unfiltered model,the optimal feature combination led to a 0.09 increase in R^(2)and a 0.7%reduction in ubRMSE.Ultimately,the optimized model achieved a R²of 0.87 and an ubRMSE of 5.6%.Analysis revealed that soil particle size had significant impact on soil water retention capacity.The impact of vegetation on the estimated soil moisture on the Qinghai-Xizang Plateau was considerable,demonstrating a significant positive correlation.Moreover,the microtopographical features of hummocks interfered with soil moisture estimation,indicating that such terrain effects warrant increased attention in future studies within the permafrost regions.The developed method not only enhances the accuracy of soil moisture retrieval in the complex terrain of the Qinghai-Xizang Plateau,but also exhibits high computational efficiency(with a relative time reduction of 18.5%),striking an excellent balance between accuracy and efficiency.This approach provides a robust framework for efficient soil moisture monitoring in remote areas with limited ground data,offering critical insights for ecological conservation,water resource management,and climate change adaptation on the Qinghai-Xizang Plateau.
文摘Feature selection(FS)is a pivotal pre-processing step in developing data-driven models,influencing reliability,performance and optimization.Although existing FS techniques can yield high-performance metrics for certain models,they do not invariably guarantee the extraction of the most critical or impactful features.Prior literature underscores the significance of equitable FS practices and has proposed diverse methodologies for the identification of appropriate features.However,the challenge of discerning the most relevant and influential features persists,particularly in the context of the exponential growth and heterogeneity of big data—a challenge that is increasingly salient in modern artificial intelligence(AI)applications.In response,this study introduces an innovative,automated statistical method termed Farea Similarity for Feature Selection(FSFS).The FSFS approach computes a similarity metric for each feature by benchmarking it against the record-wise mean,thereby finding feature dependencies and mitigating the influence of outliers that could potentially distort evaluation outcomes.Features are subsequently ranked according to their similarity scores,with the threshold established at the average similarity score.Notably,lower FSFS values indicate higher similarity and stronger data correlations,whereas higher values suggest lower similarity.The FSFS method is designed not only to yield reliable evaluation metrics but also to reduce data complexity without compromising model performance.Comparative analyses were performed against several established techniques,including Chi-squared(CS),Correlation Coefficient(CC),Genetic Algorithm(GA),Exhaustive Approach,Greedy Stepwise Approach,Gain Ratio,and Filtered Subset Eval,using a variety of datasets such as the Experimental Dataset,Breast Cancer Wisconsin(Original),KDD CUP 1999,NSL-KDD,UNSW-NB15,and Edge-IIoT.In the absence of the FSFS method,the highest classifier accuracies observed were 60.00%,95.13%,97.02%,98.17%,95.86%,and 94.62%for the respective datasets.When the FSFS technique was integrated with data normalization,encoding,balancing,and feature importance selection processes,accuracies improved to 100.00%,97.81%,98.63%,98.94%,94.27%,and 98.46%,respectively.The FSFS method,with a computational complexity of O(fn log n),demonstrates robust scalability and is well-suited for datasets of large size,ensuring efficient processing even when the number of features is substantial.By automatically eliminating outliers and redundant data,FSFS reduces computational overhead,resulting in faster training and improved model performance.Overall,the FSFS framework not only optimizes performance but also enhances the interpretability and explainability of data-driven models,thereby facilitating more trustworthy decision-making in AI applications.
基金funded by Deanship of Graduate studies and Scientific Research at Jouf University under grant No.(DGSSR-2024-02-01264).
文摘Automated essay scoring(AES)systems have gained significant importance in educational settings,offering a scalable,efficient,and objective method for evaluating student essays.However,developing AES systems for Arabic poses distinct challenges due to the language’s complex morphology,diglossia,and the scarcity of annotated datasets.This paper presents a hybrid approach to Arabic AES by combining text-based,vector-based,and embeddingbased similarity measures to improve essay scoring accuracy while minimizing the training data required.Using a large Arabic essay dataset categorized into thematic groups,the study conducted four experiments to evaluate the impact of feature selection,data size,and model performance.Experiment 1 established a baseline using a non-machine learning approach,selecting top-N correlated features to predict essay scores.The subsequent experiments employed 5-fold cross-validation.Experiment 2 showed that combining embedding-based,text-based,and vector-based features in a Random Forest(RF)model achieved an R2 of 88.92%and an accuracy of 83.3%within a 0.5-point tolerance.Experiment 3 further refined the feature selection process,demonstrating that 19 correlated features yielded optimal results,improving R2 to 88.95%.In Experiment 4,an optimal data efficiency training approach was introduced,where training data portions increased from 5%to 50%.The study found that using just 10%of the data achieved near-peak performance,with an R2 of 85.49%,emphasizing an effective trade-off between performance and computational costs.These findings highlight the potential of the hybrid approach for developing scalable Arabic AES systems,especially in low-resource environments,addressing linguistic challenges while ensuring efficient data usage.
文摘The rapid evolution of smart cities through IoT,cloud computing,and connected infrastructures has significantly enhanced sectors such as transportation,healthcare,energy,and public safety,but also increased exposure to sophisticated cyber threats.The diversity of devices,high data volumes,and real-time operational demands complicate security,requiring not just robust intrusion detection but also effective feature selection for relevance and scalability.Traditional Machine Learning(ML)based Intrusion Detection System(IDS)improves detection but often lacks interpretability,limiting stakeholder trust and timely responses.Moreover,centralized feature selection in conventional IDS compromises data privacy and fails to accommodate the decentralized nature of smart city infrastructures.To address these limitations,this research introduces an Interpretable Federated Learning(FL)based Cyber Intrusion Detection model tailored for smart city applications.The proposed system leverages privacy-preserving feature selection,where each client node independently identifies top-ranked features using ML models integrated with SHAP-based explainability.These local feature subsets are then aggregated at a central server to construct a global model without compromising sensitive data.Furthermore,the global model is enhanced with Explainable AI(XAI)techniques such as SHAP and LIME,offering both global interpretability and instance-level transparency for cyber threat decisions.Experimental results demonstrate that the proposed global model achieves a high detection accuracy of 98.51%,with a significantly low miss rate of 1.49%,outperforming existing models while ensuring explainability,privacy,and scalability across smart city infrastructures.
文摘Multi-label feature selection(MFS)is a crucial dimensionality reduction technique aimed at identifying informative features associated with multiple labels.However,traditional centralized methods face significant challenges in privacy-sensitive and distributed settings,often neglecting label dependencies and suffering from low computational efficiency.To address these issues,we introduce a novel framework,Fed-MFSDHBCPSO—federated MFS via dual-layer hybrid breeding cooperative particle swarm optimization algorithm with manifold and sparsity regularization(DHBCPSO-MSR).Leveraging the federated learning paradigm,Fed-MFSDHBCPSO allows clients to perform local feature selection(FS)using DHBCPSO-MSR.Locally selected feature subsets are encrypted with differential privacy(DP)and transmitted to a central server,where they are securely aggregated and refined through secure multi-party computation(SMPC)until global convergence is achieved.Within each client,DHBCPSO-MSR employs a dual-layer FS strategy.The inner layer constructs sample and label similarity graphs,generates Laplacian matrices to capture the manifold structure between samples and labels,and applies L2,1-norm regularization to sparsify the feature subset,yielding an optimized feature weight matrix.The outer layer uses a hybrid breeding cooperative particle swarm optimization algorithm to further refine the feature weight matrix and identify the optimal feature subset.The updated weight matrix is then fed back to the inner layer for further optimization.Comprehensive experiments on multiple real-world multi-label datasets demonstrate that Fed-MFSDHBCPSO consistently outperforms both centralized and federated baseline methods across several key evaluation metrics.
文摘Software defect prediction(SDP)aims to find a reliable method to predict defects in specific software projects and help software engineers allocate limited resources to release high-quality software products.Software defect prediction can be effectively performed using traditional features,but there are some redundant or irrelevant features in them(the presence or absence of this feature has little effect on the prediction results).These problems can be solved using feature selection.However,existing feature selection methods have shortcomings such as insignificant dimensionality reduction effect and low classification accuracy of the selected optimal feature subset.In order to reduce the impact of these shortcomings,this paper proposes a new feature selection method Cubic TraverseMa Beluga whale optimization algorithm(CTMBWO)based on the improved Beluga whale optimization algorithm(BWO).The goal of this study is to determine how well the CTMBWO can extract the features that are most important for correctly predicting software defects,improve the accuracy of fault prediction,reduce the number of the selected feature and mitigate the risk of overfitting,thereby achieving more efficient resource utilization and better distribution of test workload.The CTMBWO comprises three main stages:preprocessing the dataset,selecting relevant features,and evaluating the classification performance of the model.The novel feature selection method can effectively improve the performance of SDP.This study performs experiments on two software defect datasets(PROMISE,NASA)and shows the method’s classification performance using four detailed evaluation metrics,Accuracy,F1-score,MCC,AUC and Recall.The results indicate that the approach presented in this paper achieves outstanding classification performance on both datasets and has significant improvement over the baseline models.