As a promising edge learning framework in future 6G networks,federated learning(FL)faces a number of technical challenges due to the heterogeneous network environment and diversified user behaviors.Data imbalance is o...As a promising edge learning framework in future 6G networks,federated learning(FL)faces a number of technical challenges due to the heterogeneous network environment and diversified user behaviors.Data imbalance is one of these challenges that can significantly degrade the learning efficiency.To deal with data imbalance issue,this work proposes a new learning framework,called clustered federated learning with weighted model aggregation(weighted CFL).Compared with traditional FL,our weighted CFL adaptively clusters the participating edge devices based on the cosine similarity of their local gradients at each training iteration,and then performs weighted per-cluster model aggregation.Therein,the similarity threshold for clustering is adaptive over iterations in response to the time-varying divergence of local gradients.Moreover,the weights for per-cluster model aggregation are adjusted according to the data balance feature so as to speed up the convergence rate.Experimental results show that the proposed weighted CFL achieves a faster model convergence rate and greater learning accuracy than benchmark methods under the imbalanced data scenario.展开更多
A network intrusion detection system is critical for cyber security against llegitimate attacks.In terms of feature perspectives,network traffic may include a variety of elements such as attack reference,attack type,a...A network intrusion detection system is critical for cyber security against llegitimate attacks.In terms of feature perspectives,network traffic may include a variety of elements such as attack reference,attack type,a subcategory of attack,host information,malicious scripts,etc.In terms of network perspectives,network traffic may contain an imbalanced number of harmful attacks when compared to normal traffic.It is challenging to identify a specific attack due to complex features and data imbalance issues.To address these issues,this paper proposes an Intrusion Detection System using transformer-based transfer learning for Imbalanced Network Traffic(IDS-INT).IDS-INT uses transformer-based transfer learning to learn feature interactions in both network feature representation and imbalanced data.First,detailed information about each type of attack is gathered from network interaction descriptions,which include network nodes,attack type,reference,host information,etc.Second,the transformer-based transfer learning approach is developed to learn detailed feature representation using their semantic anchors.Third,the Synthetic Minority Oversampling Technique(SMOTE)is implemented to balance abnormal traffic and detect minority attacks.Fourth,the Convolution Neural Network(CNN)model is designed to extract deep features from the balanced network traffic.Finally,the hybrid approach of the CNN-Long Short-Term Memory(CNN-LSTM)model is developed to detect different types of attacks from the deep features.Detailed experiments are conducted to test the proposed approach using three standard datasets,i.e.,UNsWNB15,CIC-IDS2017,and NSL-KDD.An explainable AI approach is implemented to interpret the proposed method and develop a trustable model.展开更多
Nowadays aviation accidents have become one of the major causes of severe injuries and fatalities around the world. This attracts the research community to look into aviation safety by applying data analysis technique...Nowadays aviation accidents have become one of the major causes of severe injuries and fatalities around the world. This attracts the research community to look into aviation safety by applying data analysis techniques based on an advanced machine learning algorithm. An ensemble classification model based on Aviation Safety Reporting System(ASRS) has been proposed to analyze aviation safety targeting the people injured in the system.The ensemble classification model shall contain two modules: the data-driven module consisting of data cleaning, feature selection,and imbalanced data division and reorganization, and the modeldriven module stacked by Random Forest(RF), XGBoost(XGB),and Light Gradient Boosting Machine(LGBM) separately. The results indicate that the ensemble model could solve the data imbalance while vastly improving accuracy. LGBM illustrates higher accuracy and faster run in the analysis of a single model of the ASRS-based imbalanced data, while the ensemble model has the best performance in classification at the same time. The ensemble model proposed for imbalanced data classification can provide a certain reference for similar data processing while improving the safety of civil aviation.展开更多
Recent advancements in machine learning and computer vision enable direct prediction of mechanical properties from microstructure images.The feasibility of this process hinges on the material structure-property relati...Recent advancements in machine learning and computer vision enable direct prediction of mechanical properties from microstructure images.The feasibility of this process hinges on the material structure-property relationship,richness of the dataset,and the choice of machine learning approach.This study investigates the application of a deep learning model to directly predict the yield strength(YS),ultimate tensile strength(UTS),and true stress-strain curve of the cast-forged AZ80 alloys from SEM microstructure images.We manufactured 27 cast-forged AZ80 magnesium alloy components using varied process parameters,creating a diverse dataset of AZ80 microstructures and mechanical properties through their characterization.In addition to predicting magnesium alloy properties,we address challenges related to data imbalance,brightness and contrast variability,and microstructure long-range heterogeneity.We demonstrate that synthetic data oversampling using a denoising diffusion probabilistic model effectively improves the model’s prediction accuracy via balancing the minority classes.A rigorous analysis of the model’s performance shows that the model accurately predicts the YS,UTS,and Ramberg-Osgood equation’s parameters(K and n).In image-out validation,the model achieves average percentage errors of 2.10%(YS),2.15%(UTS),1.50%(K),and 5.47%(n).In class-out validation,the errors are 6.27%,9.58%,4.69%,and 10.24%,respectively.展开更多
Electrocardiogram (ECG) analysis is critical for detecting arrhythmias, but traditional methods struggle with large-scale Electrocardiogram data and rare arrhythmia events in imbalanced datasets. These methods fail to...Electrocardiogram (ECG) analysis is critical for detecting arrhythmias, but traditional methods struggle with large-scale Electrocardiogram data and rare arrhythmia events in imbalanced datasets. These methods fail to perform multi-perspective learning of temporal signals and Electrocardiogram images, nor can they fully extract the latent information within the data, falling short of the accuracy required by clinicians. Therefore, this paper proposes an innovative hybrid multimodal spatiotemporal neural network to address these challenges. The model employs a multimodal data augmentation framework integrating visual and signal-based features to enhance the classification performance of rare arrhythmias in imbalanced datasets. Additionally, the spatiotemporal fusion module incorporates a spatiotemporal graph convolutional network to jointly model temporal and spatial features, uncovering complex dependencies within the Electrocardiogram data and improving the model’s ability to represent complex patterns. In experiments conducted on the MIT-BIH arrhythmia dataset, the model achieved 99.95% accuracy, 99.80% recall, and a 99.78% F1 score. The model was further validated for generalization using the clinical INCART arrhythmia dataset, and the results demonstrated its effectiveness in terms of both generalization and robustness.展开更多
A novel Hilbert-curve is introduced for parallel spatial data partitioning, with consideration of the huge-amount property of spatial information and the variable-length characteristic of vector data items. Based on t...A novel Hilbert-curve is introduced for parallel spatial data partitioning, with consideration of the huge-amount property of spatial information and the variable-length characteristic of vector data items. Based on the improved Hilbert curve, the algorithm can be designed to achieve almost-uniform spatial data partitioning among multiple disks in parallel spatial databases. Thus, the phenomenon of data imbalance can be significantly avoided and search and query efficiency can be enhanced.展开更多
DeepSeek Chinese artificial intelligence(AI)open-source model,has gained a lot of attention due to its economical training and efficient inference.DeepSeek,a model trained on large-scale reinforcement learning without...DeepSeek Chinese artificial intelligence(AI)open-source model,has gained a lot of attention due to its economical training and efficient inference.DeepSeek,a model trained on large-scale reinforcement learning without supervised fine-tuning as a preliminary step,demonstrates remarkable reasoning capabilities of performing a wide range of tasks.DeepSeek is a prominent AI-driven chatbot that assists individuals in learning and enhances responses by generating insightful solutions to inquiries.Users possess divergent viewpoints regarding advanced models like DeepSeek,posting both their merits and shortcomings across several social media platforms.This research presents a new framework for predicting public sentiment to evaluate perceptions of DeepSeek.To transform the unstructured data into a suitable manner,we initially collect DeepSeek-related tweets from Twitter and subsequently implement various preprocessing methods.Subsequently,we annotated the tweets utilizing the Valence Aware Dictionary and sentiment Reasoning(VADER)methodology and the lexicon-driven TextBlob.Next,we classified the attitudes obtained from the purified data utilizing the proposed hybrid model.The proposed hybrid model consists of long-term,shortterm memory(LSTM)and bidirectional gated recurrent units(BiGRU).To strengthen it,we include multi-head attention,regularizer activation,and dropout units to enhance performance.Topic modeling employing KMeans clustering and Latent Dirichlet Allocation(LDA),was utilized to analyze public behavior concerning DeepSeek.The perceptions demonstrate that 82.5%of the people are positive,15.2%negative,and 2.3%neutral using TextBlob,and 82.8%positive,16.1%negative,and 1.2%neutral using the VADER analysis.The slight difference in results ensures that both analyses concur with their overall perceptions and may have distinct views of language peculiarities.The results indicate that the proposed model surpassed previous state-of-the-art approaches.展开更多
Big data streams started becoming ubiquitous in recent years,thanks to rapid generation of massive volumes of data by different applications.It is challenging to apply existing data mining tools and techniques directl...Big data streams started becoming ubiquitous in recent years,thanks to rapid generation of massive volumes of data by different applications.It is challenging to apply existing data mining tools and techniques directly in these big data streams.At the same time,streaming data from several applications results in two major problems such as class imbalance and concept drift.The current research paper presents a new Multi-Objective Metaheuristic Optimization-based Big Data Analytics with Concept Drift Detection(MOMBD-CDD)method on High-Dimensional Streaming Data.The presented MOMBD-CDD model has different operational stages such as pre-processing,CDD,and classification.MOMBD-CDD model overcomes class imbalance problem by Synthetic Minority Over-sampling Technique(SMOTE).In order to determine the oversampling rates and neighboring point values of SMOTE,Glowworm Swarm Optimization(GSO)algorithm is employed.Besides,Statistical Test of Equal Proportions(STEPD),a CDD technique is also utilized.Finally,Bidirectional Long Short-Term Memory(Bi-LSTM)model is applied for classification.In order to improve classification performance and to compute the optimum parameters for Bi-LSTM model,GSO-based hyperparameter tuning process is carried out.The performance of the presented model was evaluated using high dimensional benchmark streaming datasets namely intrusion detection(NSL KDDCup)dataset and ECUE spam dataset.An extensive experimental validation process confirmed the effective outcome of MOMBD-CDD model.The proposed model attained high accuracy of 97.45%and 94.23%on the applied KDDCup99 Dataset and ECUE Spam datasets respectively.展开更多
Modern intrusion detection systems(MIDS)face persistent challenges in coping with the rapid evolution of cyber threats,high-volume network traffic,and imbalanced datasets.Traditional models often lack the robustness a...Modern intrusion detection systems(MIDS)face persistent challenges in coping with the rapid evolution of cyber threats,high-volume network traffic,and imbalanced datasets.Traditional models often lack the robustness and explainability required to detect novel and sophisticated attacks effectively.This study introduces an advanced,explainable machine learning framework for multi-class IDS using the KDD99 and IDS datasets,which reflects real-world network behavior through a blend of normal and diverse attack classes.The methodology begins with sophisticated data preprocessing,incorporating both RobustScaler and QuantileTransformer to address outliers and skewed feature distributions,ensuring standardized and model-ready inputs.Critical dimensionality reduction is achieved via the Harris Hawks Optimization(HHO)algorithm—a nature-inspired metaheuristic modeled on hawks’hunting strategies.HHO efficiently identifies the most informative features by optimizing a fitness function based on classification performance.Following feature selection,the SMOTE is applied to the training data to resolve class imbalance by synthetically augmenting underrepresented attack types.The stacked architecture is then employed,combining the strengths of XGBoost,SVM,and RF as base learners.This layered approach improves prediction robustness and generalization by balancing bias and variance across diverse classifiers.The model was evaluated using standard classification metrics:precision,recall,F1-score,and overall accuracy.The best overall performance was recorded with an accuracy of 99.44%for UNSW-NB15,demonstrating the model’s effectiveness.After balancing,the model demonstrated a clear improvement in detecting the attacks.We tested the model on four datasets to show the effectiveness of the proposed approach and performed the ablation study to check the effect of each parameter.Also,the proposed model is computationaly efficient.To support transparency and trust in decision-making,explainable AI(XAI)techniques are incorporated that provides both global and local insight into feature contributions,and offers intuitive visualizations for individual predictions.This makes it suitable for practical deployment in cybersecurity environments that demand both precision and accountability.展开更多
Traditional particle identification methods face timeconsuming,experience-dependent,and poor repeatability challenges in heavy-ion collisions at low and intermediate energies.Researchers urgently need solutions to the...Traditional particle identification methods face timeconsuming,experience-dependent,and poor repeatability challenges in heavy-ion collisions at low and intermediate energies.Researchers urgently need solutions to the dilemma of traditional particle identification methods.This study explores the possibility of applying intelligent learning algorithms to the particle identification of heavy-ion collisions at low and intermediate energies.Multiple intelligent algorithms,including XgBoost and TabNet,were selected to test datasets from the neutron ion multi-detector for reaction-oriented dynamics(NIMROD-ISiS)and Geant4 simulation.Tree-based machine learning algorithms and deep learning algorithms e.g.TabNet show excellent performance and generalization ability.Adding additional data features besides energy deposition can improve the algorithm’s performance when the data distribution is nonuniform.Intelligent learning algorithms can be applied to solve the particle identification problem in heavy-ion collisions at low and intermediate energies.展开更多
Aviation accidents are currently one of the leading causes of significant injuries and deaths worldwide. This entices researchers to investigate aircraft safety using data analysis approaches based on an advanced mach...Aviation accidents are currently one of the leading causes of significant injuries and deaths worldwide. This entices researchers to investigate aircraft safety using data analysis approaches based on an advanced machine learning algorithm.To assess aviation safety and identify the causes of incidents, a classification model with light gradient boosting machine (LGBM)based on the aviation safety reporting system (ASRS) has been developed. It is improved by k-fold cross-validation with hybrid sampling model (HSCV), which may boost classification performance and maintain data balance. The results show that employing the LGBM-HSCV model can significantly improve accuracy while alleviating data imbalance. Vertical comparison with other cross-validation (CV) methods and lateral comparison with different fold times comprise the comparative approach. Aside from the comparison, two further CV approaches based on the improved method in this study are discussed:one with a different sampling and folding order, and the other with more CV. According to the assessment indices with different methods, the LGBMHSCV model proposed here is effective at detecting incident causes. The improved model for imbalanced data categorization proposed may serve as a point of reference for similar data processing, and the model’s accurate identification of civil aviation incident causes can assist to improve civil aviation safety.展开更多
The extent of the peril associated with cancer can be perceivedfrom the lack of treatment, ineffective early diagnosis techniques, and mostimportantly its fatality rate. Globally, cancer is the second leading cause of...The extent of the peril associated with cancer can be perceivedfrom the lack of treatment, ineffective early diagnosis techniques, and mostimportantly its fatality rate. Globally, cancer is the second leading cause ofdeath and among over a hundred types of cancer;lung cancer is the secondmost common type of cancer as well as the leading cause of cancer-relateddeaths. Anyhow, an accurate lung cancer diagnosis in a timely manner canelevate the likelihood of survival by a noticeable margin and medical imagingis a prevalent manner of cancer diagnosis since it is easily accessible to peoplearound the globe. Nonetheless, this is not eminently efficacious consideringhuman inspection of medical images can yield a high false positive rate. Ineffectiveand inefficient diagnosis is a crucial reason for such a high mortalityrate for this malady. However, the conspicuous advancements in deep learningand artificial intelligence have stimulated the development of exceedinglyprecise diagnosis systems. The development and performance of these systemsrely prominently on the data that is used to train these systems. A standardproblem witnessed in publicly available medical image datasets is the severeimbalance of data between different classes. This grave imbalance of data canmake a deep learning model biased towards the dominant class and unableto generalize. This study aims to present an end-to-end convolutional neuralnetwork that can accurately differentiate lung nodules from non-nodules andreduce the false positive rate to a bare minimum. To tackle the problem ofdata imbalance, we oversampled the data by transforming available images inthe minority class. The average false positive rate in the proposed method isa mere 1.5 percent. However, the average false negative rate is 31.76 percent.The proposed neural network has 68.66 percent sensitivity and 98.42 percentspecificity.展开更多
With the advancement of network communication technology,network traffic shows explosive growth.Consequently,network attacks occur frequently.Network intrusion detection systems are still the primary means of detectin...With the advancement of network communication technology,network traffic shows explosive growth.Consequently,network attacks occur frequently.Network intrusion detection systems are still the primary means of detecting attacks.However,two challenges continue to stymie the development of a viable network intrusion detection system:imbalanced training data and new undiscovered attacks.Therefore,this study proposes a unique deep learning-based intrusion detection method.We use two independent in-memory autoencoders trained on regular network traffic and attacks to capture the dynamic relationship between traffic features in the presence of unbalanced training data.Then the original data is fed into the triplet network by forming a triplet with the data reconstructed from the two encoders to train.Finally,the distance relationship between the triples determines whether the traffic is an attack.In addition,to improve the accuracy of detecting unknown attacks,this research proposes an improved triplet loss function that is used to pull the distances of the same class closer while pushing the distances belonging to different classes farther in the learned feature space.The proposed approach’s effectiveness,stability,and significance are evaluated against advanced models on the Android Adware and General Malware Dataset(AAGM17),Knowledge Discovery and Data Mining Cup 1999(KDDCUP99),Canadian Institute for Cybersecurity Group’s Intrusion Detection Evaluation Dataset(CICIDS2017),UNSW-NB15,Network Security Lab-Knowledge Discovery and Data Mining(NSL-KDD)datasets.The achieved results confirmed the superiority of the proposed method for the task of network intrusion detection.展开更多
Cloud Computing(CC)is the preference of all information technology(IT)organizations as it offers pay-per-use based and flexible services to its users.But the privacy and security become the main hindrances in its achi...Cloud Computing(CC)is the preference of all information technology(IT)organizations as it offers pay-per-use based and flexible services to its users.But the privacy and security become the main hindrances in its achievement due to distributed and open architecture that is prone to intruders.Intrusion Detection System(IDS)refers to one of the commonly utilized system for detecting attacks on cloud.IDS proves to be an effective and promising technique,that identifies malicious activities and known threats by observing traffic data in computers,and warnings are given when such threatswere identified.The current mainstream IDS are assisted with machine learning(ML)but have issues of low detection rates and demanded wide feature engineering.This article devises an Enhanced Coyote Optimization with Deep Learning based Intrusion Detection System for Cloud Security(ECODL-IDSCS)model.The ECODL-IDSCS model initially addresses the class imbalance data problem by the use of Adaptive Synthetic(ADASYN)technique.For detecting and classification of intrusions,long short term memory(LSTM)model is exploited.In addition,ECO algorithm is derived to optimally fine tune the hyperparameters related to the LSTM model to enhance its detection efficiency in the cloud environment.Once the presented ECODL-IDSCS model is tested on benchmark dataset,the experimental results show the promising performance of the ECODL-IDSCS model over the existing IDS models.展开更多
Due to the anonymous and contract transfer nature of blockchain cryptocurrencies,they are susceptible to fraudulent incidents such as phishing.This poses a threat to the property security of users and hinders the heal...Due to the anonymous and contract transfer nature of blockchain cryptocurrencies,they are susceptible to fraudulent incidents such as phishing.This poses a threat to the property security of users and hinders the healthy development of the entire blockchain community.While numerous studies have been conducted on identifying cryptocurrency phishing users,there is a lack of research that integrates class imbalance and transaction time characteristics.This paper introduces a novel graph neural network-based account identification model called CT-GCN+,which utilizes blockchain cryptocurrency phishing data.It incorporates an imbalanced data processing module for graphs to consider cryptocurrency transaction time.The model initially extracts time characteristics from the transaction graph using LSTM and Attention mechanisms.These time characteristics are then fused with underlying features,which are subsequently inputted into a combined SMOTE and GCN model for phishing user classification.Experimental results demonstrate that the CT-GCN+model achieves a phishing user identification accuracy of 97.22%and a phishing user identification area under the curve of 96.67%.This paper presents a valuable approach to phishing detection research within the blockchain and cryptocurrency ecosystems.展开更多
ICOs,the initial coin offerings,are a common way to raise funds for blockchain projects.Fraudulent ICO projects not only cause financial losses to investors but also cause a loss of confidence in the blockchain capita...ICOs,the initial coin offerings,are a common way to raise funds for blockchain projects.Fraudulent ICO projects not only cause financial losses to investors but also cause a loss of confidence in the blockchain capital market.Whitepapers are usually the most important information source,so it is feasible to identify fraudulent ICO programs by analyzing whitepapers.However,the fraud samples are difficult to collect,and the classes are imbalanced.In this study,we attempt to solve this problem by extracting linguistic features from the ICO whitepaper and using a variety of cutting-edge machine learning and deep learning algorithms to train the prediction model and attempt to resample,modify the weight and modify the loss function for imbalanced samples.Our optimal method achieves an AUC of 0.94 and an accuracy of 82%,which is better than other traditional standard methods,and the results provide important implications for ICO fraud detection.展开更多
As the use of blockchain for digital payments continues to rise,it becomes susceptible to various malicious attacks.Successfully detecting anomalies within blockchain transactions is essential for bolstering trust in ...As the use of blockchain for digital payments continues to rise,it becomes susceptible to various malicious attacks.Successfully detecting anomalies within blockchain transactions is essential for bolstering trust in digital payments.However,the task of anomaly detection in blockchain transaction data is challenging due to the infrequent occurrence of illicit transactions.Although several studies have been conducted in the field,a limitation persists:the lack of explanations for the model’s predictions.This study seeks to overcome this limitation by integrating explainable artificial intelligence(XAI)techniques and anomaly rules into tree-based ensemble classifiers for detecting anomalous Bitcoin transactions.The shapley additive explanation(SHAP)method is employed to measure the contribution of each feature,and it is compatible with ensemble models.Moreover,we present rules for interpreting whether a Bitcoin transaction is anomalous or not.Additionally,we introduce an under-sampling algorithm named XGBCLUS,designed to balance anomalous and non-anomalous transaction data.This algorithm is compared against other commonly used under-sampling and over-sampling techniques.Finally,the outcomes of various tree-based single classifiers are compared with those of stacking and voting ensemble classifiers.Our experimental results demonstrate that:(i)XGBCLUS enhances true positive rate(TPR)and receiver operating characteristic-area under curve(ROC-AUC)scores compared to state-of-the-art under-sampling and over-sampling techniques,and(ii)our proposed ensemble classifiers outperform traditional single tree-based machine learning classifiers in terms of accuracy,TPR,and false positive rate(FPR)scores.展开更多
A credit risk prediction model named KM-ADASYN-TL-FLLightGBM(KADT-FLightGBM)is proposed in this study.Firstly,to overcome the limitation of traditional sampling methods in dealing with imbalanced datasets,an improved ...A credit risk prediction model named KM-ADASYN-TL-FLLightGBM(KADT-FLightGBM)is proposed in this study.Firstly,to overcome the limitation of traditional sampling methods in dealing with imbalanced datasets,an improved ADASYN sampling with K-means clustering algorithm is constructed.Moreover,the Tomek Links method is used to filter the generated samples.Secondly,an utilized an optimized LightGBM algorithm with the Focal Loss is employed to training the model using the datasets obtained by the improved ADASYN sampling.Finally,the comparative analysis between the ensemble model and other different sampling methodologies is conducted on the Lending Club dataset.The results demonstrate that the proposed model effectively minimizes the misclassification of minority classes in credit risk prediction and can be used as a reference for similar studies.展开更多
Recently,rapid serial visual presentation(RSVP),as a new event-related potential(ERP)paradigm,has become one of the most popular forms in electroencephalogram signal processing technologies.Several improvement approac...Recently,rapid serial visual presentation(RSVP),as a new event-related potential(ERP)paradigm,has become one of the most popular forms in electroencephalogram signal processing technologies.Several improvement approaches have been proposed to improve the performance of RSVP analysis.In brain-computer interface systems based on RSVP,the family of approaches that do not depend on training specific parameters is essential.The participating teams proposed several effective training-free frameworks of algorithms in the ERP competition of the BCI Controlled Robot Contest in World Robot Contest 2021.This paper discusses the effectiveness of various approaches in improving the performance of the system without requiring training and suggests how to apply these approaches in a practical system.First,appropriate preprocessing techniques will greatly improve the results.Then,the non-deep learning algorithm may be more stable than the deep learning approach.Furthermore,ensemble learning can make the model more stable and robust.展开更多
文摘As a promising edge learning framework in future 6G networks,federated learning(FL)faces a number of technical challenges due to the heterogeneous network environment and diversified user behaviors.Data imbalance is one of these challenges that can significantly degrade the learning efficiency.To deal with data imbalance issue,this work proposes a new learning framework,called clustered federated learning with weighted model aggregation(weighted CFL).Compared with traditional FL,our weighted CFL adaptively clusters the participating edge devices based on the cosine similarity of their local gradients at each training iteration,and then performs weighted per-cluster model aggregation.Therein,the similarity threshold for clustering is adaptive over iterations in response to the time-varying divergence of local gradients.Moreover,the weights for per-cluster model aggregation are adjusted according to the data balance feature so as to speed up the convergence rate.Experimental results show that the proposed weighted CFL achieves a faster model convergence rate and greater learning accuracy than benchmark methods under the imbalanced data scenario.
文摘A network intrusion detection system is critical for cyber security against llegitimate attacks.In terms of feature perspectives,network traffic may include a variety of elements such as attack reference,attack type,a subcategory of attack,host information,malicious scripts,etc.In terms of network perspectives,network traffic may contain an imbalanced number of harmful attacks when compared to normal traffic.It is challenging to identify a specific attack due to complex features and data imbalance issues.To address these issues,this paper proposes an Intrusion Detection System using transformer-based transfer learning for Imbalanced Network Traffic(IDS-INT).IDS-INT uses transformer-based transfer learning to learn feature interactions in both network feature representation and imbalanced data.First,detailed information about each type of attack is gathered from network interaction descriptions,which include network nodes,attack type,reference,host information,etc.Second,the transformer-based transfer learning approach is developed to learn detailed feature representation using their semantic anchors.Third,the Synthetic Minority Oversampling Technique(SMOTE)is implemented to balance abnormal traffic and detect minority attacks.Fourth,the Convolution Neural Network(CNN)model is designed to extract deep features from the balanced network traffic.Finally,the hybrid approach of the CNN-Long Short-Term Memory(CNN-LSTM)model is developed to detect different types of attacks from the deep features.Detailed experiments are conducted to test the proposed approach using three standard datasets,i.e.,UNsWNB15,CIC-IDS2017,and NSL-KDD.An explainable AI approach is implemented to interpret the proposed method and develop a trustable model.
基金Supported by the Joint Fund of National Natural Science Foundation of China and Civil Aviation Administration of China (U1833110)。
文摘Nowadays aviation accidents have become one of the major causes of severe injuries and fatalities around the world. This attracts the research community to look into aviation safety by applying data analysis techniques based on an advanced machine learning algorithm. An ensemble classification model based on Aviation Safety Reporting System(ASRS) has been proposed to analyze aviation safety targeting the people injured in the system.The ensemble classification model shall contain two modules: the data-driven module consisting of data cleaning, feature selection,and imbalanced data division and reorganization, and the modeldriven module stacked by Random Forest(RF), XGBoost(XGB),and Light Gradient Boosting Machine(LGBM) separately. The results indicate that the ensemble model could solve the data imbalance while vastly improving accuracy. LGBM illustrates higher accuracy and faster run in the analysis of a single model of the ASRS-based imbalanced data, while the ensemble model has the best performance in classification at the same time. The ensemble model proposed for imbalanced data classification can provide a certain reference for similar data processing while improving the safety of civil aviation.
基金the financial contribution from the Natural Sciences and Engineering Research Council of Canada (NSERC), through their Strategic Partnership Grant STPGP 521551in part by support provided by the Digital Research Alliance of Canada (alliance can.ca)
文摘Recent advancements in machine learning and computer vision enable direct prediction of mechanical properties from microstructure images.The feasibility of this process hinges on the material structure-property relationship,richness of the dataset,and the choice of machine learning approach.This study investigates the application of a deep learning model to directly predict the yield strength(YS),ultimate tensile strength(UTS),and true stress-strain curve of the cast-forged AZ80 alloys from SEM microstructure images.We manufactured 27 cast-forged AZ80 magnesium alloy components using varied process parameters,creating a diverse dataset of AZ80 microstructures and mechanical properties through their characterization.In addition to predicting magnesium alloy properties,we address challenges related to data imbalance,brightness and contrast variability,and microstructure long-range heterogeneity.We demonstrate that synthetic data oversampling using a denoising diffusion probabilistic model effectively improves the model’s prediction accuracy via balancing the minority classes.A rigorous analysis of the model’s performance shows that the model accurately predicts the YS,UTS,and Ramberg-Osgood equation’s parameters(K and n).In image-out validation,the model achieves average percentage errors of 2.10%(YS),2.15%(UTS),1.50%(K),and 5.47%(n).In class-out validation,the errors are 6.27%,9.58%,4.69%,and 10.24%,respectively.
基金supported by The Henan Province Science and Technology Research Project(242102211046)the Key Scientific Research Project of Higher Education Institutions in Henan Province(25A520039)+1 种基金theNatural Science Foundation project of Zhongyuan Institute of Technology(K2025YB011)the Zhongyuan University of Technology Graduate Education and Teaching Reform Research Project(JG202424).
文摘Electrocardiogram (ECG) analysis is critical for detecting arrhythmias, but traditional methods struggle with large-scale Electrocardiogram data and rare arrhythmia events in imbalanced datasets. These methods fail to perform multi-perspective learning of temporal signals and Electrocardiogram images, nor can they fully extract the latent information within the data, falling short of the accuracy required by clinicians. Therefore, this paper proposes an innovative hybrid multimodal spatiotemporal neural network to address these challenges. The model employs a multimodal data augmentation framework integrating visual and signal-based features to enhance the classification performance of rare arrhythmias in imbalanced datasets. Additionally, the spatiotemporal fusion module incorporates a spatiotemporal graph convolutional network to jointly model temporal and spatial features, uncovering complex dependencies within the Electrocardiogram data and improving the model’s ability to represent complex patterns. In experiments conducted on the MIT-BIH arrhythmia dataset, the model achieved 99.95% accuracy, 99.80% recall, and a 99.78% F1 score. The model was further validated for generalization using the clinical INCART arrhythmia dataset, and the results demonstrated its effectiveness in terms of both generalization and robustness.
基金Funded by the National 863 Program of China (No. 2005AA113150), and the National Natural Science Foundation of China (No.40701158).
文摘A novel Hilbert-curve is introduced for parallel spatial data partitioning, with consideration of the huge-amount property of spatial information and the variable-length characteristic of vector data items. Based on the improved Hilbert curve, the algorithm can be designed to achieve almost-uniform spatial data partitioning among multiple disks in parallel spatial databases. Thus, the phenomenon of data imbalance can be significantly avoided and search and query efficiency can be enhanced.
文摘DeepSeek Chinese artificial intelligence(AI)open-source model,has gained a lot of attention due to its economical training and efficient inference.DeepSeek,a model trained on large-scale reinforcement learning without supervised fine-tuning as a preliminary step,demonstrates remarkable reasoning capabilities of performing a wide range of tasks.DeepSeek is a prominent AI-driven chatbot that assists individuals in learning and enhances responses by generating insightful solutions to inquiries.Users possess divergent viewpoints regarding advanced models like DeepSeek,posting both their merits and shortcomings across several social media platforms.This research presents a new framework for predicting public sentiment to evaluate perceptions of DeepSeek.To transform the unstructured data into a suitable manner,we initially collect DeepSeek-related tweets from Twitter and subsequently implement various preprocessing methods.Subsequently,we annotated the tweets utilizing the Valence Aware Dictionary and sentiment Reasoning(VADER)methodology and the lexicon-driven TextBlob.Next,we classified the attitudes obtained from the purified data utilizing the proposed hybrid model.The proposed hybrid model consists of long-term,shortterm memory(LSTM)and bidirectional gated recurrent units(BiGRU).To strengthen it,we include multi-head attention,regularizer activation,and dropout units to enhance performance.Topic modeling employing KMeans clustering and Latent Dirichlet Allocation(LDA),was utilized to analyze public behavior concerning DeepSeek.The perceptions demonstrate that 82.5%of the people are positive,15.2%negative,and 2.3%neutral using TextBlob,and 82.8%positive,16.1%negative,and 1.2%neutral using the VADER analysis.The slight difference in results ensures that both analyses concur with their overall perceptions and may have distinct views of language peculiarities.The results indicate that the proposed model surpassed previous state-of-the-art approaches.
文摘Big data streams started becoming ubiquitous in recent years,thanks to rapid generation of massive volumes of data by different applications.It is challenging to apply existing data mining tools and techniques directly in these big data streams.At the same time,streaming data from several applications results in two major problems such as class imbalance and concept drift.The current research paper presents a new Multi-Objective Metaheuristic Optimization-based Big Data Analytics with Concept Drift Detection(MOMBD-CDD)method on High-Dimensional Streaming Data.The presented MOMBD-CDD model has different operational stages such as pre-processing,CDD,and classification.MOMBD-CDD model overcomes class imbalance problem by Synthetic Minority Over-sampling Technique(SMOTE).In order to determine the oversampling rates and neighboring point values of SMOTE,Glowworm Swarm Optimization(GSO)algorithm is employed.Besides,Statistical Test of Equal Proportions(STEPD),a CDD technique is also utilized.Finally,Bidirectional Long Short-Term Memory(Bi-LSTM)model is applied for classification.In order to improve classification performance and to compute the optimum parameters for Bi-LSTM model,GSO-based hyperparameter tuning process is carried out.The performance of the presented model was evaluated using high dimensional benchmark streaming datasets namely intrusion detection(NSL KDDCup)dataset and ECUE spam dataset.An extensive experimental validation process confirmed the effective outcome of MOMBD-CDD model.The proposed model attained high accuracy of 97.45%and 94.23%on the applied KDDCup99 Dataset and ECUE Spam datasets respectively.
基金funded by Princess Nourah bint Abdulrahman University Researchers Supporting Project number(PNURSP2025R104)Princess Nourah bint Abdulrahman University,Riyadh,Saudi Arabia.
文摘Modern intrusion detection systems(MIDS)face persistent challenges in coping with the rapid evolution of cyber threats,high-volume network traffic,and imbalanced datasets.Traditional models often lack the robustness and explainability required to detect novel and sophisticated attacks effectively.This study introduces an advanced,explainable machine learning framework for multi-class IDS using the KDD99 and IDS datasets,which reflects real-world network behavior through a blend of normal and diverse attack classes.The methodology begins with sophisticated data preprocessing,incorporating both RobustScaler and QuantileTransformer to address outliers and skewed feature distributions,ensuring standardized and model-ready inputs.Critical dimensionality reduction is achieved via the Harris Hawks Optimization(HHO)algorithm—a nature-inspired metaheuristic modeled on hawks’hunting strategies.HHO efficiently identifies the most informative features by optimizing a fitness function based on classification performance.Following feature selection,the SMOTE is applied to the training data to resolve class imbalance by synthetically augmenting underrepresented attack types.The stacked architecture is then employed,combining the strengths of XGBoost,SVM,and RF as base learners.This layered approach improves prediction robustness and generalization by balancing bias and variance across diverse classifiers.The model was evaluated using standard classification metrics:precision,recall,F1-score,and overall accuracy.The best overall performance was recorded with an accuracy of 99.44%for UNSW-NB15,demonstrating the model’s effectiveness.After balancing,the model demonstrated a clear improvement in detecting the attacks.We tested the model on four datasets to show the effectiveness of the proposed approach and performed the ablation study to check the effect of each parameter.Also,the proposed model is computationaly efficient.To support transparency and trust in decision-making,explainable AI(XAI)techniques are incorporated that provides both global and local insight into feature contributions,and offers intuitive visualizations for individual predictions.This makes it suitable for practical deployment in cybersecurity environments that demand both precision and accountability.
基金This work was supported by the Strategic Priority Research Program of Chinese Academy of Sciences(No.XDB34030000)the National Key Research and Development Program of China(No.2022YFA1602404)+1 种基金the National Natural Science Foundation(No.U1832129)the Youth Innovation Promotion Association CAS(No.2017309).
文摘Traditional particle identification methods face timeconsuming,experience-dependent,and poor repeatability challenges in heavy-ion collisions at low and intermediate energies.Researchers urgently need solutions to the dilemma of traditional particle identification methods.This study explores the possibility of applying intelligent learning algorithms to the particle identification of heavy-ion collisions at low and intermediate energies.Multiple intelligent algorithms,including XgBoost and TabNet,were selected to test datasets from the neutron ion multi-detector for reaction-oriented dynamics(NIMROD-ISiS)and Geant4 simulation.Tree-based machine learning algorithms and deep learning algorithms e.g.TabNet show excellent performance and generalization ability.Adding additional data features besides energy deposition can improve the algorithm’s performance when the data distribution is nonuniform.Intelligent learning algorithms can be applied to solve the particle identification problem in heavy-ion collisions at low and intermediate energies.
基金supported by the National Natural Science Foundation of China Civil Aviation Joint Fund (U1833110)Research on the Dual Prevention Mechanism and Intelligent Management Technology f or Civil Aviation Safety Risks (YK23-03-05)。
文摘Aviation accidents are currently one of the leading causes of significant injuries and deaths worldwide. This entices researchers to investigate aircraft safety using data analysis approaches based on an advanced machine learning algorithm.To assess aviation safety and identify the causes of incidents, a classification model with light gradient boosting machine (LGBM)based on the aviation safety reporting system (ASRS) has been developed. It is improved by k-fold cross-validation with hybrid sampling model (HSCV), which may boost classification performance and maintain data balance. The results show that employing the LGBM-HSCV model can significantly improve accuracy while alleviating data imbalance. Vertical comparison with other cross-validation (CV) methods and lateral comparison with different fold times comprise the comparative approach. Aside from the comparison, two further CV approaches based on the improved method in this study are discussed:one with a different sampling and folding order, and the other with more CV. According to the assessment indices with different methods, the LGBMHSCV model proposed here is effective at detecting incident causes. The improved model for imbalanced data categorization proposed may serve as a point of reference for similar data processing, and the model’s accurate identification of civil aviation incident causes can assist to improve civil aviation safety.
基金supported this research through the National Research Foundation of Korea (NRF)funded by the Ministry of Science,ICT (2019M3F2A1073387)this work was supported by the Institute for Information&communications Technology Promotion (IITP) (NO.2022-0-00980Cooperative Intelligence Framework of Scene Perception for Autonomous IoT Device).
文摘The extent of the peril associated with cancer can be perceivedfrom the lack of treatment, ineffective early diagnosis techniques, and mostimportantly its fatality rate. Globally, cancer is the second leading cause ofdeath and among over a hundred types of cancer;lung cancer is the secondmost common type of cancer as well as the leading cause of cancer-relateddeaths. Anyhow, an accurate lung cancer diagnosis in a timely manner canelevate the likelihood of survival by a noticeable margin and medical imagingis a prevalent manner of cancer diagnosis since it is easily accessible to peoplearound the globe. Nonetheless, this is not eminently efficacious consideringhuman inspection of medical images can yield a high false positive rate. Ineffectiveand inefficient diagnosis is a crucial reason for such a high mortalityrate for this malady. However, the conspicuous advancements in deep learningand artificial intelligence have stimulated the development of exceedinglyprecise diagnosis systems. The development and performance of these systemsrely prominently on the data that is used to train these systems. A standardproblem witnessed in publicly available medical image datasets is the severeimbalance of data between different classes. This grave imbalance of data canmake a deep learning model biased towards the dominant class and unableto generalize. This study aims to present an end-to-end convolutional neuralnetwork that can accurately differentiate lung nodules from non-nodules andreduce the false positive rate to a bare minimum. To tackle the problem ofdata imbalance, we oversampled the data by transforming available images inthe minority class. The average false positive rate in the proposed method isa mere 1.5 percent. However, the average false negative rate is 31.76 percent.The proposed neural network has 68.66 percent sensitivity and 98.42 percentspecificity.
基金support of National Natural Science Foundation of China(U1936213)Yunnan Provincial Natural Science Foundation,“Robustness analysis method and coupling mechanism of complex coupled network system”(202101AT070167)Yunnan Provincial Major Science and Technology Program,“Construction and application demonstration of intelligent diagnosis and treatment system for childhood diseases based on intelligent medical platform”(202102AA100021).
文摘With the advancement of network communication technology,network traffic shows explosive growth.Consequently,network attacks occur frequently.Network intrusion detection systems are still the primary means of detecting attacks.However,two challenges continue to stymie the development of a viable network intrusion detection system:imbalanced training data and new undiscovered attacks.Therefore,this study proposes a unique deep learning-based intrusion detection method.We use two independent in-memory autoencoders trained on regular network traffic and attacks to capture the dynamic relationship between traffic features in the presence of unbalanced training data.Then the original data is fed into the triplet network by forming a triplet with the data reconstructed from the two encoders to train.Finally,the distance relationship between the triples determines whether the traffic is an attack.In addition,to improve the accuracy of detecting unknown attacks,this research proposes an improved triplet loss function that is used to pull the distances of the same class closer while pushing the distances belonging to different classes farther in the learned feature space.The proposed approach’s effectiveness,stability,and significance are evaluated against advanced models on the Android Adware and General Malware Dataset(AAGM17),Knowledge Discovery and Data Mining Cup 1999(KDDCUP99),Canadian Institute for Cybersecurity Group’s Intrusion Detection Evaluation Dataset(CICIDS2017),UNSW-NB15,Network Security Lab-Knowledge Discovery and Data Mining(NSL-KDD)datasets.The achieved results confirmed the superiority of the proposed method for the task of network intrusion detection.
基金The Deanship of Scientific Research(DSR)at King Abdulaziz University(KAU),Jeddah,Saudi Arabia has funded this project,under grant no.KEP-1-120-42.
文摘Cloud Computing(CC)is the preference of all information technology(IT)organizations as it offers pay-per-use based and flexible services to its users.But the privacy and security become the main hindrances in its achievement due to distributed and open architecture that is prone to intruders.Intrusion Detection System(IDS)refers to one of the commonly utilized system for detecting attacks on cloud.IDS proves to be an effective and promising technique,that identifies malicious activities and known threats by observing traffic data in computers,and warnings are given when such threatswere identified.The current mainstream IDS are assisted with machine learning(ML)but have issues of low detection rates and demanded wide feature engineering.This article devises an Enhanced Coyote Optimization with Deep Learning based Intrusion Detection System for Cloud Security(ECODL-IDSCS)model.The ECODL-IDSCS model initially addresses the class imbalance data problem by the use of Adaptive Synthetic(ADASYN)technique.For detecting and classification of intrusions,long short term memory(LSTM)model is exploited.In addition,ECO algorithm is derived to optimally fine tune the hyperparameters related to the LSTM model to enhance its detection efficiency in the cloud environment.Once the presented ECODL-IDSCS model is tested on benchmark dataset,the experimental results show the promising performance of the ECODL-IDSCS model over the existing IDS models.
文摘Due to the anonymous and contract transfer nature of blockchain cryptocurrencies,they are susceptible to fraudulent incidents such as phishing.This poses a threat to the property security of users and hinders the healthy development of the entire blockchain community.While numerous studies have been conducted on identifying cryptocurrency phishing users,there is a lack of research that integrates class imbalance and transaction time characteristics.This paper introduces a novel graph neural network-based account identification model called CT-GCN+,which utilizes blockchain cryptocurrency phishing data.It incorporates an imbalanced data processing module for graphs to consider cryptocurrency transaction time.The model initially extracts time characteristics from the transaction graph using LSTM and Attention mechanisms.These time characteristics are then fused with underlying features,which are subsequently inputted into a combined SMOTE and GCN model for phishing user classification.Experimental results demonstrate that the CT-GCN+model achieves a phishing user identification accuracy of 97.22%and a phishing user identification area under the curve of 96.67%.This paper presents a valuable approach to phishing detection research within the blockchain and cryptocurrency ecosystems.
基金supported by the National Natural Science Foundation of China under Grant No.61907042Beijing Natural Science Foundation under Grant No.4194090.
文摘ICOs,the initial coin offerings,are a common way to raise funds for blockchain projects.Fraudulent ICO projects not only cause financial losses to investors but also cause a loss of confidence in the blockchain capital market.Whitepapers are usually the most important information source,so it is feasible to identify fraudulent ICO programs by analyzing whitepapers.However,the fraud samples are difficult to collect,and the classes are imbalanced.In this study,we attempt to solve this problem by extracting linguistic features from the ICO whitepaper and using a variety of cutting-edge machine learning and deep learning algorithms to train the prediction model and attempt to resample,modify the weight and modify the loss function for imbalanced samples.Our optimal method achieves an AUC of 0.94 and an accuracy of 82%,which is better than other traditional standard methods,and the results provide important implications for ICO fraud detection.
文摘As the use of blockchain for digital payments continues to rise,it becomes susceptible to various malicious attacks.Successfully detecting anomalies within blockchain transactions is essential for bolstering trust in digital payments.However,the task of anomaly detection in blockchain transaction data is challenging due to the infrequent occurrence of illicit transactions.Although several studies have been conducted in the field,a limitation persists:the lack of explanations for the model’s predictions.This study seeks to overcome this limitation by integrating explainable artificial intelligence(XAI)techniques and anomaly rules into tree-based ensemble classifiers for detecting anomalous Bitcoin transactions.The shapley additive explanation(SHAP)method is employed to measure the contribution of each feature,and it is compatible with ensemble models.Moreover,we present rules for interpreting whether a Bitcoin transaction is anomalous or not.Additionally,we introduce an under-sampling algorithm named XGBCLUS,designed to balance anomalous and non-anomalous transaction data.This algorithm is compared against other commonly used under-sampling and over-sampling techniques.Finally,the outcomes of various tree-based single classifiers are compared with those of stacking and voting ensemble classifiers.Our experimental results demonstrate that:(i)XGBCLUS enhances true positive rate(TPR)and receiver operating characteristic-area under curve(ROC-AUC)scores compared to state-of-the-art under-sampling and over-sampling techniques,and(ii)our proposed ensemble classifiers outperform traditional single tree-based machine learning classifiers in terms of accuracy,TPR,and false positive rate(FPR)scores.
基金supported by the National Natural Science Foundation of China(Nos.71503108 and 62077029)CCF-Huawei Innovation Research Program Grant(No.CCF-HuaweiFM202209)Research and Practice Innovation Project of Jiangsu Normal University(No.2022XKT1540).
文摘A credit risk prediction model named KM-ADASYN-TL-FLLightGBM(KADT-FLightGBM)is proposed in this study.Firstly,to overcome the limitation of traditional sampling methods in dealing with imbalanced datasets,an improved ADASYN sampling with K-means clustering algorithm is constructed.Moreover,the Tomek Links method is used to filter the generated samples.Secondly,an utilized an optimized LightGBM algorithm with the Focal Loss is employed to training the model using the datasets obtained by the improved ADASYN sampling.Finally,the comparative analysis between the ensemble model and other different sampling methodologies is conducted on the Lending Club dataset.The results demonstrate that the proposed model effectively minimizes the misclassification of minority classes in credit risk prediction and can be used as a reference for similar studies.
基金This research was supported by the National Key Research and Development Program of China(Grant No.2021ZD0201303)the Technology Innovation Project of Hubei Province of China(Grant No.2019AEA171)the Hubei Province Funds for Distinguished Young Scholars(Grant No.2020CFA050).
文摘Recently,rapid serial visual presentation(RSVP),as a new event-related potential(ERP)paradigm,has become one of the most popular forms in electroencephalogram signal processing technologies.Several improvement approaches have been proposed to improve the performance of RSVP analysis.In brain-computer interface systems based on RSVP,the family of approaches that do not depend on training specific parameters is essential.The participating teams proposed several effective training-free frameworks of algorithms in the ERP competition of the BCI Controlled Robot Contest in World Robot Contest 2021.This paper discusses the effectiveness of various approaches in improving the performance of the system without requiring training and suggests how to apply these approaches in a practical system.First,appropriate preprocessing techniques will greatly improve the results.Then,the non-deep learning algorithm may be more stable than the deep learning approach.Furthermore,ensemble learning can make the model more stable and robust.