This paper explores the synergistic effect of a model combining Elastic Net and Random Forest in online fraud detection.The study selects a public network dataset containing 1781 data records,divides the dataset by 70...This paper explores the synergistic effect of a model combining Elastic Net and Random Forest in online fraud detection.The study selects a public network dataset containing 1781 data records,divides the dataset by 70%for training and 30%for validation,and analyses the correlation between features using a correlation matrix.The experimental results show that the Elastic Net feature selection method generally outperforms PCA in all models,especially when combined with the Random Forest and XGBoost models,and the ElasticNet+Random Forest model achieves the highest accuracy of 0.968 and AUC value of 0.983,while the Kappa and MCC also reached 0.839 and 0.844 respectively,showing extremely high consistency and correlation.This indicates that combining Elastic Net feature selection and Random Forest model has significant performance advantages in online fraud detection.展开更多
Machine learning has emerged as a pivotal tool in deciphering and managing this excess of information in an era of abundant data.This paper presents a comprehensive analysis of machine learning algorithms,focusing on ...Machine learning has emerged as a pivotal tool in deciphering and managing this excess of information in an era of abundant data.This paper presents a comprehensive analysis of machine learning algorithms,focusing on the structure and efficacy of random forests in mitigating overfitting—a prevalent issue in decision tree models.It also introduces a novel approach to enhancing decision tree performance through an optimized pruning method called Adaptive Cross-Validated Alpha CCP(ACV-CCP).This method refines traditional cost complexity pruning by streamlining the selection of the alpha parameter,leveraging cross-validation within the pruning process to achieve a reliable,computationally efficient alpha selection that generalizes well to unseen data.By enhancing computational efficiency and balancing model complexity,ACV-CCP allows decision trees to maintain predictive accuracy while minimizing overfitting,effectively narrowing the performance gap between decision trees and random forests.Our findings illustrate how ACV-CCP contributes to the robustness and applicability of decision trees,providing a valuable perspective on achieving computationally efficient and generalized machine learning models.展开更多
Leaf area index (LAI) is a key parameter for describing vegetation structures and is closely associated with vegetative photosynthesis and energy balance. The accurate retrieval of LAI is important when modeling bio...Leaf area index (LAI) is a key parameter for describing vegetation structures and is closely associated with vegetative photosynthesis and energy balance. The accurate retrieval of LAI is important when modeling biophysical processes of vegetation and the productivity of earth systems. The Random Forests (RF) method aggregates an ensemble of deci- sion trees to improve the prediction accuracy and demonstrates a more robust capacity than other regression methods. This study evaluated the RF method for predicting grassland LAI using ground measurements and remote sensing data. Parameter optimization and variable reduction were conducted before model prediction. Two variable reduction methods were examined: the Variable Importance Value method and the principal component analysis (PCA) method. Finally, the sensitivity of RF to highly correlated variables was tested. The results showed that the RF parameters have a small effect on the performance of RF, and a satisfactory prediction was acquired with a root mean square error (RMSE) of 0.1956. The two variable reduction methods for RF prediction produced different results; variable reduction based on the Variable Importance Value method achieved nearly the same prediction accuracy with no reduced prediction, whereas variable re- duction using the PCA method had an obviously degraded result that may have been caused by the loss of subtle variations and the fusion of noise information. After removing highly correlated variables, the relative variable importance remained steady, and the use of variables selected based on the best-performing vegetation indices performed better than the vari- ables with all vegetation indices or those selected based on the most important one. The results in this study demonstrate the practical and powerful ability of the RF method in predicting grassland LAI, which can also be applied to the estimation of other vegetation traits as an alternative to conventional empirical regression models and the selection of relevant variables used in ecological models.展开更多
Massive Open Online Course(MOOC)has become a popular way of online learning used across the world by millions of people.Meanwhile,a vast amount of information has been collected from the MOOC learners and institutions...Massive Open Online Course(MOOC)has become a popular way of online learning used across the world by millions of people.Meanwhile,a vast amount of information has been collected from the MOOC learners and institutions.Based on the educational data,a lot of researches have been investigated for the prediction of the MOOC learner’s final grade.However,there are still two problems in this research field.The first problem is how to select the most proper features to improve the prediction accuracy,and the second problem is how to use or modify the data mining algorithms for a better analysis of the MOOC data.In order to solve these two problems,an improved random forests method is proposed in this paper.First,a hybrid indicator is defined to measure the importance of the features,and a rule is further established for the feature selection;then,a Clustering-Synthetic Minority Over-sampling Technique(SMOTE)is embedded into the traditional random forests algorithm to solve the class imbalance problem.In experiment part,we verify the performance of the proposed method by using the Canvas Network Person-Course(CNPC)dataset.Furthermore,four well-known prediction methods have been applied for comparison,where the superiority of our method has been proved.展开更多
On-site programming big data refers to the massive data generated in the process of software development with the characteristics of real-time,complexity and high-difficulty for processing.Therefore,data cleaning is e...On-site programming big data refers to the massive data generated in the process of software development with the characteristics of real-time,complexity and high-difficulty for processing.Therefore,data cleaning is essential for on-site programming big data.Duplicate data detection is an important step in data cleaning,which can save storage resources and enhance data consistency.Due to the insufficiency in traditional Sorted Neighborhood Method(SNM)and the difficulty of high-dimensional data detection,an optimized algorithm based on random forests with the dynamic and adaptive window size is proposed.The efficiency of the algorithm can be elevated by improving the method of the key-selection,reducing dimension of data set and using an adaptive variable size sliding window.Experimental results show that the improved SNM algorithm exhibits better performance and achieve higher accuracy.展开更多
In materials science,data-driven methods accelerate material discovery and optimization while reducing costs and improving success rates.Symbolic regression is a key to extracting material descriptors from large datas...In materials science,data-driven methods accelerate material discovery and optimization while reducing costs and improving success rates.Symbolic regression is a key to extracting material descriptors from large datasets,in particular the Sure Independence Screening and Sparsifying Operator(SISSO)method.While SISSO needs to store the entire expression space to impose heavy memory demands,it limits the performance in complex problems.To address this issue,we propose a RF-SISSO algorithm by combining Random Forests(RF)with SISSO.In this algorithm,the Random Forests algorithm is used for prescreening,capturing non-linear relationships and improving feature selection,which may enhance the quality of the input data and boost the accuracy and efficiency on regression and classification tasks.For a testing on the SISSO’s verification problem for 299 materials,RF-SISSO demonstrates its robust performance and high accuracy.RF-SISSO can maintain the testing accuracy above 0.9 across all four training sample sizes and significantly enhancing regression efficiency,especially in training subsets with smaller sample sizes.For the training subset with 45 samples,the efficiency of RF-SISSO was 265 times higher than that of original SISSO.As collecting large datasets would be both costly and time-consuming in the practical experiments,it is thus believed that RF-SISSO may benefit scientific researches by offering a high predicting accuracy with limited data efficiently.展开更多
The medical industry generates vast amounts of data suitable for machine learning during patient-clinician interaction in hospitals.However,as a result of data protection regulations like the general data protection r...The medical industry generates vast amounts of data suitable for machine learning during patient-clinician interaction in hospitals.However,as a result of data protection regulations like the general data protection regulation(GDPR),patient data cannot be shared freely across institutions.In these cases,federated learning(FL)is a viable option where a global model learns from multiple data sites without moving the data.In this paper,we focused on random forests(RFs)for its effectiveness in classification tasks and widespread use throughout the medical industry and compared two popular federated random forest aggregation algorithms on horizontally partitioned data.We first provided necessary background information on federated learning,the advantages of random forests in a medical context,and the two aggregation algorithms.A series of extensive experiments using four public binary medical datasets(an excerpt of MIMIC III,Pima Indian diabetes dataset from Kaggle,and diabetic retinopathy and heart failure dataset from UCI machine learning repository)were then performed to systematically compare the two on equal-sized,unequal-sized,and class-imbalanced clients.A follow-up investigation on the effects of more clients was also conducted.We finally empirically analyzed the advantages of federated learning and concluded that the weighted merge algorithm produces models with,on average,1.903%higher F1 score and 1.406%higher AUCROC value.展开更多
Background:To create and validate nomograms for the personalized prediction of survival in octogenarians with newly diagnosed nonsmall-cell lung cancer(NSCLC)with sole brain metastases(BMs).Methods:Random forests(RF)w...Background:To create and validate nomograms for the personalized prediction of survival in octogenarians with newly diagnosed nonsmall-cell lung cancer(NSCLC)with sole brain metastases(BMs).Methods:Random forests(RF)were applied to identify independent prognostic factors for building nomogram models.The predictive accuracy of the model was evaluated based on the receiver operating characteristic(ROC)curve,C-index,and calibration plots.Results:The area under the curve(AUC)values for overall survival at 6,12,and 18 months in the validation cohort were 0.837,0.867,and 0.849,respectively;the AUC values for cancer-specific survival prediction were 0.819,0.835,and 0.818,respectively.The calibration curves visualized the accuracy of the model.Conclusion:The new nomograms have good predictive power for survival among octogenarians with sole BMs related to NSCLC.展开更多
The proliferation of robot accounts on social media platforms has posed a significant negative impact,necessitating robust measures to counter network anomalies and safeguard content integrity.Social robot detection h...The proliferation of robot accounts on social media platforms has posed a significant negative impact,necessitating robust measures to counter network anomalies and safeguard content integrity.Social robot detection has emerged as a pivotal yet intricate task,aimed at mitigating the dissemination of misleading information.While graphbased approaches have attained remarkable performance in this realm,they grapple with a fundamental limitation:the homogeneity assumption in graph convolution allows social robots to stealthily evade detection by mingling with genuine human profiles.To unravel this challenge and thwart the camouflage tactics,this work proposed an innovative social robot detection framework based on enhanced HOmogeneity and Random Forest(HORFBot).At the core of HORFBot lies a homogeneous graph enhancement strategy,intricately woven with edge-removal techniques,tometiculously dissect the graph intomultiple revealing subgraphs.Subsequently,leveraging the power of contrastive learning,the proposed methodology meticulously trains multiple graph convolutional networks,each honed to discern nuances within these tailored subgraphs.The culminating stage involves the fusion of these feature-rich base classifiers,harmoniously aggregating their insights to produce a comprehensive detection outcome.Extensive experiments on three social robot detection datasets have shown that this method effectively improves the accuracy of social robot detection and outperforms comparative methods.展开更多
Evaluation of water richness in sandstone is an important research topic in the prevention and control of mine water disasters,and the water richness in sandstone is closely related to its porosity.The refl ection sei...Evaluation of water richness in sandstone is an important research topic in the prevention and control of mine water disasters,and the water richness in sandstone is closely related to its porosity.The refl ection seismic exploration data have high-density spatial sampling information,which provides an important data basis for the prediction of sandstone porosity in coal seam roofs by using refl ection seismic data.First,the basic principles of the variational mode decomposition(VMD)method and the random forest method are introduced.Then,the geological model of coal seam roof sandstone is constructed,seismic forward modeling is conducted,and random noise is added.The decomposition eff ects of the empirical mode decomposition(EMD)method and VMD method on noisy signals are compared and analyzed.The test results show that the firstorder intrinsic mode functions(IMF1)and IMF2 decomposed by the VMD method contain the main eff ective components of seismic signals.A prediction process of sandstone porosity in coal seam roofs based on the combination of VMD and random forest method is proposed.The feasibility and eff ectiveness of the method are verified by trial calculation in the porosity prediction of model data.Taking the actual coalfield refl ection seismic data as an example,the sandstone porosity of the 8 coal seam roof is predicted.The application results show the potential application value of the new porosity prediction method proposed in this study.This method has important theoretical guiding significance for evaluating water richness in coal seam roof sandstone and the prevention and control of mine water disasters.展开更多
The Darjeeling Himalayan region,characterized by its complex topography and vulnerability to multiple environmental hazards,faces significant challenges including landslides,earthquakes,flash floods,and soil loss that...The Darjeeling Himalayan region,characterized by its complex topography and vulnerability to multiple environmental hazards,faces significant challenges including landslides,earthquakes,flash floods,and soil loss that critically threaten ecosystem stability.Among these challenges,soil erosion emerges as a silent disaster-a gradual yet relentless process whose impacts accumulate over time,progressively degrading landscape integrity and disrupting ecological sustainability.Unlike catastrophic events with immediate visibility,soil erosion’s most devastating consequences often manifest decades later through diminished agricultural productivity,habitat fragmentation,and irreversible biodiversity loss.This study developed a scalable predictive framework employing Random Forest(RF)and Gradient Boosting Tree(GBT)machine learning models to assess and map soil erosion susceptibility across the region.A comprehensive geo-database was developed incorporating 11 erosion triggering factors:slope,elevation,rainfall,drainage density,topographic wetness index,normalized difference vegetation index,curvature,soil texture,land use,geology,and aspect.A total of 2,483 historical soil erosion locations were identified and randomly divided into two sets:70%for model building and 30%for validation purposes.The models revealed distinct spatial patterns of erosion risks,with GBT classifying 60.50%of the area as very low susceptibility,while RF identified 28.92%in this category.Notable differences emerged in high-risk zone identification,with GBT highlighting 7.42%and RF indicating 2.21%as very high erosion susceptibility areas.Both models demonstrated robust predictive capabilities,with GBT achieving 80.77%accuracy and 0.975 AUC,slightly outperforming RF’s 79.67%accuracy and 0.972 AUC.Analysis of predictor variables identified elevation,slope,rainfall and NDVI as the primary factors influencing erosion susceptibility,highlighting the complex interrelationship between geo-environmental factors and erosion processes.This research offers a strategic framework for targeted conservation and sustainable land management in the fragile Himalayan region,providing valuable insights to help policymakers implement effective soil erosion mitigation strategies and support long-term environmental sustainability.展开更多
Detecting cyber attacks in networks connected to the Internet of Things(IoT)is of utmost importance because of the growing vulnerabilities in the smart environment.Conventional models,such as Naive Bayes and support v...Detecting cyber attacks in networks connected to the Internet of Things(IoT)is of utmost importance because of the growing vulnerabilities in the smart environment.Conventional models,such as Naive Bayes and support vector machine(SVM),as well as ensemble methods,such as Gradient Boosting and eXtreme gradient boosting(XGBoost),are often plagued by high computational costs,which makes it challenging for them to perform real-time detection.In this regard,we suggested an attack detection approach that integrates Visual Geometry Group 16(VGG16),Artificial Rabbits Optimizer(ARO),and Random Forest Model to increase detection accuracy and operational efficiency in Internet of Things(IoT)networks.In the suggested model,the extraction of features from malware pictures was accomplished with the help of VGG16.The prediction process is carried out by the random forest model using the extracted features from the VGG16.Additionally,ARO is used to improve the hyper-parameters of the random forest model of the random forest.With an accuracy of 96.36%,the suggested model outperforms the standard models in terms of accuracy,F1-score,precision,and recall.The comparative research highlights our strategy’s success,which improves performance while maintaining a lower computational cost.This method is ideal for real-time applications,but it is effective.展开更多
Accurate Electric Load Forecasting(ELF)is crucial for optimizing production capacity,improving operational efficiency,and managing energy resources effectively.Moreover,precise ELF contributes to a smaller environment...Accurate Electric Load Forecasting(ELF)is crucial for optimizing production capacity,improving operational efficiency,and managing energy resources effectively.Moreover,precise ELF contributes to a smaller environmental footprint by reducing the risks of disruption,downtime,and waste.However,with increasingly complex energy consumption patterns driven by renewable energy integration and changing consumer behaviors,no single approach has emerged as universally effective.In response,this research presents a hybrid modeling framework that combines the strengths of Random Forest(RF)and Autoregressive Integrated Moving Average(ARIMA)models,enhanced with advanced feature selection—Minimum Redundancy Maximum Relevancy and Maximum Synergy(MRMRMS)method—to produce a sparse model.Additionally,the residual patterns are analyzed to enhance forecast accuracy.High-resolution weather data from Weather Underground and historical energy consumption data from PJM for Duke Energy Ohio and Kentucky(DEO&K)are used in this application.This methodology,termed SP-RF-ARIMA,is evaluated against existing approaches;it demonstrates more than 40%reduction in mean absolute error and root mean square error compared to the second-best method.展开更多
A switch from avian-typeα-2,3 to human-typeα-2,6 receptors is an essential element for the initiation of a pandemic from an avian influenza virus.Some H9N2 viruses exhibit a preference for binding to human-typeα-2,...A switch from avian-typeα-2,3 to human-typeα-2,6 receptors is an essential element for the initiation of a pandemic from an avian influenza virus.Some H9N2 viruses exhibit a preference for binding to human-typeα-2,6 receptors.This identifies their potential threat to public health.However,our understanding of the molecular basis for the switch of receptor preference is still limited.In this study,we employed the random forest algorithm to identify the potentially key amino acid sites within hemagglutinin(HA),which are associated with the receptor binding ability of H9N2 avian influenza virus(AIV).Subsequently,these sites were further verified by receptor binding assays.A total of 12 substitutions in the HA protein(N158D,N158S,A160 N,A160D,A160T,T163I,T163V,V190T,V190A,D193 N,D193G,and N231D)were predicted to prefer binding toα-2,6 receptors.Except for the V190T substitution,the other substitutions were demonstrated to display an affinity for preferential binding toα-2,6 receptors by receptor binding assays.Especially,the A160T substitution caused a significant upregulation of immune-response genes and an increased mortality rate in mice.Our findings provide novel insights into understanding the genetic basis of receptor preference of the H9N2 AIV.展开更多
Zenith wet delay(ZWD)is a key parameter for the precise positioning of global navigation satellite systems(GNSS)and occupies a central role in meteorological research.Currently,most models only consider the periodic v...Zenith wet delay(ZWD)is a key parameter for the precise positioning of global navigation satellite systems(GNSS)and occupies a central role in meteorological research.Currently,most models only consider the periodic variability of the ZWD,neglecting the effect of nonlinear factors on the ZWD estimation.This oversight results in a limited capability to reflect the rapid fluctuations of the ZWD.To more accurately capture and predict complicated variations in ZWD,this paper developed the CRZWD model by a combination of the GPT3 model and random forests(RF)algorithm using 5-year atmospheric profiles from 70 radiosonde(RS)stations across China.Taking the external 25 test stations data as reference,the root mean square(RMS)of the CRZWD model is 29.95 mm.Compared with the GPT3 model and another model using backpropagation neural network(BPNN),the accuracy has improved by 24.7%and 15.9%,respectively.Notably,over 56%of the test stations exhibit an improvement of more than 20%in contrast to GPT3-ZWD.Further temporal and spatial characteristic analyses also demonstrate the significant accuracy and stability advantages of the CRZWD model,indicating the potential prospects for GNSS-based applications.展开更多
One of the core works of analyzing Electrochemical Impedance Spectroscopy(EIS)data is to select an appropriate equivalent circuit model to quantify the parameters of the electrochemical reaction process.However,this p...One of the core works of analyzing Electrochemical Impedance Spectroscopy(EIS)data is to select an appropriate equivalent circuit model to quantify the parameters of the electrochemical reaction process.However,this process often relies on human experience and judgment,which will introduce subjectivity and error.In this paper,an intelligent approach is proposed for matching EIS data to their equivalent circuits based on the Random Forest algorithm.It can automatically select the most suitable equivalent circuit model based on the characteristics and patterns of EIS data.Addressing the typical scenario of metal corrosion,an atmospheric corrosion EIS dataset of low-carbon steel is constructed in this paper,which includes five different corrosion scenarios.This dataset was used to validate and evaluate the pro-posed method in this paper.The contributions of this paper can be summarized in three aspects:(1)This paper proposes a method for selecting equivalent circuit models for EIS data based on the Random Forest algorithm.(2)Using authentic EIS data collected from metal atmospheric corrosion,the paper es-tablishes a dataset encompassing five categories of metal corrosion scenarios.(3)The superiority of the proposed method is validated through the utilization of the established authentic EIS dataset.The ex-periment results demonstrate that,in terms of equivalent circuit matching,this method surpasses other machine learning algorithms in both precision and robustness.Furthermore,it shows strong applicability in the analysis of EIS data.展开更多
This study investigated the impacts of random negative training datasets(NTDs)on the uncertainty of machine learning models for geologic hazard susceptibility assessment of the Loess Plateau,northern Shaanxi Province,...This study investigated the impacts of random negative training datasets(NTDs)on the uncertainty of machine learning models for geologic hazard susceptibility assessment of the Loess Plateau,northern Shaanxi Province,China.Based on randomly generated 40 NTDs,the study developed models for the geologic hazard susceptibility assessment using the random forest algorithm and evaluated their performances using the area under the receiver operating characteristic curve(AUC).Specifically,the means and standard deviations of the AUC values from all models were then utilized to assess the overall spatial correlation between the conditioning factors and the susceptibility assessment,as well as the uncertainty introduced by the NTDs.A risk and return methodology was thus employed to quantify and mitigate the uncertainty,with log odds ratios used to characterize the susceptibility assessment levels.The risk and return values were calculated based on the standard deviations and means of the log odds ratios of various locations.After the mean log odds ratios were converted into probability values,the final susceptibility map was plotted,which accounts for the uncertainty induced by random NTDs.The results indicate that the AUC values of the models ranged from 0.810 to 0.963,with an average of 0.852 and a standard deviation of 0.035,indicating encouraging prediction effects and certain uncertainty.The risk and return analysis reveals that low-risk and high-return areas suggest lower standard deviations and higher means across multiple model-derived assessments.Overall,this study introduces a new framework for quantifying the uncertainty of multiple training and evaluation models,aimed at improving their robustness and reliability.Additionally,by identifying low-risk and high-return areas,resource allocation for geologic hazard prevention and control can be optimized,thus ensuring that limited resources are directed toward the most effective prevention and control measures.展开更多
The agricultural Internet of Things(IoT)system is a critical component of modern smart agriculture,and its security risk assessment methods have garnered increasing attention from the industry.Current agricultural IoT...The agricultural Internet of Things(IoT)system is a critical component of modern smart agriculture,and its security risk assessment methods have garnered increasing attention from the industry.Current agricultural IoT security risk assessment methods primarily rely on expert judgment,introducing subjective factors that reduce the credibility of the assessment results.To address this issue,this study constructed a dataset for agricultural IoT security risk assessment based on real-world security reports.A PCARF algorithm,built on random forest principles,was proposed,incorporating ensemble learning strategies to enhance prediction accuracy.Compared to the second-best model,the proposed model demonstrated a 2.7%increase in accuracy,a 3.4%improvement in recall rate,a 3.1%rise in Area Under the Curve(AUC),and a 7.9%boost in Matthews Correlation Coefficient(MCC).Extensive comparative experiments showed that the proposed model outperforms others in prediction accuracy and robustness.展开更多
The prediction of slope stability is a complex nonlinear problem.This paper proposes a new method based on the random forest(RF)algorithm to study the rocky slopes stability.Taking the Bukit Merah,Perak and Twin Peak(...The prediction of slope stability is a complex nonlinear problem.This paper proposes a new method based on the random forest(RF)algorithm to study the rocky slopes stability.Taking the Bukit Merah,Perak and Twin Peak(Kuala Lumpur)as the study area,the slope characteristics of geometrical parameters are obtained from a multidisciplinary approach(consisting of geological,geotechnical,and remote sensing analyses).18 factors,including rock strength,rock quality designation(RQD),joint spacing,continuity,openness,roughness,filling,weathering,water seepage,temperature,vegetation index,water index,and orientation,are selected to construct model input variables while the factor of safety(FOS)functions as an output.The area under the curve(AUC)value of the receiver operating characteristic(ROC)curve is obtained with precision and accuracy and used to analyse the predictive model ability.With a large training set and predicted parameters,an area under the ROC curve(the AUC)of 0.95 is achieved.A precision score of 0.88 is obtained,indicating that the model has a low false positive rate and correctly identifies a substantial number of true positives.The findings emphasise the importance of using a variety of terrain characteristics and different approaches to characterise the rock slope.展开更多
基金Guangdong Innovation and Entrepreneurship Training Programme for Undergraduates“Automatic Classification and Identification of Fraudulent Websites Based on Machine Learning”(Project No.:DC2023125)。
文摘This paper explores the synergistic effect of a model combining Elastic Net and Random Forest in online fraud detection.The study selects a public network dataset containing 1781 data records,divides the dataset by 70%for training and 30%for validation,and analyses the correlation between features using a correlation matrix.The experimental results show that the Elastic Net feature selection method generally outperforms PCA in all models,especially when combined with the Random Forest and XGBoost models,and the ElasticNet+Random Forest model achieves the highest accuracy of 0.968 and AUC value of 0.983,while the Kappa and MCC also reached 0.839 and 0.844 respectively,showing extremely high consistency and correlation.This indicates that combining Elastic Net feature selection and Random Forest model has significant performance advantages in online fraud detection.
文摘Machine learning has emerged as a pivotal tool in deciphering and managing this excess of information in an era of abundant data.This paper presents a comprehensive analysis of machine learning algorithms,focusing on the structure and efficacy of random forests in mitigating overfitting—a prevalent issue in decision tree models.It also introduces a novel approach to enhancing decision tree performance through an optimized pruning method called Adaptive Cross-Validated Alpha CCP(ACV-CCP).This method refines traditional cost complexity pruning by streamlining the selection of the alpha parameter,leveraging cross-validation within the pruning process to achieve a reliable,computationally efficient alpha selection that generalizes well to unseen data.By enhancing computational efficiency and balancing model complexity,ACV-CCP allows decision trees to maintain predictive accuracy while minimizing overfitting,effectively narrowing the performance gap between decision trees and random forests.Our findings illustrate how ACV-CCP contributes to the robustness and applicability of decision trees,providing a valuable perspective on achieving computationally efficient and generalized machine learning models.
基金funded by the Key Technologies Research and Development Program of China (2013BAC03B02,2012BAC19B04)the International Science and Technology Cooperation Project of China (2012DFA31290)the Earmarked Fund for Modern Agro-industry Technology Research System,China (CARS-35)
文摘Leaf area index (LAI) is a key parameter for describing vegetation structures and is closely associated with vegetative photosynthesis and energy balance. The accurate retrieval of LAI is important when modeling biophysical processes of vegetation and the productivity of earth systems. The Random Forests (RF) method aggregates an ensemble of deci- sion trees to improve the prediction accuracy and demonstrates a more robust capacity than other regression methods. This study evaluated the RF method for predicting grassland LAI using ground measurements and remote sensing data. Parameter optimization and variable reduction were conducted before model prediction. Two variable reduction methods were examined: the Variable Importance Value method and the principal component analysis (PCA) method. Finally, the sensitivity of RF to highly correlated variables was tested. The results showed that the RF parameters have a small effect on the performance of RF, and a satisfactory prediction was acquired with a root mean square error (RMSE) of 0.1956. The two variable reduction methods for RF prediction produced different results; variable reduction based on the Variable Importance Value method achieved nearly the same prediction accuracy with no reduced prediction, whereas variable re- duction using the PCA method had an obviously degraded result that may have been caused by the loss of subtle variations and the fusion of noise information. After removing highly correlated variables, the relative variable importance remained steady, and the use of variables selected based on the best-performing vegetation indices performed better than the vari- ables with all vegetation indices or those selected based on the most important one. The results in this study demonstrate the practical and powerful ability of the RF method in predicting grassland LAI, which can also be applied to the estimation of other vegetation traits as an alternative to conventional empirical regression models and the selection of relevant variables used in ecological models.
基金supported by the National Natural Science Foundation of China under Grant No.61801222in part supported by the Fundamental Research Funds for the Central Universities under Grant No.30919011230in part supported by the Jiangsu Provincial Department of Education Degree and Graduate Education Research Fund under Grant No.JGZD18_012.
文摘Massive Open Online Course(MOOC)has become a popular way of online learning used across the world by millions of people.Meanwhile,a vast amount of information has been collected from the MOOC learners and institutions.Based on the educational data,a lot of researches have been investigated for the prediction of the MOOC learner’s final grade.However,there are still two problems in this research field.The first problem is how to select the most proper features to improve the prediction accuracy,and the second problem is how to use or modify the data mining algorithms for a better analysis of the MOOC data.In order to solve these two problems,an improved random forests method is proposed in this paper.First,a hybrid indicator is defined to measure the importance of the features,and a rule is further established for the feature selection;then,a Clustering-Synthetic Minority Over-sampling Technique(SMOTE)is embedded into the traditional random forests algorithm to solve the class imbalance problem.In experiment part,we verify the performance of the proposed method by using the Canvas Network Person-Course(CNPC)dataset.Furthermore,four well-known prediction methods have been applied for comparison,where the superiority of our method has been proved.
基金supported by the National Key R&D Program of China(Nos.2018YFB1003905)the National Natural Science Foundation of China under Grant No.61971032,Fundamental Research Funds for the Central Universities(No.FRF-TP-18-008A3).
文摘On-site programming big data refers to the massive data generated in the process of software development with the characteristics of real-time,complexity and high-difficulty for processing.Therefore,data cleaning is essential for on-site programming big data.Duplicate data detection is an important step in data cleaning,which can save storage resources and enhance data consistency.Due to the insufficiency in traditional Sorted Neighborhood Method(SNM)and the difficulty of high-dimensional data detection,an optimized algorithm based on random forests with the dynamic and adaptive window size is proposed.The efficiency of the algorithm can be elevated by improving the method of the key-selection,reducing dimension of data set and using an adaptive variable size sliding window.Experimental results show that the improved SNM algorithm exhibits better performance and achieve higher accuracy.
基金supported by the National Natural Science Foundation of China(Nos.21933006 and 21773124)the Fundamental Research Funds for the Central Universities of Nankai University(Nos.63243091 and 63233001)the Supercomputing Center of Nankai University(NKSC).
文摘In materials science,data-driven methods accelerate material discovery and optimization while reducing costs and improving success rates.Symbolic regression is a key to extracting material descriptors from large datasets,in particular the Sure Independence Screening and Sparsifying Operator(SISSO)method.While SISSO needs to store the entire expression space to impose heavy memory demands,it limits the performance in complex problems.To address this issue,we propose a RF-SISSO algorithm by combining Random Forests(RF)with SISSO.In this algorithm,the Random Forests algorithm is used for prescreening,capturing non-linear relationships and improving feature selection,which may enhance the quality of the input data and boost the accuracy and efficiency on regression and classification tasks.For a testing on the SISSO’s verification problem for 299 materials,RF-SISSO demonstrates its robust performance and high accuracy.RF-SISSO can maintain the testing accuracy above 0.9 across all four training sample sizes and significantly enhancing regression efficiency,especially in training subsets with smaller sample sizes.For the training subset with 45 samples,the efficiency of RF-SISSO was 265 times higher than that of original SISSO.As collecting large datasets would be both costly and time-consuming in the practical experiments,it is thus believed that RF-SISSO may benefit scientific researches by offering a high predicting accuracy with limited data efficiently.
文摘The medical industry generates vast amounts of data suitable for machine learning during patient-clinician interaction in hospitals.However,as a result of data protection regulations like the general data protection regulation(GDPR),patient data cannot be shared freely across institutions.In these cases,federated learning(FL)is a viable option where a global model learns from multiple data sites without moving the data.In this paper,we focused on random forests(RFs)for its effectiveness in classification tasks and widespread use throughout the medical industry and compared two popular federated random forest aggregation algorithms on horizontally partitioned data.We first provided necessary background information on federated learning,the advantages of random forests in a medical context,and the two aggregation algorithms.A series of extensive experiments using four public binary medical datasets(an excerpt of MIMIC III,Pima Indian diabetes dataset from Kaggle,and diabetic retinopathy and heart failure dataset from UCI machine learning repository)were then performed to systematically compare the two on equal-sized,unequal-sized,and class-imbalanced clients.A follow-up investigation on the effects of more clients was also conducted.We finally empirically analyzed the advantages of federated learning and concluded that the weighted merge algorithm produces models with,on average,1.903%higher F1 score and 1.406%higher AUCROC value.
基金supported by the key specialty of traditional Chinese medicine promotion project
文摘Background:To create and validate nomograms for the personalized prediction of survival in octogenarians with newly diagnosed nonsmall-cell lung cancer(NSCLC)with sole brain metastases(BMs).Methods:Random forests(RF)were applied to identify independent prognostic factors for building nomogram models.The predictive accuracy of the model was evaluated based on the receiver operating characteristic(ROC)curve,C-index,and calibration plots.Results:The area under the curve(AUC)values for overall survival at 6,12,and 18 months in the validation cohort were 0.837,0.867,and 0.849,respectively;the AUC values for cancer-specific survival prediction were 0.819,0.835,and 0.818,respectively.The calibration curves visualized the accuracy of the model.Conclusion:The new nomograms have good predictive power for survival among octogenarians with sole BMs related to NSCLC.
基金Funds for the Central Universities(grant number CUC24SG018).
文摘The proliferation of robot accounts on social media platforms has posed a significant negative impact,necessitating robust measures to counter network anomalies and safeguard content integrity.Social robot detection has emerged as a pivotal yet intricate task,aimed at mitigating the dissemination of misleading information.While graphbased approaches have attained remarkable performance in this realm,they grapple with a fundamental limitation:the homogeneity assumption in graph convolution allows social robots to stealthily evade detection by mingling with genuine human profiles.To unravel this challenge and thwart the camouflage tactics,this work proposed an innovative social robot detection framework based on enhanced HOmogeneity and Random Forest(HORFBot).At the core of HORFBot lies a homogeneous graph enhancement strategy,intricately woven with edge-removal techniques,tometiculously dissect the graph intomultiple revealing subgraphs.Subsequently,leveraging the power of contrastive learning,the proposed methodology meticulously trains multiple graph convolutional networks,each honed to discern nuances within these tailored subgraphs.The culminating stage involves the fusion of these feature-rich base classifiers,harmoniously aggregating their insights to produce a comprehensive detection outcome.Extensive experiments on three social robot detection datasets have shown that this method effectively improves the accuracy of social robot detection and outperforms comparative methods.
基金National Natural Science Foundation of China(Grant No.42274180)National Key Research and Development Program of China(2021YFC2902003).
文摘Evaluation of water richness in sandstone is an important research topic in the prevention and control of mine water disasters,and the water richness in sandstone is closely related to its porosity.The refl ection seismic exploration data have high-density spatial sampling information,which provides an important data basis for the prediction of sandstone porosity in coal seam roofs by using refl ection seismic data.First,the basic principles of the variational mode decomposition(VMD)method and the random forest method are introduced.Then,the geological model of coal seam roof sandstone is constructed,seismic forward modeling is conducted,and random noise is added.The decomposition eff ects of the empirical mode decomposition(EMD)method and VMD method on noisy signals are compared and analyzed.The test results show that the firstorder intrinsic mode functions(IMF1)and IMF2 decomposed by the VMD method contain the main eff ective components of seismic signals.A prediction process of sandstone porosity in coal seam roofs based on the combination of VMD and random forest method is proposed.The feasibility and eff ectiveness of the method are verified by trial calculation in the porosity prediction of model data.Taking the actual coalfield refl ection seismic data as an example,the sandstone porosity of the 8 coal seam roof is predicted.The application results show the potential application value of the new porosity prediction method proposed in this study.This method has important theoretical guiding significance for evaluating water richness in coal seam roof sandstone and the prevention and control of mine water disasters.
文摘The Darjeeling Himalayan region,characterized by its complex topography and vulnerability to multiple environmental hazards,faces significant challenges including landslides,earthquakes,flash floods,and soil loss that critically threaten ecosystem stability.Among these challenges,soil erosion emerges as a silent disaster-a gradual yet relentless process whose impacts accumulate over time,progressively degrading landscape integrity and disrupting ecological sustainability.Unlike catastrophic events with immediate visibility,soil erosion’s most devastating consequences often manifest decades later through diminished agricultural productivity,habitat fragmentation,and irreversible biodiversity loss.This study developed a scalable predictive framework employing Random Forest(RF)and Gradient Boosting Tree(GBT)machine learning models to assess and map soil erosion susceptibility across the region.A comprehensive geo-database was developed incorporating 11 erosion triggering factors:slope,elevation,rainfall,drainage density,topographic wetness index,normalized difference vegetation index,curvature,soil texture,land use,geology,and aspect.A total of 2,483 historical soil erosion locations were identified and randomly divided into two sets:70%for model building and 30%for validation purposes.The models revealed distinct spatial patterns of erosion risks,with GBT classifying 60.50%of the area as very low susceptibility,while RF identified 28.92%in this category.Notable differences emerged in high-risk zone identification,with GBT highlighting 7.42%and RF indicating 2.21%as very high erosion susceptibility areas.Both models demonstrated robust predictive capabilities,with GBT achieving 80.77%accuracy and 0.975 AUC,slightly outperforming RF’s 79.67%accuracy and 0.972 AUC.Analysis of predictor variables identified elevation,slope,rainfall and NDVI as the primary factors influencing erosion susceptibility,highlighting the complex interrelationship between geo-environmental factors and erosion processes.This research offers a strategic framework for targeted conservation and sustainable land management in the fragile Himalayan region,providing valuable insights to help policymakers implement effective soil erosion mitigation strategies and support long-term environmental sustainability.
基金funded by Institutional Fund Projects under grant no.(IFPDP-261-22)。
文摘Detecting cyber attacks in networks connected to the Internet of Things(IoT)is of utmost importance because of the growing vulnerabilities in the smart environment.Conventional models,such as Naive Bayes and support vector machine(SVM),as well as ensemble methods,such as Gradient Boosting and eXtreme gradient boosting(XGBoost),are often plagued by high computational costs,which makes it challenging for them to perform real-time detection.In this regard,we suggested an attack detection approach that integrates Visual Geometry Group 16(VGG16),Artificial Rabbits Optimizer(ARO),and Random Forest Model to increase detection accuracy and operational efficiency in Internet of Things(IoT)networks.In the suggested model,the extraction of features from malware pictures was accomplished with the help of VGG16.The prediction process is carried out by the random forest model using the extracted features from the VGG16.Additionally,ARO is used to improve the hyper-parameters of the random forest model of the random forest.With an accuracy of 96.36%,the suggested model outperforms the standard models in terms of accuracy,F1-score,precision,and recall.The comparative research highlights our strategy’s success,which improves performance while maintaining a lower computational cost.This method is ideal for real-time applications,but it is effective.
基金supported by the Startup Grant(PG18929)awarded to F.Shokoohi.
文摘Accurate Electric Load Forecasting(ELF)is crucial for optimizing production capacity,improving operational efficiency,and managing energy resources effectively.Moreover,precise ELF contributes to a smaller environmental footprint by reducing the risks of disruption,downtime,and waste.However,with increasingly complex energy consumption patterns driven by renewable energy integration and changing consumer behaviors,no single approach has emerged as universally effective.In response,this research presents a hybrid modeling framework that combines the strengths of Random Forest(RF)and Autoregressive Integrated Moving Average(ARIMA)models,enhanced with advanced feature selection—Minimum Redundancy Maximum Relevancy and Maximum Synergy(MRMRMS)method—to produce a sparse model.Additionally,the residual patterns are analyzed to enhance forecast accuracy.High-resolution weather data from Weather Underground and historical energy consumption data from PJM for Duke Energy Ohio and Kentucky(DEO&K)are used in this application.This methodology,termed SP-RF-ARIMA,is evaluated against existing approaches;it demonstrates more than 40%reduction in mean absolute error and root mean square error compared to the second-best method.
基金supported by the National Natural Science Foundation of China(32273037 and 32102636)the Guangdong Major Project of Basic and Applied Basic Research(2020B0301030007)+4 种基金Laboratory of Lingnan Modern Agriculture Project(NT2021007)the Guangdong Science and Technology Innovation Leading Talent Program(2019TX05N098)the 111 Center(D20008)the double first-class discipline promotion project(2023B10564003)the Department of Education of Guangdong Province(2019KZDXM004 and 2019KCXTD001).
文摘A switch from avian-typeα-2,3 to human-typeα-2,6 receptors is an essential element for the initiation of a pandemic from an avian influenza virus.Some H9N2 viruses exhibit a preference for binding to human-typeα-2,6 receptors.This identifies their potential threat to public health.However,our understanding of the molecular basis for the switch of receptor preference is still limited.In this study,we employed the random forest algorithm to identify the potentially key amino acid sites within hemagglutinin(HA),which are associated with the receptor binding ability of H9N2 avian influenza virus(AIV).Subsequently,these sites were further verified by receptor binding assays.A total of 12 substitutions in the HA protein(N158D,N158S,A160 N,A160D,A160T,T163I,T163V,V190T,V190A,D193 N,D193G,and N231D)were predicted to prefer binding toα-2,6 receptors.Except for the V190T substitution,the other substitutions were demonstrated to display an affinity for preferential binding toα-2,6 receptors by receptor binding assays.Especially,the A160T substitution caused a significant upregulation of immune-response genes and an increased mortality rate in mice.Our findings provide novel insights into understanding the genetic basis of receptor preference of the H9N2 AIV.
基金supported by the National Natural Science Foundation of China[42030109,42074012]the Scientific Study Project for institutes of Higher Learning,Ministry of Education,Liaoning Province[LJKMZ20220673]+2 种基金the Project supported by the State Key Laboratory of Geodesy and Earths'Dynamics,Innovation Academy for Precision Measurement Science and Technology[SKLGED2023-3-2]Liaoning Revitalization Talent Program[XLYC2203162]Natural Science Foundation of Hebei Province in China[D2023402024].
文摘Zenith wet delay(ZWD)is a key parameter for the precise positioning of global navigation satellite systems(GNSS)and occupies a central role in meteorological research.Currently,most models only consider the periodic variability of the ZWD,neglecting the effect of nonlinear factors on the ZWD estimation.This oversight results in a limited capability to reflect the rapid fluctuations of the ZWD.To more accurately capture and predict complicated variations in ZWD,this paper developed the CRZWD model by a combination of the GPT3 model and random forests(RF)algorithm using 5-year atmospheric profiles from 70 radiosonde(RS)stations across China.Taking the external 25 test stations data as reference,the root mean square(RMS)of the CRZWD model is 29.95 mm.Compared with the GPT3 model and another model using backpropagation neural network(BPNN),the accuracy has improved by 24.7%and 15.9%,respectively.Notably,over 56%of the test stations exhibit an improvement of more than 20%in contrast to GPT3-ZWD.Further temporal and spatial characteristic analyses also demonstrate the significant accuracy and stability advantages of the CRZWD model,indicating the potential prospects for GNSS-based applications.
基金support of the project from the National Key R&D Program of China,Research and Application of Sensing System for Cross-regional Complex Oil&Gas Pipeline Network Safe and Efficiency Operational Status Monitoring(Grant No.2022YFB3207603).
文摘One of the core works of analyzing Electrochemical Impedance Spectroscopy(EIS)data is to select an appropriate equivalent circuit model to quantify the parameters of the electrochemical reaction process.However,this process often relies on human experience and judgment,which will introduce subjectivity and error.In this paper,an intelligent approach is proposed for matching EIS data to their equivalent circuits based on the Random Forest algorithm.It can automatically select the most suitable equivalent circuit model based on the characteristics and patterns of EIS data.Addressing the typical scenario of metal corrosion,an atmospheric corrosion EIS dataset of low-carbon steel is constructed in this paper,which includes five different corrosion scenarios.This dataset was used to validate and evaluate the pro-posed method in this paper.The contributions of this paper can be summarized in three aspects:(1)This paper proposes a method for selecting equivalent circuit models for EIS data based on the Random Forest algorithm.(2)Using authentic EIS data collected from metal atmospheric corrosion,the paper es-tablishes a dataset encompassing five categories of metal corrosion scenarios.(3)The superiority of the proposed method is validated through the utilization of the established authentic EIS dataset.The ex-periment results demonstrate that,in terms of equivalent circuit matching,this method surpasses other machine learning algorithms in both precision and robustness.Furthermore,it shows strong applicability in the analysis of EIS data.
基金supported by a project entitled Loess Plateau Region-Watershed-Slope Geological Hazard Multi-Scale Collaborative Intelligent Early Warning System of the National Key R&D Program of China(2022YFC3003404)a project of the Shaanxi Youth Science and Technology Star(2021KJXX-87)public welfare geological survey projects of Shaanxi Institute of Geologic Survey(20180301,201918,202103,and 202413).
文摘This study investigated the impacts of random negative training datasets(NTDs)on the uncertainty of machine learning models for geologic hazard susceptibility assessment of the Loess Plateau,northern Shaanxi Province,China.Based on randomly generated 40 NTDs,the study developed models for the geologic hazard susceptibility assessment using the random forest algorithm and evaluated their performances using the area under the receiver operating characteristic curve(AUC).Specifically,the means and standard deviations of the AUC values from all models were then utilized to assess the overall spatial correlation between the conditioning factors and the susceptibility assessment,as well as the uncertainty introduced by the NTDs.A risk and return methodology was thus employed to quantify and mitigate the uncertainty,with log odds ratios used to characterize the susceptibility assessment levels.The risk and return values were calculated based on the standard deviations and means of the log odds ratios of various locations.After the mean log odds ratios were converted into probability values,the final susceptibility map was plotted,which accounts for the uncertainty induced by random NTDs.The results indicate that the AUC values of the models ranged from 0.810 to 0.963,with an average of 0.852 and a standard deviation of 0.035,indicating encouraging prediction effects and certain uncertainty.The risk and return analysis reveals that low-risk and high-return areas suggest lower standard deviations and higher means across multiple model-derived assessments.Overall,this study introduces a new framework for quantifying the uncertainty of multiple training and evaluation models,aimed at improving their robustness and reliability.Additionally,by identifying low-risk and high-return areas,resource allocation for geologic hazard prevention and control can be optimized,thus ensuring that limited resources are directed toward the most effective prevention and control measures.
文摘The agricultural Internet of Things(IoT)system is a critical component of modern smart agriculture,and its security risk assessment methods have garnered increasing attention from the industry.Current agricultural IoT security risk assessment methods primarily rely on expert judgment,introducing subjective factors that reduce the credibility of the assessment results.To address this issue,this study constructed a dataset for agricultural IoT security risk assessment based on real-world security reports.A PCARF algorithm,built on random forest principles,was proposed,incorporating ensemble learning strategies to enhance prediction accuracy.Compared to the second-best model,the proposed model demonstrated a 2.7%increase in accuracy,a 3.4%improvement in recall rate,a 3.1%rise in Area Under the Curve(AUC),and a 7.9%boost in Matthews Correlation Coefficient(MCC).Extensive comparative experiments showed that the proposed model outperforms others in prediction accuracy and robustness.
基金support in providing the data and the Universiti Teknologi Malaysia supported this work under UTM Flagship CoE/RG-Coe/RG 5.2:Evaluating Surface PGA with Global Ground Motion Site Response Analyses for the highest seismic activity location in Peninsular Malaysia(Q.J130000.5022.10G47)Universiti Teknologi Malaysia-Earthquake Hazard Assessment in Peninsular Malaysia Using Probabilistic Seismic Hazard Analysis(PSHA)Method(Q.J130000.21A2.06E9).
文摘The prediction of slope stability is a complex nonlinear problem.This paper proposes a new method based on the random forest(RF)algorithm to study the rocky slopes stability.Taking the Bukit Merah,Perak and Twin Peak(Kuala Lumpur)as the study area,the slope characteristics of geometrical parameters are obtained from a multidisciplinary approach(consisting of geological,geotechnical,and remote sensing analyses).18 factors,including rock strength,rock quality designation(RQD),joint spacing,continuity,openness,roughness,filling,weathering,water seepage,temperature,vegetation index,water index,and orientation,are selected to construct model input variables while the factor of safety(FOS)functions as an output.The area under the curve(AUC)value of the receiver operating characteristic(ROC)curve is obtained with precision and accuracy and used to analyse the predictive model ability.With a large training set and predicted parameters,an area under the ROC curve(the AUC)of 0.95 is achieved.A precision score of 0.88 is obtained,indicating that the model has a low false positive rate and correctly identifies a substantial number of true positives.The findings emphasise the importance of using a variety of terrain characteristics and different approaches to characterise the rock slope.