This paper explores the synergistic effect of a model combining Elastic Net and Random Forest in online fraud detection.The study selects a public network dataset containing 1781 data records,divides the dataset by 70...This paper explores the synergistic effect of a model combining Elastic Net and Random Forest in online fraud detection.The study selects a public network dataset containing 1781 data records,divides the dataset by 70%for training and 30%for validation,and analyses the correlation between features using a correlation matrix.The experimental results show that the Elastic Net feature selection method generally outperforms PCA in all models,especially when combined with the Random Forest and XGBoost models,and the ElasticNet+Random Forest model achieves the highest accuracy of 0.968 and AUC value of 0.983,while the Kappa and MCC also reached 0.839 and 0.844 respectively,showing extremely high consistency and correlation.This indicates that combining Elastic Net feature selection and Random Forest model has significant performance advantages in online fraud detection.展开更多
Leaf area index (LAI) is a key parameter for describing vegetation structures and is closely associated with vegetative photosynthesis and energy balance. The accurate retrieval of LAI is important when modeling bio...Leaf area index (LAI) is a key parameter for describing vegetation structures and is closely associated with vegetative photosynthesis and energy balance. The accurate retrieval of LAI is important when modeling biophysical processes of vegetation and the productivity of earth systems. The Random Forests (RF) method aggregates an ensemble of deci- sion trees to improve the prediction accuracy and demonstrates a more robust capacity than other regression methods. This study evaluated the RF method for predicting grassland LAI using ground measurements and remote sensing data. Parameter optimization and variable reduction were conducted before model prediction. Two variable reduction methods were examined: the Variable Importance Value method and the principal component analysis (PCA) method. Finally, the sensitivity of RF to highly correlated variables was tested. The results showed that the RF parameters have a small effect on the performance of RF, and a satisfactory prediction was acquired with a root mean square error (RMSE) of 0.1956. The two variable reduction methods for RF prediction produced different results; variable reduction based on the Variable Importance Value method achieved nearly the same prediction accuracy with no reduced prediction, whereas variable re- duction using the PCA method had an obviously degraded result that may have been caused by the loss of subtle variations and the fusion of noise information. After removing highly correlated variables, the relative variable importance remained steady, and the use of variables selected based on the best-performing vegetation indices performed better than the vari- ables with all vegetation indices or those selected based on the most important one. The results in this study demonstrate the practical and powerful ability of the RF method in predicting grassland LAI, which can also be applied to the estimation of other vegetation traits as an alternative to conventional empirical regression models and the selection of relevant variables used in ecological models.展开更多
Massive Open Online Course(MOOC)has become a popular way of online learning used across the world by millions of people.Meanwhile,a vast amount of information has been collected from the MOOC learners and institutions...Massive Open Online Course(MOOC)has become a popular way of online learning used across the world by millions of people.Meanwhile,a vast amount of information has been collected from the MOOC learners and institutions.Based on the educational data,a lot of researches have been investigated for the prediction of the MOOC learner’s final grade.However,there are still two problems in this research field.The first problem is how to select the most proper features to improve the prediction accuracy,and the second problem is how to use or modify the data mining algorithms for a better analysis of the MOOC data.In order to solve these two problems,an improved random forests method is proposed in this paper.First,a hybrid indicator is defined to measure the importance of the features,and a rule is further established for the feature selection;then,a Clustering-Synthetic Minority Over-sampling Technique(SMOTE)is embedded into the traditional random forests algorithm to solve the class imbalance problem.In experiment part,we verify the performance of the proposed method by using the Canvas Network Person-Course(CNPC)dataset.Furthermore,four well-known prediction methods have been applied for comparison,where the superiority of our method has been proved.展开更多
On-site programming big data refers to the massive data generated in the process of software development with the characteristics of real-time,complexity and high-difficulty for processing.Therefore,data cleaning is e...On-site programming big data refers to the massive data generated in the process of software development with the characteristics of real-time,complexity and high-difficulty for processing.Therefore,data cleaning is essential for on-site programming big data.Duplicate data detection is an important step in data cleaning,which can save storage resources and enhance data consistency.Due to the insufficiency in traditional Sorted Neighborhood Method(SNM)and the difficulty of high-dimensional data detection,an optimized algorithm based on random forests with the dynamic and adaptive window size is proposed.The efficiency of the algorithm can be elevated by improving the method of the key-selection,reducing dimension of data set and using an adaptive variable size sliding window.Experimental results show that the improved SNM algorithm exhibits better performance and achieve higher accuracy.展开更多
Machine learning has emerged as a pivotal tool in deciphering and managing this excess of information in an era of abundant data.This paper presents a comprehensive analysis of machine learning algorithms,focusing on ...Machine learning has emerged as a pivotal tool in deciphering and managing this excess of information in an era of abundant data.This paper presents a comprehensive analysis of machine learning algorithms,focusing on the structure and efficacy of random forests in mitigating overfitting—a prevalent issue in decision tree models.It also introduces a novel approach to enhancing decision tree performance through an optimized pruning method called Adaptive Cross-Validated Alpha CCP(ACV-CCP).This method refines traditional cost complexity pruning by streamlining the selection of the alpha parameter,leveraging cross-validation within the pruning process to achieve a reliable,computationally efficient alpha selection that generalizes well to unseen data.By enhancing computational efficiency and balancing model complexity,ACV-CCP allows decision trees to maintain predictive accuracy while minimizing overfitting,effectively narrowing the performance gap between decision trees and random forests.Our findings illustrate how ACV-CCP contributes to the robustness and applicability of decision trees,providing a valuable perspective on achieving computationally efficient and generalized machine learning models.展开更多
In materials science,data-driven methods accelerate material discovery and optimization while reducing costs and improving success rates.Symbolic regression is a key to extracting material descriptors from large datas...In materials science,data-driven methods accelerate material discovery and optimization while reducing costs and improving success rates.Symbolic regression is a key to extracting material descriptors from large datasets,in particular the Sure Independence Screening and Sparsifying Operator(SISSO)method.While SISSO needs to store the entire expression space to impose heavy memory demands,it limits the performance in complex problems.To address this issue,we propose a RF-SISSO algorithm by combining Random Forests(RF)with SISSO.In this algorithm,the Random Forests algorithm is used for prescreening,capturing non-linear relationships and improving feature selection,which may enhance the quality of the input data and boost the accuracy and efficiency on regression and classification tasks.For a testing on the SISSO’s verification problem for 299 materials,RF-SISSO demonstrates its robust performance and high accuracy.RF-SISSO can maintain the testing accuracy above 0.9 across all four training sample sizes and significantly enhancing regression efficiency,especially in training subsets with smaller sample sizes.For the training subset with 45 samples,the efficiency of RF-SISSO was 265 times higher than that of original SISSO.As collecting large datasets would be both costly and time-consuming in the practical experiments,it is thus believed that RF-SISSO may benefit scientific researches by offering a high predicting accuracy with limited data efficiently.展开更多
The medical industry generates vast amounts of data suitable for machine learning during patient-clinician interaction in hospitals.However,as a result of data protection regulations like the general data protection r...The medical industry generates vast amounts of data suitable for machine learning during patient-clinician interaction in hospitals.However,as a result of data protection regulations like the general data protection regulation(GDPR),patient data cannot be shared freely across institutions.In these cases,federated learning(FL)is a viable option where a global model learns from multiple data sites without moving the data.In this paper,we focused on random forests(RFs)for its effectiveness in classification tasks and widespread use throughout the medical industry and compared two popular federated random forest aggregation algorithms on horizontally partitioned data.We first provided necessary background information on federated learning,the advantages of random forests in a medical context,and the two aggregation algorithms.A series of extensive experiments using four public binary medical datasets(an excerpt of MIMIC III,Pima Indian diabetes dataset from Kaggle,and diabetic retinopathy and heart failure dataset from UCI machine learning repository)were then performed to systematically compare the two on equal-sized,unequal-sized,and class-imbalanced clients.A follow-up investigation on the effects of more clients was also conducted.We finally empirically analyzed the advantages of federated learning and concluded that the weighted merge algorithm produces models with,on average,1.903%higher F1 score and 1.406%higher AUCROC value.展开更多
Slope units are divided according to the real topography and have clear geological characteristics,making them ideal units for evaluating the susceptibility to geological disasters.Based on the results of automaticall...Slope units are divided according to the real topography and have clear geological characteristics,making them ideal units for evaluating the susceptibility to geological disasters.Based on the results of automatically and manually corrected hydrological slope unit division,the Longhua District,Shenzhen City,Guangdong Province,was selected as the study area.A total of 15 influencing factors,namely Fluctuation,slope,slope aspect,curvature,topographic witness index(TWI),stream power index(SPI),topographic roughness index(TRI),annual average rainfall,distance to water system,engineering rock group,distance to fault,land use,normalized difference vegetation index(NDVI),nighttime light,and distance to road,were selected as evaluation indicators.The information volume model(IV)and random points were used to select non-geological disaster units,and then the random forest model(RF)was used to evaluate the susceptibility to geological disasters.The automatic slope unit and the hydrological slope unit were compared and analyzed in the random forest and information volume random forest models.The results show that the area under the curve(AUC)values of the automatic slope unit evaluation results are 0.931 for the IV-RF model and 0.716 for the RF model,which are 0.6%(IV-RF model)and 1.9%(RF model)higher than those for the hydrological slope unit.Based on a comparison of the evaluation methods based on the two types of slope units,the hydrological slope unit evaluation method based on manual correction is highly subjective,is complicated to operate,and has a low evaluation accuracy,whereas the evaluation method based on automatic slope unit division is efficient and accurate,is suitable for large-scale efficient geological disaster evaluation,and can better deal with the problem of geological disaster susceptibility evaluation.展开更多
Although the concentration of fine particulate matter(PM_(2.5))is reducing continuously,the proportion of secondary organic aerosols(SOA)in PM_(2.5) and the O_(3) levels are increasing.This is causing severe complex a...Although the concentration of fine particulate matter(PM_(2.5))is reducing continuously,the proportion of secondary organic aerosols(SOA)in PM_(2.5) and the O_(3) levels are increasing.This is causing severe complex atmospheric pollution in North China.It is essential to identify and quantify the driving factors of SOA and O_(3),including the various pollution sources and meteorological factors.PM_(2.5) and volatile organic compounds(VOCs)samples were collected simultaneously in three cities in Shandong Province during different pollution scenarios from 2021 to 2023.Then,the carbonaceous aerosol and 99 VOC species were analyzed.Random forest(RF)combined with positive matrix factorization and an observation-based model(OBM)were used to quantify the key drivers of SOA and O_(3).Aromatic hydrocarbons were the main contributors to secondary organic aerosol potential(74.3%-89.9%),whereas alkenes contributed the most to the ozone formation potential(27.0%-62.3%).The RF modeling identified temperature and NOx as the dominant drivers of ozone formation.These accounted for 47.8%and 17.4%,respectively.Temperature showed a positive correlation with O_(3) because an increase in temperature can promote ozone formation.NOx had a significant negative correlation with O_(3),which was consistent with the conclusions from the sensitivity analysis of the OBM.The dominant contributors to SOA were vehicle emissions,solvent use,and industrial emissions.These accounted for 43.9%,18.2%,and 10.5%,respectively.An evident positive correlation existed between these emission sources and SOA.展开更多
In response to the challenges of inadequate predictive accuracy and limited generalization capability in data-driven modeling for the mechanical properties of the cold-rolled strip steel,a predictive modeling method n...In response to the challenges of inadequate predictive accuracy and limited generalization capability in data-driven modeling for the mechanical properties of the cold-rolled strip steel,a predictive modeling method named RFR-WOA is developed based on random forest regression(RFR)and whale optimization algorithm(WOA).Firstly,using Pearson and Spearman correlation analysis and Gini coefficient importance ranking on an actual production dataset containing 37,878 samples,22 key variables are selected as model inputs from 112 variables that affect mechanical properties.Subsequently,an RFR-based predictive model for the mechanical properties of cold-rolled strip steel is constructed.Then,with the combination of the coefficient of determination(R^(2))and root mean square error as the optimization objective,the hyperparameters of RFR model are iteratively optimized using WOA,and better predictive effectiveness is obtained.Finally,the mechanical properties prediction model based on RFR-WOA is compared with models established using deep neural networks,convolutional neural networks,and other methods.The test results on 9469 samples of actual production data show that the model developed present has better predictive accuracy and generalization capability.展开更多
To address the problems of wind power abandonment and the stoppage of electricity transmission caused by a short circuit in a power line of a doubly-fed induction generator(DFIG) based wind farm, this paper proposes a...To address the problems of wind power abandonment and the stoppage of electricity transmission caused by a short circuit in a power line of a doubly-fed induction generator(DFIG) based wind farm, this paper proposes an intelligent location method for a single-phase grounding fault based on a multiple random forests(multi-RF) algorithm. First, the simulation model is built, and the fundamental amplitudes of the zerosequence currents are extracted by a fast Fourier transform(FFT) to construct the feature set. Then, the random forest classification algorithm is applied to establish the fault section locator. The model is resampled on the basis of the bootstrap method to generate multiple sample subsets, which are used to establish multiple classification and regression tree(CART) classifiers. The CART classifiers use the mean decrease in the node impurity as the feature importance,which is used to mine the relationship between features and fault sections. Subsequently, a fault section is identified by voting on the test results for each classifier. Finally, a multi-RF regression fault locator is built to output the predicted fault distance. Experimental results with PSCAD/EMTDC software show that the proposed method can overcome the shortcomings of a single RF and has the advantage of locating a short hybrid overhead/cable line with multiple branches. Compared with support vector machines(SVMs)and previously reported methods, the proposed method can meet the location accuracy and efficiency requirements of a DFIG-based wind farm better.展开更多
Background: Random Forests is a popular classification and regression method that has proven powerful for various prediction problems in biological studies. However, its performance often deteriorates when the number...Background: Random Forests is a popular classification and regression method that has proven powerful for various prediction problems in biological studies. However, its performance often deteriorates when the number of features increases. To address this limitation, feature elimination Random Forests was proposed that only uses features with the largest variable importance scores. Yet the performance of this method is not satisfying, possibly due to its rigid feature selection, and increased correlations between trees of forest. Methods: We propose variable importance-weighted Random Forests, which instead of sampling features with equal probability at each node to build up trees, samples features according to their variable importance scores, and then select the best split from the randomly selected features. Results: We evaluate the performance of our method through comprehensive simulation and real data analyses, for both regression and classification. Compared to the standard Random Forests and the feature elimination Random Forests methods, our proposed method has improved performance in most cases. Conclusions: By incorporating the variable importance scores into the random feature selection step, our method can better utilize more informative features without completely ignoring less informative ones, hence has improved prediction accuracy in the presence of weak signals and large noises. We have implemented an R package "viRandomForests" based on the original R package "randomForest" and it can be freely downloaded from http:// zhaocenter.org/software.展开更多
The random forests (RF) algorithm, which combines the predictions from an ensemble of random trees, has achieved significant improvements in terms of classification accuracy. In many real-world applications, however...The random forests (RF) algorithm, which combines the predictions from an ensemble of random trees, has achieved significant improvements in terms of classification accuracy. In many real-world applications, however, ranking is often required in order to make optimal decisions. Thus, we focus our attention on the ranking performance of RF in this paper. Our experi- mental results based on the entire 36 UC Irvine Machine Learning Repository (UCI) data sets published on the main website of Weka platform show that RF doesn't perform well in ranking, and is even about the same as a single C4.4 tree. This fact raises the question of whether several improvements to RF can scale up its ranking performance. To answer this question, we single out an improved random forests (IRF) algorithm. Instead of the information gain measure and the maximum-likelihood estimate, the average gain measure and the similarity- weighted estimate are used in IRF. Our experiments show that IRF significantly outperforms all the other algorithms used to compare in terms of ranking while maintains the high classification accuracy characterizing RF.展开更多
Alzheimer's disease(AD) is a serious neurodegenerative disorder and its cause remains largely elusive.In past years,genome-wide association(GWA) studies have provided an effective means for AD research.However,the...Alzheimer's disease(AD) is a serious neurodegenerative disorder and its cause remains largely elusive.In past years,genome-wide association(GWA) studies have provided an effective means for AD research.However,the univariate method that is commonly used in GWA studies cannot effectively detect the biological mechanisms associated with this disease.In this study,we propose a new strategy for the GWA analysis of AD that combines random forests with enrichment analysis.First,backward feature selection using random forests was performed on a GWA dataset of AD patients carrying the apolipoprotein gene(APOEε4) and 1058 susceptible single nucleotide polymorphisms(SNPs) were detected,including several known AD-associated SNPs.Next,the susceptible SNPs were investigated by enrichment analysis and significantly-associated gene functional annotations,such as 'alternative splicing','glycoprotein',and 'neuron development',were successfully discovered,indicating that these biological mechanisms play important roles in the development of AD in APOEε4 carriers.These findings may provide insights into the pathogenesis of AD and helpful guidance for further studies.Furthermore,this strategy can easily be modified and applied to GWA studies of other complex diseases.展开更多
Dysfunction of microbial communities in various human body sites has been shown to be associated with a variety of diseases raising the possibility of predicting diseases based on metagenomic samples.Although many stu...Dysfunction of microbial communities in various human body sites has been shown to be associated with a variety of diseases raising the possibility of predicting diseases based on metagenomic samples.Although many studies have investigated this problem,there are no consensus on the optimal approaches for predicting disease status based on metagenomic samples.Using six human gut metagenomic datasets consisting of large numbers of colorectal cancer patients and healthy controls from different countries,we investigated different software packages for extracting relative abundances of known microbial genomes and for integrating mapping and as-sembly approaches to obtain the relative abundance profiles of both known and novel genomes.The random forests(RF)classification algorithm was then used to predict colorectal cancer status based on the microbial relative abundance profiles.Based on within data cross-validation and cross-dataset prediction,we show that the RF prediction performance using the microbial relative abundance profiles estimated by Centrifuge is generally higher than that using the microbial relative abundance profiles estimated by MetaPhlAn2 and Bracken.We also develop a novel method to integrate the relative abundance profiles of both known and novel microbial or-ganisms to further increase the prediction performance for colorectal cancer from metagenomes.展开更多
Background:To create and validate nomograms for the personalized prediction of survival in octogenarians with newly diagnosed nonsmall-cell lung cancer(NSCLC)with sole brain metastases(BMs).Methods:Random forests(RF)w...Background:To create and validate nomograms for the personalized prediction of survival in octogenarians with newly diagnosed nonsmall-cell lung cancer(NSCLC)with sole brain metastases(BMs).Methods:Random forests(RF)were applied to identify independent prognostic factors for building nomogram models.The predictive accuracy of the model was evaluated based on the receiver operating characteristic(ROC)curve,C-index,and calibration plots.Results:The area under the curve(AUC)values for overall survival at 6,12,and 18 months in the validation cohort were 0.837,0.867,and 0.849,respectively;the AUC values for cancer-specific survival prediction were 0.819,0.835,and 0.818,respectively.The calibration curves visualized the accuracy of the model.Conclusion:The new nomograms have good predictive power for survival among octogenarians with sole BMs related to NSCLC.展开更多
Recent studies have pointed out that the widespread iron deposits in southwestern Fujian metallogenic belt(SFMB)(China) are skarn-type deposits associated with the Yanshanian granites. There is still excellent potenti...Recent studies have pointed out that the widespread iron deposits in southwestern Fujian metallogenic belt(SFMB)(China) are skarn-type deposits associated with the Yanshanian granites. There is still excellent potential for mineral exploration because large areas in this belt are covered by forest. A new predictive model for mapping skarn-type Fe deposit prospectivity in this belt was developed and focused on in this study, using five criteria as evidence:(1) the contact zones of Yanshanian granites(GRANITE);(2) the contact zones within the late Paleozoic marine sedimentary rocks and the carbonate formations(FORMATION);(3) the NE-NNE-trending faults(FAULT);(4) the zones of skarn alterations(SKARN); and(5) the aeromagnetic anomaly(AEROMAGNETIC). The fuzzy weights of evidence(FWof E) method, developed from the classical weights of evidence(Wof E) and based on fuzzy sets and fuzzy probabilities, could provide smaller variances and more accurate posterior probabilities and could effectively minimize the uncertainty caused by omitted or wrongly assigned data and be more flexible than the Wof E. It is an efficient and widely used method for mineral potential mapping. Random forests(RF) is a new and useful method for data-driven predictive mapping of mineral prospectivity method, and needs further scrutiny. Both prospectivity results respectively using the FWof E and RF methods reveal that the prediction model for the skarn-type Fe deposits in the SFMB is successful and efficient. Both methods suggested that the GRANITE and FORMATION are the most valuable evidence maps, followed by SKARN, AEROMAGNETIC, and FAULT. This is coincident with the skarn-type Fe deposit mineral model in the SFMB. The unstable performance experienced when FORMATION was omitted might indicate that the highest uncertainty and risk in follow-up exploration is related to the sequences. In addition, the performance of the RF method for the skarn-type Fe deposits prospectivity in the SFMB is better than the FWof E; therefore, it could be used to guide further exploration of skarn-type Fe prospects in the SFMB.展开更多
Purpose-Ensemble methods have been widely used in the field of pattern recognition due to the difficulty offinding a single classifier that performs well on a wide variety of problems.Despite the effectiveness of thes...Purpose-Ensemble methods have been widely used in the field of pattern recognition due to the difficulty offinding a single classifier that performs well on a wide variety of problems.Despite the effectiveness of thesetechniques,studies have shown that ensemble methods generate a large number of hypotheses and thatcontain redundant classifiers in most cases.Several works proposed in the state of the art attempt to reduce allhypotheses without affecting performance.Design/methodology/approach-In this work,the authors are proposing a pruning method that takes intoconsideration the correlation between classifiers/classes and each classifier with the rest of the set.The authorshave used the random forest algorithm as trees-based ensemble classifiers and the pruning was made by atechnique inspired by the CFS(correlation feature selection)algorithm.Findings-The proposed method CES(correlation-based Ensemble Selection)was evaluated onten datasets from the UCI machine learning repository,and the performances were compared to sixensemble pruning techniques.The results showed that our proposed pruning method selects a smallensemble in a smaller amount of time while improving classification rates compared to the state-of-the-artmethods.Originality/value-CES is a new ordering-based method that uses the CFS algorithm.CES selects,in a shorttime,a small sub-ensemble that outperforms results obtained from the whole forest and the other state-of-thearttechniques used in this study.展开更多
Many of the best predictors for complex problems are typically regarded as hard to interpret physically.These include kernel methods,Shtarkov solutions,and random forests.We show that,despite the inability to interpre...Many of the best predictors for complex problems are typically regarded as hard to interpret physically.These include kernel methods,Shtarkov solutions,and random forests.We show that,despite the inability to interpret these three predictors to infinite precision,they can be asymptotically approximated and admit conceptual interpretations in terms of their mathe-matical/statistical properties.The resulting expressions can be in terms of polynomials,basis elements,or other functions that an analyst may regard as interpretable.展开更多
基金Guangdong Innovation and Entrepreneurship Training Programme for Undergraduates“Automatic Classification and Identification of Fraudulent Websites Based on Machine Learning”(Project No.:DC2023125)。
文摘This paper explores the synergistic effect of a model combining Elastic Net and Random Forest in online fraud detection.The study selects a public network dataset containing 1781 data records,divides the dataset by 70%for training and 30%for validation,and analyses the correlation between features using a correlation matrix.The experimental results show that the Elastic Net feature selection method generally outperforms PCA in all models,especially when combined with the Random Forest and XGBoost models,and the ElasticNet+Random Forest model achieves the highest accuracy of 0.968 and AUC value of 0.983,while the Kappa and MCC also reached 0.839 and 0.844 respectively,showing extremely high consistency and correlation.This indicates that combining Elastic Net feature selection and Random Forest model has significant performance advantages in online fraud detection.
基金funded by the Key Technologies Research and Development Program of China (2013BAC03B02,2012BAC19B04)the International Science and Technology Cooperation Project of China (2012DFA31290)the Earmarked Fund for Modern Agro-industry Technology Research System,China (CARS-35)
文摘Leaf area index (LAI) is a key parameter for describing vegetation structures and is closely associated with vegetative photosynthesis and energy balance. The accurate retrieval of LAI is important when modeling biophysical processes of vegetation and the productivity of earth systems. The Random Forests (RF) method aggregates an ensemble of deci- sion trees to improve the prediction accuracy and demonstrates a more robust capacity than other regression methods. This study evaluated the RF method for predicting grassland LAI using ground measurements and remote sensing data. Parameter optimization and variable reduction were conducted before model prediction. Two variable reduction methods were examined: the Variable Importance Value method and the principal component analysis (PCA) method. Finally, the sensitivity of RF to highly correlated variables was tested. The results showed that the RF parameters have a small effect on the performance of RF, and a satisfactory prediction was acquired with a root mean square error (RMSE) of 0.1956. The two variable reduction methods for RF prediction produced different results; variable reduction based on the Variable Importance Value method achieved nearly the same prediction accuracy with no reduced prediction, whereas variable re- duction using the PCA method had an obviously degraded result that may have been caused by the loss of subtle variations and the fusion of noise information. After removing highly correlated variables, the relative variable importance remained steady, and the use of variables selected based on the best-performing vegetation indices performed better than the vari- ables with all vegetation indices or those selected based on the most important one. The results in this study demonstrate the practical and powerful ability of the RF method in predicting grassland LAI, which can also be applied to the estimation of other vegetation traits as an alternative to conventional empirical regression models and the selection of relevant variables used in ecological models.
基金supported by the National Natural Science Foundation of China under Grant No.61801222in part supported by the Fundamental Research Funds for the Central Universities under Grant No.30919011230in part supported by the Jiangsu Provincial Department of Education Degree and Graduate Education Research Fund under Grant No.JGZD18_012.
文摘Massive Open Online Course(MOOC)has become a popular way of online learning used across the world by millions of people.Meanwhile,a vast amount of information has been collected from the MOOC learners and institutions.Based on the educational data,a lot of researches have been investigated for the prediction of the MOOC learner’s final grade.However,there are still two problems in this research field.The first problem is how to select the most proper features to improve the prediction accuracy,and the second problem is how to use or modify the data mining algorithms for a better analysis of the MOOC data.In order to solve these two problems,an improved random forests method is proposed in this paper.First,a hybrid indicator is defined to measure the importance of the features,and a rule is further established for the feature selection;then,a Clustering-Synthetic Minority Over-sampling Technique(SMOTE)is embedded into the traditional random forests algorithm to solve the class imbalance problem.In experiment part,we verify the performance of the proposed method by using the Canvas Network Person-Course(CNPC)dataset.Furthermore,four well-known prediction methods have been applied for comparison,where the superiority of our method has been proved.
基金supported by the National Key R&D Program of China(Nos.2018YFB1003905)the National Natural Science Foundation of China under Grant No.61971032,Fundamental Research Funds for the Central Universities(No.FRF-TP-18-008A3).
文摘On-site programming big data refers to the massive data generated in the process of software development with the characteristics of real-time,complexity and high-difficulty for processing.Therefore,data cleaning is essential for on-site programming big data.Duplicate data detection is an important step in data cleaning,which can save storage resources and enhance data consistency.Due to the insufficiency in traditional Sorted Neighborhood Method(SNM)and the difficulty of high-dimensional data detection,an optimized algorithm based on random forests with the dynamic and adaptive window size is proposed.The efficiency of the algorithm can be elevated by improving the method of the key-selection,reducing dimension of data set and using an adaptive variable size sliding window.Experimental results show that the improved SNM algorithm exhibits better performance and achieve higher accuracy.
文摘Machine learning has emerged as a pivotal tool in deciphering and managing this excess of information in an era of abundant data.This paper presents a comprehensive analysis of machine learning algorithms,focusing on the structure and efficacy of random forests in mitigating overfitting—a prevalent issue in decision tree models.It also introduces a novel approach to enhancing decision tree performance through an optimized pruning method called Adaptive Cross-Validated Alpha CCP(ACV-CCP).This method refines traditional cost complexity pruning by streamlining the selection of the alpha parameter,leveraging cross-validation within the pruning process to achieve a reliable,computationally efficient alpha selection that generalizes well to unseen data.By enhancing computational efficiency and balancing model complexity,ACV-CCP allows decision trees to maintain predictive accuracy while minimizing overfitting,effectively narrowing the performance gap between decision trees and random forests.Our findings illustrate how ACV-CCP contributes to the robustness and applicability of decision trees,providing a valuable perspective on achieving computationally efficient and generalized machine learning models.
基金supported by the National Natural Science Foundation of China(Nos.21933006 and 21773124)the Fundamental Research Funds for the Central Universities of Nankai University(Nos.63243091 and 63233001)the Supercomputing Center of Nankai University(NKSC).
文摘In materials science,data-driven methods accelerate material discovery and optimization while reducing costs and improving success rates.Symbolic regression is a key to extracting material descriptors from large datasets,in particular the Sure Independence Screening and Sparsifying Operator(SISSO)method.While SISSO needs to store the entire expression space to impose heavy memory demands,it limits the performance in complex problems.To address this issue,we propose a RF-SISSO algorithm by combining Random Forests(RF)with SISSO.In this algorithm,the Random Forests algorithm is used for prescreening,capturing non-linear relationships and improving feature selection,which may enhance the quality of the input data and boost the accuracy and efficiency on regression and classification tasks.For a testing on the SISSO’s verification problem for 299 materials,RF-SISSO demonstrates its robust performance and high accuracy.RF-SISSO can maintain the testing accuracy above 0.9 across all four training sample sizes and significantly enhancing regression efficiency,especially in training subsets with smaller sample sizes.For the training subset with 45 samples,the efficiency of RF-SISSO was 265 times higher than that of original SISSO.As collecting large datasets would be both costly and time-consuming in the practical experiments,it is thus believed that RF-SISSO may benefit scientific researches by offering a high predicting accuracy with limited data efficiently.
文摘The medical industry generates vast amounts of data suitable for machine learning during patient-clinician interaction in hospitals.However,as a result of data protection regulations like the general data protection regulation(GDPR),patient data cannot be shared freely across institutions.In these cases,federated learning(FL)is a viable option where a global model learns from multiple data sites without moving the data.In this paper,we focused on random forests(RFs)for its effectiveness in classification tasks and widespread use throughout the medical industry and compared two popular federated random forest aggregation algorithms on horizontally partitioned data.We first provided necessary background information on federated learning,the advantages of random forests in a medical context,and the two aggregation algorithms.A series of extensive experiments using four public binary medical datasets(an excerpt of MIMIC III,Pima Indian diabetes dataset from Kaggle,and diabetic retinopathy and heart failure dataset from UCI machine learning repository)were then performed to systematically compare the two on equal-sized,unequal-sized,and class-imbalanced clients.A follow-up investigation on the effects of more clients was also conducted.We finally empirically analyzed the advantages of federated learning and concluded that the weighted merge algorithm produces models with,on average,1.903%higher F1 score and 1.406%higher AUCROC value.
文摘Slope units are divided according to the real topography and have clear geological characteristics,making them ideal units for evaluating the susceptibility to geological disasters.Based on the results of automatically and manually corrected hydrological slope unit division,the Longhua District,Shenzhen City,Guangdong Province,was selected as the study area.A total of 15 influencing factors,namely Fluctuation,slope,slope aspect,curvature,topographic witness index(TWI),stream power index(SPI),topographic roughness index(TRI),annual average rainfall,distance to water system,engineering rock group,distance to fault,land use,normalized difference vegetation index(NDVI),nighttime light,and distance to road,were selected as evaluation indicators.The information volume model(IV)and random points were used to select non-geological disaster units,and then the random forest model(RF)was used to evaluate the susceptibility to geological disasters.The automatic slope unit and the hydrological slope unit were compared and analyzed in the random forest and information volume random forest models.The results show that the area under the curve(AUC)values of the automatic slope unit evaluation results are 0.931 for the IV-RF model and 0.716 for the RF model,which are 0.6%(IV-RF model)and 1.9%(RF model)higher than those for the hydrological slope unit.Based on a comparison of the evaluation methods based on the two types of slope units,the hydrological slope unit evaluation method based on manual correction is highly subjective,is complicated to operate,and has a low evaluation accuracy,whereas the evaluation method based on automatic slope unit division is efficient and accurate,is suitable for large-scale efficient geological disaster evaluation,and can better deal with the problem of geological disaster susceptibility evaluation.
基金supported by Qingdao Natural Science Foundation(No. 23-2-1-224-zyyd-jch)。
文摘Although the concentration of fine particulate matter(PM_(2.5))is reducing continuously,the proportion of secondary organic aerosols(SOA)in PM_(2.5) and the O_(3) levels are increasing.This is causing severe complex atmospheric pollution in North China.It is essential to identify and quantify the driving factors of SOA and O_(3),including the various pollution sources and meteorological factors.PM_(2.5) and volatile organic compounds(VOCs)samples were collected simultaneously in three cities in Shandong Province during different pollution scenarios from 2021 to 2023.Then,the carbonaceous aerosol and 99 VOC species were analyzed.Random forest(RF)combined with positive matrix factorization and an observation-based model(OBM)were used to quantify the key drivers of SOA and O_(3).Aromatic hydrocarbons were the main contributors to secondary organic aerosol potential(74.3%-89.9%),whereas alkenes contributed the most to the ozone formation potential(27.0%-62.3%).The RF modeling identified temperature and NOx as the dominant drivers of ozone formation.These accounted for 47.8%and 17.4%,respectively.Temperature showed a positive correlation with O_(3) because an increase in temperature can promote ozone formation.NOx had a significant negative correlation with O_(3),which was consistent with the conclusions from the sensitivity analysis of the OBM.The dominant contributors to SOA were vehicle emissions,solvent use,and industrial emissions.These accounted for 43.9%,18.2%,and 10.5%,respectively.An evident positive correlation existed between these emission sources and SOA.
基金supported by National Natural Science Foundation of China(Grant 62573375)the Natural Science Foundation of Hebei Province(Grant F2024203038)+2 种基金the Science and Technology Research and Development Plan Project of Qinhuangdao City(Grant 202302B048)the Provincial Key Laboratory Performance Subsidy Project(Grant 22567612H)the Shandong Provincial Natural Science Foundation Youth Project(ZR2023QF044)。
文摘In response to the challenges of inadequate predictive accuracy and limited generalization capability in data-driven modeling for the mechanical properties of the cold-rolled strip steel,a predictive modeling method named RFR-WOA is developed based on random forest regression(RFR)and whale optimization algorithm(WOA).Firstly,using Pearson and Spearman correlation analysis and Gini coefficient importance ranking on an actual production dataset containing 37,878 samples,22 key variables are selected as model inputs from 112 variables that affect mechanical properties.Subsequently,an RFR-based predictive model for the mechanical properties of cold-rolled strip steel is constructed.Then,with the combination of the coefficient of determination(R^(2))and root mean square error as the optimization objective,the hyperparameters of RFR model are iteratively optimized using WOA,and better predictive effectiveness is obtained.Finally,the mechanical properties prediction model based on RFR-WOA is compared with models established using deep neural networks,convolutional neural networks,and other methods.The test results on 9469 samples of actual production data show that the model developed present has better predictive accuracy and generalization capability.
基金supported in part by the National Natural Science Foundation of China (No. 51677072)。
文摘To address the problems of wind power abandonment and the stoppage of electricity transmission caused by a short circuit in a power line of a doubly-fed induction generator(DFIG) based wind farm, this paper proposes an intelligent location method for a single-phase grounding fault based on a multiple random forests(multi-RF) algorithm. First, the simulation model is built, and the fundamental amplitudes of the zerosequence currents are extracted by a fast Fourier transform(FFT) to construct the feature set. Then, the random forest classification algorithm is applied to establish the fault section locator. The model is resampled on the basis of the bootstrap method to generate multiple sample subsets, which are used to establish multiple classification and regression tree(CART) classifiers. The CART classifiers use the mean decrease in the node impurity as the feature importance,which is used to mine the relationship between features and fault sections. Subsequently, a fault section is identified by voting on the test results for each classifier. Finally, a multi-RF regression fault locator is built to output the predicted fault distance. Experimental results with PSCAD/EMTDC software show that the proposed method can overcome the shortcomings of a single RF and has the advantage of locating a short hybrid overhead/cable line with multiple branches. Compared with support vector machines(SVMs)and previously reported methods, the proposed method can meet the location accuracy and efficiency requirements of a DFIG-based wind farm better.
文摘Background: Random Forests is a popular classification and regression method that has proven powerful for various prediction problems in biological studies. However, its performance often deteriorates when the number of features increases. To address this limitation, feature elimination Random Forests was proposed that only uses features with the largest variable importance scores. Yet the performance of this method is not satisfying, possibly due to its rigid feature selection, and increased correlations between trees of forest. Methods: We propose variable importance-weighted Random Forests, which instead of sampling features with equal probability at each node to build up trees, samples features according to their variable importance scores, and then select the best split from the randomly selected features. Results: We evaluate the performance of our method through comprehensive simulation and real data analyses, for both regression and classification. Compared to the standard Random Forests and the feature elimination Random Forests methods, our proposed method has improved performance in most cases. Conclusions: By incorporating the variable importance scores into the random feature selection step, our method can better utilize more informative features without completely ignoring less informative ones, hence has improved prediction accuracy in the presence of weak signals and large noises. We have implemented an R package "viRandomForests" based on the original R package "randomForest" and it can be freely downloaded from http:// zhaocenter.org/software.
文摘The random forests (RF) algorithm, which combines the predictions from an ensemble of random trees, has achieved significant improvements in terms of classification accuracy. In many real-world applications, however, ranking is often required in order to make optimal decisions. Thus, we focus our attention on the ranking performance of RF in this paper. Our experi- mental results based on the entire 36 UC Irvine Machine Learning Repository (UCI) data sets published on the main website of Weka platform show that RF doesn't perform well in ranking, and is even about the same as a single C4.4 tree. This fact raises the question of whether several improvements to RF can scale up its ranking performance. To answer this question, we single out an improved random forests (IRF) algorithm. Instead of the information gain measure and the maximum-likelihood estimate, the average gain measure and the similarity- weighted estimate are used in IRF. Our experiments show that IRF significantly outperforms all the other algorithms used to compare in terms of ranking while maintains the high classification accuracy characterizing RF.
基金supported by the National Natural Science Foundation of China (Nos. 2100230024 and 2100230023)
文摘Alzheimer's disease(AD) is a serious neurodegenerative disorder and its cause remains largely elusive.In past years,genome-wide association(GWA) studies have provided an effective means for AD research.However,the univariate method that is commonly used in GWA studies cannot effectively detect the biological mechanisms associated with this disease.In this study,we propose a new strategy for the GWA analysis of AD that combines random forests with enrichment analysis.First,backward feature selection using random forests was performed on a GWA dataset of AD patients carrying the apolipoprotein gene(APOEε4) and 1058 susceptible single nucleotide polymorphisms(SNPs) were detected,including several known AD-associated SNPs.Next,the susceptible SNPs were investigated by enrichment analysis and significantly-associated gene functional annotations,such as 'alternative splicing','glycoprotein',and 'neuron development',were successfully discovered,indicating that these biological mechanisms play important roles in the development of AD in APOEε4 carriers.These findings may provide insights into the pathogenesis of AD and helpful guidance for further studies.Furthermore,this strategy can easily be modified and applied to GWA studies of other complex diseases.
文摘Dysfunction of microbial communities in various human body sites has been shown to be associated with a variety of diseases raising the possibility of predicting diseases based on metagenomic samples.Although many studies have investigated this problem,there are no consensus on the optimal approaches for predicting disease status based on metagenomic samples.Using six human gut metagenomic datasets consisting of large numbers of colorectal cancer patients and healthy controls from different countries,we investigated different software packages for extracting relative abundances of known microbial genomes and for integrating mapping and as-sembly approaches to obtain the relative abundance profiles of both known and novel genomes.The random forests(RF)classification algorithm was then used to predict colorectal cancer status based on the microbial relative abundance profiles.Based on within data cross-validation and cross-dataset prediction,we show that the RF prediction performance using the microbial relative abundance profiles estimated by Centrifuge is generally higher than that using the microbial relative abundance profiles estimated by MetaPhlAn2 and Bracken.We also develop a novel method to integrate the relative abundance profiles of both known and novel microbial or-ganisms to further increase the prediction performance for colorectal cancer from metagenomes.
基金supported by the key specialty of traditional Chinese medicine promotion project
文摘Background:To create and validate nomograms for the personalized prediction of survival in octogenarians with newly diagnosed nonsmall-cell lung cancer(NSCLC)with sole brain metastases(BMs).Methods:Random forests(RF)were applied to identify independent prognostic factors for building nomogram models.The predictive accuracy of the model was evaluated based on the receiver operating characteristic(ROC)curve,C-index,and calibration plots.Results:The area under the curve(AUC)values for overall survival at 6,12,and 18 months in the validation cohort were 0.837,0.867,and 0.849,respectively;the AUC values for cancer-specific survival prediction were 0.819,0.835,and 0.818,respectively.The calibration curves visualized the accuracy of the model.Conclusion:The new nomograms have good predictive power for survival among octogenarians with sole BMs related to NSCLC.
基金the joint financial support from a research project on "Quantitative models for prediction of strategic mineral resources in China" (Grant No. 201211022) by China Geological Surveythe National Natural Science Foundation of China (Grant Nos. 41372007, 41430320 & 41522206)the Program for New Century Excellent Talents in University (Grant No. NCET-13-1016)
文摘Recent studies have pointed out that the widespread iron deposits in southwestern Fujian metallogenic belt(SFMB)(China) are skarn-type deposits associated with the Yanshanian granites. There is still excellent potential for mineral exploration because large areas in this belt are covered by forest. A new predictive model for mapping skarn-type Fe deposit prospectivity in this belt was developed and focused on in this study, using five criteria as evidence:(1) the contact zones of Yanshanian granites(GRANITE);(2) the contact zones within the late Paleozoic marine sedimentary rocks and the carbonate formations(FORMATION);(3) the NE-NNE-trending faults(FAULT);(4) the zones of skarn alterations(SKARN); and(5) the aeromagnetic anomaly(AEROMAGNETIC). The fuzzy weights of evidence(FWof E) method, developed from the classical weights of evidence(Wof E) and based on fuzzy sets and fuzzy probabilities, could provide smaller variances and more accurate posterior probabilities and could effectively minimize the uncertainty caused by omitted or wrongly assigned data and be more flexible than the Wof E. It is an efficient and widely used method for mineral potential mapping. Random forests(RF) is a new and useful method for data-driven predictive mapping of mineral prospectivity method, and needs further scrutiny. Both prospectivity results respectively using the FWof E and RF methods reveal that the prediction model for the skarn-type Fe deposits in the SFMB is successful and efficient. Both methods suggested that the GRANITE and FORMATION are the most valuable evidence maps, followed by SKARN, AEROMAGNETIC, and FAULT. This is coincident with the skarn-type Fe deposit mineral model in the SFMB. The unstable performance experienced when FORMATION was omitted might indicate that the highest uncertainty and risk in follow-up exploration is related to the sequences. In addition, the performance of the RF method for the skarn-type Fe deposits prospectivity in the SFMB is better than the FWof E; therefore, it could be used to guide further exploration of skarn-type Fe prospects in the SFMB.
基金The authors would like to thank the Directorate-General of Scientific Research and Technological Development(Direction Generale de la Recherche Scientifique et du Developpement Technologique,DGRSDT,URL:www.dgrsdt.dz,Algeria)for the financial assistance towards this research.
文摘Purpose-Ensemble methods have been widely used in the field of pattern recognition due to the difficulty offinding a single classifier that performs well on a wide variety of problems.Despite the effectiveness of thesetechniques,studies have shown that ensemble methods generate a large number of hypotheses and thatcontain redundant classifiers in most cases.Several works proposed in the state of the art attempt to reduce allhypotheses without affecting performance.Design/methodology/approach-In this work,the authors are proposing a pruning method that takes intoconsideration the correlation between classifiers/classes and each classifier with the rest of the set.The authorshave used the random forest algorithm as trees-based ensemble classifiers and the pruning was made by atechnique inspired by the CFS(correlation feature selection)algorithm.Findings-The proposed method CES(correlation-based Ensemble Selection)was evaluated onten datasets from the UCI machine learning repository,and the performances were compared to sixensemble pruning techniques.The results showed that our proposed pruning method selects a smallensemble in a smaller amount of time while improving classification rates compared to the state-of-the-artmethods.Originality/value-CES is a new ordering-based method that uses the CFS algorithm.CES selects,in a shorttime,a small sub-ensemble that outperforms results obtained from the whole forest and the other state-of-thearttechniques used in this study.
文摘Many of the best predictors for complex problems are typically regarded as hard to interpret physically.These include kernel methods,Shtarkov solutions,and random forests.We show that,despite the inability to interpret these three predictors to infinite precision,they can be asymptotically approximated and admit conceptual interpretations in terms of their mathe-matical/statistical properties.The resulting expressions can be in terms of polynomials,basis elements,or other functions that an analyst may regard as interpretable.