In this investigation,the Gradient Boosting(GB),Linear Regression(LR),Decision Tree(DT),and Voting algo-rithms were applied to predict the distribution pattern of Au geochemical data.Trace and indicator elements,inclu...In this investigation,the Gradient Boosting(GB),Linear Regression(LR),Decision Tree(DT),and Voting algo-rithms were applied to predict the distribution pattern of Au geochemical data.Trace and indicator elements,including Mo,Cu,Pb,Zn,Ag,Ni,Co,Mn,Fe,and As,were used with these machine learning algorithms(MLAs)to predict Au concentration values in the Doostbigloo porphyry Cu-Au-Mo mineralization area.The performance of the models was evaluated using the Mean Absolute Percentage Error(MAPE)and Root Mean Square Error(RMSE)metrics.The proposed ensemble Voting algorithm outperformed the other models,yielding more ac-curate predictions according to both metrics.The predicted data from the GB,LR,DT,and Voting MLAs were modeled using the Concentration-Area fractal method,and Au geochemical anomalies were mapped.To compare and validate the results,factors such as the location of the mineral deposits,their surface extent,and mineralization trend were considered.The results indicate that integrating hybrid MLAs with fractal modeling signifi-cantly improves geochemical prospectivity mapping.Among the four models,three(DT,GB,Voting)accurately identified both mineral deposits.The LR model,however,only identified Deposit I(central),and its mineralization trend diverged from the field data.The GB and Voting models produced similar results,with their final maps derived from fractal modeling showing the same anomalous areas.The anomaly boundaries identified by these two models are consistent with the two known reserves in the region.The results and plots related to prediction indicators and error rates for these two models also show high similarity,with lower error rates than the other models.Notably,the Voting model demonstrated superior performance in accurately delineating mineral deposit locations and identifying realistic mineralization trends while minimizing false anomalies.展开更多
According to groundwater level monitoring data of Shuping landslide in the Three Gorges Reservoir area, based on the response relationship between influential factors such as rainfall and reservoir level and the chang...According to groundwater level monitoring data of Shuping landslide in the Three Gorges Reservoir area, based on the response relationship between influential factors such as rainfall and reservoir level and the change of groundwater level, the influential factors of groundwater level were selected. Then the classification and regression tree(CART) model was constructed by the subset and used to predict the groundwater level. Through the verification, the predictive results of the test sample were consistent with the actually measured values, and the mean absolute error and relative error is 0.28 m and 1.15%respectively. To compare the support vector machine(SVM) model constructed using the same set of factors, the mean absolute error and relative error of predicted results is 1.53 m and 6.11% respectively. It is indicated that CART model has not only better fitting and generalization ability, but also strong advantages in the analysis of landslide groundwater dynamic characteristics and the screening of important variables. It is an effective method for prediction of ground water level in landslides.展开更多
In order to solve the poor generalization ability of the back-propagation(BP)neural network in the model updating hybrid test,a novel method called the AdaBoost regression tree algorithm is introduced into the model u...In order to solve the poor generalization ability of the back-propagation(BP)neural network in the model updating hybrid test,a novel method called the AdaBoost regression tree algorithm is introduced into the model updating procedure in hybrid tests.During the learning phase,the regression tree is selected as a weak regression model to be trained,and then multiple trained weak regression models are integrated into a strong regression model.Finally,the training results are generated through voting by all the selected regression models.A 2-DOF nonlinear structure was numerically simulated by utilizing the online AdaBoost regression tree algorithm and the BP neural network algorithm as a contrast.The results show that the prediction accuracy of the online AdaBoost regression algorithm is 48.3%higher than that of the BP neural network algorithm,which verifies that the online AdaBoost regression tree algorithm has better generalization ability compared to the BP neural network algorithm.Furthermore,it can effectively eliminate the influence of weight initialization and improve the prediction accuracy of the restoring force in hybrid tests.展开更多
The increase of competition, economic recession and financial crises has increased business failure and depending on this the researchers have attempted to develop new approaches which can yield more correct and more ...The increase of competition, economic recession and financial crises has increased business failure and depending on this the researchers have attempted to develop new approaches which can yield more correct and more reliable results. The classification and regression tree (CART) is one of the new modeling techniques which is developed for this purpose. In this study, the classification and regression trees method is explained and tested the power of the financial failure prediction. CART is applied for the data of industry companies which is trade in Istanbul Stock Exchange (ISE) between 1997-2007. As a result of this study, it has been observed that, CART has a high predicting power of financial failure one, two and three years prior to failure, and profitability ratios being the most important ratios in the prediction of failure.展开更多
Urban grid power forecasting is one of the important tasks of power system operators, which helps to analyze the development trend of the city. As the demand for electricity in various industries is affected by many f...Urban grid power forecasting is one of the important tasks of power system operators, which helps to analyze the development trend of the city. As the demand for electricity in various industries is affected by many factors, the data of relevant influencing factors are scarce, resulting in great deviations in the accuracy of prediction results. In order to improve the prediction results, this paper proposes a model based on Multi-Target Tree Regression to predict the monthly electricity consumption of different industrial structures. Due to few data characteristics of actual electricity consumption in Shanghai from 2013 to the first half of 2017. Thus, we collect data on GDP growth, weather conditions, and tourism season distribution in various industries in Shanghai, model and train the electricity consumption data of different industries in different months. The multi-target tree regression model was tested with actual values to verify the reliability of the model and predict the monthly electricity consumption of each industry in the second half of 2017. The experimental results show that the model can accurately predict the monthly electricity consumption of various industries.展开更多
Multi-target regression is concerned with the simultaneous prediction of multiple continuous target variables based on the same set of input variables.It has received relatively small attention from the Machine Learni...Multi-target regression is concerned with the simultaneous prediction of multiple continuous target variables based on the same set of input variables.It has received relatively small attention from the Machine Learning community.However,multi-target regression exists in many real-world applications.In this paper we conduct extensive experiments to investigate the performance of three representative multi-target regression learning algorithms(i.e.Multi-Target Stacking(MTS),Random Linear Target Combination(RLTC),and Multi-Objective Random Forest(MORF)),comparing the baseline single-target learning.Our experimental results show that all three multi-target regression learning algorithms do improve the performance of the single-target learning.Among them,MTS performs the best,followed by RLTC,followed by MORF.However,the single-target learning sometimes still performs very well,even the best.This analysis sheds the light on multi-target regression learning and indicates that the single-target learning is a competitive baseline for multi-target regression learning on multi-target domains.展开更多
Understanding an underlying structure for phylogenetic trees is very important as it informs on the methods that should be employed during phylogenetic inference. The methods used under a structured population differ ...Understanding an underlying structure for phylogenetic trees is very important as it informs on the methods that should be employed during phylogenetic inference. The methods used under a structured population differ from those needed when a population is not structured. In this paper, we compared two supervised machine learning techniques, that is artificial neural network (ANN) and logistic regression models for prediction of an underlying structure for phylogenetic trees. We carried out parameter tuning for the models to identify optimal models. We then performed 10-fold cross-validation on the optimal models for both logistic regression?and ANN. We also performed a non-supervised technique called clustering to identify the number of clusters that could be identified from simulated phylogenetic trees. The trees were from?both structured?and non-structured populations. Clustering and prediction using classification techniques were?done using tree statistics such as Colless, Sackin and cophenetic indices, among others. Results from 10-fold cross-validation revealed that both logistic regression and ANN models had comparable results, with both models having average accuracy rates of over 0.75. Most of the clustering indices used resulted in 2 or 3 as the optimal number of clusters.展开更多
Background: Vegetation distribution maps are of great significance for nature protection and management. In diverse tropical forests, accurate spatial mapping of vegetation types is challenging;the high species divers...Background: Vegetation distribution maps are of great significance for nature protection and management. In diverse tropical forests, accurate spatial mapping of vegetation types is challenging;the high species diversity and abundance of rare species challenge classification concepts, while remote sensing signals may not vary systematically with species composition, complicating the technical capability for delineating vegetation types in the landscape.Methods: We used a combination of field-based compositional data and their relations to environmental variables to predict the distribution of forest types in the Wuzhishan National Natural Reserve(WNNR), Hainan Island,China, using multivariate regression trees(MRT). The MRT was based on arboreal vegetation composition in 132plots of 20 m×20 m with a regular spacing of 1 km. Apart from the MRT, non-metric multidimensional scaling(NMDS) was used to evaluate vegetation-environment relationships.Results: The MRT model worked best when using 14 key environmental variables including topography, climate,latitude and soil, although the difference with the simpler model including only topographical variables was small. The full model classified the 132 plots into 3 vegetation types, 6 formation groups, 20 formations and 65associations at different hierarchical syntaxonomic levels. This model was the basis for forest vegetation maps for the WNNR. MRT and NMDS showed that elevation was the main driving force for the distribution of vegetation types and formation groups. Climate, latitude, and soil(especially available P), together with topographic variables, all influenced the distribution of formations and associations.Conclusions: While elevation determines forest-type distributions, lower-level syntaxonomic forest classes respond to the topographic diversity typical for mountains. Apart from providing the first detailed forest vegetation map for any part of WNNR, we show how, in spite of limitations, MRT with existing environmental data can be a useful method for mapping diverse and remote tropical forests.展开更多
Challenges in Big Data analysis arise due to the way the data are recorded, maintained, processed and stored. We demonstrate that a hierarchical, multivariate, statistical machine learning algorithm, namely Boosted Re...Challenges in Big Data analysis arise due to the way the data are recorded, maintained, processed and stored. We demonstrate that a hierarchical, multivariate, statistical machine learning algorithm, namely Boosted Regression Tree (BRT) can address Big Data challenges to drive decision making. The challenge of this study is lack of interoperability since the data, a collection of GIS shapefiles, remotely sensed imagery, and aggregated and interpolated spatio-temporal information, are stored in monolithic hardware components. For the modelling process, it was necessary to create one common input file. By merging the data sources together, a structured but noisy input file, showing inconsistencies and redundancies, was created. Here, it is shown that BRT can process different data granularities, heterogeneous data and missingness. In particular, BRT has the advantage of dealing with missing data by default by allowing a split on whether or not a value is missing as well as what the value is. Most importantly, the BRT offers a wide range of possibilities regarding the interpretation of results and variable selection is automatically performed by considering how frequently a variable is used to define a split in the tree. A comparison with two similar regression models (Random Forests and Least Absolute Shrinkage and Selection Operator, LASSO) shows that BRT outperforms these in this instance. BRT can also be a starting point for sophisticated hierarchical modelling in real world scenarios. For example, a single or ensemble approach of BRT could be tested with existing models in order to improve results for a wide range of data-driven decisions and applications.展开更多
A new point-tree data structure genetic programming (PTGP) method is proposed. For the discontinuous function regression problem, the proposed method is able to identify both the function structure and discontinuities...A new point-tree data structure genetic programming (PTGP) method is proposed. For the discontinuous function regression problem, the proposed method is able to identify both the function structure and discontinuities points simultaneously. It is also easy to be used to solve the continuous function's regression problems. The numerical experiment results demonstrate that the point-tree GP is an efficient alternative way to the complex function identification problems.展开更多
The Arctic region is experiencing accelerated sea ice melt and increased iceberg detachment from glaciers due to climate change.These drifting icebergs present a risk and engineering challenge for subsea installations...The Arctic region is experiencing accelerated sea ice melt and increased iceberg detachment from glaciers due to climate change.These drifting icebergs present a risk and engineering challenge for subsea installations traversing shallow waters,where ice-berg keels may reach the seabed,potentially damaging subsea structures.Consequently,costly and time-intensive iceberg manage-ment operations,such as towing and rerouting,are undertaken to safeguard subsea and offshore infrastructure.This study,therefore,explores the application of extra tree regression(ETR)as a robust solution for estimating iceberg draft,particularly in the preliminary phases of decision-making for iceberg management projects.Nine ETR models were developed using parameters influencing iceberg draft.Subsequent analyses identified the most effective models and significant input variables.Uncertainty analysis revealed that the superior ETR model tended to overestimate iceberg drafts;however,it achieved the highest precision,correlation,and simplicity in estimation.Comparison with decision tree regression,random forest regression,and empirical methods confirmed the superior perfor-mance of ETR in predicting iceberg drafts.展开更多
The feasibility of constructing shallow foundations on saturated sands remains uncertain.Seismic design standards simply stipulate that geotechnical investigations for a shallow foundation on such soils shall be condu...The feasibility of constructing shallow foundations on saturated sands remains uncertain.Seismic design standards simply stipulate that geotechnical investigations for a shallow foundation on such soils shall be conducted to mitigate the effects of the liquefaction hazard.This study investigates the seismic behavior of strip foundations on typical two-layered soil profiles-a natural loose sand layer supported by a dense sand layer.Coupled nonlinear dynamic analyses have been conducted to calculate response parameters,including seismic settlement,the acceleration response on the ground surface,and excess pore pressure beneath strip foundations.A novel liquefaction potential index(LPI_(footing)),based on excess pore pressure ratios across a given region of soil mass beneath footings is introduced to classify liquefaction severity into three distinct levels:minor,moderate,and severe.To validate the proposed LPI_(footing),the foundation settlement is evaluated for the different liquefaction potential classes.A classification tree model has been grown to predict liquefaction susceptibility,utilizing various input variables,including earthquake intensity on the ground surface,foundation pressure,sand permeability,and top layer thickness.Moreover,a nonlinear regression function has been established to map LPI_(footing) in relation to these input predictors.The models have been constructed using a substantial dataset comprising 13,824 excess pore pressure ratio time histories.The performance of the developed models has been examined using various methods,including the 10-fold cross-validation method.The predictive capability of the tree also has been validated through existing experimental studies.The results indicate that the classification tree is not only interpretable but also highly predictive,with a testing accuracy level of 78.1%.The decision tree provides valuable insights for engineers assessing liquefaction potential beneath strip foundations.展开更多
快速预测金属切削的各种力学性能对工业制造的优化设计和产能提高十分关键.当前相关预测模型通常需要昂贵且耗时的实验和分析过程.构建了一种基于金属切削模拟和决策树回归(decision tree regression,DTR)的预测模型,用于获取不同切削...快速预测金属切削的各种力学性能对工业制造的优化设计和产能提高十分关键.当前相关预测模型通常需要昂贵且耗时的实验和分析过程.构建了一种基于金属切削模拟和决策树回归(decision tree regression,DTR)的预测模型,用于获取不同切削工况下的力学性能.首先,采用自适应光滑粒子流体动力学(adaptive smoothed particle hydrodynamics,ASPH)模拟金属切削过程,捕获了不同模拟参数下的多种力学性能,组成2000种切削工况的模拟数据集;其次,利用DTR算法学习模拟数据集,训练和构建金属切削预测模型,并通过交叉验证和网格搜索评估了不同剪枝策略下预测模型的效果.结果表明,建立的预测模型可以快速地预测不同模拟参数下的多种力学性能,适宜的剪枝策略可以提升预测模型的准确度、泛化能力和稳定性.展开更多
Tree-based models have been widely applied in both academic and industrial settings due to the natural interpretability, good predictive accuracy, and high scalability. In this paper, we focus on improving the single-...Tree-based models have been widely applied in both academic and industrial settings due to the natural interpretability, good predictive accuracy, and high scalability. In this paper, we focus on improving the single-tree method and propose the segmented linear regression trees(SLRT) model that replaces the traditional constant leaf model with linear ones. From the parametric view, SLRT can be employed as a recursive change point detect procedure for segmented linear regression(SLR) models,which is much more efficient and flexible than the traditional grid search method. Along this way,we propose to use the conditional Kendall's τ correlation coefficient to select the underlying change points. From the non-parametric view, we propose an efficient greedy splitting method that selects the splits by analyzing the association between residuals and each candidate split variable. Further, with the SLRT as a single-tree predictor, we propose a linear random forest approach that aggregates the SLRTs by a weighted average. Both simulation and empirical studies showed significant improvements than the CART trees and even the random forest.展开更多
Determining the causal effect of special education is a critical topic when mak-ing educational policy that focuses on student achievement.However,current special education research is facing challenges from persisten...Determining the causal effect of special education is a critical topic when mak-ing educational policy that focuses on student achievement.However,current special education research is facing challenges from persistent selection bias and complex confounding.Bayesian Additive Regression Trees(BART)is em-ployed in this study to provide a flexible estimation of the academic perfor-mance.Targeted Maximum Likelihood Estimation(TMLE)is also integrated into the BART model,supporting doubly robust estimation of the special ed-ucation effect.This study extracted survey data from the Early Childhood Lon-gitudinal Study,Kindergarten Class(ECLS-K),to estimate the causal impact of special education status on students’combined mathematics and reading achievement scores.The analysis results of the BART-TMLE model show that children receiving special education services demonstrated approximately 9 points lower scores on average for combined math and reading scores,even adjusting for a considerable number of covariates,compared to their peers who did not receive these services.The estimated negative treatment effect persists after controlling for observed covariates that are closely correlated to the combined test score.The negative effect likely reflects unobserved factors,such as the underlying severity of learning disabilities,parent involvement and other potential traits,which are actual factors that determine the placement of special education status,rather than indicating the ineffectiveness of special education service.The achievement gap in academic performance reflects the current observable status of special education.The estimated effect could be improved by future research incorporating educational domain knowledge,allowing the model to be constructed more accurately.展开更多
Boreal forests play an important role in global environment systems. Understanding boreal forest ecosystem structure and function requires accurate monitoring and estimating of forest canopy and biomass. We used parti...Boreal forests play an important role in global environment systems. Understanding boreal forest ecosystem structure and function requires accurate monitoring and estimating of forest canopy and biomass. We used partial least square regression (PLSR) models to relate forest parameters, i.e. canopy closure density and above ground tree biomass, to Landsat ETM+ data. The established models were optimized according to the variable importance for projection (VIP) criterion and the bootstrap method, and their performance was compared using several statistical indices. All variables selected by the VIP criterion passed the bootstrap test (p〈0.05). The simplified models without insignificant variables (VIP 〈1) performed as well as the full model but with less computation time. The relative root mean square error (RMSE%) was 29% for canopy closure density, and 58% for above ground tree biomass. We conclude that PLSR can be an effective method for estimating canopy closure density and above ground biomass.展开更多
文摘In this investigation,the Gradient Boosting(GB),Linear Regression(LR),Decision Tree(DT),and Voting algo-rithms were applied to predict the distribution pattern of Au geochemical data.Trace and indicator elements,including Mo,Cu,Pb,Zn,Ag,Ni,Co,Mn,Fe,and As,were used with these machine learning algorithms(MLAs)to predict Au concentration values in the Doostbigloo porphyry Cu-Au-Mo mineralization area.The performance of the models was evaluated using the Mean Absolute Percentage Error(MAPE)and Root Mean Square Error(RMSE)metrics.The proposed ensemble Voting algorithm outperformed the other models,yielding more ac-curate predictions according to both metrics.The predicted data from the GB,LR,DT,and Voting MLAs were modeled using the Concentration-Area fractal method,and Au geochemical anomalies were mapped.To compare and validate the results,factors such as the location of the mineral deposits,their surface extent,and mineralization trend were considered.The results indicate that integrating hybrid MLAs with fractal modeling signifi-cantly improves geochemical prospectivity mapping.Among the four models,three(DT,GB,Voting)accurately identified both mineral deposits.The LR model,however,only identified Deposit I(central),and its mineralization trend diverged from the field data.The GB and Voting models produced similar results,with their final maps derived from fractal modeling showing the same anomalous areas.The anomaly boundaries identified by these two models are consistent with the two known reserves in the region.The results and plots related to prediction indicators and error rates for these two models also show high similarity,with lower error rates than the other models.Notably,the Voting model demonstrated superior performance in accurately delineating mineral deposit locations and identifying realistic mineralization trends while minimizing false anomalies.
基金supported by the China Earthquake Administration, Institute of Seismology Foundation (IS201526246)
文摘According to groundwater level monitoring data of Shuping landslide in the Three Gorges Reservoir area, based on the response relationship between influential factors such as rainfall and reservoir level and the change of groundwater level, the influential factors of groundwater level were selected. Then the classification and regression tree(CART) model was constructed by the subset and used to predict the groundwater level. Through the verification, the predictive results of the test sample were consistent with the actually measured values, and the mean absolute error and relative error is 0.28 m and 1.15%respectively. To compare the support vector machine(SVM) model constructed using the same set of factors, the mean absolute error and relative error of predicted results is 1.53 m and 6.11% respectively. It is indicated that CART model has not only better fitting and generalization ability, but also strong advantages in the analysis of landslide groundwater dynamic characteristics and the screening of important variables. It is an effective method for prediction of ground water level in landslides.
基金The National Natural Science Foundation of China(No.51708110)。
文摘In order to solve the poor generalization ability of the back-propagation(BP)neural network in the model updating hybrid test,a novel method called the AdaBoost regression tree algorithm is introduced into the model updating procedure in hybrid tests.During the learning phase,the regression tree is selected as a weak regression model to be trained,and then multiple trained weak regression models are integrated into a strong regression model.Finally,the training results are generated through voting by all the selected regression models.A 2-DOF nonlinear structure was numerically simulated by utilizing the online AdaBoost regression tree algorithm and the BP neural network algorithm as a contrast.The results show that the prediction accuracy of the online AdaBoost regression algorithm is 48.3%higher than that of the BP neural network algorithm,which verifies that the online AdaBoost regression tree algorithm has better generalization ability compared to the BP neural network algorithm.Furthermore,it can effectively eliminate the influence of weight initialization and improve the prediction accuracy of the restoring force in hybrid tests.
文摘The increase of competition, economic recession and financial crises has increased business failure and depending on this the researchers have attempted to develop new approaches which can yield more correct and more reliable results. The classification and regression tree (CART) is one of the new modeling techniques which is developed for this purpose. In this study, the classification and regression trees method is explained and tested the power of the financial failure prediction. CART is applied for the data of industry companies which is trade in Istanbul Stock Exchange (ISE) between 1997-2007. As a result of this study, it has been observed that, CART has a high predicting power of financial failure one, two and three years prior to failure, and profitability ratios being the most important ratios in the prediction of failure.
文摘Urban grid power forecasting is one of the important tasks of power system operators, which helps to analyze the development trend of the city. As the demand for electricity in various industries is affected by many factors, the data of relevant influencing factors are scarce, resulting in great deviations in the accuracy of prediction results. In order to improve the prediction results, this paper proposes a model based on Multi-Target Tree Regression to predict the monthly electricity consumption of different industrial structures. Due to few data characteristics of actual electricity consumption in Shanghai from 2013 to the first half of 2017. Thus, we collect data on GDP growth, weather conditions, and tourism season distribution in various industries in Shanghai, model and train the electricity consumption data of different industries in different months. The multi-target tree regression model was tested with actual values to verify the reliability of the model and predict the monthly electricity consumption of each industry in the second half of 2017. The experimental results show that the model can accurately predict the monthly electricity consumption of various industries.
基金This research has been supported by the US National Science Foundation under grant IIS-1115417the National Natural Science Foundation of China under grant 61728205,61472267and Foundation of Key Laboratory in Science and Technology Development Project of Suzhou under grant SZS201609。
文摘Multi-target regression is concerned with the simultaneous prediction of multiple continuous target variables based on the same set of input variables.It has received relatively small attention from the Machine Learning community.However,multi-target regression exists in many real-world applications.In this paper we conduct extensive experiments to investigate the performance of three representative multi-target regression learning algorithms(i.e.Multi-Target Stacking(MTS),Random Linear Target Combination(RLTC),and Multi-Objective Random Forest(MORF)),comparing the baseline single-target learning.Our experimental results show that all three multi-target regression learning algorithms do improve the performance of the single-target learning.Among them,MTS performs the best,followed by RLTC,followed by MORF.However,the single-target learning sometimes still performs very well,even the best.This analysis sheds the light on multi-target regression learning and indicates that the single-target learning is a competitive baseline for multi-target regression learning on multi-target domains.
文摘Understanding an underlying structure for phylogenetic trees is very important as it informs on the methods that should be employed during phylogenetic inference. The methods used under a structured population differ from those needed when a population is not structured. In this paper, we compared two supervised machine learning techniques, that is artificial neural network (ANN) and logistic regression models for prediction of an underlying structure for phylogenetic trees. We carried out parameter tuning for the models to identify optimal models. We then performed 10-fold cross-validation on the optimal models for both logistic regression?and ANN. We also performed a non-supervised technique called clustering to identify the number of clusters that could be identified from simulated phylogenetic trees. The trees were from?both structured?and non-structured populations. Clustering and prediction using classification techniques were?done using tree statistics such as Colless, Sackin and cophenetic indices, among others. Results from 10-fold cross-validation revealed that both logistic regression and ANN models had comparable results, with both models having average accuracy rates of over 0.75. Most of the clustering indices used resulted in 2 or 3 as the optimal number of clusters.
基金financially supported by National Key R&D Program of China(2021YFD220040403 and 2021YFD220040304)the China Scholarship Council(202107565021).
文摘Background: Vegetation distribution maps are of great significance for nature protection and management. In diverse tropical forests, accurate spatial mapping of vegetation types is challenging;the high species diversity and abundance of rare species challenge classification concepts, while remote sensing signals may not vary systematically with species composition, complicating the technical capability for delineating vegetation types in the landscape.Methods: We used a combination of field-based compositional data and their relations to environmental variables to predict the distribution of forest types in the Wuzhishan National Natural Reserve(WNNR), Hainan Island,China, using multivariate regression trees(MRT). The MRT was based on arboreal vegetation composition in 132plots of 20 m×20 m with a regular spacing of 1 km. Apart from the MRT, non-metric multidimensional scaling(NMDS) was used to evaluate vegetation-environment relationships.Results: The MRT model worked best when using 14 key environmental variables including topography, climate,latitude and soil, although the difference with the simpler model including only topographical variables was small. The full model classified the 132 plots into 3 vegetation types, 6 formation groups, 20 formations and 65associations at different hierarchical syntaxonomic levels. This model was the basis for forest vegetation maps for the WNNR. MRT and NMDS showed that elevation was the main driving force for the distribution of vegetation types and formation groups. Climate, latitude, and soil(especially available P), together with topographic variables, all influenced the distribution of formations and associations.Conclusions: While elevation determines forest-type distributions, lower-level syntaxonomic forest classes respond to the topographic diversity typical for mountains. Apart from providing the first detailed forest vegetation map for any part of WNNR, we show how, in spite of limitations, MRT with existing environmental data can be a useful method for mapping diverse and remote tropical forests.
文摘Challenges in Big Data analysis arise due to the way the data are recorded, maintained, processed and stored. We demonstrate that a hierarchical, multivariate, statistical machine learning algorithm, namely Boosted Regression Tree (BRT) can address Big Data challenges to drive decision making. The challenge of this study is lack of interoperability since the data, a collection of GIS shapefiles, remotely sensed imagery, and aggregated and interpolated spatio-temporal information, are stored in monolithic hardware components. For the modelling process, it was necessary to create one common input file. By merging the data sources together, a structured but noisy input file, showing inconsistencies and redundancies, was created. Here, it is shown that BRT can process different data granularities, heterogeneous data and missingness. In particular, BRT has the advantage of dealing with missing data by default by allowing a split on whether or not a value is missing as well as what the value is. Most importantly, the BRT offers a wide range of possibilities regarding the interpretation of results and variable selection is automatically performed by considering how frequently a variable is used to define a split in the tree. A comparison with two similar regression models (Random Forests and Least Absolute Shrinkage and Selection Operator, LASSO) shows that BRT outperforms these in this instance. BRT can also be a starting point for sophisticated hierarchical modelling in real world scenarios. For example, a single or ensemble approach of BRT could be tested with existing models in order to improve results for a wide range of data-driven decisions and applications.
基金Supported by the National Natural Science Foundation(60173046)and the Natural Science Foundation of Province(2002AB040)
文摘A new point-tree data structure genetic programming (PTGP) method is proposed. For the discontinuous function regression problem, the proposed method is able to identify both the function structure and discontinuities points simultaneously. It is also easy to be used to solve the continuous function's regression problems. The numerical experiment results demonstrate that the point-tree GP is an efficient alternative way to the complex function identification problems.
文摘The Arctic region is experiencing accelerated sea ice melt and increased iceberg detachment from glaciers due to climate change.These drifting icebergs present a risk and engineering challenge for subsea installations traversing shallow waters,where ice-berg keels may reach the seabed,potentially damaging subsea structures.Consequently,costly and time-intensive iceberg manage-ment operations,such as towing and rerouting,are undertaken to safeguard subsea and offshore infrastructure.This study,therefore,explores the application of extra tree regression(ETR)as a robust solution for estimating iceberg draft,particularly in the preliminary phases of decision-making for iceberg management projects.Nine ETR models were developed using parameters influencing iceberg draft.Subsequent analyses identified the most effective models and significant input variables.Uncertainty analysis revealed that the superior ETR model tended to overestimate iceberg drafts;however,it achieved the highest precision,correlation,and simplicity in estimation.Comparison with decision tree regression,random forest regression,and empirical methods confirmed the superior perfor-mance of ETR in predicting iceberg drafts.
文摘The feasibility of constructing shallow foundations on saturated sands remains uncertain.Seismic design standards simply stipulate that geotechnical investigations for a shallow foundation on such soils shall be conducted to mitigate the effects of the liquefaction hazard.This study investigates the seismic behavior of strip foundations on typical two-layered soil profiles-a natural loose sand layer supported by a dense sand layer.Coupled nonlinear dynamic analyses have been conducted to calculate response parameters,including seismic settlement,the acceleration response on the ground surface,and excess pore pressure beneath strip foundations.A novel liquefaction potential index(LPI_(footing)),based on excess pore pressure ratios across a given region of soil mass beneath footings is introduced to classify liquefaction severity into three distinct levels:minor,moderate,and severe.To validate the proposed LPI_(footing),the foundation settlement is evaluated for the different liquefaction potential classes.A classification tree model has been grown to predict liquefaction susceptibility,utilizing various input variables,including earthquake intensity on the ground surface,foundation pressure,sand permeability,and top layer thickness.Moreover,a nonlinear regression function has been established to map LPI_(footing) in relation to these input predictors.The models have been constructed using a substantial dataset comprising 13,824 excess pore pressure ratio time histories.The performance of the developed models has been examined using various methods,including the 10-fold cross-validation method.The predictive capability of the tree also has been validated through existing experimental studies.The results indicate that the classification tree is not only interpretable but also highly predictive,with a testing accuracy level of 78.1%.The decision tree provides valuable insights for engineers assessing liquefaction potential beneath strip foundations.
文摘Tree-based models have been widely applied in both academic and industrial settings due to the natural interpretability, good predictive accuracy, and high scalability. In this paper, we focus on improving the single-tree method and propose the segmented linear regression trees(SLRT) model that replaces the traditional constant leaf model with linear ones. From the parametric view, SLRT can be employed as a recursive change point detect procedure for segmented linear regression(SLR) models,which is much more efficient and flexible than the traditional grid search method. Along this way,we propose to use the conditional Kendall's τ correlation coefficient to select the underlying change points. From the non-parametric view, we propose an efficient greedy splitting method that selects the splits by analyzing the association between residuals and each candidate split variable. Further, with the SLRT as a single-tree predictor, we propose a linear random forest approach that aggregates the SLRTs by a weighted average. Both simulation and empirical studies showed significant improvements than the CART trees and even the random forest.
文摘Determining the causal effect of special education is a critical topic when mak-ing educational policy that focuses on student achievement.However,current special education research is facing challenges from persistent selection bias and complex confounding.Bayesian Additive Regression Trees(BART)is em-ployed in this study to provide a flexible estimation of the academic perfor-mance.Targeted Maximum Likelihood Estimation(TMLE)is also integrated into the BART model,supporting doubly robust estimation of the special ed-ucation effect.This study extracted survey data from the Early Childhood Lon-gitudinal Study,Kindergarten Class(ECLS-K),to estimate the causal impact of special education status on students’combined mathematics and reading achievement scores.The analysis results of the BART-TMLE model show that children receiving special education services demonstrated approximately 9 points lower scores on average for combined math and reading scores,even adjusting for a considerable number of covariates,compared to their peers who did not receive these services.The estimated negative treatment effect persists after controlling for observed covariates that are closely correlated to the combined test score.The negative effect likely reflects unobserved factors,such as the underlying severity of learning disabilities,parent involvement and other potential traits,which are actual factors that determine the placement of special education status,rather than indicating the ineffectiveness of special education service.The achievement gap in academic performance reflects the current observable status of special education.The estimated effect could be improved by future research incorporating educational domain knowledge,allowing the model to be constructed more accurately.
基金supported by the 948 Program of the State Forestry Administration (2009-4-43)the National Natura Science Foundation of China (No.30870420)
文摘Boreal forests play an important role in global environment systems. Understanding boreal forest ecosystem structure and function requires accurate monitoring and estimating of forest canopy and biomass. We used partial least square regression (PLSR) models to relate forest parameters, i.e. canopy closure density and above ground tree biomass, to Landsat ETM+ data. The established models were optimized according to the variable importance for projection (VIP) criterion and the bootstrap method, and their performance was compared using several statistical indices. All variables selected by the VIP criterion passed the bootstrap test (p〈0.05). The simplified models without insignificant variables (VIP 〈1) performed as well as the full model but with less computation time. The relative root mean square error (RMSE%) was 29% for canopy closure density, and 58% for above ground tree biomass. We conclude that PLSR can be an effective method for estimating canopy closure density and above ground biomass.