A method for fast 1-fold cross validation is proposed for the regularized extreme learning machine (RELM). The computational time of fast l-fold cross validation increases as the fold number decreases, which is oppo...A method for fast 1-fold cross validation is proposed for the regularized extreme learning machine (RELM). The computational time of fast l-fold cross validation increases as the fold number decreases, which is opposite to that of naive 1-fold cross validation. As opposed to naive l-fold cross validation, fast l-fold cross validation takes the advantage in terms of computational time, especially for the large fold number such as l 〉 20. To corroborate the efficacy and feasibility of fast l-fold cross validation, experiments on five benchmark regression data sets are evaluated.展开更多
Background: A random multiple-regression model that simultaneously fit all allele substitution effects for additive markers or haplotypes as uncorrelated random effects was proposed for Best Linear Unbiased Predictio...Background: A random multiple-regression model that simultaneously fit all allele substitution effects for additive markers or haplotypes as uncorrelated random effects was proposed for Best Linear Unbiased Prediction, using whole-genome data. Leave-one-out cross validation can be used to quantify the predictive ability of a statistical model.Methods: Naive application of Leave-one-out cross validation is computationally intensive because the training and validation analyses need to be repeated n times, once for each observation. Efficient Leave-one-out cross validation strategies are presented here, requiring little more effort than a single analysis.Results: Efficient Leave-one-out cross validation strategies is 786 times faster than the naive application for a simulated dataset with 1,000 observations and 10,000 markers and 99 times faster with 1,000 observations and 100 markers. These efficiencies relative to the naive approach using the same model will increase with increases in the number of observations.Conclusions: Efficient Leave-one-out cross validation strategies are presented here, requiring little more effort than a single analysis.展开更多
In regression, despite being both aimed at estimating the Mean Squared Prediction Error (MSPE), Akaike’s Final Prediction Error (FPE) and the Generalized Cross Validation (GCV) selection criteria are usually derived ...In regression, despite being both aimed at estimating the Mean Squared Prediction Error (MSPE), Akaike’s Final Prediction Error (FPE) and the Generalized Cross Validation (GCV) selection criteria are usually derived from two quite different perspectives. Here, settling on the most commonly accepted definition of the MSPE as the expectation of the squared prediction error loss, we provide theoretical expressions for it, valid for any linear model (LM) fitter, be it under random or non random designs. Specializing these MSPE expressions for each of them, we are able to derive closed formulas of the MSPE for some of the most popular LM fitters: Ordinary Least Squares (OLS), with or without a full column rank design matrix;Ordinary and Generalized Ridge regression, the latter embedding smoothing splines fitting. For each of these LM fitters, we then deduce a computable estimate of the MSPE which turns out to coincide with Akaike’s FPE. Using a slight variation, we similarly get a class of MSPE estimates coinciding with the classical GCV formula for those same LM fitters.展开更多
Frequentist model averaging has received much attention from econometricians and statisticians in recent years.A key problem with frequentist model average estimators is the choice of weights.This paper develops a new...Frequentist model averaging has received much attention from econometricians and statisticians in recent years.A key problem with frequentist model average estimators is the choice of weights.This paper develops a new approach of choosing weights based on an approximation of generalized cross validation.The resultant least squares model average estimators are proved to be asymptotically optimal in the sense of achieving the lowest possible squared errors.Especially,the optimality is built under both discrete and continuous weigh sets.Compared with the existing approach based on Mallows criterion,the conditions required for the asymptotic optimality of the proposed method are more reasonable.Simulation studies and real data application show good performance of the proposed estimators.展开更多
Statistical machine learning models should be evaluated and validated before putting to work.Conventional k-fold Monte Carlo cross-validation(MCCV)procedure uses a pseudo-random sequence to partition instances into k ...Statistical machine learning models should be evaluated and validated before putting to work.Conventional k-fold Monte Carlo cross-validation(MCCV)procedure uses a pseudo-random sequence to partition instances into k subsets,which usually causes subsampling bias,inflates generalization errors and jeopardizes the reliability and effectiveness of cross-validation.Based on ordered systematic sampling theory in statistics and low-discrepancy sequence theory in number theory,we propose a new k-fold cross-validation procedure by replacing a pseudo-random sequence with a best-discrepancy sequence,which ensures low subsampling bias and leads to more precise expected-prediction-error(EPE)estimates.Experiments with 156 benchmark datasets and three classifiers(logistic regression,decision tree and na?ve bayes)show that in general,our cross-validation procedure can extrude subsampling bias in the MCCV by lowering the EPE around 7.18%and the variances around 26.73%.In comparison,the stratified MCCV can reduce the EPE and variances of the MCCV around 1.58%and 11.85%,respectively.The leave-one-out(LOO)can lower the EPE around 2.50%but its variances are much higher than the any other cross-validation(CV)procedure.The computational time of our cross-validation procedure is just 8.64%of the MCCV,8.67%of the stratified MCCV and 16.72%of the LOO.Experiments also show that our approach is more beneficial for datasets characterized by relatively small size and large aspect ratio.This makes our approach particularly pertinent when solving bioscience classification problems.Our proposed systematic subsampling technique could be generalized to other machine learning algorithms that involve random subsampling mechanism.展开更多
Jacket platforms constitute the foundational infrastructure of offshore oil and gas field exploitation.How to efficiently and accurately monitor the mechanical properties of jacket structures is one of the key problem...Jacket platforms constitute the foundational infrastructure of offshore oil and gas field exploitation.How to efficiently and accurately monitor the mechanical properties of jacket structures is one of the key problems to be solved to ensure the safe operation of the platform.To address the practical engineering problem that it is difficult to monitor the stress response of the tubular joints of jacket platforms online,a digital twin reduced-order method for real-time prediction of the stress response of tubular joints is proposed.In the offline construction phase,multi-scale modeling and multi-parameter experimental design methods are used to obtain the stress response data set of the jacket structure.Proper orthogonal decomposition is employed to extract the main feature information from the snapshot matrix,resulting in a reduced-order basis.The leave-one-out cross-validation method is used to select the optimal modal order for constructing the reduced-order model(ROM).In the online prediction phase,a digital twin model of the tubular joint is established,and the prediction performance of the ROM is analyzed and verified through using random environmental load and field environmental monitoring data.The results indicate that,compared with traditional numerical simulations of tubular joints,the ROM based on the proposed reduced-order method is more efficient in predicting the stress response of tubular joints while ensuring accuracy and robustness.展开更多
PM_(1.0),particulate matter with an aerodynamic diameter smaller than 1.0μm,can adversely affect human health.However,fewer stations are capable of measuring PM_(1.0) concentrations than PM2.5 and PM10 concentrations...PM_(1.0),particulate matter with an aerodynamic diameter smaller than 1.0μm,can adversely affect human health.However,fewer stations are capable of measuring PM_(1.0) concentrations than PM2.5 and PM10 concentrations in real time(i.e.,only 9 locations for PM_(1.0) vs.623 locations for PM2.5 or PM10)in South Korea,making it impossible to conduct a nationwide health risk analysis of PM_(1.0).Thus,this study aimed to develop a PM_(1.0) prediction model using a random forest algorithm based on PM_(1.0) data from the nine measurement stations and various environmental input factors.Cross validation,in which the model was trained in eight stations and tested in the remaining station,achieved an average R^(2) of 0.913.The high R^(2) value achieved undermutually exclusive training and test locations in the cross validation can be ascribed to the fact that all the locations had similar relationships between PM_(1.0) and the input factors,which were captured by our model.Moreover,results of feature importance analysis showed that PM2.5 and PM10 concentrations were the two most important input features in predicting PM_(1.0) concentration.Finally,the model was used to estimate the PM_(1.0) concentrations in 623 locations,where input factors such as PM2.5 and PM10 can be obtained.Based on the augmented profile,we identified Seoul and Ansan to be PM_(1.0) concentration hotspots.These regions are large cities or the center of anthropogenic and industrial activities.The proposed model and the augmented PM_(1.0) profiles can be used for large epidemiological studies to understand the health impacts of PM_(1.0).展开更多
Background:Diabetes is one of the fastest rising chronic illness worldwide,and early detection is very crucial for reducing complications.Traditional machine learning models often struggle with imbalanced data and mod...Background:Diabetes is one of the fastest rising chronic illness worldwide,and early detection is very crucial for reducing complications.Traditional machine learning models often struggle with imbalanced data and moderate accuracy.To overcome these limitations,we propose a SMOTE-based ensemble boosting strategy(SMOTEBEnDi)for more accurate diabetes classification.Methods:The framework uses the Pima Indians diabetes dataset(PIDD)consisting of eight clinical features.Preprocessing steps included normalization,feature relevance analysis,and handling of missing values.The class imbalance was corrected using the synthetic minority oversampling technique(SMOTE),and multiple classifiers such as K-nearest neighbor(KNN),decision tree(DT),random forest(RF),and support vector machine(SVM)were ensembled in a boosting architecture.Hyperparameter tuning with k-fold cross validation was applied to ensure robust performance.Results:Experimental analysis showed that the proposed SMOTEBEnDi model achieved 99.5%accuracy,99.39%sensitivity,and 99.59%specificity,outperforming baseline classifiers and demonstrating near-perfect detection.The improvements in performance metrics like area under curve(AUC),precision,and specificity confirm the effectiveness of addressing class imbalance.Conclusion:The study proves that combining SMOTE with ensemble boosting greatly enhances early diabetes detection.This reduces diagnostic errors,supports clinicians in timely intervention,and can serve as a strong base for computer-aided diagnostic tools.Future work should extend this framework for real-time prediction systems,integrate with IoT health devices,and adapt it across diverse clinical datasets to improve generalization and trust in real healthcare settings.展开更多
Machine learning(ML)algorithms are frequently used in landslide susceptibility modeling.Different data handling strategies may generate variations in landslide susceptibility modeling,even when using the same ML algor...Machine learning(ML)algorithms are frequently used in landslide susceptibility modeling.Different data handling strategies may generate variations in landslide susceptibility modeling,even when using the same ML algorithm.This research aims to compare the combinations of inventory data handling,cross validation(CV),and hyperparameter tuning strategies to generate landslide susceptibility maps.The results are expected to provide a general strategy for landslide susceptibility modeling using ML techniques.The authors employed eight landslide inventory data handling scenarios to convert a landslide polygon into a landslide point,i.e.,the landslide point is located on the toe(minimum height),on the scarp(maximum height),at the center of the landslide,randomly inside the polygon(1 point),randomly inside the polygon(3 points),randomly inside the polygon(5 points),randomly inside the polygon(10 points),and 15 m grid sampling.Random forest models using CV-nonspatial hyperparameter tuning,spatial CV-spatial hyperparameter tuning,and spatial CV-forward feature selection-no hyperparameter tuning were applied for each data handling strategy.The combination generated 24 random forest ML workflows,which are applied using a complete inventory of 743 landslides triggered by Tropical Cyclone Cempaka(2017)in Pacitan Regency,Indonesia,and 11 landslide controlling factors.The results show that grid sampling with spatial CV and spatial hyperparameter tuning is favorable because the strategy can minimize overfitting,generate a relatively high-performance predictive model,and reduce the appearance of susceptibility artifacts in the landslide area.Careful data inventory handling,CV,and hyperparameter tuning strategies should be considered in landslide susceptibility modeling to increase the applicability of landslide susceptibility maps in practical application.展开更多
A statistical dynamic model for forecasting Chinese landfall of tropical cyclones (CLTCs) was developed based on the empirical relationship between the observed CLTC variability and the hindcast atmospheric circulat...A statistical dynamic model for forecasting Chinese landfall of tropical cyclones (CLTCs) was developed based on the empirical relationship between the observed CLTC variability and the hindcast atmospheric circulations from the Pusan National University coupled general circulation model (PNU-CGCM).In the last 31 years,CLTCs have shown strong year-to-year variability,with a maximum frequency in 1994 and a minimum frequency in 1987.Such features were well forecasted by the model.A cross-validation test showed that the correlation between the observed index and the forecasted CLTC index was high,with a coefficient of 0.71.The relative error percentage (16.3%) and root-mean-square error (1.07) were low.Therefore the coupled model performs well in terms of forecasting CLTCs;the model has potential for dynamic forecasting of landfall of tropical cyclones.展开更多
With rapid development of urban rail transit,maglev trains,benefiting from its comfortable,energy-saving and environmentally friendly merits,have gradually entered people's horizons.In this paper,aiming at improvi...With rapid development of urban rail transit,maglev trains,benefiting from its comfortable,energy-saving and environmentally friendly merits,have gradually entered people's horizons.In this paper,aiming at improving the aerodynamic performance of an urban maglev train,the aerodynamic optimization design has been performed.An improved two-point infill criterion has been adopted to construct the cross-validated Kriging model.Meanwhile,the multi-objective genetic algorithm and complex three-dimensional geometric parametrization method have been used,to optimize the streamlined head of the train.Several optimal shapes have been obtained.Results reveal that the optimization strategy used in this paper is sufficiently accurate and time-efficient for the optimization of the urban maglev train,and can be applied in practical engineering.Compared to the prototype of the train,optimal shape benefits from higher lift of the leading car and smaller drag of the whole train.Sensitivity analysis reveals that the length and height of the streamlined head have a great influence on the aerodynamic performance of the train,and strong nonlinear relationships exist between these design variables and aerodynamic performance.The conclusions drawn in this study offer the chance to derive critical reference values for the optimization of the aerodynamic characteristics of urban maglev trains.展开更多
Slope stability prediction research is a complex non-linear system problem.In carrying out slope stability prediction work,it often encounters low accuracy of prediction models and blind data preprocessing.Based on 77...Slope stability prediction research is a complex non-linear system problem.In carrying out slope stability prediction work,it often encounters low accuracy of prediction models and blind data preprocessing.Based on 77 field cases,5 quantitative indicators are selected to improve the accuracy of prediction models for slope stability.These indicators include slope angle,slope height,internal friction angle,cohesion and unit weight of rock and soil.Potential data aggregation in the prediction of slope stability is analyzed and visualized based on Six-dimension reduction methods,namely principal components analysis(PCA),Kernel PCA,factor analysis(FA),independent component analysis(ICA),non-negative matrix factorization(NMF)and t-SNE(stochastic neighbor embedding).Combined with classic machine learning methods,7 prediction models for slope stability are established and their reliabilities are examined by random cross validation.Besides,the significance of each indicator in the prediction of slope stability is discussed using the coefficient of variation method.The research results show that dimension reduction is unnecessary for the data processing of prediction models established in this paper of slope stability.Random forest(RF),support vector machine(SVM)and k-nearest neighbour(KNN)achieve the best prediction accuracy,which is higher than 90%.The decision tree(DT)has better accuracy which is 86%.The most important factor influencing slope stability is slope height,while unit weight of rock and soil is the least significant.RF and SVM models have the best accuracy and superiority in slope stability prediction.The results provide a new approach toward slope stability prediction in geotechnical engineering.展开更多
Predicting neuron growth is valuable to understand the morphology of neurons, thus it is helpful in the research of neuron classification. This study sought to propose a new method of predicting the growth of human ne...Predicting neuron growth is valuable to understand the morphology of neurons, thus it is helpful in the research of neuron classification. This study sought to propose a new method of predicting the growth of human neurons using 1 907 sets of data in human brain pyramidal neurons obtained from the website of NeuroMorpho.Org. First, we analyzed neurons in a morphology field and used an expectation-maximization algorithm to specify the neurons into six clusters. Second, naive Bayes classifier was used to verify the accuracy of the expectation-maximization algorithm. Experiment results proved that the cluster groups here were efficient and feasible. Finally, a new method to rank the six expectation-maximization algorithm clustered classes was used in predicting the growth of human pyramidal neurons.展开更多
To improve the anti-noise performance of the time-domain Bregman iterative algorithm,an adaptive frequency-domain Bregman sparse-spike deconvolution algorithm is proposed.By solving the Bregman algorithm in the freque...To improve the anti-noise performance of the time-domain Bregman iterative algorithm,an adaptive frequency-domain Bregman sparse-spike deconvolution algorithm is proposed.By solving the Bregman algorithm in the frequency domain,the influence of Gaussian as well as outlier noise on the convergence of the algorithm is effectively avoided.In other words,the proposed algorithm avoids data noise effects by implementing the calculations in the frequency domain.Moreover,the computational efficiency is greatly improved compared with the conventional method.Generalized cross validation is introduced in the solving process to optimize the regularization parameter and thus the algorithm is equipped with strong self-adaptation.Different theoretical models are built and solved using the algorithms in both time and frequency domains.Finally,the proposed and the conventional methods are both used to process actual seismic data.The comparison of the results confirms the superiority of the proposed algorithm due to its noise resistance and self-adaptation capability.展开更多
Pattern classification is an important field in machine learning; least squares support vector machine (LSSVM) is a powerful tool for pattern classification. A new version of LSSVM, SVD-LSSVM, to save time of selectin...Pattern classification is an important field in machine learning; least squares support vector machine (LSSVM) is a powerful tool for pattern classification. A new version of LSSVM, SVD-LSSVM, to save time of selecting hyper parameters for LSSVM is proposed. SVD-LSSVM is trained through singular value decomposition (SVD) of kernel matrix. Cross validation time of selecting hyper parameters can be saved because a new hyper parameter, singular value contribution rate (SVCR), replaces the penalty factor of LSSVM. Several UCI benchmarking data and the Olive classification problem were used to test SVD-LSSVM. The result showed that SVD-LSSVM has good performance in classification and saves time for cross validation.展开更多
For practical engineering structures,it is usually difficult to measure external load distribution in a direct manner,which makes inverse load identification important.Specifically,load identification is a typical inv...For practical engineering structures,it is usually difficult to measure external load distribution in a direct manner,which makes inverse load identification important.Specifically,load identification is a typical inverse problem,for which the models(e.g.,response matrix)are often ill-posed,resulting in degraded accuracy and impaired noise immunity of load identification.This study aims at identifying external loads in a stiffened plate structure,through comparing the effectiveness of different methods for parameter selection in regulation problems,including the Generalized Cross Validation(GCV)method,the Ordinary Cross Validation method and the truncated singular value decomposition method.With demonstrated high accuracy,the GCV method is used to identify concentrated loads in three different directions(e.g.,vertical,lateral and longitudinal)exerted on a stiffened plate.The results show that the GCV method is able to effectively identify multi-source static loads,with relative errors less than 5%.Moreover,under the situation of swept frequency excitation,when the excitation frequency is near the natural frequency of the structure,the GCV method can achieve much higher accuracy compared with direct inversion.At other excitation frequencies,the average recognition error of the GCV method load identification less than 10%.展开更多
The water quality grades of phosphate(PO4-P) and dissolved inorganic nitrogen(DIN) are integrated by spatial partitioning to fit the global and local semi-variograms of these nutrients. Leave-one-out cross validat...The water quality grades of phosphate(PO4-P) and dissolved inorganic nitrogen(DIN) are integrated by spatial partitioning to fit the global and local semi-variograms of these nutrients. Leave-one-out cross validation is used to determine the statistical inference method. To minimize absolute average errors and error mean squares,stratified Kriging(SK) interpolation is applied to DIN and ordinary Kriging(OK) interpolation is applied to PO4-P.Ten percent of the sites is adjusted by considering their impact on the change in deviations in DIN and PO4-P interpolation and the resultant effect on areas with different water quality grades. Thus, seven redundant historical sites are removed. Seven historical sites are distributed in areas with water quality poorer than Grade IV at the north and south branches of the Changjiang(Yangtze River) Estuary and at the coastal region north of the Hangzhou Bay. Numerous sites are installed in these regions. The contents of various elements in the waters are not remarkably changed, and the waters are mixed well. Seven sites that have been optimized and removed are set to water with quality Grades III and IV. Optimization and adjustment of unrestricted areas show that the optimized and adjusted sites are mainly distributed in regions where the water quality grade undergoes transition.Therefore, key sites for adjustment and optimization are located at the boundaries of areas with different water quality grades and seawater.展开更多
基金supported by the National Natural Science Foundation of China(51006052)the NUST Outstanding Scholar Supporting Program
文摘A method for fast 1-fold cross validation is proposed for the regularized extreme learning machine (RELM). The computational time of fast l-fold cross validation increases as the fold number decreases, which is opposite to that of naive 1-fold cross validation. As opposed to naive l-fold cross validation, fast l-fold cross validation takes the advantage in terms of computational time, especially for the large fold number such as l 〉 20. To corroborate the efficacy and feasibility of fast l-fold cross validation, experiments on five benchmark regression data sets are evaluated.
基金supported by the US Department of Agriculture,Agriculture and Food Research Initiative National Institute of Food and Agriculture Competitive grant no.2015-67015-22947
文摘Background: A random multiple-regression model that simultaneously fit all allele substitution effects for additive markers or haplotypes as uncorrelated random effects was proposed for Best Linear Unbiased Prediction, using whole-genome data. Leave-one-out cross validation can be used to quantify the predictive ability of a statistical model.Methods: Naive application of Leave-one-out cross validation is computationally intensive because the training and validation analyses need to be repeated n times, once for each observation. Efficient Leave-one-out cross validation strategies are presented here, requiring little more effort than a single analysis.Results: Efficient Leave-one-out cross validation strategies is 786 times faster than the naive application for a simulated dataset with 1,000 observations and 10,000 markers and 99 times faster with 1,000 observations and 100 markers. These efficiencies relative to the naive approach using the same model will increase with increases in the number of observations.Conclusions: Efficient Leave-one-out cross validation strategies are presented here, requiring little more effort than a single analysis.
文摘In regression, despite being both aimed at estimating the Mean Squared Prediction Error (MSPE), Akaike’s Final Prediction Error (FPE) and the Generalized Cross Validation (GCV) selection criteria are usually derived from two quite different perspectives. Here, settling on the most commonly accepted definition of the MSPE as the expectation of the squared prediction error loss, we provide theoretical expressions for it, valid for any linear model (LM) fitter, be it under random or non random designs. Specializing these MSPE expressions for each of them, we are able to derive closed formulas of the MSPE for some of the most popular LM fitters: Ordinary Least Squares (OLS), with or without a full column rank design matrix;Ordinary and Generalized Ridge regression, the latter embedding smoothing splines fitting. For each of these LM fitters, we then deduce a computable estimate of the MSPE which turns out to coincide with Akaike’s FPE. Using a slight variation, we similarly get a class of MSPE estimates coinciding with the classical GCV formula for those same LM fitters.
基金by National Key R&D Program of China(2020AAA0105200)the Ministry of Science and Technology of China(Grant no.2016YFB0502301)+1 种基金the National Natural Science Foundation of China(Grant nos.11871294,12031016,11971323,71925007,72042019,72091212 and 12001559)a joint grant from the Academy for Multidisciplinary Studies,Capital Normal University.
文摘Frequentist model averaging has received much attention from econometricians and statisticians in recent years.A key problem with frequentist model average estimators is the choice of weights.This paper develops a new approach of choosing weights based on an approximation of generalized cross validation.The resultant least squares model average estimators are proved to be asymptotically optimal in the sense of achieving the lowest possible squared errors.Especially,the optimality is built under both discrete and continuous weigh sets.Compared with the existing approach based on Mallows criterion,the conditions required for the asymptotic optimality of the proposed method are more reasonable.Simulation studies and real data application show good performance of the proposed estimators.
基金supported by the Qilu Youth Scholar Project of Shandong Universitysupported by National Natural Science Foundation of China(Grant No.11531008)+1 种基金the Ministry of Education of China(Grant No.IRT16R43)the Taishan Scholar Project of Shandong Province。
文摘Statistical machine learning models should be evaluated and validated before putting to work.Conventional k-fold Monte Carlo cross-validation(MCCV)procedure uses a pseudo-random sequence to partition instances into k subsets,which usually causes subsampling bias,inflates generalization errors and jeopardizes the reliability and effectiveness of cross-validation.Based on ordered systematic sampling theory in statistics and low-discrepancy sequence theory in number theory,we propose a new k-fold cross-validation procedure by replacing a pseudo-random sequence with a best-discrepancy sequence,which ensures low subsampling bias and leads to more precise expected-prediction-error(EPE)estimates.Experiments with 156 benchmark datasets and three classifiers(logistic regression,decision tree and na?ve bayes)show that in general,our cross-validation procedure can extrude subsampling bias in the MCCV by lowering the EPE around 7.18%and the variances around 26.73%.In comparison,the stratified MCCV can reduce the EPE and variances of the MCCV around 1.58%and 11.85%,respectively.The leave-one-out(LOO)can lower the EPE around 2.50%but its variances are much higher than the any other cross-validation(CV)procedure.The computational time of our cross-validation procedure is just 8.64%of the MCCV,8.67%of the stratified MCCV and 16.72%of the LOO.Experiments also show that our approach is more beneficial for datasets characterized by relatively small size and large aspect ratio.This makes our approach particularly pertinent when solving bioscience classification problems.Our proposed systematic subsampling technique could be generalized to other machine learning algorithms that involve random subsampling mechanism.
基金financially supported by the National Natural Science Foundation of China(Grant No.11472076).
文摘Jacket platforms constitute the foundational infrastructure of offshore oil and gas field exploitation.How to efficiently and accurately monitor the mechanical properties of jacket structures is one of the key problems to be solved to ensure the safe operation of the platform.To address the practical engineering problem that it is difficult to monitor the stress response of the tubular joints of jacket platforms online,a digital twin reduced-order method for real-time prediction of the stress response of tubular joints is proposed.In the offline construction phase,multi-scale modeling and multi-parameter experimental design methods are used to obtain the stress response data set of the jacket structure.Proper orthogonal decomposition is employed to extract the main feature information from the snapshot matrix,resulting in a reduced-order basis.The leave-one-out cross-validation method is used to select the optimal modal order for constructing the reduced-order model(ROM).In the online prediction phase,a digital twin model of the tubular joint is established,and the prediction performance of the ROM is analyzed and verified through using random environmental load and field environmental monitoring data.The results indicate that,compared with traditional numerical simulations of tubular joints,the ROM based on the proposed reduced-order method is more efficient in predicting the stress response of tubular joints while ensuring accuracy and robustness.
基金supported by the Fine Particle Research Initiative in East Asia Considering National Differences Project through the National Research Foundation of Korea(NRF)funded by the Ministry of Science and ICT(No.NRF-2023M3G1A1090660)supported by a grant from the National Institute of Environmental Research(NIER),funded by the Ministry of Environment of the Republic of Korea(No.NIER-2023-04-02-056).
文摘PM_(1.0),particulate matter with an aerodynamic diameter smaller than 1.0μm,can adversely affect human health.However,fewer stations are capable of measuring PM_(1.0) concentrations than PM2.5 and PM10 concentrations in real time(i.e.,only 9 locations for PM_(1.0) vs.623 locations for PM2.5 or PM10)in South Korea,making it impossible to conduct a nationwide health risk analysis of PM_(1.0).Thus,this study aimed to develop a PM_(1.0) prediction model using a random forest algorithm based on PM_(1.0) data from the nine measurement stations and various environmental input factors.Cross validation,in which the model was trained in eight stations and tested in the remaining station,achieved an average R^(2) of 0.913.The high R^(2) value achieved undermutually exclusive training and test locations in the cross validation can be ascribed to the fact that all the locations had similar relationships between PM_(1.0) and the input factors,which were captured by our model.Moreover,results of feature importance analysis showed that PM2.5 and PM10 concentrations were the two most important input features in predicting PM_(1.0) concentration.Finally,the model was used to estimate the PM_(1.0) concentrations in 623 locations,where input factors such as PM2.5 and PM10 can be obtained.Based on the augmented profile,we identified Seoul and Ansan to be PM_(1.0) concentration hotspots.These regions are large cities or the center of anthropogenic and industrial activities.The proposed model and the augmented PM_(1.0) profiles can be used for large epidemiological studies to understand the health impacts of PM_(1.0).
文摘Background:Diabetes is one of the fastest rising chronic illness worldwide,and early detection is very crucial for reducing complications.Traditional machine learning models often struggle with imbalanced data and moderate accuracy.To overcome these limitations,we propose a SMOTE-based ensemble boosting strategy(SMOTEBEnDi)for more accurate diabetes classification.Methods:The framework uses the Pima Indians diabetes dataset(PIDD)consisting of eight clinical features.Preprocessing steps included normalization,feature relevance analysis,and handling of missing values.The class imbalance was corrected using the synthetic minority oversampling technique(SMOTE),and multiple classifiers such as K-nearest neighbor(KNN),decision tree(DT),random forest(RF),and support vector machine(SVM)were ensembled in a boosting architecture.Hyperparameter tuning with k-fold cross validation was applied to ensure robust performance.Results:Experimental analysis showed that the proposed SMOTEBEnDi model achieved 99.5%accuracy,99.39%sensitivity,and 99.59%specificity,outperforming baseline classifiers and demonstrating near-perfect detection.The improvements in performance metrics like area under curve(AUC),precision,and specificity confirm the effectiveness of addressing class imbalance.Conclusion:The study proves that combining SMOTE with ensemble boosting greatly enhances early diabetes detection.This reduces diagnostic errors,supports clinicians in timely intervention,and can serve as a strong base for computer-aided diagnostic tools.Future work should extend this framework for real-time prediction systems,integrate with IoT health devices,and adapt it across diverse clinical datasets to improve generalization and trust in real healthcare settings.
文摘Machine learning(ML)algorithms are frequently used in landslide susceptibility modeling.Different data handling strategies may generate variations in landslide susceptibility modeling,even when using the same ML algorithm.This research aims to compare the combinations of inventory data handling,cross validation(CV),and hyperparameter tuning strategies to generate landslide susceptibility maps.The results are expected to provide a general strategy for landslide susceptibility modeling using ML techniques.The authors employed eight landslide inventory data handling scenarios to convert a landslide polygon into a landslide point,i.e.,the landslide point is located on the toe(minimum height),on the scarp(maximum height),at the center of the landslide,randomly inside the polygon(1 point),randomly inside the polygon(3 points),randomly inside the polygon(5 points),randomly inside the polygon(10 points),and 15 m grid sampling.Random forest models using CV-nonspatial hyperparameter tuning,spatial CV-spatial hyperparameter tuning,and spatial CV-forward feature selection-no hyperparameter tuning were applied for each data handling strategy.The combination generated 24 random forest ML workflows,which are applied using a complete inventory of 743 landslides triggered by Tropical Cyclone Cempaka(2017)in Pacitan Regency,Indonesia,and 11 landslide controlling factors.The results show that grid sampling with spatial CV and spatial hyperparameter tuning is favorable because the strategy can minimize overfitting,generate a relatively high-performance predictive model,and reduce the appearance of susceptibility artifacts in the landslide area.Careful data inventory handling,CV,and hyperparameter tuning strategies should be considered in landslide susceptibility modeling to increase the applicability of landslide susceptibility maps in practical application.
基金supported by the Chinese Academy of Sciences key program(Grant No. KZCX2-YW-Q03-3)the Korea Meteorological Administration Research and Development Program(Grant No. CATER 2009-1147)+1 种基金the Korea Rural Development Administration Research and Development Programthe National Basic Research Program of China (Grant No. 2009CB421406)
文摘A statistical dynamic model for forecasting Chinese landfall of tropical cyclones (CLTCs) was developed based on the empirical relationship between the observed CLTC variability and the hindcast atmospheric circulations from the Pusan National University coupled general circulation model (PNU-CGCM).In the last 31 years,CLTCs have shown strong year-to-year variability,with a maximum frequency in 1994 and a minimum frequency in 1987.Such features were well forecasted by the model.A cross-validation test showed that the correlation between the observed index and the forecasted CLTC index was high,with a coefficient of 0.71.The relative error percentage (16.3%) and root-mean-square error (1.07) were low.Therefore the coupled model performs well in terms of forecasting CLTCs;the model has potential for dynamic forecasting of landfall of tropical cyclones.
基金This work was supported by Advanced Rail Transportation Special Plan in National Key Research and Development Project(Grants 2016YFB1200601-B13 and 2016YFB1200602-09)Youth Innovation Promotion Association CAS(2019020).
文摘With rapid development of urban rail transit,maglev trains,benefiting from its comfortable,energy-saving and environmentally friendly merits,have gradually entered people's horizons.In this paper,aiming at improving the aerodynamic performance of an urban maglev train,the aerodynamic optimization design has been performed.An improved two-point infill criterion has been adopted to construct the cross-validated Kriging model.Meanwhile,the multi-objective genetic algorithm and complex three-dimensional geometric parametrization method have been used,to optimize the streamlined head of the train.Several optimal shapes have been obtained.Results reveal that the optimization strategy used in this paper is sufficiently accurate and time-efficient for the optimization of the urban maglev train,and can be applied in practical engineering.Compared to the prototype of the train,optimal shape benefits from higher lift of the leading car and smaller drag of the whole train.Sensitivity analysis reveals that the length and height of the streamlined head have a great influence on the aerodynamic performance of the train,and strong nonlinear relationships exist between these design variables and aerodynamic performance.The conclusions drawn in this study offer the chance to derive critical reference values for the optimization of the aerodynamic characteristics of urban maglev trains.
基金by the National Natural Science Foundation of China(No.52174114)the State Key Laboratory of Hydroscience and Engineering of Tsinghua University(No.61010101218).
文摘Slope stability prediction research is a complex non-linear system problem.In carrying out slope stability prediction work,it often encounters low accuracy of prediction models and blind data preprocessing.Based on 77 field cases,5 quantitative indicators are selected to improve the accuracy of prediction models for slope stability.These indicators include slope angle,slope height,internal friction angle,cohesion and unit weight of rock and soil.Potential data aggregation in the prediction of slope stability is analyzed and visualized based on Six-dimension reduction methods,namely principal components analysis(PCA),Kernel PCA,factor analysis(FA),independent component analysis(ICA),non-negative matrix factorization(NMF)and t-SNE(stochastic neighbor embedding).Combined with classic machine learning methods,7 prediction models for slope stability are established and their reliabilities are examined by random cross validation.Besides,the significance of each indicator in the prediction of slope stability is discussed using the coefficient of variation method.The research results show that dimension reduction is unnecessary for the data processing of prediction models established in this paper of slope stability.Random forest(RF),support vector machine(SVM)and k-nearest neighbour(KNN)achieve the best prediction accuracy,which is higher than 90%.The decision tree(DT)has better accuracy which is 86%.The most important factor influencing slope stability is slope height,while unit weight of rock and soil is the least significant.RF and SVM models have the best accuracy and superiority in slope stability prediction.The results provide a new approach toward slope stability prediction in geotechnical engineering.
基金supported by the National Natural Science Foundation of China,No.10872069
文摘Predicting neuron growth is valuable to understand the morphology of neurons, thus it is helpful in the research of neuron classification. This study sought to propose a new method of predicting the growth of human neurons using 1 907 sets of data in human brain pyramidal neurons obtained from the website of NeuroMorpho.Org. First, we analyzed neurons in a morphology field and used an expectation-maximization algorithm to specify the neurons into six clusters. Second, naive Bayes classifier was used to verify the accuracy of the expectation-maximization algorithm. Experiment results proved that the cluster groups here were efficient and feasible. Finally, a new method to rank the six expectation-maximization algorithm clustered classes was used in predicting the growth of human pyramidal neurons.
基金supported by the National Natural Science Foundation of China(No.NSFC 41204101)Open Projects Fund of the State Key Laboratory of Oil and Gas Reservoir Geology and Exploitation(No.PLN201733)+1 种基金Youth Innovation Promotion Association of the Chinese Academy of Sciences(No.2015051)Open Projects Fund of the Natural Gas and Geology Key Laboratory of Sichuan Province(No.2015trqdz03)
文摘To improve the anti-noise performance of the time-domain Bregman iterative algorithm,an adaptive frequency-domain Bregman sparse-spike deconvolution algorithm is proposed.By solving the Bregman algorithm in the frequency domain,the influence of Gaussian as well as outlier noise on the convergence of the algorithm is effectively avoided.In other words,the proposed algorithm avoids data noise effects by implementing the calculations in the frequency domain.Moreover,the computational efficiency is greatly improved compared with the conventional method.Generalized cross validation is introduced in the solving process to optimize the regularization parameter and thus the algorithm is equipped with strong self-adaptation.Different theoretical models are built and solved using the algorithms in both time and frequency domains.Finally,the proposed and the conventional methods are both used to process actual seismic data.The comparison of the results confirms the superiority of the proposed algorithm due to its noise resistance and self-adaptation capability.
基金Project (No. 20276063) supported by the National Natural Science Foundation of China
文摘Pattern classification is an important field in machine learning; least squares support vector machine (LSSVM) is a powerful tool for pattern classification. A new version of LSSVM, SVD-LSSVM, to save time of selecting hyper parameters for LSSVM is proposed. SVD-LSSVM is trained through singular value decomposition (SVD) of kernel matrix. Cross validation time of selecting hyper parameters can be saved because a new hyper parameter, singular value contribution rate (SVCR), replaces the penalty factor of LSSVM. Several UCI benchmarking data and the Olive classification problem were used to test SVD-LSSVM. The result showed that SVD-LSSVM has good performance in classification and saves time for cross validation.
基金funding for this study from National Key R&D Program of China(2018YFA0702800)National Natural Science Foundation of China(12072056)+1 种基金the Fundamental Research Funds for the Central Universities(DUT19LK49)Nantong Science and Technology Plan Project(No.MS22019016).
文摘For practical engineering structures,it is usually difficult to measure external load distribution in a direct manner,which makes inverse load identification important.Specifically,load identification is a typical inverse problem,for which the models(e.g.,response matrix)are often ill-posed,resulting in degraded accuracy and impaired noise immunity of load identification.This study aims at identifying external loads in a stiffened plate structure,through comparing the effectiveness of different methods for parameter selection in regulation problems,including the Generalized Cross Validation(GCV)method,the Ordinary Cross Validation method and the truncated singular value decomposition method.With demonstrated high accuracy,the GCV method is used to identify concentrated loads in three different directions(e.g.,vertical,lateral and longitudinal)exerted on a stiffened plate.The results show that the GCV method is able to effectively identify multi-source static loads,with relative errors less than 5%.Moreover,under the situation of swept frequency excitation,when the excitation frequency is near the natural frequency of the structure,the GCV method can achieve much higher accuracy compared with direct inversion.At other excitation frequencies,the average recognition error of the GCV method load identification less than 10%.
基金The National Natural Science Fundation of China under contract Nos 41376190,41271404,41531179,41421001 and41601425the Open Funds of the Key Laboratory of Integrated Monitoring and Applied Technologies for Marin Harmful Algal Blooms,SOA under contract No.MATHA201120204+1 种基金the Scientific Research Project of Shanghai Marine Bureau under contract No.Hu Hai Ke2016-05the Ocean Public Welfare Scientific Research Project,State Oceanic Administration of the People's Republic of China under contract Nos 201305027 and 201505008
文摘The water quality grades of phosphate(PO4-P) and dissolved inorganic nitrogen(DIN) are integrated by spatial partitioning to fit the global and local semi-variograms of these nutrients. Leave-one-out cross validation is used to determine the statistical inference method. To minimize absolute average errors and error mean squares,stratified Kriging(SK) interpolation is applied to DIN and ordinary Kriging(OK) interpolation is applied to PO4-P.Ten percent of the sites is adjusted by considering their impact on the change in deviations in DIN and PO4-P interpolation and the resultant effect on areas with different water quality grades. Thus, seven redundant historical sites are removed. Seven historical sites are distributed in areas with water quality poorer than Grade IV at the north and south branches of the Changjiang(Yangtze River) Estuary and at the coastal region north of the Hangzhou Bay. Numerous sites are installed in these regions. The contents of various elements in the waters are not remarkably changed, and the waters are mixed well. Seven sites that have been optimized and removed are set to water with quality Grades III and IV. Optimization and adjustment of unrestricted areas show that the optimized and adjusted sites are mainly distributed in regions where the water quality grade undergoes transition.Therefore, key sites for adjustment and optimization are located at the boundaries of areas with different water quality grades and seawater.