In deriving a regression model analysts often have to use variable selection, despite of problems introduced by data- dependent model building. Resampling approaches are proposed to handle some of the critical issues....In deriving a regression model analysts often have to use variable selection, despite of problems introduced by data- dependent model building. Resampling approaches are proposed to handle some of the critical issues. In order to assess and compare several strategies, we will conduct a simulation study with 15 predictors and a complex correlation structure in the linear regression model. Using sample sizes of 100 and 400 and estimates of the residual variance corresponding to R2 of 0.50 and 0.71, we consider 4 scenarios with varying amount of information. We also consider two examples with 24 and 13 predictors, respectively. We will discuss the value of cross-validation, shrinkage and backward elimination (BE) with varying significance level. We will assess whether 2-step approaches using global or parameterwise shrinkage (PWSF) can improve selected models and will compare results to models derived with the LASSO procedure. Beside of MSE we will use model sparsity and further criteria for model assessment. The amount of information in the data has an influence on the selected models and the comparison of the procedures. None of the approaches was best in all scenarios. The performance of backward elimination with a suitably chosen significance level was not worse compared to the LASSO and BE models selected were much sparser, an important advantage for interpretation and transportability. Compared to global shrinkage, PWSF had better performance. Provided that the amount of information is not too small, we conclude that BE followed by PWSF is a suitable approach when variable selection is a key part of data analysis.展开更多
Aviation accidents are currently one of the leading causes of significant injuries and deaths worldwide. This entices researchers to investigate aircraft safety using data analysis approaches based on an advanced mach...Aviation accidents are currently one of the leading causes of significant injuries and deaths worldwide. This entices researchers to investigate aircraft safety using data analysis approaches based on an advanced machine learning algorithm.To assess aviation safety and identify the causes of incidents, a classification model with light gradient boosting machine (LGBM)based on the aviation safety reporting system (ASRS) has been developed. It is improved by k-fold cross-validation with hybrid sampling model (HSCV), which may boost classification performance and maintain data balance. The results show that employing the LGBM-HSCV model can significantly improve accuracy while alleviating data imbalance. Vertical comparison with other cross-validation (CV) methods and lateral comparison with different fold times comprise the comparative approach. Aside from the comparison, two further CV approaches based on the improved method in this study are discussed:one with a different sampling and folding order, and the other with more CV. According to the assessment indices with different methods, the LGBMHSCV model proposed here is effective at detecting incident causes. The improved model for imbalanced data categorization proposed may serve as a point of reference for similar data processing, and the model’s accurate identification of civil aviation incident causes can assist to improve civil aviation safety.展开更多
For the nonparametric regression model Y-ni = g(x(ni)) + epsilon(ni)i = 1, ..., n, with regularly spaced nonrandom design, the authors study the behavior of the nonlinear wavelet estimator of g(x). When the threshold ...For the nonparametric regression model Y-ni = g(x(ni)) + epsilon(ni)i = 1, ..., n, with regularly spaced nonrandom design, the authors study the behavior of the nonlinear wavelet estimator of g(x). When the threshold and truncation parameters are chosen by cross-validation on the everage squared error, strong consistency for the case of dyadic sample size and moment consistency for arbitrary sample size are established under some regular conditions.展开更多
Background Cardiovascular diseases are closely linked to atherosclerotic plaque development and rupture.Plaque progression prediction is of fundamental significance to cardiovascular research and disease diagnosis,pre...Background Cardiovascular diseases are closely linked to atherosclerotic plaque development and rupture.Plaque progression prediction is of fundamental significance to cardiovascular research and disease diagnosis,prevention,and treatment.Generalized linear mixed models(GLMM)is an extension of linear model for categorical responses while considering the correlation among observations.Methods Magnetic resonance image(MRI)data of carotid atheroscleroticplaques were acquired from 20 patients with consent obtained and 3D thin-layer models were constructed to calculate plaque stress and strain for plaque progression prediction.Data for ten morphological and biomechanical risk factors included wall thickness(WT),lipid percent(LP),minimum cap thickness(MinCT),plaque area(PA),plaque burden(PB),lumen area(LA),maximum plaque wall stress(MPWS),maximum plaque wall strain(MPWSn),average plaque wall stress(APWS),and average plaque wall strain(APWSn)were extracted from all slices for analysis.Wall thickness increase(WTI),plaque burden increase(PBI)and plaque area increase(PAI) were chosen as three measures for plaque progression.Generalized linear mixed models(GLMM)with 5-fold cross-validation strategy were used to calculate prediction accuracy for each predictor and identify optimal predictor with the highest prediction accuracy defined as sum of sensitivity and specificity.All 201 MRI slices were randomly divided into 4 training subgroups and 1 verification subgroup.The training subgroups were used for model fitting,and the verification subgroup was used to estimate the model.All combinations(total1023)of 10 risk factors were feed to GLMM and the prediction accuracy of each predictor were selected from the point on the ROC(receiver operating characteristic)curve with the highest sum of specificity and sensitivity.Results LA was the best single predictor for PBI with the highest prediction accuracy(1.360 1),and the area under of the ROC curve(AUC)is0.654 0,followed by APWSn(1.336 3)with AUC=0.6342.The optimal predictor among all possible combinations for PBI was the combination of LA,PA,LP,WT,MPWS and MPWSn with prediction accuracy=1.414 6(AUC=0.715 8).LA was once again the best single predictor for PAI with the highest prediction accuracy(1.184 6)with AUC=0.606 4,followed by MPWSn(1. 183 2)with AUC=0.6084.The combination of PA,PB,WT,MPWS,MPWSn and APWSn gave the best prediction accuracy(1.302 5)for PAI,and the AUC value is 0.6657.PA was the best single predictor for WTI with highest prediction accuracy(1.288 7)with AUC=0.641 5,followed by WT(1.254 0),with AUC=0.6097.The combination of PA,PB,WT,LP,MinCT,MPWS and MPWS was the best predictor for WTI with prediction accuracy as 1.314 0,with AUC=0.6552.This indicated that PBI was a more predictable measure than WTI and PAI. The combinational predictors improved prediction accuracy by 9.95%,4.01%and 1.96%over the best single predictors for PAI,PBI and WTI(AUC values improved by9.78%,9.45%,and 2.14%),respectively.Conclusions The use of GLMM with 5-fold cross-validation strategy combining both morphological and biomechanical risk factors could potentially improve the accuracy of carotid plaque progression prediction.This study suggests that a linear combination of multiple predictors can provide potential improvement to existing plaque assessment schemes.展开更多
Heart disease remains a leading cause of mortality worldwide,emphasizing the urgent need for reliable and interpretable predictive models to support early diagnosis and timely intervention.However,existing Deep Learni...Heart disease remains a leading cause of mortality worldwide,emphasizing the urgent need for reliable and interpretable predictive models to support early diagnosis and timely intervention.However,existing Deep Learning(DL)approaches often face several limitations,including inefficient feature extraction,class imbalance,suboptimal classification performance,and limited interpretability,which collectively hinder their deployment in clinical settings.To address these challenges,we propose a novel DL framework for heart disease prediction that integrates a comprehensive preprocessing pipeline with an advanced classification architecture.The preprocessing stage involves label encoding and feature scaling.To address the issue of class imbalance inherent in the personal key indicators of the heart disease dataset,the localized random affine shadowsampling technique is employed,which enhances minority class representation while minimizing overfitting.At the core of the framework lies the Deep Residual Network(DeepResNet),which employs hierarchical residual transformations to facilitate efficient feature extraction and capture complex,non-linear relationships in the data.Experimental results demonstrate that the proposed model significantly outperforms existing techniques,achieving improvements of 3.26%in accuracy,3.16%in area under the receiver operating characteristics,1.09%in recall,and 1.07%in F1-score.Furthermore,robustness is validated using 10-fold crossvalidation,confirming the model’s generalizability across diverse data distributions.Moreover,model interpretability is ensured through the integration of Shapley additive explanations and local interpretable model-agnostic explanations,offering valuable insights into the contribution of individual features to model predictions.Overall,the proposed DL framework presents a robust,interpretable,and clinically applicable solution for heart disease prediction.展开更多
Sustainable forecasting of home energy demand(SFHED)is crucial for promoting energy efficiency,minimizing environmental impact,and optimizing resource allocation.Machine learning(ML)supports SFHED by identifying patte...Sustainable forecasting of home energy demand(SFHED)is crucial for promoting energy efficiency,minimizing environmental impact,and optimizing resource allocation.Machine learning(ML)supports SFHED by identifying patterns and forecasting demand.However,conventional hyperparameter tuning methods often rely solely on minimizing average prediction errors,typically through fixed k-fold cross-validation,which overlooks error variability and limits model robustness.To address this limitation,we propose the Optimized Robust Hyperparameter Tuning for Machine Learning with Enhanced Multi-fold Cross-Validation(ORHT-ML-EMCV)framework.This method integrates statistical analysis of k-fold validation errors by incorporating their mean and variance into the optimization objective,enhancing robustness and generalizability.A weighting factor is introduced to balance accuracy and robustness,and its impact is evaluated across a range of values.A novel Enhanced Multi-Fold Cross-Validation(EMCV)technique is employed to automatically evaluate model performance across varying fold configurations without requiring a predefined k value,thereby reducing sensitivity to data splits.Using three evolutionary algorithms Genetic Algorithm(GA),Particle Swarm Optimization(PSO),and Differential Evolution(DE)we optimize two ensemble models:XGBoost and LightGBM.The optimization process minimizes both mean error and variance,with robustness assessed through cumulative distribution function(CDF)analyses.Experiments on three real-world residential datasets show the proposed method reduces worst-case Root Mean Square Error(RMSE)by up to 19.8%and narrows confidence intervals by up to 25%.Cross-household validations confirm strong generalization,achieving coefficient of determination(R²)of 0.946 and 0.972 on unseen homes.The framework offers a statistically grounded and efficient solution for robust energy forecasting.展开更多
The 91 measured values of the development height of the water-conducting fracture zone(WCFZ)in deep and thick coal seam mining faces under thick loose layer conditions were collected.Five key characteristic variables ...The 91 measured values of the development height of the water-conducting fracture zone(WCFZ)in deep and thick coal seam mining faces under thick loose layer conditions were collected.Five key characteristic variables influencing the WCFZ height were identified.After removing outliers from the dataset,a Random Forest(RF)regression model optimized by the Sparrow Search Algorithm(SSA)was constructed.The hyperparameters of the RF model were iteratively optimized by minimizing the Out-of-Bag(OOB)error,resulting in the rapid deter-mination of optimal parameters.Specifically,the SSA-RF model achieved an OOB error of 0.148,with 20 de-cision trees,a maximum depth of 8,a minimum split sample size of 2,and a minimum leaf node sample size of 1.Cross-validation experiments were performed using the trained optimal model and compared against other prediction methods.The results showed that the mining height had the most significant correlation with the development height of the WCFZ.The SSA-RF model outperformed all other models,with R2 values exceeding 0.9 across the training,validation,and test datasets.Compared to other models,the SSA-RF model demonstrates a simpler structure,stronger fitting capacity,higher predictive accuracy,and superior stability and generaliza-tion ability.It also exhibits the smallest variation in relative error across datasets,indicating excellent adapt-ability to different data conditions.Furthermore,a numerical model was developed using the hydrogeological data from the 1305 working face at Wanfukou Coal Mine,Shandong Province,China,to simulate the dynamic development of the WCFZ during mining.The SSA-RF model predicted the WCFZ height to be 69.7 m,closely aligning with the PFC2D simulation result of 65 m,with an error of less than 5%.Compared to traditional methods and numerical simulations,the SSA-RF model provides more accurate predictions,showing only a 7.23% deviation from the PFC2D simulation,while traditional empirical formulas yield deviations as large as 19.97%.These results demonstrate the SSA-RF model’s superior predictive capability,reinforcing its reliability and engineering applicability for real-world mining operations.This model holds significant potential for enhancing mining safety and optimizing planning processes,offering a more accurate and efficient approach for WCFZ height prediction.展开更多
Spartina alterniflora is now listed among the world’s 100 most dangerous invasive species,severely affecting the ecological balance of coastal wetlands.Remote sensing technologies based on deep learning enable large-...Spartina alterniflora is now listed among the world’s 100 most dangerous invasive species,severely affecting the ecological balance of coastal wetlands.Remote sensing technologies based on deep learning enable large-scale monitoring of Spartina alterniflora,but they require large datasets and have poor interpretability.A new method is proposed to detect Spartina alterniflora from Sentinel-2 imagery.Firstly,to get the high canopy cover and dense community characteristics of Spartina alterniflora,multi-dimensional shallow features are extracted from the imagery.Secondly,to detect different objects from satellite imagery,index features are extracted,and the statistical features of the Gray-Level Co-occurrence Matrix(GLCM)are derived using principal component analysis.Then,ensemble learning methods,including random forest,extreme gradient boosting,and light gradient boosting machine models,are employed for image classification.Meanwhile,Recursive Feature Elimination with Cross-Validation(RFECV)is used to select the best feature subset.Finally,to enhance the interpretability of the models,the best features are utilized to classify multi-temporal images and SHapley Additive exPlanations(SHAP)is combined with these classifications to explain the model prediction process.The method is validated by using Sentinel-2 imageries and previous observations of Spartina alterniflora in Chongming Island,it is found that the model combining image texture features such as GLCM covariance can significantly improve the detection accuracy of Spartina alterniflora by about 8%compared with the model without image texture features.Through multiple model comparisons and feature selection via RFECV,the selected model and eight features demonstrated good classification accuracy when applied to data from different time periods,proving that feature reduction can effectively enhance model generalization.Additionally,visualizing model decisions using SHAP revealed that the image texture feature component_1_GLCMVariance is particularly important for identifying each land cover type.展开更多
The purpose of this article is to investigate approaches for modeling individual patient count/rate data over time accounting for temporal correlation and non</span><span style="font-family:Verdana;"...The purpose of this article is to investigate approaches for modeling individual patient count/rate data over time accounting for temporal correlation and non</span><span style="font-family:Verdana;">-</span><span style="font-family:Verdana;">constant dispersions while requiring reasonable amounts of time to search over alternative models for those data. This research addresses formulations for two approaches for extending generalized estimating equations (GEE) modeling. These approaches use a likelihood-like function based on the multivariate normal density. The first approach augments standard GEE equations to include equations for estimation of dispersion parameters. The second approach is based on estimating equations determined by partial derivatives of the likelihood-like function with respect to all model parameters and so extends linear mixed modeling. Three correlation structures are considered including independent, exchangeable, and spatial autoregressive of order 1 correlations. The likelihood-like function is used to formulate a likelihood-like cross-validation (LCV) score for use in evaluating models. Example analyses are presented using these two modeling approaches applied to three data sets of counts/rates over time for individual cancer patients including pain flares per day, as needed pain medications taken per day, and around the clock pain medications taken per day per dose. Means and dispersions are modeled as possibly nonlinear functions of time using adaptive regression modeling methods to search through alternative models compared using LCV scores. The results of these analyses demonstrate that extended linear mixed modeling is preferable for modeling individual patient count/rate data over time</span><span style="font-family:Verdana;">,</span><span style="font-family:Verdana;"> because in example analyses</span><span style="font-family:Verdana;">,</span><span style="font-family:Verdana;"> it either generates better LCV scores or more parsimonious models and requires substantially less time.展开更多
Purpose: To formulate and demonstrate methods for regression modeling of probabilities and dispersions for individual-patient longitudinal outcomes taking on discrete numeric values. Methods: Three alternatives for mo...Purpose: To formulate and demonstrate methods for regression modeling of probabilities and dispersions for individual-patient longitudinal outcomes taking on discrete numeric values. Methods: Three alternatives for modeling of outcome probabilities are considered. Multinomial probabilities are based on different intercepts and slopes for probabilities of different outcome values. Ordinal probabilities are based on different intercepts and the same slope for probabilities of different outcome values. Censored Poisson probabilities are based on the same intercept and slope for probabilities of different outcome values. Parameters are estimated with extended linear mixed modeling maximizing a likelihood-like function based on the multivariate normal density that accounts for within-patient correlation. Formulas are provided for gradient vectors and Hessian matrices for estimating model parameters. The likelihood-like function is also used to compute cross-validation scores for alternative models and to control an adaptive modeling process for identifying possibly nonlinear functional relationships in predictors for probabilities and dispersions. Example analyses are provided of daily pain ratings for a cancer patient over a period of 97 days. Results: The censored Poisson approach is preferable for modeling these data, and presumably other data sets of this kind, because it generates a competitive model with fewer parameters in less time than the other two approaches. The generated probabilities for this model are distinctly nonlinear in time while the dispersions are distinctly nonconstant over time, demonstrating the need for adaptive modeling of such data. The analyses also address the dependence of these daily pain ratings on time and the daily numbers of pain flares. Probabilities and dispersions change differently over time for different numbers of pain flares. Conclusions: Adaptive modeling of daily pain ratings for individual cancer patients is an effective way to identify nonlinear relationships in time as well as in other predictors such as the number of pain flares.展开更多
Identification and counting of rice light-trap pests are important to monitor rice pest population dynamics and make pest forecast. Identification and counting of rice light-trap pests manually is time-consuming, and ...Identification and counting of rice light-trap pests are important to monitor rice pest population dynamics and make pest forecast. Identification and counting of rice light-trap pests manually is time-consuming, and leads to fatigue and an increase in the error rate. A rice light-trap insect imaging system is developed to automate rice pest identification. This system can capture the top and bottom images of each insect by two cameras to obtain more image features. A method is proposed for removing the background by color difference of two images with pests and non-pests. 156 features including color, shape and texture features of each pest are extracted into an support vector machine (SVM) classifier with radial basis kernel function. The seven-fold cross-validation is used to improve the accurate rate of pest identification. Four species of Lepidoptera rice pests are tested and achieved 97.5% average accurate rate.展开更多
文摘In deriving a regression model analysts often have to use variable selection, despite of problems introduced by data- dependent model building. Resampling approaches are proposed to handle some of the critical issues. In order to assess and compare several strategies, we will conduct a simulation study with 15 predictors and a complex correlation structure in the linear regression model. Using sample sizes of 100 and 400 and estimates of the residual variance corresponding to R2 of 0.50 and 0.71, we consider 4 scenarios with varying amount of information. We also consider two examples with 24 and 13 predictors, respectively. We will discuss the value of cross-validation, shrinkage and backward elimination (BE) with varying significance level. We will assess whether 2-step approaches using global or parameterwise shrinkage (PWSF) can improve selected models and will compare results to models derived with the LASSO procedure. Beside of MSE we will use model sparsity and further criteria for model assessment. The amount of information in the data has an influence on the selected models and the comparison of the procedures. None of the approaches was best in all scenarios. The performance of backward elimination with a suitably chosen significance level was not worse compared to the LASSO and BE models selected were much sparser, an important advantage for interpretation and transportability. Compared to global shrinkage, PWSF had better performance. Provided that the amount of information is not too small, we conclude that BE followed by PWSF is a suitable approach when variable selection is a key part of data analysis.
基金supported by the National Natural Science Foundation of China Civil Aviation Joint Fund (U1833110)Research on the Dual Prevention Mechanism and Intelligent Management Technology f or Civil Aviation Safety Risks (YK23-03-05)。
文摘Aviation accidents are currently one of the leading causes of significant injuries and deaths worldwide. This entices researchers to investigate aircraft safety using data analysis approaches based on an advanced machine learning algorithm.To assess aviation safety and identify the causes of incidents, a classification model with light gradient boosting machine (LGBM)based on the aviation safety reporting system (ASRS) has been developed. It is improved by k-fold cross-validation with hybrid sampling model (HSCV), which may boost classification performance and maintain data balance. The results show that employing the LGBM-HSCV model can significantly improve accuracy while alleviating data imbalance. Vertical comparison with other cross-validation (CV) methods and lateral comparison with different fold times comprise the comparative approach. Aside from the comparison, two further CV approaches based on the improved method in this study are discussed:one with a different sampling and folding order, and the other with more CV. According to the assessment indices with different methods, the LGBMHSCV model proposed here is effective at detecting incident causes. The improved model for imbalanced data categorization proposed may serve as a point of reference for similar data processing, and the model’s accurate identification of civil aviation incident causes can assist to improve civil aviation safety.
文摘For the nonparametric regression model Y-ni = g(x(ni)) + epsilon(ni)i = 1, ..., n, with regularly spaced nonrandom design, the authors study the behavior of the nonlinear wavelet estimator of g(x). When the threshold and truncation parameters are chosen by cross-validation on the everage squared error, strong consistency for the case of dyadic sample size and moment consistency for arbitrary sample size are established under some regular conditions.
基金supported in part by National Sciences Foundation of China grant ( 11672001)Jiangsu Province Science and Technology Agency grant ( BE2016785)supported in part by Postgraduate Research & Practice Innovation Program of Jiangsu Province grant ( KYCX18_0156)
文摘Background Cardiovascular diseases are closely linked to atherosclerotic plaque development and rupture.Plaque progression prediction is of fundamental significance to cardiovascular research and disease diagnosis,prevention,and treatment.Generalized linear mixed models(GLMM)is an extension of linear model for categorical responses while considering the correlation among observations.Methods Magnetic resonance image(MRI)data of carotid atheroscleroticplaques were acquired from 20 patients with consent obtained and 3D thin-layer models were constructed to calculate plaque stress and strain for plaque progression prediction.Data for ten morphological and biomechanical risk factors included wall thickness(WT),lipid percent(LP),minimum cap thickness(MinCT),plaque area(PA),plaque burden(PB),lumen area(LA),maximum plaque wall stress(MPWS),maximum plaque wall strain(MPWSn),average plaque wall stress(APWS),and average plaque wall strain(APWSn)were extracted from all slices for analysis.Wall thickness increase(WTI),plaque burden increase(PBI)and plaque area increase(PAI) were chosen as three measures for plaque progression.Generalized linear mixed models(GLMM)with 5-fold cross-validation strategy were used to calculate prediction accuracy for each predictor and identify optimal predictor with the highest prediction accuracy defined as sum of sensitivity and specificity.All 201 MRI slices were randomly divided into 4 training subgroups and 1 verification subgroup.The training subgroups were used for model fitting,and the verification subgroup was used to estimate the model.All combinations(total1023)of 10 risk factors were feed to GLMM and the prediction accuracy of each predictor were selected from the point on the ROC(receiver operating characteristic)curve with the highest sum of specificity and sensitivity.Results LA was the best single predictor for PBI with the highest prediction accuracy(1.360 1),and the area under of the ROC curve(AUC)is0.654 0,followed by APWSn(1.336 3)with AUC=0.6342.The optimal predictor among all possible combinations for PBI was the combination of LA,PA,LP,WT,MPWS and MPWSn with prediction accuracy=1.414 6(AUC=0.715 8).LA was once again the best single predictor for PAI with the highest prediction accuracy(1.184 6)with AUC=0.606 4,followed by MPWSn(1. 183 2)with AUC=0.6084.The combination of PA,PB,WT,MPWS,MPWSn and APWSn gave the best prediction accuracy(1.302 5)for PAI,and the AUC value is 0.6657.PA was the best single predictor for WTI with highest prediction accuracy(1.288 7)with AUC=0.641 5,followed by WT(1.254 0),with AUC=0.6097.The combination of PA,PB,WT,LP,MinCT,MPWS and MPWS was the best predictor for WTI with prediction accuracy as 1.314 0,with AUC=0.6552.This indicated that PBI was a more predictable measure than WTI and PAI. The combinational predictors improved prediction accuracy by 9.95%,4.01%and 1.96%over the best single predictors for PAI,PBI and WTI(AUC values improved by9.78%,9.45%,and 2.14%),respectively.Conclusions The use of GLMM with 5-fold cross-validation strategy combining both morphological and biomechanical risk factors could potentially improve the accuracy of carotid plaque progression prediction.This study suggests that a linear combination of multiple predictors can provide potential improvement to existing plaque assessment schemes.
基金funded by Ongoing Research Funding Program for Project number(ORF-2025-648),King Saud University,Riyadh,Saudi Arabia.
文摘Heart disease remains a leading cause of mortality worldwide,emphasizing the urgent need for reliable and interpretable predictive models to support early diagnosis and timely intervention.However,existing Deep Learning(DL)approaches often face several limitations,including inefficient feature extraction,class imbalance,suboptimal classification performance,and limited interpretability,which collectively hinder their deployment in clinical settings.To address these challenges,we propose a novel DL framework for heart disease prediction that integrates a comprehensive preprocessing pipeline with an advanced classification architecture.The preprocessing stage involves label encoding and feature scaling.To address the issue of class imbalance inherent in the personal key indicators of the heart disease dataset,the localized random affine shadowsampling technique is employed,which enhances minority class representation while minimizing overfitting.At the core of the framework lies the Deep Residual Network(DeepResNet),which employs hierarchical residual transformations to facilitate efficient feature extraction and capture complex,non-linear relationships in the data.Experimental results demonstrate that the proposed model significantly outperforms existing techniques,achieving improvements of 3.26%in accuracy,3.16%in area under the receiver operating characteristics,1.09%in recall,and 1.07%in F1-score.Furthermore,robustness is validated using 10-fold crossvalidation,confirming the model’s generalizability across diverse data distributions.Moreover,model interpretability is ensured through the integration of Shapley additive explanations and local interpretable model-agnostic explanations,offering valuable insights into the contribution of individual features to model predictions.Overall,the proposed DL framework presents a robust,interpretable,and clinically applicable solution for heart disease prediction.
文摘Sustainable forecasting of home energy demand(SFHED)is crucial for promoting energy efficiency,minimizing environmental impact,and optimizing resource allocation.Machine learning(ML)supports SFHED by identifying patterns and forecasting demand.However,conventional hyperparameter tuning methods often rely solely on minimizing average prediction errors,typically through fixed k-fold cross-validation,which overlooks error variability and limits model robustness.To address this limitation,we propose the Optimized Robust Hyperparameter Tuning for Machine Learning with Enhanced Multi-fold Cross-Validation(ORHT-ML-EMCV)framework.This method integrates statistical analysis of k-fold validation errors by incorporating their mean and variance into the optimization objective,enhancing robustness and generalizability.A weighting factor is introduced to balance accuracy and robustness,and its impact is evaluated across a range of values.A novel Enhanced Multi-Fold Cross-Validation(EMCV)technique is employed to automatically evaluate model performance across varying fold configurations without requiring a predefined k value,thereby reducing sensitivity to data splits.Using three evolutionary algorithms Genetic Algorithm(GA),Particle Swarm Optimization(PSO),and Differential Evolution(DE)we optimize two ensemble models:XGBoost and LightGBM.The optimization process minimizes both mean error and variance,with robustness assessed through cumulative distribution function(CDF)analyses.Experiments on three real-world residential datasets show the proposed method reduces worst-case Root Mean Square Error(RMSE)by up to 19.8%and narrows confidence intervals by up to 25%.Cross-household validations confirm strong generalization,achieving coefficient of determination(R²)of 0.946 and 0.972 on unseen homes.The framework offers a statistically grounded and efficient solution for robust energy forecasting.
基金supported by the National Natural Science Foundation of China(51774199)the project of the educational department of Liaoning Province(No LJKMZ20220825).
文摘The 91 measured values of the development height of the water-conducting fracture zone(WCFZ)in deep and thick coal seam mining faces under thick loose layer conditions were collected.Five key characteristic variables influencing the WCFZ height were identified.After removing outliers from the dataset,a Random Forest(RF)regression model optimized by the Sparrow Search Algorithm(SSA)was constructed.The hyperparameters of the RF model were iteratively optimized by minimizing the Out-of-Bag(OOB)error,resulting in the rapid deter-mination of optimal parameters.Specifically,the SSA-RF model achieved an OOB error of 0.148,with 20 de-cision trees,a maximum depth of 8,a minimum split sample size of 2,and a minimum leaf node sample size of 1.Cross-validation experiments were performed using the trained optimal model and compared against other prediction methods.The results showed that the mining height had the most significant correlation with the development height of the WCFZ.The SSA-RF model outperformed all other models,with R2 values exceeding 0.9 across the training,validation,and test datasets.Compared to other models,the SSA-RF model demonstrates a simpler structure,stronger fitting capacity,higher predictive accuracy,and superior stability and generaliza-tion ability.It also exhibits the smallest variation in relative error across datasets,indicating excellent adapt-ability to different data conditions.Furthermore,a numerical model was developed using the hydrogeological data from the 1305 working face at Wanfukou Coal Mine,Shandong Province,China,to simulate the dynamic development of the WCFZ during mining.The SSA-RF model predicted the WCFZ height to be 69.7 m,closely aligning with the PFC2D simulation result of 65 m,with an error of less than 5%.Compared to traditional methods and numerical simulations,the SSA-RF model provides more accurate predictions,showing only a 7.23% deviation from the PFC2D simulation,while traditional empirical formulas yield deviations as large as 19.97%.These results demonstrate the SSA-RF model’s superior predictive capability,reinforcing its reliability and engineering applicability for real-world mining operations.This model holds significant potential for enhancing mining safety and optimizing planning processes,offering a more accurate and efficient approach for WCFZ height prediction.
基金The National Key Research and Development Program of China under contract No.2023YFC3008204the National Natural Science Foundation of China under contract Nos 41977302 and 42476217.
文摘Spartina alterniflora is now listed among the world’s 100 most dangerous invasive species,severely affecting the ecological balance of coastal wetlands.Remote sensing technologies based on deep learning enable large-scale monitoring of Spartina alterniflora,but they require large datasets and have poor interpretability.A new method is proposed to detect Spartina alterniflora from Sentinel-2 imagery.Firstly,to get the high canopy cover and dense community characteristics of Spartina alterniflora,multi-dimensional shallow features are extracted from the imagery.Secondly,to detect different objects from satellite imagery,index features are extracted,and the statistical features of the Gray-Level Co-occurrence Matrix(GLCM)are derived using principal component analysis.Then,ensemble learning methods,including random forest,extreme gradient boosting,and light gradient boosting machine models,are employed for image classification.Meanwhile,Recursive Feature Elimination with Cross-Validation(RFECV)is used to select the best feature subset.Finally,to enhance the interpretability of the models,the best features are utilized to classify multi-temporal images and SHapley Additive exPlanations(SHAP)is combined with these classifications to explain the model prediction process.The method is validated by using Sentinel-2 imageries and previous observations of Spartina alterniflora in Chongming Island,it is found that the model combining image texture features such as GLCM covariance can significantly improve the detection accuracy of Spartina alterniflora by about 8%compared with the model without image texture features.Through multiple model comparisons and feature selection via RFECV,the selected model and eight features demonstrated good classification accuracy when applied to data from different time periods,proving that feature reduction can effectively enhance model generalization.Additionally,visualizing model decisions using SHAP revealed that the image texture feature component_1_GLCMVariance is particularly important for identifying each land cover type.
文摘The purpose of this article is to investigate approaches for modeling individual patient count/rate data over time accounting for temporal correlation and non</span><span style="font-family:Verdana;">-</span><span style="font-family:Verdana;">constant dispersions while requiring reasonable amounts of time to search over alternative models for those data. This research addresses formulations for two approaches for extending generalized estimating equations (GEE) modeling. These approaches use a likelihood-like function based on the multivariate normal density. The first approach augments standard GEE equations to include equations for estimation of dispersion parameters. The second approach is based on estimating equations determined by partial derivatives of the likelihood-like function with respect to all model parameters and so extends linear mixed modeling. Three correlation structures are considered including independent, exchangeable, and spatial autoregressive of order 1 correlations. The likelihood-like function is used to formulate a likelihood-like cross-validation (LCV) score for use in evaluating models. Example analyses are presented using these two modeling approaches applied to three data sets of counts/rates over time for individual cancer patients including pain flares per day, as needed pain medications taken per day, and around the clock pain medications taken per day per dose. Means and dispersions are modeled as possibly nonlinear functions of time using adaptive regression modeling methods to search through alternative models compared using LCV scores. The results of these analyses demonstrate that extended linear mixed modeling is preferable for modeling individual patient count/rate data over time</span><span style="font-family:Verdana;">,</span><span style="font-family:Verdana;"> because in example analyses</span><span style="font-family:Verdana;">,</span><span style="font-family:Verdana;"> it either generates better LCV scores or more parsimonious models and requires substantially less time.
文摘Purpose: To formulate and demonstrate methods for regression modeling of probabilities and dispersions for individual-patient longitudinal outcomes taking on discrete numeric values. Methods: Three alternatives for modeling of outcome probabilities are considered. Multinomial probabilities are based on different intercepts and slopes for probabilities of different outcome values. Ordinal probabilities are based on different intercepts and the same slope for probabilities of different outcome values. Censored Poisson probabilities are based on the same intercept and slope for probabilities of different outcome values. Parameters are estimated with extended linear mixed modeling maximizing a likelihood-like function based on the multivariate normal density that accounts for within-patient correlation. Formulas are provided for gradient vectors and Hessian matrices for estimating model parameters. The likelihood-like function is also used to compute cross-validation scores for alternative models and to control an adaptive modeling process for identifying possibly nonlinear functional relationships in predictors for probabilities and dispersions. Example analyses are provided of daily pain ratings for a cancer patient over a period of 97 days. Results: The censored Poisson approach is preferable for modeling these data, and presumably other data sets of this kind, because it generates a competitive model with fewer parameters in less time than the other two approaches. The generated probabilities for this model are distinctly nonlinear in time while the dispersions are distinctly nonconstant over time, demonstrating the need for adaptive modeling of such data. The analyses also address the dependence of these daily pain ratings on time and the daily numbers of pain flares. Probabilities and dispersions change differently over time for different numbers of pain flares. Conclusions: Adaptive modeling of daily pain ratings for individual cancer patients is an effective way to identify nonlinear relationships in time as well as in other predictors such as the number of pain flares.
基金support of the National Natural Science Foundation of China (31071678)the Major Scientific and Technological Special of Zhejiang Province, China (2010C12026)+1 种基金the Ningbo Science and Technology Project, China (201002C1011001)Xiangshan Science and Technology Project, China(2010C0001)
文摘Identification and counting of rice light-trap pests are important to monitor rice pest population dynamics and make pest forecast. Identification and counting of rice light-trap pests manually is time-consuming, and leads to fatigue and an increase in the error rate. A rice light-trap insect imaging system is developed to automate rice pest identification. This system can capture the top and bottom images of each insect by two cameras to obtain more image features. A method is proposed for removing the background by color difference of two images with pests and non-pests. 156 features including color, shape and texture features of each pest are extracted into an support vector machine (SVM) classifier with radial basis kernel function. The seven-fold cross-validation is used to improve the accurate rate of pest identification. Four species of Lepidoptera rice pests are tested and achieved 97.5% average accurate rate.