In the era of advanced machine learning techniques,the development of accurate predictive models for complex medical conditions,such as thyroid cancer,has shown remarkable progress.Accurate predictivemodels for thyroi...In the era of advanced machine learning techniques,the development of accurate predictive models for complex medical conditions,such as thyroid cancer,has shown remarkable progress.Accurate predictivemodels for thyroid cancer enhance early detection,improve resource allocation,and reduce overtreatment.However,the widespread adoption of these models in clinical practice demands predictive performance along with interpretability and transparency.This paper proposes a novel association-rule based feature-integratedmachine learning model which shows better classification and prediction accuracy than present state-of-the-artmodels.Our study also focuses on the application of SHapley Additive exPlanations(SHAP)values as a powerful tool for explaining thyroid cancer prediction models.In the proposed method,the association-rule based feature integration framework identifies frequently occurring attribute combinations in the dataset.The original dataset is used in trainingmachine learning models,and further used in generating SHAP values fromthesemodels.In the next phase,the dataset is integrated with the dominant feature sets identified through association-rule based analysis.This new integrated dataset is used in re-training the machine learning models.The new SHAP values generated from these models help in validating the contributions of feature sets in predicting malignancy.The conventional machine learning models lack interpretability,which can hinder their integration into clinical decision-making systems.In this study,the SHAP values are introduced along with association-rule based feature integration as a comprehensive framework for understanding the contributions of feature sets inmodelling the predictions.The study discusses the importance of reliable predictive models for early diagnosis of thyroid cancer,and a validation framework of explainability.The proposed model shows an accuracy of 93.48%.Performance metrics such as precision,recall,F1-score,and the area under the receiver operating characteristic(AUROC)are also higher than the baseline models.The results of the proposed model help us identify the dominant feature sets that impact thyroid cancer classification and prediction.The features{calcification}and{shape}consistently emerged as the top-ranked features associated with thyroid malignancy,in both association-rule based interestingnessmetric values and SHAPmethods.The paper highlights the potential of the rule-based integrated models with SHAP in bridging the gap between the machine learning predictions and the interpretability of this prediction which is required for real-world medical applications.展开更多
This study proposed a cutting-edge,multistep workflow and upgraded it by addressing its flaw of not considering how to determine the index system objectively.It then used the updated workflow to identify the probabili...This study proposed a cutting-edge,multistep workflow and upgraded it by addressing its flaw of not considering how to determine the index system objectively.It then used the updated workflow to identify the probability of China’s systemic financial crisis and analyzed the impact of macroeconomic indicators on the crisis.The final workflow comprises four steps:selecting rational indicators,modeling using supervised learning,decomposing the model’s internal function,and conducting the non-linear,non-parametric statistical inference,with advantages of objective index selection,accurate prediction,and high model transparency.In addition,since China’s international influence is progressively increasing,and the report of the 19th National Congress of the Communist Party of China has demonstrated that China is facing severe risk control challenges and stressed that the government should ensure that no systemic risks would emerge,this study selected China’s systemic financial crisis as an example.Specifically,one global trade factor and 11 country-level macroeconomic indicators were selected to conduct the machine learning models.The prediction models captured six risk-rising periods in China’s financial system from 1990 to 2020,which is consistent with reality.The interpretation techniques show the non-linearities of risk drivers,expressed as threshold and interval effects.Furthermore,Shapley regression validates the alignment of the indicators.The final workflow is suitable for categorical and regression analyses in several areas.These methods can also be used independently or in combination,depending on the research requirements.Researchers can switch to other suitable shallow machine learning models or deep neural networks for modeling.The results regarding crises could provide specific references for bank regulators and policymakers to develop critical measures to maintain macroeconomic and financial stability.展开更多
The role of heating load forecasts in the energy transition is significant,given the considerable increase in the number of heat pumps and the growing prevalence of fluctuating electricity generation.While machine lea...The role of heating load forecasts in the energy transition is significant,given the considerable increase in the number of heat pumps and the growing prevalence of fluctuating electricity generation.While machine learning methods offer promising forecasting capabilities,their black-box nature makes them difficult to interpret and explain.The deployment of explainable artificial intelligence methodologies enables the actions of these machine learning models to be made transparent.In this study,a multi-step forecast was employed using an Encoder–Decoder model to forecast the hourly heating load for an multifamily residential building and a district heating system over a forecast horizon of 24-h.By using 24 instead of 48 lagged hours,the simulation time was reduced from 92.75 s to 45.80 s and the forecast accuracy was increased.The feature selection was conducted for four distinct methods.The Tree and Deep SHAP method yielded superior results in feature selection.The application of feature selection according to the Deep SHAP values resulted in a reduction of 3.98%in the training time and a 8.11%reduction in the NRMSE.The utilisation of local Deep SHAP values enables the visualisation of the influence of past input hours and individual features.By mapping temporal attention,it was possible to demonstrate the importance of the most recent time steps in a intrinsic way.The combination of explainable methods enables plant operators to gain further insights and trustworthiness from the purely data-driven forecast model,and to identify the importance of individual features and time steps.展开更多
Machine learning is employed to comprehensively analyze and predict the hardenability of 20CrMo steel.The hardenability dataset includes J9 and J15 hardenability values,chemical composition,and heat treatment paramete...Machine learning is employed to comprehensively analyze and predict the hardenability of 20CrMo steel.The hardenability dataset includes J9 and J15 hardenability values,chemical composition,and heat treatment parameters.Various machine learning models,including linear regression(LR),k-nearest neighbors(KNN),random forest(RF),and extreme Gradient Boosting(XGBoost),are employed to develop predictive models for the hardenability of 20CrMo steel.Among these models,the XGBoost model achieves the best performance,with coefficients of determination(R2)of 0.941 and 0.946 for predicting J9 and J15 values,respectively.The predictions fall with a±2 HRC bandwidth for 98%of J9 cases and 99%of J15 cases.Additionally,SHapley Additive exPlanations(SHAP)analysis is used to identify the key elements that significantly influence the hardenability of the 20CrMo steel.The analysis revealed that alloying elements such as Si,Cr,C,N and Mo play significant roles in hardenability.The strengths and weaknesses of various machine learning models in predicting hardenability are also discussed.展开更多
Understanding spatial heterogeneity in groundwater responses to multiple factors is critical for water resource management in coastal cities.Daily groundwater depth(GWD)data from 43 wells(2018-2022)were collected in t...Understanding spatial heterogeneity in groundwater responses to multiple factors is critical for water resource management in coastal cities.Daily groundwater depth(GWD)data from 43 wells(2018-2022)were collected in three coastal cities in Jiangsu Province,China.Seasonal and Trend decomposition using Loess(STL)together with wavelet analysis and empirical mode decomposition were applied to identify tide-influenced wells while remaining wells were grouped by hierarchical clustering analysis(HCA).Machine learning models were developed to predict GWD,then their response to natural conditions and human activities was assessed by the Shapley Additive exPlanations(SHAP)method.Results showed that eXtreme Gradient Boosting(XGB)was superior to other models in terms of prediction performance and computational efficiency(R^(2)>0.95).GWD in Yancheng and southern Lianyungang were greater than those in Nantong,exhibiting larger fluctuations.Groundwater within 5 km of the coastline was affected by tides,with more pronounced effects in agricultural areas compared to urban areas.Shallow groundwater(3-7 m depth)responded immediately(0-1 day)to rainfall,primarily influenced by farmland and topography(slope and distance from rivers).Rainfall recharge to groundwater peaked at 50%farmland coverage,but this effect was suppressed by high temperatures(>30℃)which intensified as distance from rivers increased,especially in forest and grassland.Deep groundwater(>10 m)showed delayed responses to rainfall(1-4 days)and temperature(10-15 days),with GDP as the primary influence,followed by agricultural irrigation and population density.Farmland helped to maintain stable GWD in low population density regions,while excessive farmland coverage(>90%)led to overexploitation.In the early stages of GDP development,increased industrial and agricultural water demand led to GWD decline,but as GDP levels significantly improved,groundwater consumption pressure gradually eased.This methodological framework is applicable not only to coastal cities in China but also could be extended to coastal regions worldwide.展开更多
Credit card fraud remains a significant challenge, with financial losses and consumer protection at stake. This study addresses the need for practical, real-time fraud detection methodologies. Using a Kaggle credit ca...Credit card fraud remains a significant challenge, with financial losses and consumer protection at stake. This study addresses the need for practical, real-time fraud detection methodologies. Using a Kaggle credit card dataset, I tackle class imbalance using the Synthetic Minority Oversampling Technique (SMOTE) to enhance modeling efficiency. I compare several machine learning algorithms, including Logistic Regression, Linear Discriminant Analysis, K-nearest Neighbors, Classification and Regression Tree, Naive Bayes, Support Vector, Random Forest, XGBoost, and Light Gradient-Boosting Machine to classify transactions as fraud or genuine. Rigorous evaluation metrics, such as AUC, PRAUC, F1, KS, Recall, and Precision, identify the Random Forest as the best performer in detecting fraudulent activities. The Random Forest model successfully identifies approximately 92% of transactions scoring 90 and above as fraudulent, equating to a detection rate of over 70% for all fraudulent transactions in the test dataset. Moreover, the model captures more than half of the fraud in each bin of the test dataset. SHAP values provide model explainability, with the SHAP summary plot highlighting the global importance of individual features, such as “V12” and “V14”. SHAP force plots offer local interpretability, revealing the impact of specific features on individual predictions. This study demonstrates the potential of machine learning, particularly the Random Forest model, for real-time credit card fraud detection, offering a promising approach to mitigate financial losses and protect consumers.展开更多
文摘In the era of advanced machine learning techniques,the development of accurate predictive models for complex medical conditions,such as thyroid cancer,has shown remarkable progress.Accurate predictivemodels for thyroid cancer enhance early detection,improve resource allocation,and reduce overtreatment.However,the widespread adoption of these models in clinical practice demands predictive performance along with interpretability and transparency.This paper proposes a novel association-rule based feature-integratedmachine learning model which shows better classification and prediction accuracy than present state-of-the-artmodels.Our study also focuses on the application of SHapley Additive exPlanations(SHAP)values as a powerful tool for explaining thyroid cancer prediction models.In the proposed method,the association-rule based feature integration framework identifies frequently occurring attribute combinations in the dataset.The original dataset is used in trainingmachine learning models,and further used in generating SHAP values fromthesemodels.In the next phase,the dataset is integrated with the dominant feature sets identified through association-rule based analysis.This new integrated dataset is used in re-training the machine learning models.The new SHAP values generated from these models help in validating the contributions of feature sets in predicting malignancy.The conventional machine learning models lack interpretability,which can hinder their integration into clinical decision-making systems.In this study,the SHAP values are introduced along with association-rule based feature integration as a comprehensive framework for understanding the contributions of feature sets inmodelling the predictions.The study discusses the importance of reliable predictive models for early diagnosis of thyroid cancer,and a validation framework of explainability.The proposed model shows an accuracy of 93.48%.Performance metrics such as precision,recall,F1-score,and the area under the receiver operating characteristic(AUROC)are also higher than the baseline models.The results of the proposed model help us identify the dominant feature sets that impact thyroid cancer classification and prediction.The features{calcification}and{shape}consistently emerged as the top-ranked features associated with thyroid malignancy,in both association-rule based interestingnessmetric values and SHAPmethods.The paper highlights the potential of the rule-based integrated models with SHAP in bridging the gap between the machine learning predictions and the interpretability of this prediction which is required for real-world medical applications.
基金funded by National Social Science Fund of China(No.22AGJ006).
文摘This study proposed a cutting-edge,multistep workflow and upgraded it by addressing its flaw of not considering how to determine the index system objectively.It then used the updated workflow to identify the probability of China’s systemic financial crisis and analyzed the impact of macroeconomic indicators on the crisis.The final workflow comprises four steps:selecting rational indicators,modeling using supervised learning,decomposing the model’s internal function,and conducting the non-linear,non-parametric statistical inference,with advantages of objective index selection,accurate prediction,and high model transparency.In addition,since China’s international influence is progressively increasing,and the report of the 19th National Congress of the Communist Party of China has demonstrated that China is facing severe risk control challenges and stressed that the government should ensure that no systemic risks would emerge,this study selected China’s systemic financial crisis as an example.Specifically,one global trade factor and 11 country-level macroeconomic indicators were selected to conduct the machine learning models.The prediction models captured six risk-rising periods in China’s financial system from 1990 to 2020,which is consistent with reality.The interpretation techniques show the non-linearities of risk drivers,expressed as threshold and interval effects.Furthermore,Shapley regression validates the alignment of the indicators.The final workflow is suitable for categorical and regression analyses in several areas.These methods can also be used independently or in combination,depending on the research requirements.Researchers can switch to other suitable shallow machine learning models or deep neural networks for modeling.The results regarding crises could provide specific references for bank regulators and policymakers to develop critical measures to maintain macroeconomic and financial stability.
基金the German Federal Ministry for Economic Affairs and Climate Action in the framework of the research program EnOB:ML-EBESR 03EN1076B.
文摘The role of heating load forecasts in the energy transition is significant,given the considerable increase in the number of heat pumps and the growing prevalence of fluctuating electricity generation.While machine learning methods offer promising forecasting capabilities,their black-box nature makes them difficult to interpret and explain.The deployment of explainable artificial intelligence methodologies enables the actions of these machine learning models to be made transparent.In this study,a multi-step forecast was employed using an Encoder–Decoder model to forecast the hourly heating load for an multifamily residential building and a district heating system over a forecast horizon of 24-h.By using 24 instead of 48 lagged hours,the simulation time was reduced from 92.75 s to 45.80 s and the forecast accuracy was increased.The feature selection was conducted for four distinct methods.The Tree and Deep SHAP method yielded superior results in feature selection.The application of feature selection according to the Deep SHAP values resulted in a reduction of 3.98%in the training time and a 8.11%reduction in the NRMSE.The utilisation of local Deep SHAP values enables the visualisation of the influence of past input hours and individual features.By mapping temporal attention,it was possible to demonstrate the importance of the most recent time steps in a intrinsic way.The combination of explainable methods enables plant operators to gain further insights and trustworthiness from the purely data-driven forecast model,and to identify the importance of individual features and time steps.
基金supported by the Key scientific and technological project plan of Hebei Iron and Steel Group(No.HG2023235).
文摘Machine learning is employed to comprehensively analyze and predict the hardenability of 20CrMo steel.The hardenability dataset includes J9 and J15 hardenability values,chemical composition,and heat treatment parameters.Various machine learning models,including linear regression(LR),k-nearest neighbors(KNN),random forest(RF),and extreme Gradient Boosting(XGBoost),are employed to develop predictive models for the hardenability of 20CrMo steel.Among these models,the XGBoost model achieves the best performance,with coefficients of determination(R2)of 0.941 and 0.946 for predicting J9 and J15 values,respectively.The predictions fall with a±2 HRC bandwidth for 98%of J9 cases and 99%of J15 cases.Additionally,SHapley Additive exPlanations(SHAP)analysis is used to identify the key elements that significantly influence the hardenability of the 20CrMo steel.The analysis revealed that alloying elements such as Si,Cr,C,N and Mo play significant roles in hardenability.The strengths and weaknesses of various machine learning models in predicting hardenability are also discussed.
基金supported by the Natural Science Foundation of Jiangsu province,China(BK20240937)the Belt and Road Special Foundation of the National Key Laboratory of Water Disaster Prevention(2022491411,2021491811)the Basal Research Fund of Central Public Welfare Scientific Institution of Nanjing Hydraulic Research Institute(Y223006).
文摘Understanding spatial heterogeneity in groundwater responses to multiple factors is critical for water resource management in coastal cities.Daily groundwater depth(GWD)data from 43 wells(2018-2022)were collected in three coastal cities in Jiangsu Province,China.Seasonal and Trend decomposition using Loess(STL)together with wavelet analysis and empirical mode decomposition were applied to identify tide-influenced wells while remaining wells were grouped by hierarchical clustering analysis(HCA).Machine learning models were developed to predict GWD,then their response to natural conditions and human activities was assessed by the Shapley Additive exPlanations(SHAP)method.Results showed that eXtreme Gradient Boosting(XGB)was superior to other models in terms of prediction performance and computational efficiency(R^(2)>0.95).GWD in Yancheng and southern Lianyungang were greater than those in Nantong,exhibiting larger fluctuations.Groundwater within 5 km of the coastline was affected by tides,with more pronounced effects in agricultural areas compared to urban areas.Shallow groundwater(3-7 m depth)responded immediately(0-1 day)to rainfall,primarily influenced by farmland and topography(slope and distance from rivers).Rainfall recharge to groundwater peaked at 50%farmland coverage,but this effect was suppressed by high temperatures(>30℃)which intensified as distance from rivers increased,especially in forest and grassland.Deep groundwater(>10 m)showed delayed responses to rainfall(1-4 days)and temperature(10-15 days),with GDP as the primary influence,followed by agricultural irrigation and population density.Farmland helped to maintain stable GWD in low population density regions,while excessive farmland coverage(>90%)led to overexploitation.In the early stages of GDP development,increased industrial and agricultural water demand led to GWD decline,but as GDP levels significantly improved,groundwater consumption pressure gradually eased.This methodological framework is applicable not only to coastal cities in China but also could be extended to coastal regions worldwide.
文摘Credit card fraud remains a significant challenge, with financial losses and consumer protection at stake. This study addresses the need for practical, real-time fraud detection methodologies. Using a Kaggle credit card dataset, I tackle class imbalance using the Synthetic Minority Oversampling Technique (SMOTE) to enhance modeling efficiency. I compare several machine learning algorithms, including Logistic Regression, Linear Discriminant Analysis, K-nearest Neighbors, Classification and Regression Tree, Naive Bayes, Support Vector, Random Forest, XGBoost, and Light Gradient-Boosting Machine to classify transactions as fraud or genuine. Rigorous evaluation metrics, such as AUC, PRAUC, F1, KS, Recall, and Precision, identify the Random Forest as the best performer in detecting fraudulent activities. The Random Forest model successfully identifies approximately 92% of transactions scoring 90 and above as fraudulent, equating to a detection rate of over 70% for all fraudulent transactions in the test dataset. Moreover, the model captures more than half of the fraud in each bin of the test dataset. SHAP values provide model explainability, with the SHAP summary plot highlighting the global importance of individual features, such as “V12” and “V14”. SHAP force plots offer local interpretability, revealing the impact of specific features on individual predictions. This study demonstrates the potential of machine learning, particularly the Random Forest model, for real-time credit card fraud detection, offering a promising approach to mitigate financial losses and protect consumers.