BACKGROUND To investigate the preoperative factors influencing textbook outcomes(TO)in Intrahepatic cholangiocarcinoma(ICC)patients and evaluate the feasibility of an interpretable machine learning model for preoperat...BACKGROUND To investigate the preoperative factors influencing textbook outcomes(TO)in Intrahepatic cholangiocarcinoma(ICC)patients and evaluate the feasibility of an interpretable machine learning model for preoperative prediction of TO,we developed a machine learning model for preoperative prediction of TO and used the SHapley Additive exPlanations(SHAP)technique to illustrate the prediction process.AIM To analyze the factors influencing textbook outcomes before surgery and to establish interpretable machine learning models for preoperative prediction.METHODS A total of 376 patients diagnosed with ICC were retrospectively collected from four major medical institutions in China,covering the period from 2011 to 2017.Logistic regression analysis was conducted to identify preoperative variables associated with achieving TO.Based on these variables,an EXtreme Gradient Boosting(XGBoost)machine learning prediction model was constructed using the XGBoost package.The SHAP(package:Shapviz)algorithm was employed to visualize each variable's contribution to the model's predictions.Kaplan-Meier survival analysis was performed to compare the prognostic differences between the TO-achieving and non-TO-achieving groups.RESULTS Among 376 patients,287 were included in the training group and 89 in the validation group.Logistic regression identified the following preoperative variables influencing TO:Child-Pugh classification,Eastern Cooperative Oncology Group(ECOG)score,hepatitis B,and tumor size.The XGBoost prediction model demonstrated high accuracy in internal validation(AUC=0.8825)and external validation(AUC=0.8346).Survival analysis revealed that the disease-free survival rates for patients achieving TO at 1,2,and 3 years were 64.2%,56.8%,and 43.4%,respectively.CONCLUSION Child-Pugh classification,ECOG score,hepatitis B,and tumor size are preoperative predictors of TO.In both the training group and the validation group,the machine learning model had certain effectiveness in predicting TO before surgery.The SHAP algorithm provided intuitive visualization of the machine learning prediction process,enhancing its interpretability.展开更多
The application of machine learning in alloy design is increasingly widespread,yet traditional models still face challenges when dealing with limited datasets and complex nonlinear relationships.This work proposes an ...The application of machine learning in alloy design is increasingly widespread,yet traditional models still face challenges when dealing with limited datasets and complex nonlinear relationships.This work proposes an interpretable machine learning method based on data augmentation and reconstruction,excavating high-performance low-alloyed magnesium(Mg)alloys.The data augmentation technique expands the original dataset through Gaussian noise.The data reconstruction method reorganizes and transforms the original data to extract more representative features,significantly improving the model's generalization ability and prediction accuracy,with a coefficient of determination(R^(2))of 95.9%for the ultimate tensile strength(UTS)model and a R^(2)of 95.3%for the elongation-to-failure(EL)model.The correlation coefficient assisted screening(CCAS)method is proposed to filter low-alloyed target alloys.A new Mg-2.2Mn-0.4Zn-0.2Al-0.2Ca(MZAX2000,wt%)alloy is designed and extruded into bar at given processing parameters,achieving room-temperature strength-ductility synergy showing an excellent UTS of 395 MPa and a high EL of 17.9%.This is closely related to its hetero-structured characteristic in the as-extruded MZAX2000 alloy consisting of coarse grains(16%),fine grains(75%),and fiber regions(9%).Therefore,this work offers new insights into optimizing alloy compositions and processing parameters for attaining new high strong and ductile low-alloyed Mg alloys.展开更多
The widespread adoption of tunnel boring machines(TBMs)has led to an increased focus on disc cutter wear,including both normal and abnormal types,for efficient and safe TBM excavation.However,abnormal wear has yet to ...The widespread adoption of tunnel boring machines(TBMs)has led to an increased focus on disc cutter wear,including both normal and abnormal types,for efficient and safe TBM excavation.However,abnormal wear has yet to be thoroughly investigated,primarily due to the complexity of considering mixed ground conditions and the imbalance in the number of instances between the two types of wear.This study developed a prediction model for abnormal TBM disc cutter wear,considering mixed ground conditions,by employing interpretable machine learning with data augmentation.An equivalent elastic modulus was used to consider the characteristics of mixed ground conditions,and wear data was obtained from 65 cutterhead intervention(CHI)reports covering both mixed ground and hard rock sections.With a balanced training dataset obtained by data augmentation,an extreme gradient boosting(XGB)model delivered acceptable results with an accuracy of 0.94,an F1-score of 0.808,and a recall of 0.8.In addition,the accuracy for each individual disc cutter exhibited low variability.When employing data augmentation,a significant improvement in recall was observed compared to when it was not used,although the difference in accuracy and F1-score was marginal.The subsequent model interpretation revealed the chamber pressure,cutter installation radius,and torque as significant contributors.Specifically,a threshold in chamber pressure was observed,which could induce abnormal wear.The study also explored how elevated values of these influential contributors correlate with abnormal wear.The proposed model offers a valuable tool for planning the replacement of abnormally worn disc cutters,enhancing the safety and efficiency of TBM operations.展开更多
Purpose:This study aims to integrate large language models(LLMs)with interpretable machine learning methods to develop a multimodal data-driven framework for predicting corporate financial fraud,addressing the limitat...Purpose:This study aims to integrate large language models(LLMs)with interpretable machine learning methods to develop a multimodal data-driven framework for predicting corporate financial fraud,addressing the limitations of traditional approaches in long-text semantic parsing,model interpretability,and multisource data fusion,thereby providing regulatory agencies with intelligent auditing tools.Design/methodology/approach:Analyzing 5,304 Chinese listed firms’annual reports(2015-2020)from the CSMAD database,this study leverages the Doubao LLMs to generate chunked summaries and 256-dimensional semantic vectors,developing textual semantic features.It integrates 19 financial indicators,11 governance metrics,and linguistic characteristics(tone,readability)with fraud prediction models optimized through a group of Gradient Boosted Decision Tree(GBDT)algorithms.SHAP value analysis in the final model reveals the risk transmission mechanism by quantifying the marginal impacts of financial,governance,and textual features on fraud likelihood.Findings:The study found that LLMs effectively distill lengthy annual reports into semantic summaries,while GBDT algorithms(AUC>0.850)outperform the traditional Logistic Regression model in fraud detection.Multimodal fusion improved performance by 7.4%,with financial,governance,and textual features providing complementary signals.SHAP analysis revealed financial distress,governance conflicts,and narrative patterns(e.g.,tone anchoring,semantic thresholds)as key fraud indicators,highlighting managerial intent in report language.Research limitations:This study identifies three key limitations:1)lack of interpretability for semantic features,2)absence of granular fraud-type differentiation,and 3)unexplored comparative validation with other deep learning methods.Future research will address these gaps to enhance fraud detection precision and model transparency.Practical implications:The developed semantic-enhanced evaluation model provides a quantitative tool for assessing listed companies’information disclosure quality and enables practical implementation through its derivative real-time monitoring system.This advancement significantly strengthens capital market risk early warning capabilities,offering actionable insights for securities regulation.Originality/value:This study presents three key innovations:1)A novel“chunking-summarizationembedding”framework for efficient semantic compression of lengthy annual reports(30,000 words);2)Demonstration of LLMs’superior performance in financial text analysis,outperforming traditional methods by 19.3%;3)A novel“language-psychology-behavior”triad model for analyzing managerial fraud motives.展开更多
AIM:To investigate the associations between urinary dialkyl phosphate(DAP)metabolites of organophosphorus pesticides(OPPs)exposure and age-related macular degeneration(AMD)risk.METHODS:Participants were drawn from the...AIM:To investigate the associations between urinary dialkyl phosphate(DAP)metabolites of organophosphorus pesticides(OPPs)exposure and age-related macular degeneration(AMD)risk.METHODS:Participants were drawn from the National Health and Nutrition Examination Survey(NHANES)between 2005 and 2008.Urinary DAP metabolites were used to construct a machine learning(ML)model for AMD prediction.Several interpretability pipelines,including permutation feature importance(PFI),partial dependence plot(PDP),and SHapley Additive exPlanations(SHAP)analyses were employed to analyze the influence from exposure features to prediction outcomes.RESULTS:A total of 1845 participants were included and 137 were diagnosed with AMD.Receiver operating characteristic curve(ROC)analysis evaluated Random Forests(RF)as the best ML model with its optimal predictive performance among eleven models.PFI and SHAP analyses illustrated that DAP metabolites were of significant contribution weights in AMD risk prediction,higher than most of the socio-demographic covariates.Shapley values and waterfall plots of randomly selected AMD individuals emphasized the predictive capacity of ML with high accuracy and sensitivity in each case.The relationships and interactions visualized by graphical plots and supported by statistical measures demonstrated the indispensable impacts from six DAP metabolites to the prediction of AMD risk.CONCLUSION:Urinary DAP metabolites of OPPs exposure are associated with AMD risk and ML algorithms show the excellent generalizability and differentiability in the course of AMD risk prediction.展开更多
The optimization of reaction processes is crucial for the green, efficient, and sustainable development of the chemical industry. However, how to address the problems posed by multiple variables, nonlinearities, and u...The optimization of reaction processes is crucial for the green, efficient, and sustainable development of the chemical industry. However, how to address the problems posed by multiple variables, nonlinearities, and uncertainties during optimization remains a formidable challenge. In this study, a strategy combining interpretable machine learning with metaheuristic optimization algorithms is employed to optimize the reaction process. First, experimental data from a biodiesel production process are collected to establish a database. These data are then used to construct a predictive model based on artificial neural network (ANN) models. Subsequently, interpretable machine learning techniques are applied for quantitative analysis and verification of the model. Finally, four metaheuristic optimization algorithms are coupled with the ANN model to achieve the desired optimization. The research results show that the methanol: palm fatty acid distillate (PFAD) molar ratio contributes the most to the reaction outcome, accounting for 41%. The ANN-simulated annealing (SA) hybrid method is more suitable for this optimization, and the optimal process parameters are a catalyst concentration of 3.00% (mass), a methanol: PFAD molar ratio of 8.67, and a reaction time of 30 min. This study provides deeper insights into reaction process optimization, which will facilitate future applications in various reaction optimization processes.展开更多
An algorithm named InterOpt for optimizing operational parameters is proposed based on interpretable machine learning,and is demonstrated via optimization of shale gas development.InterOpt consists of three parts:a ne...An algorithm named InterOpt for optimizing operational parameters is proposed based on interpretable machine learning,and is demonstrated via optimization of shale gas development.InterOpt consists of three parts:a neural network is used to construct an emulator of the actual drilling and hydraulic fracturing process in the vector space(i.e.,virtual environment);:the Sharpley value method in inter-pretable machine learning is applied to analyzing the impact of geological and operational parameters in each well(i.e.,single well feature impact analysis):and ensemble randomized maximum likelihood(EnRML)is conducted to optimize the operational parameters to comprehensively improve the efficiency of shale gas development and reduce the average cost.In the experiment,InterOpt provides different drilling and fracturing plans for each well according to its specific geological conditions,and finally achieves an average cost reduction of 9.7%for a case study with 104 wells.展开更多
Electrocatalytic nitrogen reduction to ammonia has garnered significant attention with the blooming of single-atom catalysts(SACs),showcasing their potential for sustainable and energy-efficient ammonia production.How...Electrocatalytic nitrogen reduction to ammonia has garnered significant attention with the blooming of single-atom catalysts(SACs),showcasing their potential for sustainable and energy-efficient ammonia production.However,cost-effectively designing and screening efficient electrocatalysts remains a challenge.In this study,we have successfully established interpretable machine learning(ML)models to evaluate the catalytic activity of SACs by directly and accurately predicting reaction Gibbs free energy.Our models were trained using non-density functional theory(DFT)calculated features from a dataset comprising 90 graphene-supported SACs.Our results underscore the superior prediction accuracy of the gradient boosting regression(GBR)model for bothΔg(N_(2)→NNH)andΔG(NH_(2)→NH_(3)),boasting coefficient of determination(R^(2))score of 0.972 and 0.984,along with root mean square error(RMSE)of 0.051 and 0.085 eV,respectively.Moreover,feature importance analysis elucidates that the high accuracy of GBR model stems from its adept capture of characteristics pertinent to the active center and coordination environment,unveilling the significance of elementary descriptors,with the colvalent radius playing a dominant role.Additionally,Shapley additive explanations(SHAP)analysis provides global and local interpretation of the working mechanism of the GBR model.Our analysis identifies that a pyrrole-type coordination(flag=0),d-orbitals with a moderate occupation(N_(d)=5),and a moderate difference in covalent radius(r_(TM-ave)near 140 pm)are conducive to achieving high activity.Furthermore,we extend the prediction of activity to more catalysts without additional DFT calculations,validating the reliability of our feature engineering,model training,and design strategy.These findings not only highlight new opportunity for accelerating catalyst design using non-DFT calculated features,but also shed light on the working mechanism of"black box"ML model.Moreover,the model provides valuable guidance for catalytic material design in multiple proton-electron coupling reactions,particularly in driving sustainable CO_(2),O_(2),and N_(2) conversion.展开更多
Thermoelectric and thermal materials are essential in achieving carbon neutrality. However, the high cost of lattice thermal conductivity calculations and the limited applicability of classical physical models have le...Thermoelectric and thermal materials are essential in achieving carbon neutrality. However, the high cost of lattice thermal conductivity calculations and the limited applicability of classical physical models have led to the inefficient development of thermoelectric materials. In this study, we proposed a two-stage machine learning framework with physical interpretability incorporating domain knowledge to calculate high/low thermal conductivity rapidly. Specifically, crystal graph convolutional neural network(CGCNN) is constructed to predict the fundamental physical parameters related to lattice thermal conductivity. Based on the above physical parameters, an interpretable machine learning model–sure independence screening and sparsifying operator(SISSO), is trained to predict the lattice thermal conductivity. We have predicted the lattice thermal conductivity of all available materials in the open quantum materials database(OQMD)(https://www.oqmd.org/). The proposed approach guides the next step of searching for materials with ultra-high or ultralow lattice thermal conductivity and promotes the development of new thermal insulation materials and thermoelectric materials.展开更多
Understanding the relationship between attribute performance(AP)and customer satisfaction(CS)is crucial for the hospitality industry.However,accurately modeling this relationship remains challenging.To address this is...Understanding the relationship between attribute performance(AP)and customer satisfaction(CS)is crucial for the hospitality industry.However,accurately modeling this relationship remains challenging.To address this issue,we propose an interpretable machine learning-based dynamic asymmetric analysis(IML-DAA)approach that leverages interpretable machine learning(IML)to improve traditional relationship analysis methods.The IML-DAA employs extreme gradient boosting(XGBoost)and SHapley Additive exPlanations(SHAP)to construct relationships and explain the significance of each attribute.Following this,an improved version of penalty-reward contrast analysis(PRCA)is used to classify attributes,whereas asymmetric impact-performance analysis(AIPA)is employed to determine the attribute improvement priority order.A total of 29,724 user ratings in New York City collected from TripAdvisor were investigated.The results suggest that IML-DAA can effectively capture non-linear relationships and that there is a dynamic asymmetric effect between AP and CS,as identified by the dynamic AIPA model.This study enhances our understanding of the relationship between AP and CS and contributes to the literature on the hotel service industry.展开更多
Most of the existing machine learning studies in logs interpretation do not consider the data distribution discrepancy issue,so the trained model cannot well generalize to the unseen data without calibrating the logs....Most of the existing machine learning studies in logs interpretation do not consider the data distribution discrepancy issue,so the trained model cannot well generalize to the unseen data without calibrating the logs.In this paper,we formulated the geophysical logs calibration problem and give its statistical explanation,and then exhibited an interpretable machine learning method,i.e.,Unilateral Alignment,which could align the logs from one well to another without losing the physical meanings.The involved UA method is an unsupervised feature domain adaptation method,so it does not rely on any labels from cores.The experiments in 3 wells and 6 tasks showed the effectiveness and interpretability from multiple views.展开更多
Hydraulic fracturing stimulation technology is essential in the oil and gas industry.However,current techniques for predicting rock fracture pressure in hydraulic fracturing face significant challenges in precision an...Hydraulic fracturing stimulation technology is essential in the oil and gas industry.However,current techniques for predicting rock fracture pressure in hydraulic fracturing face significant challenges in precision and reliability.Traditional approaches often result in inadequate accuracy due to the complex and diverse nature of underground formations.However,recent advances in computational power and optimization techniques have enabled the application of machine learning in mining operations,resulting in improved prediction and feedback.In this study,various machine learning techniques are employed to predict hydraulic fracturing pressure based on the concept of mechanical specific energy.Additionally,the study interprets the models through feature importance analysis.Thefindings suggest that most machine learning models deliver highly accurate predictions.Feature importance analysis indicates that for an approximate assessment of fracture pressure,the characteristics of well depth and torque are sufficient.For more precise predictions,incorporating additional characteristics from the mechanical specific energy framework into the machine learning model is essential.The study emphasizes the feasibility of employing machine learning methods to predict fracture pressure and their usefulness in determining optimal engineering sites.展开更多
BACKGROUND The early diagnosis rate of pancreatic ductal adenocarcinoma(PDAC)is low and the prognosis is poor.It is important to develop an interpretable noninvasive early diagnostic model in clinical practice.AIM To ...BACKGROUND The early diagnosis rate of pancreatic ductal adenocarcinoma(PDAC)is low and the prognosis is poor.It is important to develop an interpretable noninvasive early diagnostic model in clinical practice.AIM To develop an interpretable noninvasive early diagnostic model for PDAC using plasma extracellular vesicle long RNA(EvlRNA).METHODS The diagnostic model was constructed based on plasma EvlRNA data.During the process of establishing the model,EvlRNA-index was introduced,and four algorithms were adopted to calculate EvlRNA-index.After the model was successfully constructed,performance evaluation was conducted.A series of bioinformatics methods were adopted to explore the potential mechanism of EvlRNA-index as the input feature of the model.And the relationship between key characteristics and PDAC were explored at the single-cell level.RESULTS A novel interpretable machine learning framework was developed based on plasma EvlRNA.In this framework,a two-layer classifier was established.A new concept was proposed:EvlRNA-index.Based on EvlRNA-index,a cancer diagnostic model was established,and a good diagnostic effect was achieved.The accuracy of PDACandCPvsHealth-Probabilistic PCA Index-SVM(PDAC and chronic pancreatitis vs health-probabilistic principal component analysis index-support vector machine)(1-18)was 91.51%,with Mathew’s correlation coefficient 0.7760 and area under the curve 0.9560.In the second layer of the model,the accuracy of PDACvsCP-Probabilistic PCA Index-RF(PDAC vs chronic pancreatitis-probabilistic principal component analysis index-random forest)(2-17)was 93.83%,with Mathew’s correlation coefficient 0.8422 and area under the curve 0.9698.Forty-nine PDAC-related genes were identified,among which 16 were known,inferring that the remaining ones were also PDAC-related genes.CONCLUSION An interpretable two-layer machine learning framework was proposed for early diagnosis and prediction of PDAC based on plasma EvlRNA,providing new insights into the clinical value of EvlRNA.展开更多
Forecasting landslide deformation is challenging due to influence of various internal and external factors on the occurrence of systemic and localized heterogeneities.Despite the potential to improve landslide predict...Forecasting landslide deformation is challenging due to influence of various internal and external factors on the occurrence of systemic and localized heterogeneities.Despite the potential to improve landslide predictability,deep learning has yet to be sufficiently explored for complex deformation patterns associated with landslides and is inherently opaque.Herein,we developed a holistic landslide deformation forecasting method that considers spatiotemporal correlations of landslide deformation by integrating domain knowledge into interpretable deep learning.By spatially capturing the interconnections between multiple deformations from different observation points,our method contributes to the understanding and forecasting of landslide systematic behavior.By integrating specific domain knowledge relevant to each observation point and merging internal properties with external variables,the local heterogeneity is considered in our method,identifying deformation temporal patterns in different landslide zones.Case studies involving reservoir-induced landslides and creeping landslides demonstrated that our approach(1)enhances the accuracy of landslide deformation forecasting,(2)identifies significant contributing factors and their influence on spatiotemporal deformation characteristics,and(3)demonstrates how identifying these factors and patterns facilitates landslide forecasting.Our research offers a promising and pragmatic pathway toward a deeper understanding and forecasting of complex landslide behaviors.展开更多
The present study extracts human-understandable insights from machine learning(ML)-based mesoscale closure in fluid-particle flows via several novel data-driven analysis approaches,i.e.,maximal information coefficient...The present study extracts human-understandable insights from machine learning(ML)-based mesoscale closure in fluid-particle flows via several novel data-driven analysis approaches,i.e.,maximal information coefficient(MIC),interpretable ML,and automated ML.It is previously shown that the solidvolume fraction has the greatest effect on the drag force.The present study aims to quantitativelyinvestigate the influence of flow properties on mesoscale drag correction(H_(d)).The MIC results showstrong correlations between the features(i.e.,slip velocity(u^(*)_(sy))and particle volume fraction(εs))and thelabel H_(d).The interpretable ML analysis confirms this conclusion,and quantifies the contribution of u^(*)_(sy),εs and gas pressure gradient to the model as 71.9%,27.2%and 0.9%,respectively.Automated ML without theneed to select the model structure and hyperparameters is used for modeling,improving the predictionaccuracy over our previous model(Zhu et al.,2020;Ouyang,Zhu,Su,&Luo,2021).展开更多
Artificial intelligence and machine learning have been increasingly applied for prediction in agricultural science.However,many models are typically black boxes,meaning we cannot explain what the models learned from t...Artificial intelligence and machine learning have been increasingly applied for prediction in agricultural science.However,many models are typically black boxes,meaning we cannot explain what the models learned from the data and the reasons behind predictions.To address this issue,I introduce an emerging subdomain of artificial intelligence,explainable artificial intelligence(XAI),and associated toolkits,interpretable machine learning.This study demonstrates the usefulness of several methods by applying them to an openly available dataset.The dataset includes the no-tillage effect on crop yield relative to conventional tillage and soil,climate,and management variables.Data analysis discovered that no-tillage management can increase maize crop yield where yield in conventional tillage is<5000 kg/ha and the maximum temperature is higher than 32°.These methods are useful to answer(i)which variables are important for prediction in regression/classification,(ii)which variable interactions are important for prediction,(iii)how important variables and their interactions are associated with the response variable,(iv)what are the reasons underlying a predicted value for a certain instance,and(v)whether different machine learning algorithms offer the same answer to these questions.I argue that the goodness of model fit is overly evaluated with model performance measures in the current practice,while these questions are unanswered.XAI and interpretable machine learning can enhance trust and explainability in AI.展开更多
We examine how machine learning models predict stock returns in the Korean market.By analyzing various firm characteristics and macroeconomic variables,we find that tree-based models outperform other machine learning ...We examine how machine learning models predict stock returns in the Korean market.By analyzing various firm characteristics and macroeconomic variables,we find that tree-based models outperform other machine learning approaches.This finding suggests that,in data-constrained contexts,moderately complex models outperform advanced methods that require extensive datasets.Using PFI,SHAP,and LIME,we consistently identify the 36-month momentum as the key predictor.PDP,ICE,and ALE analyses reveal threshold effects of 36-month momentum that diminish at higher return levels.Our findings underscore the value of ensemble-based methods in settings characterized by short data histories and heightened volatility.This study illustrates how multimethod interpretability can yield deeper economic insights,ultimately guiding more effective investment strategies and policy decisions.展开更多
CONSPECTUS:Finding catalytic materials with optimal properties for sustainable chemical and energy transformations is one of the pressing challenges facing our society today.Traditionally,the discovery of catalysts or...CONSPECTUS:Finding catalytic materials with optimal properties for sustainable chemical and energy transformations is one of the pressing challenges facing our society today.Traditionally,the discovery of catalysts or the philosopher’s stone of alchemists relies on a trial-and-error approach with physicochemical intuition.Decades-long advances in science and engineering,particularly in quantum chemistry and computing infrastructures,popularize a paradigm of computational science for materials discovery.However,the brute-force search through a vast chemical space is hampered by its formidable cost.In recent years,machine learning(ML)has emerged as a promising approach to streamline the design of active sites by learning from data.As ML is increasingly employed to make predictions in practical settings,the demand for domain interpretability is surging.Therefore,it is of great importance to provide an in-depth review of our efforts in tackling this challenging issue in computational heterogeneous catalysis.In this Account,we present an interpretable ML framework for accelerating catalytic materials design,particularly in driving sustainable carbon,nitrogen,and oxygen cycles.By leveraging the linear adsorption-energy scaling and Bronsted−Evans−Polanyi(BEP)relationships,catalytic outcomes(i.e.,activity,selectivity,and stability)of a multistep reaction can often be mapped onto one or two kinetics-informed descriptors.One type of descriptor of great importance is the adsorption energies of representative species at active site motifs that can be computed from quantum-chemical simulations.To complement such a descriptor-based design strategy,we delineate our endeavors in incorporating domain knowledge into a datadriven ML workflow.We demonstrate that the major drawbacks of black-box ML algorithms,e.g.,poor explainability,can be largely circumvented by employing(1)physics-inspired feature engineering,(2)Bayesian statistical learning,and(3)theory-infused deep neural networks.The framework drastically facilitates the design of heterogeneous metal-based catalysts,some of which have been experimentally verified for an array of sustainable chemistries.We offer some remarks on the existing challenges,opportunities,and future directions of interpretable ML in predicting catalytic materials and,more importantly,on advancing catalysis theory beyond conventional wisdom.We envision that this Account will attract more researchers’attention to develop highly accurate,easily explainable,and trustworthy materials design strategies,facilitating the transition to the data science paradigm for sustainability through catalysis.展开更多
The identification of factors that may be forcing ecological observations to approach the upper boundary provides insight into potential mechanisms affecting driver-response relationships,and can help inform ecosystem...The identification of factors that may be forcing ecological observations to approach the upper boundary provides insight into potential mechanisms affecting driver-response relationships,and can help inform ecosystem management,but has rarely been explored.In this study,we propose a novel framework integrating quantile regression with interpretable machine learning.In the first stage of the framework,we estimate the upper boundary of a driver-response relationship using quantile regression.Next,we calculate“potentials”of the response variable depending on the driver,which are defined as vertical distances from the estimated upper boundary of the relationship to observations in the driver-response variable scatter plot.Finally,we identify key factors impacting the potential using a machine learning model.We illustrate the necessary steps to implement the framework using the total phosphorus(TP)-Chlorophyll a(CHL)relationship in lakes across the continental US.We found that the nitrogen to phosphorus ratio(N:P),annual average precipitation,total nitrogen(TN),and summer average air temperature were key factors impacting the potential of CHL depending on TP.We further revealed important implications of our findings for lake eutrophication management.The important role of N:P and TN on the potential highlights the co-limitation of phosphorus and nitrogen and indicates the need for dual nutrient criteria.Future wetter and/or warmer climate scenarios can decrease the potential which may reduce the efficacy of lake eutrophication management.The novel framework advances the application of quantile regression to identify factors driving observations to approach the upper boundary of driver-response relationships.展开更多
Machine learning(ML)techniques have made enormous progress in the field of materials science.However,many conventional ML algorithms operate as“blackboxes”,lacking transparency in revealing explicit relationships be...Machine learning(ML)techniques have made enormous progress in the field of materials science.However,many conventional ML algorithms operate as“blackboxes”,lacking transparency in revealing explicit relationships between material features and target properties.To address this,the development of interpretable ML models is essential to drive further advancements in AI-driven materials discovery.In this study,we present an interpretable framework that combines traditional machine learning with symbolic regression,using Janus III–VI vdW heterostructures as a case study.This approach enables fast and accurate predictions of stability and electronic structure.Our results demonstrate that the prediction accuracy using the classification model for stability,based on formation energy,reaches 0.960.On the other hand,the R2,MAE,and RMSE value using the regression model for electronic structure prediction,based on band gap,achieves 0.927,0.113,and 0.141 on the testing set,respectively.Additionally,we identify a universal interpretable descriptor comprising five simple parameters that reveals the underlying physical relationships between the candidate heterostructures and their band gaps.This descriptor not only delivers high accuracy in band gap prediction but also provides explicit physical insight into the material properties.展开更多
基金Supported by National Key Research and Development Program,No.2022YFC2407304Major Research Project for Middle-Aged and Young Scientists of Fujian Provincial Health Commission,No.2021ZQNZD013+2 种基金The National Natural Science Foundation of China,No.62275050Fujian Province Science and Technology Innovation Joint Fund Project,No.2019Y9108Major Science and Technology Projects of Fujian Province,No.2021YZ036017.
文摘BACKGROUND To investigate the preoperative factors influencing textbook outcomes(TO)in Intrahepatic cholangiocarcinoma(ICC)patients and evaluate the feasibility of an interpretable machine learning model for preoperative prediction of TO,we developed a machine learning model for preoperative prediction of TO and used the SHapley Additive exPlanations(SHAP)technique to illustrate the prediction process.AIM To analyze the factors influencing textbook outcomes before surgery and to establish interpretable machine learning models for preoperative prediction.METHODS A total of 376 patients diagnosed with ICC were retrospectively collected from four major medical institutions in China,covering the period from 2011 to 2017.Logistic regression analysis was conducted to identify preoperative variables associated with achieving TO.Based on these variables,an EXtreme Gradient Boosting(XGBoost)machine learning prediction model was constructed using the XGBoost package.The SHAP(package:Shapviz)algorithm was employed to visualize each variable's contribution to the model's predictions.Kaplan-Meier survival analysis was performed to compare the prognostic differences between the TO-achieving and non-TO-achieving groups.RESULTS Among 376 patients,287 were included in the training group and 89 in the validation group.Logistic regression identified the following preoperative variables influencing TO:Child-Pugh classification,Eastern Cooperative Oncology Group(ECOG)score,hepatitis B,and tumor size.The XGBoost prediction model demonstrated high accuracy in internal validation(AUC=0.8825)and external validation(AUC=0.8346).Survival analysis revealed that the disease-free survival rates for patients achieving TO at 1,2,and 3 years were 64.2%,56.8%,and 43.4%,respectively.CONCLUSION Child-Pugh classification,ECOG score,hepatitis B,and tumor size are preoperative predictors of TO.In both the training group and the validation group,the machine learning model had certain effectiveness in predicting TO before surgery.The SHAP algorithm provided intuitive visualization of the machine learning prediction process,enhancing its interpretability.
基金funded by the National Natural Science Foundation of China(No.52204407)the Natural Science Foundation of Jiangsu Province(No.BK20220595)+1 种基金the China Postdoctoral Science Foundation(No.2022M723689)the Industrial Collaborative Innovation Project of Shanghai(No.XTCX-KJ-2022-2-11)。
文摘The application of machine learning in alloy design is increasingly widespread,yet traditional models still face challenges when dealing with limited datasets and complex nonlinear relationships.This work proposes an interpretable machine learning method based on data augmentation and reconstruction,excavating high-performance low-alloyed magnesium(Mg)alloys.The data augmentation technique expands the original dataset through Gaussian noise.The data reconstruction method reorganizes and transforms the original data to extract more representative features,significantly improving the model's generalization ability and prediction accuracy,with a coefficient of determination(R^(2))of 95.9%for the ultimate tensile strength(UTS)model and a R^(2)of 95.3%for the elongation-to-failure(EL)model.The correlation coefficient assisted screening(CCAS)method is proposed to filter low-alloyed target alloys.A new Mg-2.2Mn-0.4Zn-0.2Al-0.2Ca(MZAX2000,wt%)alloy is designed and extruded into bar at given processing parameters,achieving room-temperature strength-ductility synergy showing an excellent UTS of 395 MPa and a high EL of 17.9%.This is closely related to its hetero-structured characteristic in the as-extruded MZAX2000 alloy consisting of coarse grains(16%),fine grains(75%),and fiber regions(9%).Therefore,this work offers new insights into optimizing alloy compositions and processing parameters for attaining new high strong and ductile low-alloyed Mg alloys.
基金support of the“National R&D Project for Smart Construction Technology (Grant No.RS-2020-KA157074)”funded by the Korea Agency for Infrastructure Technology Advancement under the Ministry of Land,Infrastructure and Transport,and managed by the Korea Expressway Corporation.
文摘The widespread adoption of tunnel boring machines(TBMs)has led to an increased focus on disc cutter wear,including both normal and abnormal types,for efficient and safe TBM excavation.However,abnormal wear has yet to be thoroughly investigated,primarily due to the complexity of considering mixed ground conditions and the imbalance in the number of instances between the two types of wear.This study developed a prediction model for abnormal TBM disc cutter wear,considering mixed ground conditions,by employing interpretable machine learning with data augmentation.An equivalent elastic modulus was used to consider the characteristics of mixed ground conditions,and wear data was obtained from 65 cutterhead intervention(CHI)reports covering both mixed ground and hard rock sections.With a balanced training dataset obtained by data augmentation,an extreme gradient boosting(XGB)model delivered acceptable results with an accuracy of 0.94,an F1-score of 0.808,and a recall of 0.8.In addition,the accuracy for each individual disc cutter exhibited low variability.When employing data augmentation,a significant improvement in recall was observed compared to when it was not used,although the difference in accuracy and F1-score was marginal.The subsequent model interpretation revealed the chamber pressure,cutter installation radius,and torque as significant contributors.Specifically,a threshold in chamber pressure was observed,which could induce abnormal wear.The study also explored how elevated values of these influential contributors correlate with abnormal wear.The proposed model offers a valuable tool for planning the replacement of abnormally worn disc cutters,enhancing the safety and efficiency of TBM operations.
基金supported by the 2021 Guangdong Province(China)Science and Technology Plan Project“Research and Application of Key Technologies for Multi-level Knowledge Retrieval Based on Big Data Intelligence”(Project No.2021B0101420004)the 2022 commissioned project“Cross-border E-commerce Taxation and Related Research”from the State Taxation Administration Guangdong Provincial Taxation Bureau,China.
文摘Purpose:This study aims to integrate large language models(LLMs)with interpretable machine learning methods to develop a multimodal data-driven framework for predicting corporate financial fraud,addressing the limitations of traditional approaches in long-text semantic parsing,model interpretability,and multisource data fusion,thereby providing regulatory agencies with intelligent auditing tools.Design/methodology/approach:Analyzing 5,304 Chinese listed firms’annual reports(2015-2020)from the CSMAD database,this study leverages the Doubao LLMs to generate chunked summaries and 256-dimensional semantic vectors,developing textual semantic features.It integrates 19 financial indicators,11 governance metrics,and linguistic characteristics(tone,readability)with fraud prediction models optimized through a group of Gradient Boosted Decision Tree(GBDT)algorithms.SHAP value analysis in the final model reveals the risk transmission mechanism by quantifying the marginal impacts of financial,governance,and textual features on fraud likelihood.Findings:The study found that LLMs effectively distill lengthy annual reports into semantic summaries,while GBDT algorithms(AUC>0.850)outperform the traditional Logistic Regression model in fraud detection.Multimodal fusion improved performance by 7.4%,with financial,governance,and textual features providing complementary signals.SHAP analysis revealed financial distress,governance conflicts,and narrative patterns(e.g.,tone anchoring,semantic thresholds)as key fraud indicators,highlighting managerial intent in report language.Research limitations:This study identifies three key limitations:1)lack of interpretability for semantic features,2)absence of granular fraud-type differentiation,and 3)unexplored comparative validation with other deep learning methods.Future research will address these gaps to enhance fraud detection precision and model transparency.Practical implications:The developed semantic-enhanced evaluation model provides a quantitative tool for assessing listed companies’information disclosure quality and enables practical implementation through its derivative real-time monitoring system.This advancement significantly strengthens capital market risk early warning capabilities,offering actionable insights for securities regulation.Originality/value:This study presents three key innovations:1)A novel“chunking-summarizationembedding”framework for efficient semantic compression of lengthy annual reports(30,000 words);2)Demonstration of LLMs’superior performance in financial text analysis,outperforming traditional methods by 19.3%;3)A novel“language-psychology-behavior”triad model for analyzing managerial fraud motives.
基金Supported by the National Key Research and Development Program of China(No.2022YFC2502800)the National Natural Science Foundation of China(No.82171076)the Shanghai Municipal Education Commission(No.2023KJ05-67).
文摘AIM:To investigate the associations between urinary dialkyl phosphate(DAP)metabolites of organophosphorus pesticides(OPPs)exposure and age-related macular degeneration(AMD)risk.METHODS:Participants were drawn from the National Health and Nutrition Examination Survey(NHANES)between 2005 and 2008.Urinary DAP metabolites were used to construct a machine learning(ML)model for AMD prediction.Several interpretability pipelines,including permutation feature importance(PFI),partial dependence plot(PDP),and SHapley Additive exPlanations(SHAP)analyses were employed to analyze the influence from exposure features to prediction outcomes.RESULTS:A total of 1845 participants were included and 137 were diagnosed with AMD.Receiver operating characteristic curve(ROC)analysis evaluated Random Forests(RF)as the best ML model with its optimal predictive performance among eleven models.PFI and SHAP analyses illustrated that DAP metabolites were of significant contribution weights in AMD risk prediction,higher than most of the socio-demographic covariates.Shapley values and waterfall plots of randomly selected AMD individuals emphasized the predictive capacity of ML with high accuracy and sensitivity in each case.The relationships and interactions visualized by graphical plots and supported by statistical measures demonstrated the indispensable impacts from six DAP metabolites to the prediction of AMD risk.CONCLUSION:Urinary DAP metabolites of OPPs exposure are associated with AMD risk and ML algorithms show the excellent generalizability and differentiability in the course of AMD risk prediction.
基金supported by the National Natural Science Foundation of China(22408227,22238005)the Postdoctoral Research Foundation of China(GZC20231576).
文摘The optimization of reaction processes is crucial for the green, efficient, and sustainable development of the chemical industry. However, how to address the problems posed by multiple variables, nonlinearities, and uncertainties during optimization remains a formidable challenge. In this study, a strategy combining interpretable machine learning with metaheuristic optimization algorithms is employed to optimize the reaction process. First, experimental data from a biodiesel production process are collected to establish a database. These data are then used to construct a predictive model based on artificial neural network (ANN) models. Subsequently, interpretable machine learning techniques are applied for quantitative analysis and verification of the model. Finally, four metaheuristic optimization algorithms are coupled with the ANN model to achieve the desired optimization. The research results show that the methanol: palm fatty acid distillate (PFAD) molar ratio contributes the most to the reaction outcome, accounting for 41%. The ANN-simulated annealing (SA) hybrid method is more suitable for this optimization, and the optimal process parameters are a catalyst concentration of 3.00% (mass), a methanol: PFAD molar ratio of 8.67, and a reaction time of 30 min. This study provides deeper insights into reaction process optimization, which will facilitate future applications in various reaction optimization processes.
文摘An algorithm named InterOpt for optimizing operational parameters is proposed based on interpretable machine learning,and is demonstrated via optimization of shale gas development.InterOpt consists of three parts:a neural network is used to construct an emulator of the actual drilling and hydraulic fracturing process in the vector space(i.e.,virtual environment);:the Sharpley value method in inter-pretable machine learning is applied to analyzing the impact of geological and operational parameters in each well(i.e.,single well feature impact analysis):and ensemble randomized maximum likelihood(EnRML)is conducted to optimize the operational parameters to comprehensively improve the efficiency of shale gas development and reduce the average cost.In the experiment,InterOpt provides different drilling and fracturing plans for each well according to its specific geological conditions,and finally achieves an average cost reduction of 9.7%for a case study with 104 wells.
基金supported by the Research Grants Council of Hong Kong (City U 11305919 and 11308620)the NSFC/RGC Joint Research Scheme N_City U104/19The Hong Kong Research Grant Council Collaborative Research Fund:C1002-21G and C1017-22G。
文摘Electrocatalytic nitrogen reduction to ammonia has garnered significant attention with the blooming of single-atom catalysts(SACs),showcasing their potential for sustainable and energy-efficient ammonia production.However,cost-effectively designing and screening efficient electrocatalysts remains a challenge.In this study,we have successfully established interpretable machine learning(ML)models to evaluate the catalytic activity of SACs by directly and accurately predicting reaction Gibbs free energy.Our models were trained using non-density functional theory(DFT)calculated features from a dataset comprising 90 graphene-supported SACs.Our results underscore the superior prediction accuracy of the gradient boosting regression(GBR)model for bothΔg(N_(2)→NNH)andΔG(NH_(2)→NH_(3)),boasting coefficient of determination(R^(2))score of 0.972 and 0.984,along with root mean square error(RMSE)of 0.051 and 0.085 eV,respectively.Moreover,feature importance analysis elucidates that the high accuracy of GBR model stems from its adept capture of characteristics pertinent to the active center and coordination environment,unveilling the significance of elementary descriptors,with the colvalent radius playing a dominant role.Additionally,Shapley additive explanations(SHAP)analysis provides global and local interpretation of the working mechanism of the GBR model.Our analysis identifies that a pyrrole-type coordination(flag=0),d-orbitals with a moderate occupation(N_(d)=5),and a moderate difference in covalent radius(r_(TM-ave)near 140 pm)are conducive to achieving high activity.Furthermore,we extend the prediction of activity to more catalysts without additional DFT calculations,validating the reliability of our feature engineering,model training,and design strategy.These findings not only highlight new opportunity for accelerating catalyst design using non-DFT calculated features,but also shed light on the working mechanism of"black box"ML model.Moreover,the model provides valuable guidance for catalytic material design in multiple proton-electron coupling reactions,particularly in driving sustainable CO_(2),O_(2),and N_(2) conversion.
基金support of the National Natural Science Foundation of China(Grant Nos.12104356 and52250191)China Postdoctoral Science Foundation(Grant No.2022M712552)+2 种基金the Opening Project of Shanghai Key Laboratory of Special Artificial Microstructure Materials and Technology(Grant No.Ammt2022B-1)the Fundamental Research Funds for the Central Universitiessupport by HPC Platform,Xi’an Jiaotong University。
文摘Thermoelectric and thermal materials are essential in achieving carbon neutrality. However, the high cost of lattice thermal conductivity calculations and the limited applicability of classical physical models have led to the inefficient development of thermoelectric materials. In this study, we proposed a two-stage machine learning framework with physical interpretability incorporating domain knowledge to calculate high/low thermal conductivity rapidly. Specifically, crystal graph convolutional neural network(CGCNN) is constructed to predict the fundamental physical parameters related to lattice thermal conductivity. Based on the above physical parameters, an interpretable machine learning model–sure independence screening and sparsifying operator(SISSO), is trained to predict the lattice thermal conductivity. We have predicted the lattice thermal conductivity of all available materials in the open quantum materials database(OQMD)(https://www.oqmd.org/). The proposed approach guides the next step of searching for materials with ultra-high or ultralow lattice thermal conductivity and promotes the development of new thermal insulation materials and thermoelectric materials.
基金National Key R&D Program of China(Grant No.:2022YFF0903000)National Natural Science Foundation of China(Grant Nos.:72101197 and 71988101).
文摘Understanding the relationship between attribute performance(AP)and customer satisfaction(CS)is crucial for the hospitality industry.However,accurately modeling this relationship remains challenging.To address this issue,we propose an interpretable machine learning-based dynamic asymmetric analysis(IML-DAA)approach that leverages interpretable machine learning(IML)to improve traditional relationship analysis methods.The IML-DAA employs extreme gradient boosting(XGBoost)and SHapley Additive exPlanations(SHAP)to construct relationships and explain the significance of each attribute.Following this,an improved version of penalty-reward contrast analysis(PRCA)is used to classify attributes,whereas asymmetric impact-performance analysis(AIPA)is employed to determine the attribute improvement priority order.A total of 29,724 user ratings in New York City collected from TripAdvisor were investigated.The results suggest that IML-DAA can effectively capture non-linear relationships and that there is a dynamic asymmetric effect between AP and CS,as identified by the dynamic AIPA model.This study enhances our understanding of the relationship between AP and CS and contributes to the literature on the hotel service industry.
基金Supported in part by the National Natural Science Foundation of China under Grant 61903353in part by the SINOPEC Programmes for Science and Technology Development under Grant PE19008-8.
文摘Most of the existing machine learning studies in logs interpretation do not consider the data distribution discrepancy issue,so the trained model cannot well generalize to the unseen data without calibrating the logs.In this paper,we formulated the geophysical logs calibration problem and give its statistical explanation,and then exhibited an interpretable machine learning method,i.e.,Unilateral Alignment,which could align the logs from one well to another without losing the physical meanings.The involved UA method is an unsupervised feature domain adaptation method,so it does not rely on any labels from cores.The experiments in 3 wells and 6 tasks showed the effectiveness and interpretability from multiple views.
文摘Hydraulic fracturing stimulation technology is essential in the oil and gas industry.However,current techniques for predicting rock fracture pressure in hydraulic fracturing face significant challenges in precision and reliability.Traditional approaches often result in inadequate accuracy due to the complex and diverse nature of underground formations.However,recent advances in computational power and optimization techniques have enabled the application of machine learning in mining operations,resulting in improved prediction and feedback.In this study,various machine learning techniques are employed to predict hydraulic fracturing pressure based on the concept of mechanical specific energy.Additionally,the study interprets the models through feature importance analysis.Thefindings suggest that most machine learning models deliver highly accurate predictions.Feature importance analysis indicates that for an approximate assessment of fracture pressure,the characteristics of well depth and torque are sufficient.For more precise predictions,incorporating additional characteristics from the mechanical specific energy framework into the machine learning model is essential.The study emphasizes the feasibility of employing machine learning methods to predict fracture pressure and their usefulness in determining optimal engineering sites.
基金Supported by Talent Scientific Research Start-up Foundation of Wannan Medical College,No.WYRCQD2023045.
文摘BACKGROUND The early diagnosis rate of pancreatic ductal adenocarcinoma(PDAC)is low and the prognosis is poor.It is important to develop an interpretable noninvasive early diagnostic model in clinical practice.AIM To develop an interpretable noninvasive early diagnostic model for PDAC using plasma extracellular vesicle long RNA(EvlRNA).METHODS The diagnostic model was constructed based on plasma EvlRNA data.During the process of establishing the model,EvlRNA-index was introduced,and four algorithms were adopted to calculate EvlRNA-index.After the model was successfully constructed,performance evaluation was conducted.A series of bioinformatics methods were adopted to explore the potential mechanism of EvlRNA-index as the input feature of the model.And the relationship between key characteristics and PDAC were explored at the single-cell level.RESULTS A novel interpretable machine learning framework was developed based on plasma EvlRNA.In this framework,a two-layer classifier was established.A new concept was proposed:EvlRNA-index.Based on EvlRNA-index,a cancer diagnostic model was established,and a good diagnostic effect was achieved.The accuracy of PDACandCPvsHealth-Probabilistic PCA Index-SVM(PDAC and chronic pancreatitis vs health-probabilistic principal component analysis index-support vector machine)(1-18)was 91.51%,with Mathew’s correlation coefficient 0.7760 and area under the curve 0.9560.In the second layer of the model,the accuracy of PDACvsCP-Probabilistic PCA Index-RF(PDAC vs chronic pancreatitis-probabilistic principal component analysis index-random forest)(2-17)was 93.83%,with Mathew’s correlation coefficient 0.8422 and area under the curve 0.9698.Forty-nine PDAC-related genes were identified,among which 16 were known,inferring that the remaining ones were also PDAC-related genes.CONCLUSION An interpretable two-layer machine learning framework was proposed for early diagnosis and prediction of PDAC based on plasma EvlRNA,providing new insights into the clinical value of EvlRNA.
基金supported by the Postdoctoral Fellowship Program of CPSF(Grant No.GZB20230685)the National Science Foundation of China(Grant No.42277161).
文摘Forecasting landslide deformation is challenging due to influence of various internal and external factors on the occurrence of systemic and localized heterogeneities.Despite the potential to improve landslide predictability,deep learning has yet to be sufficiently explored for complex deformation patterns associated with landslides and is inherently opaque.Herein,we developed a holistic landslide deformation forecasting method that considers spatiotemporal correlations of landslide deformation by integrating domain knowledge into interpretable deep learning.By spatially capturing the interconnections between multiple deformations from different observation points,our method contributes to the understanding and forecasting of landslide systematic behavior.By integrating specific domain knowledge relevant to each observation point and merging internal properties with external variables,the local heterogeneity is considered in our method,identifying deformation temporal patterns in different landslide zones.Case studies involving reservoir-induced landslides and creeping landslides demonstrated that our approach(1)enhances the accuracy of landslide deformation forecasting,(2)identifies significant contributing factors and their influence on spatiotemporal deformation characteristics,and(3)demonstrates how identifying these factors and patterns facilitates landslide forecasting.Our research offers a promising and pragmatic pathway toward a deeper understanding and forecasting of complex landslide behaviors.
基金This work was supported by the National Natural ScienceFoundation of China(No.U1862201,91834303 and 22208208)the China Postdoctoral Science Foundation(No.2022M712056)the China National Postdoctoral Program for Innovative Talents(No.BX20220205).
文摘The present study extracts human-understandable insights from machine learning(ML)-based mesoscale closure in fluid-particle flows via several novel data-driven analysis approaches,i.e.,maximal information coefficient(MIC),interpretable ML,and automated ML.It is previously shown that the solidvolume fraction has the greatest effect on the drag force.The present study aims to quantitativelyinvestigate the influence of flow properties on mesoscale drag correction(H_(d)).The MIC results showstrong correlations between the features(i.e.,slip velocity(u^(*)_(sy))and particle volume fraction(εs))and thelabel H_(d).The interpretable ML analysis confirms this conclusion,and quantifies the contribution of u^(*)_(sy),εs and gas pressure gradient to the model as 71.9%,27.2%and 0.9%,respectively.Automated ML without theneed to select the model structure and hyperparameters is used for modeling,improving the predictionaccuracy over our previous model(Zhu et al.,2020;Ouyang,Zhu,Su,&Luo,2021).
基金supported by ZALF Integrated Priority Project(IPP2022)“Co-designing smart,resilient,sustainable agricultural landscapes with cross-scale diversification”,Bundesministerium für Bildung und Forschung(BMBF)Land-Innovation-Lausitz project“Landschaftsinnovationen in der Lausitz für eine klimaangepasste Bioökonomie und naturnahen Bioökonomie-Tourismus”(03WIR3017A)BMBF project“Multi-modale Datenintegration,domänenspezifische Methoden und KI zur Stärkung der Datenkompetenz in der Agrarforschung”(16DKWN089)Brandenburgische Technische Universität Cottbus-Senftenberg GRS cluster project“Integrated analysis of Multifunctional Fruit production landscapes to promote ecosystem services and sustainable land-use under climate change”(GRS2018/19).
文摘Artificial intelligence and machine learning have been increasingly applied for prediction in agricultural science.However,many models are typically black boxes,meaning we cannot explain what the models learned from the data and the reasons behind predictions.To address this issue,I introduce an emerging subdomain of artificial intelligence,explainable artificial intelligence(XAI),and associated toolkits,interpretable machine learning.This study demonstrates the usefulness of several methods by applying them to an openly available dataset.The dataset includes the no-tillage effect on crop yield relative to conventional tillage and soil,climate,and management variables.Data analysis discovered that no-tillage management can increase maize crop yield where yield in conventional tillage is<5000 kg/ha and the maximum temperature is higher than 32°.These methods are useful to answer(i)which variables are important for prediction in regression/classification,(ii)which variable interactions are important for prediction,(iii)how important variables and their interactions are associated with the response variable,(iv)what are the reasons underlying a predicted value for a certain instance,and(v)whether different machine learning algorithms offer the same answer to these questions.I argue that the goodness of model fit is overly evaluated with model performance measures in the current practice,while these questions are unanswered.XAI and interpretable machine learning can enhance trust and explainability in AI.
基金supported by the National Research Foundation of Korea(NRF)grant funded by the Korea government(MSIT,Ministry of Science and ICT)[RS-2025-005183388]supported by the"Regional Innovation System&Education(RISE)"through the Seoul RISE Center,funded by the Ministry of Education(MOE)and the Seoul Metropolitan Government(2025-RISE-01-018-01).
文摘We examine how machine learning models predict stock returns in the Korean market.By analyzing various firm characteristics and macroeconomic variables,we find that tree-based models outperform other machine learning approaches.This finding suggests that,in data-constrained contexts,moderately complex models outperform advanced methods that require extensive datasets.Using PFI,SHAP,and LIME,we consistently identify the 36-month momentum as the key predictor.PDP,ICE,and ALE analyses reveal threshold effects of 36-month momentum that diminish at higher return levels.Our findings underscore the value of ensemble-based methods in settings characterized by short data histories and heightened volatility.This study illustrates how multimethod interpretability can yield deeper economic insights,ultimately guiding more effective investment strategies and policy decisions.
基金funding support from the NSF Chemical Catalysis program(CHE-2102363)support from the NSF CBET Catalysis program(CBET-2245402)the US Department of Energy,Office of Basic Energy Sciences under contract no.DESC0023323.
文摘CONSPECTUS:Finding catalytic materials with optimal properties for sustainable chemical and energy transformations is one of the pressing challenges facing our society today.Traditionally,the discovery of catalysts or the philosopher’s stone of alchemists relies on a trial-and-error approach with physicochemical intuition.Decades-long advances in science and engineering,particularly in quantum chemistry and computing infrastructures,popularize a paradigm of computational science for materials discovery.However,the brute-force search through a vast chemical space is hampered by its formidable cost.In recent years,machine learning(ML)has emerged as a promising approach to streamline the design of active sites by learning from data.As ML is increasingly employed to make predictions in practical settings,the demand for domain interpretability is surging.Therefore,it is of great importance to provide an in-depth review of our efforts in tackling this challenging issue in computational heterogeneous catalysis.In this Account,we present an interpretable ML framework for accelerating catalytic materials design,particularly in driving sustainable carbon,nitrogen,and oxygen cycles.By leveraging the linear adsorption-energy scaling and Bronsted−Evans−Polanyi(BEP)relationships,catalytic outcomes(i.e.,activity,selectivity,and stability)of a multistep reaction can often be mapped onto one or two kinetics-informed descriptors.One type of descriptor of great importance is the adsorption energies of representative species at active site motifs that can be computed from quantum-chemical simulations.To complement such a descriptor-based design strategy,we delineate our endeavors in incorporating domain knowledge into a datadriven ML workflow.We demonstrate that the major drawbacks of black-box ML algorithms,e.g.,poor explainability,can be largely circumvented by employing(1)physics-inspired feature engineering,(2)Bayesian statistical learning,and(3)theory-infused deep neural networks.The framework drastically facilitates the design of heterogeneous metal-based catalysts,some of which have been experimentally verified for an array of sustainable chemistries.We offer some remarks on the existing challenges,opportunities,and future directions of interpretable ML in predicting catalytic materials and,more importantly,on advancing catalysis theory beyond conventional wisdom.We envision that this Account will attract more researchers’attention to develop highly accurate,easily explainable,and trustworthy materials design strategies,facilitating the transition to the data science paradigm for sustainability through catalysis.
基金This research was funded by the National Natural Science Foundation of China(Nos.71761147001 and 42030707)the International Partnership Program by the Chinese Academy of Sciences(No.121311KYSB20190029)+2 种基金the Fundamental Research Fund for the Central Universities(No.20720210083)the National Science Foundation(Nos.EF-1638679,EF-1638554,EF-1638539,and EF-1638550)Any use of trade,firm,or product names is for descriptive purposes only and does not imply endorsement by the US Government.
文摘The identification of factors that may be forcing ecological observations to approach the upper boundary provides insight into potential mechanisms affecting driver-response relationships,and can help inform ecosystem management,but has rarely been explored.In this study,we propose a novel framework integrating quantile regression with interpretable machine learning.In the first stage of the framework,we estimate the upper boundary of a driver-response relationship using quantile regression.Next,we calculate“potentials”of the response variable depending on the driver,which are defined as vertical distances from the estimated upper boundary of the relationship to observations in the driver-response variable scatter plot.Finally,we identify key factors impacting the potential using a machine learning model.We illustrate the necessary steps to implement the framework using the total phosphorus(TP)-Chlorophyll a(CHL)relationship in lakes across the continental US.We found that the nitrogen to phosphorus ratio(N:P),annual average precipitation,total nitrogen(TN),and summer average air temperature were key factors impacting the potential of CHL depending on TP.We further revealed important implications of our findings for lake eutrophication management.The important role of N:P and TN on the potential highlights the co-limitation of phosphorus and nitrogen and indicates the need for dual nutrient criteria.Future wetter and/or warmer climate scenarios can decrease the potential which may reduce the efficacy of lake eutrophication management.The novel framework advances the application of quantile regression to identify factors driving observations to approach the upper boundary of driver-response relationships.
基金supported by the National Key Research and Development Program of China(No.2022YFB3807200)the Natural Science Foundation of Fujian Province(No.2024J01262)the National Natural Science Foundation of China(No.52201022).
文摘Machine learning(ML)techniques have made enormous progress in the field of materials science.However,many conventional ML algorithms operate as“blackboxes”,lacking transparency in revealing explicit relationships between material features and target properties.To address this,the development of interpretable ML models is essential to drive further advancements in AI-driven materials discovery.In this study,we present an interpretable framework that combines traditional machine learning with symbolic regression,using Janus III–VI vdW heterostructures as a case study.This approach enables fast and accurate predictions of stability and electronic structure.Our results demonstrate that the prediction accuracy using the classification model for stability,based on formation energy,reaches 0.960.On the other hand,the R2,MAE,and RMSE value using the regression model for electronic structure prediction,based on band gap,achieves 0.927,0.113,and 0.141 on the testing set,respectively.Additionally,we identify a universal interpretable descriptor comprising five simple parameters that reveals the underlying physical relationships between the candidate heterostructures and their band gaps.This descriptor not only delivers high accuracy in band gap prediction but also provides explicit physical insight into the material properties.