The authors regret that the original publication of this paper did not include Jawad Fayaz as a co-author.After further discussions and a thorough review of the research contributions,it was agreed that his significan...The authors regret that the original publication of this paper did not include Jawad Fayaz as a co-author.After further discussions and a thorough review of the research contributions,it was agreed that his significant contributions to the foundational aspects of the research warranted recognition,and he has now been added as a co-author.展开更多
Nirmal et al.presented a machine learning-based design of ternary organic solar cells,utilizing feature importance[1].This paper highlights the alarming potential biases in the use of feature importance in machine lea...Nirmal et al.presented a machine learning-based design of ternary organic solar cells,utilizing feature importance[1].This paper highlights the alarming potential biases in the use of feature importance in machine learning,which can lead to incorrect conclusions and outcomes.Many scientists and researchers including Nirmal et al.are unaware that feature importances in machine learning in general are model-specific and do not necessarily represent true associations between the target and features.展开更多
Although machine learning models have achieved high enough accuracy in predicting shield position deviations,their“black box”nature makes the prediction mechanisms and decision-making processes opaque,leading to wea...Although machine learning models have achieved high enough accuracy in predicting shield position deviations,their“black box”nature makes the prediction mechanisms and decision-making processes opaque,leading to weaker explanations and practicability.This study introduces a novel explainable deep learning framework comprising the Informer model with enhanced attention mechanisms(EAMInfor)and deep learning important features(DeepLIFT),aimed at improving the prediction accuracy of shield position deviations and providing interpretability for predictive results.The EAMInfor model attempts to integrate channel attention,spatial attention,and simple attention modules to improve the Informer model's performance.The framework is tested with the four different geological conditions datasets generated from the Xiamen metro line 3,China.Results show that the EAMInfor model outperforms the traditional Informer and comparison models.The analysis with the DeepLIFT method indicates that the push thrust of push cylinder and the earth chamber pressure are the most significant features,while the stroke length of the push cylinder demonstrated lower importance.Furthermore,the variation trends in the significance of data points within input sequences exhibit substantial differences between single and composite strata.This framework not only improves predictive accuracy but also strengthens the credibility and reliability of the results.展开更多
Accurate purchase prediction in e-commerce critically depends on the quality of behavioral features.This paper proposes a layered and interpretable feature engineering framework that organizes user signals into three ...Accurate purchase prediction in e-commerce critically depends on the quality of behavioral features.This paper proposes a layered and interpretable feature engineering framework that organizes user signals into three layers:Basic,Conversion&Stability(efficiency and volatility across actions),and Advanced Interactions&Activity(crossbehavior synergies and intensity).Using real Taobao(Alibaba’s primary e-commerce platform)logs(57,976 records for 10,203 users;25 November–03 December 2017),we conducted a hierarchical,layer-wise evaluation that holds data splits and hyperparameters fixed while varying only the feature set to quantify each layer’s marginal contribution.Across logistic regression(LR),decision tree,random forest,XGBoost,and CatBoost models with stratified 5-fold cross-validation,the performance improvedmonotonically fromBasic to Conversion&Stability to Advanced features.With LR,F1 increased from 0.613(Basic)to 0.962(Advanced);boosted models achieved high discrimination(0.995 AUC Score)and an F1 score up to 0.983.Calibration and precision–recall analyses indicated strong ranking quality and acknowledged potential dataset and period biases given the short(9-day)window.By making feature contributions measurable and reproducible,the framework complements model-centric advances and offers a transparent blueprint for production-grade behavioralmodeling.The code and processed artifacts are publicly available,and future work will extend the validation to longer,seasonal datasets and hybrid approaches that combine automated feature learning with domain-driven design.展开更多
This study was conducted to enable prompt classification of malware,which was becoming increasingly sophisticated.To do this,we analyzed the important features of malware and the relative importance of selected featur...This study was conducted to enable prompt classification of malware,which was becoming increasingly sophisticated.To do this,we analyzed the important features of malware and the relative importance of selected features according to a learning model to assess how those important features were identified.Initially,the analysis features were extracted using Cuckoo Sandbox,an open-source malware analysis tool,then the features were divided into five categories using the extracted information.The 804 extracted features were reduced by 70%after selecting only the most suitable ones for malware classification using a learning model-based feature selection method called the recursive feature elimination.Next,these important features were analyzed.The level of contribution from each one was assessed by the Random Forest classifier method.The results showed that System call features were mostly allocated.At the end,it was possible to accurately identify the malware type using only 36 to 76 features for each of the four types of malware with the most analysis samples available.These were the Trojan,Adware,Downloader,and Backdoor malware.展开更多
Converting CO_(2)with green hydrogen to methanol as a carbon-neutral liquid fuel is a promising route for the long-term storage and distribution of intermittent renewable energy.Nevertheless,attaining highly efficient...Converting CO_(2)with green hydrogen to methanol as a carbon-neutral liquid fuel is a promising route for the long-term storage and distribution of intermittent renewable energy.Nevertheless,attaining highly efficient methanol synthesis catalysts from the vast composition space remains a significant challenge.Here we present a machine learning framework for accelerating the development of high space-time yield(STY)methanol synthesis catalysts.A database of methanol synthesis catalysts has been compiled,consisting of catalyst composition,preparation parameters,structural characteristics,reaction conditions and their corresponding catalytic performance.A methodology for constructing catalyst features based on the intrinsic physicochemical properties of the catalyst components has been developed,which significantly reduced the data dimensionality and enhanced the efficiency of machine learning operations.Two high-precision machine learning prediction models for the activities and product selectivity of catalysts were trained and obtained.Using this machine learning framework,an efficient search was achieved within the catalyst composition space,leading to the successful identification of high STY multielement oxide methanol synthesis catalysts.Notably,the CuZnAlTi catalyst achieved high STYs of 0.49 and 0.65 g_(MeOH)/(g_(catalyst)h)for CO_(2)and CO hydrogenation to methanol at 250℃,respectively,and the STY was further increased to 2.63 g_(Me OH)/(g_(catalyst)h)in CO and CO_(2)co-hydrogenation.展开更多
The corrosion performance of oxide dispersion strengthened(ODS)steel is crucial for SCWR application.Machine learning(ML)models were established to predict the mass gain of ODS steels under corrosion conditions(i.e.,s...The corrosion performance of oxide dispersion strengthened(ODS)steel is crucial for SCWR application.Machine learning(ML)models were established to predict the mass gain of ODS steels under corrosion conditions(i.e.,supercritical water),thereby evaluating their corrosion resistance.The grain and particle morphologies and crystal and interface structures of nanoparticles of six ODS steels were studied by transmission electron microscopy,scanning transmission electron microscopy,and high-resolution transmission electron microscopy.Among six ML models employed,the LightGBM(LGBM)model shows the highest accuracy(root mean square error of 43.18 mg/dm^(2) and 50.21 mg/dm^(2),mean absolute error of 25.91 mg/dm^(2) and 27.82 mg/dm^(2),and coefficient of determination R^(2) of 0.97 and 0.96 for training set and testing set,respectively)in predicting the mass gain of ODS steels.The LGBM feature importance coefficients were also applied to denote the degree of the feature on corrosion resistance.For microstructural features,the parameters that greatly influence corrosion resistance are inter-particle spacing and grain diameter,with importance scores of 73 and 63,respectively.Moreover,there is a strong synergistic influence between Cr and Al on the corrosion resistance of ODS steels.Developing this efficient and accurate LGBM model not only enhances the understanding of ODS steel corrosion mechanisms but also provides valuable insights for the targeted optimization and design of high-performance ODS alloys.展开更多
Predicting fracture intensity is essential for optimising reservoir production and mitigating drilling risks in the Brazilian pre-salt layer.However,previous studies rely excessively on conceptual models and typically...Predicting fracture intensity is essential for optimising reservoir production and mitigating drilling risks in the Brazilian pre-salt layer.However,previous studies rely excessively on conceptual models and typically do not integrate multiple types of data to perform such task.Moreover,to date,no feasibilitylike studies have assessed the reasonableness of such approaches.We propose a data-driven approach that utilises upscaled well logs(Young's modulus,Poisson's ratio,and silica content)alongside seismic attributes(curvature,distance to fault)to predict fracture intensity.The distance to fault is measured using the fault probability volume estimated by a pre-trained convolutional neural network(CNN).We evaluate the effectiveness of this data-driven approach employing two tree-ensemble models,eXtreme Gradient Boosting(XGBoost)and Random Forest,to estimate the volumetric fracture intensity(P32)in the wells.Regression and residual analyses indicate that XGBoost outperforms Random Forest.Results from feature importance methods,such as permutation importance and Shapley Additive explanations(SHAP),highlight curvature as the most important feature,followed by distance to fault,Young's modulus(or P-Impedance),silica content,and Poisson's ratio.The approach has been validated with rock sampling information and two blind tests.Consequently,we believe this workflow can be applied to other wells in nearby fields.The study offers a valuable tool for quantitatively estimating fracture intensity in pre-salt reservoirs.Future research may use this study as a reference for estimating fracture intensity within a seismic volume.The predicted fracture intensity estimates can enhance the reliability of reservoir porosity models and serve as a geohazard indicator to mitigate drilling risks.展开更多
We examine how machine learning models predict stock returns in the Korean market.By analyzing various firm characteristics and macroeconomic variables,we find that tree-based models outperform other machine learning ...We examine how machine learning models predict stock returns in the Korean market.By analyzing various firm characteristics and macroeconomic variables,we find that tree-based models outperform other machine learning approaches.This finding suggests that,in data-constrained contexts,moderately complex models outperform advanced methods that require extensive datasets.Using PFI,SHAP,and LIME,we consistently identify the 36-month momentum as the key predictor.PDP,ICE,and ALE analyses reveal threshold effects of 36-month momentum that diminish at higher return levels.Our findings underscore the value of ensemble-based methods in settings characterized by short data histories and heightened volatility.This study illustrates how multimethod interpretability can yield deeper economic insights,ultimately guiding more effective investment strategies and policy decisions.展开更多
PM_(2.5)constitutes a complex and diversemixture that significantly impacts the environment,human health,and climate change.However,existing observation and numerical simulation techniques have limitations,such as a l...PM_(2.5)constitutes a complex and diversemixture that significantly impacts the environment,human health,and climate change.However,existing observation and numerical simulation techniques have limitations,such as a lack of data,high acquisition costs,andmultiple uncertainties.These limitations hinder the acquisition of comprehensive information on PM_(2.5)chemical composition and effectively implement refined air pollution protection and control strategies.In this study,we developed an optimal deep learning model to acquire hourly mass concentrations of key PM_(2.5)chemical components without complex chemical analysis.The model was trained using a randomly partitioned multivariate dataset arranged in chronological order,including atmospheric state indicators,which previous studies did not consider.Our results showed that the correlation coefficients of key chemical components were no less than 0.96,and the root mean square errors ranged from 0.20 to 2.11μg/m^(3)for the entire process(training and testing combined).The model accurately captured the temporal characteristics of key chemical components,outperforming typical machine-learning models,previous studies,and global reanalysis datasets(such asModern-Era Retrospective analysis for Research and Applications,Version 2(MERRA-2)and Copernicus Atmosphere Monitoring Service ReAnalysis(CAMSRA)).We also quantified the feature importance using the random forest model,which showed that PM_(2.5),PM_(1),visibility,and temperature were the most influential variables for key chemical components.In conclusion,this study presents a practical approach to accurately obtain chemical composition information that can contribute to filling missing data,improved air pollution monitoring and source identification.This approach has the potential to enhance air pollution control strategies and promote public health and environmental sustainability.展开更多
PM_(1.0),particulate matter with an aerodynamic diameter smaller than 1.0μm,can adversely affect human health.However,fewer stations are capable of measuring PM_(1.0) concentrations than PM2.5 and PM10 concentrations...PM_(1.0),particulate matter with an aerodynamic diameter smaller than 1.0μm,can adversely affect human health.However,fewer stations are capable of measuring PM_(1.0) concentrations than PM2.5 and PM10 concentrations in real time(i.e.,only 9 locations for PM_(1.0) vs.623 locations for PM2.5 or PM10)in South Korea,making it impossible to conduct a nationwide health risk analysis of PM_(1.0).Thus,this study aimed to develop a PM_(1.0) prediction model using a random forest algorithm based on PM_(1.0) data from the nine measurement stations and various environmental input factors.Cross validation,in which the model was trained in eight stations and tested in the remaining station,achieved an average R^(2) of 0.913.The high R^(2) value achieved undermutually exclusive training and test locations in the cross validation can be ascribed to the fact that all the locations had similar relationships between PM_(1.0) and the input factors,which were captured by our model.Moreover,results of feature importance analysis showed that PM2.5 and PM10 concentrations were the two most important input features in predicting PM_(1.0) concentration.Finally,the model was used to estimate the PM_(1.0) concentrations in 623 locations,where input factors such as PM2.5 and PM10 can be obtained.Based on the augmented profile,we identified Seoul and Ansan to be PM_(1.0) concentration hotspots.These regions are large cities or the center of anthropogenic and industrial activities.The proposed model and the augmented PM_(1.0) profiles can be used for large epidemiological studies to understand the health impacts of PM_(1.0).展开更多
Prediction of tunneling-induced ground settlements is an essential task,particularly for tunneling in urban settings.Ground settlements should be limited within a tolerable threshold to avoid damages to aboveground st...Prediction of tunneling-induced ground settlements is an essential task,particularly for tunneling in urban settings.Ground settlements should be limited within a tolerable threshold to avoid damages to aboveground structures.Machine learning(ML)methods are becoming popular in many fields,including tunneling and underground excavations,as a powerful learning and predicting technique.However,the available datasets collected from a tunneling project are usually small from the perspective of applying ML methods.Can ML algorithms effectively predict tunneling-induced ground settlements when the available datasets are small?In this study,seven ML methods are utilized to predict tunneling-induced ground settlement using 14 contributing factors measured before or during tunnel excavation.These methods include multiple linear regression(MLR),decision tree(DT),random forest(RF),gradient boosting(GB),support vector regression(SVR),back-propagation neural network(BPNN),and permutation importancebased BPNN(PI-BPNN)models.All methods except BPNN and PI-BPNN are shallow-structure ML methods.The effectiveness of these seven ML approaches on small datasets is evaluated using model accuracy and stability.The model accuracy is measured by the coefficient of determination(R2)of training and testing datasets,and the stability of a learning algorithm indicates robust predictive performance.Also,the quantile error(QE)criterion is introduced to assess model predictive performance considering underpredictions and overpredictions.Our study reveals that the RF algorithm outperforms all the other models with the highest model prediction accuracy(0.9)and stability(3.0210^(-27)).Deep-structure ML models do not perform well for small datasets with relatively low model accuracy(0.59)and stability(5.76).The PI-BPNN architecture is proposed and designed for small datasets,showing better performance than typical BPNN.Six important contributing factors of ground settlements are identified,including tunnel depth,the distance between tunnel face and surface monitoring points(DTM),weighted average soil compressibility modulus(ACM),grouting pressure,penetrating rate and thrust force.展开更多
Alluvial fans are an important land resource in the Qinghai-Tibet Plateau with the expansion of human activities.However,the factors of alluvial fan development are poorly understood.According to our previous investig...Alluvial fans are an important land resource in the Qinghai-Tibet Plateau with the expansion of human activities.However,the factors of alluvial fan development are poorly understood.According to our previous investigation and research,approximately 826 alluvial fans exist in the Lhasa River Basin(LRB).The main purpose of this work is to identify the main influencing factors by using machine learning.A development index(Di)of alluvial fan was created by combining its area,perimeter,height and gradient.The 72%of data,including Di,11 types of environmental parameters of the matching catchment of alluvial fan and 10 commonly used machine learning algorithms were used to train and build models.The 18%of data were used to validate models.The remaining 10%of data were used to test the model accuracy.The feature importance of the model was used to illustrate the significance of the 11 types of environmental parameters to Di.The primary modelling results showed that the accuracy of the ensemble models,including Gradient Boost Decision Tree,Random Forest and XGBoost,are not less than 0.5(R^(2)).The accuracy of the Gradient Boost Decision Tree and XGBoost improved after grid research,and their R^(2)values are 0.782 and 0.870,respectively.The XGBoost was selected as the final model due to its optimal accuracy and generalisation ability at the sites closest to the LRB.Morphology parameters are the main factors in alluvial fan development,with a cumulative value of relative feature importance of 74.60%in XGBoost.The final model will have better accuracy and generalisation ability after adding training samples in other regions.展开更多
Accurately estimating the interfacial bond capacity of the near-surface mounted(NSM)carbon fiber-reinforced polymer(CFRP)to concrete joint is a fundamental task in the strengthening and retrofit of existing reinforced...Accurately estimating the interfacial bond capacity of the near-surface mounted(NSM)carbon fiber-reinforced polymer(CFRP)to concrete joint is a fundamental task in the strengthening and retrofit of existing reinforced concrete(RC)structures.The machine learning(ML)approach may provide an alternative to the commonly used semi-empirical or semi-analytical methods.Therefore,in this work we have developed a predictive model based on an artificial neural network(ANN)approach,i.e.using a back propagation neural network(BPNN),to map the complex data pattern obtained from an NSM CFRP to concrete joint.It involves a set of nine material and geometric input parameters and one output value.Moreover,by employing the neural interpretation diagram(NID)technique,the BPNN model becomes interpretable,as the influence of each input variable on the model can be tracked and quantified based on the connection weights of the neural network.An extensive database including 163 pull-out testing samples,collected from the authors’research group and from published results in the literature,is used to train and verify the ANN.Our results show that the prediction given by the BPNN model agrees well with the experimental data and yields a coefficient of determination of 0.957 on the whole database.After removing one non-significant feature,the BPNN becomes even more computationally efficient and accurate.In addition,compared with the existed semi-analytical model,the ANN-based approach demonstrates a more accurate estimation.Therefore,the proposed ML method may be a promising alternative for predicting the bond strength of NSM CFRP to concrete joint for structural engineers.展开更多
The complex sand-casting process combined with the interactions between process parameters makes it difficult to control the casting quality,resulting in a high scrap rate.A strategy based on a data-driven model was p...The complex sand-casting process combined with the interactions between process parameters makes it difficult to control the casting quality,resulting in a high scrap rate.A strategy based on a data-driven model was proposed to reduce casting defects and improve production efficiency,which includes the random forest(RF)classification model,the feature importance analysis,and the process parameters optimization with Monte Carlo simulation.The collected data includes four types of defects and corresponding process parameters were used to construct the RF model.Classification results show a recall rate above 90% for all categories.The Gini Index was used to assess the importance of the process parameters in the formation of various defects in the RF model.Finally,the classification model was applied to different production conditions for quality prediction.In the case of process parameters optimization for gas porosity defects,this model serves as an experimental process in the Monte Carlo method to estimate a better temperature distribution.The prediction model,when applied to the factory,greatly improved the efficiency of defect detection.Results show that the scrap rate decreased from 10.16% to 6.68%.展开更多
The existence of time delay in complex industrial processes or dynamical systems is a common phenomenon and is a difficult problem to deal with in industrial control systems,as well as in the textile field.Accurate id...The existence of time delay in complex industrial processes or dynamical systems is a common phenomenon and is a difficult problem to deal with in industrial control systems,as well as in the textile field.Accurate identification of the time delay can greatly improve the efficiency of the design of industrial process control systems.The time delay identification methods based on mathematical modeling require prior knowledge of the structural information of the model,especially for nonlinear systems.The neural network-based identification method can predict the time delay of the system,but cannot accurately obtain the specific parameters of the time delay.Benefit from the interpretability of machine learning,a novel method for delay identification based on an interpretable regression decision tree is proposed.Utilizing the self-explanatory analysis of the decision tree model,the parameters with the highest feature importance are obtained to identify the time delay of the system.Excellent results are gained by the simulation data of linear and nonlinear control systems,and the time delay of the systems can be accurately identified.展开更多
The prediction of liquefaction-induced lateral spreading/displacement(Dh)is a challenging task for civil/geotechnical engineers.In this study,a new approach is proposed to predict Dh using gene expression programming(...The prediction of liquefaction-induced lateral spreading/displacement(Dh)is a challenging task for civil/geotechnical engineers.In this study,a new approach is proposed to predict Dh using gene expression programming(GEP).Based on statistical reasoning,individual models were developed for two topographies:free-face and gently sloping ground.Along with a comparison with conventional approaches for predicting the Dh,four additional regression-based soft computing models,i.e.Gaussian process regression(GPR),relevance vector machine(RVM),sequential minimal optimization regression(SMOR),and M5-tree,were developed and compared with the GEP model.The results indicate that the GEP models predict Dh with less bias,as evidenced by the root mean square error(RMSE)and mean absolute error(MAE)for training(i.e.1.092 and 0.815;and 0.643 and 0.526)and for testing(i.e.0.89 and 0.705;and 0.773 and 0.573)in free-face and gently sloping ground topographies,respectively.The overall performance for the free-face topology was ranked as follows:GEP>RVM>M5-tree>GPR>SMOR,with a total score of 40,32,24,15,and 10,respectively.For the gently sloping condition,the performance was ranked as follows:GEP>RVM>GPR>M5-tree>SMOR with a total score of 40,32,21,19,and 8,respectively.Finally,the results of the sensitivity analysis showed that for both free-face and gently sloping ground,the liquefiable layer thickness(T_(15))was the major parameter with percentage deterioration(%D)value of 99.15 and 90.72,respectively.展开更多
As an essential property of frozen soils,change of unfrozen water content(UWC)with temperature,namely soil-freezing characteristic curve(SFCC),plays significant roles in numerous physical,hydraulic and mechanical proc...As an essential property of frozen soils,change of unfrozen water content(UWC)with temperature,namely soil-freezing characteristic curve(SFCC),plays significant roles in numerous physical,hydraulic and mechanical processes in cold regions,including the heat and water transfer within soils and at the land–atmosphere interface,frost heave and thaw settlement,as well as the simulation of coupled thermo-hydro-mechanical interactions.Although various models have been proposed to estimate SFCC,their applicability remains limited due to their derivation from specific soil types,soil treatments,and test devices.Accordingly,this study proposes a novel data-driven model to predict the SFCC using an extreme Gradient Boosting(XGBoost)model.A systematic database for SFCC of frozen soils compiled from extensive experimental investigations via various testing methods was utilized to train the XGBoost model.The predicted soil freezing characteristic curves(SFCC,UWC as a function of temperature)from the well-trained XGBoost model were compared with original experimental data and three conventional models.The results demonstrate the superior performance of the proposed XGBoost model over the traditional models in predicting SFCC.This study provides valuable insights for future investigations regarding the SFCC of frozen soils.展开更多
Breast cancer is one of the most common cancers among women in the world, with more than two million new cases of breast cancer every year. This disease is associated with numerous clinical and genetic characteristics...Breast cancer is one of the most common cancers among women in the world, with more than two million new cases of breast cancer every year. This disease is associated with numerous clinical and genetic characteristics. In recent years, machine learning technology has been increasingly applied to the medical field, including predicting the risk of malignant tumors such as breast cancer. Based on clinical and targeted sequencing data of 1980 primary breast cancer samples, this article aimed to analyze these data and predict living conditions after breast cancer. After data engineering, feature selection, and comparison of machine learning methods, the light gradient boosting machine model was found the best with hyperparameter tuning (precision = 0.818, recall = 0.816, f1 score = 0.817, roc-auc = 0.867). And the top 5 determinants were clinical features age at diagnosis, Nottingham Prognostic Index, cohort and genetic features rheb, nr3c1. The study shed light on rational allocation of medical resources and provided insights to early prevention, diagnosis and treatment of breast cancer with the identified risk clinical and genetic factors.展开更多
Energy issue is of strategic importance influencing China’s overall economic and social development that needs systematic planning and far-sighted deliberation.At the present time the revolution of energy technology ...Energy issue is of strategic importance influencing China’s overall economic and social development that needs systematic planning and far-sighted deliberation.At the present time the revolution of energy technology is advancing rapidly.The global innovation of energy technology has entered a highly dynamic period featured by multi-point breakthroughs,展开更多
文摘The authors regret that the original publication of this paper did not include Jawad Fayaz as a co-author.After further discussions and a thorough review of the research contributions,it was agreed that his significant contributions to the foundational aspects of the research warranted recognition,and he has now been added as a co-author.
文摘Nirmal et al.presented a machine learning-based design of ternary organic solar cells,utilizing feature importance[1].This paper highlights the alarming potential biases in the use of feature importance in machine learning,which can lead to incorrect conclusions and outcomes.Many scientists and researchers including Nirmal et al.are unaware that feature importances in machine learning in general are model-specific and do not necessarily represent true associations between the target and features.
基金supported by the National Natural Science Foundation of China(Grant Nos.52378392,52408356)the Foal Eagle Program Youth Top-notch Talent Project of Fujian Province,China(Grant No.00387088).
文摘Although machine learning models have achieved high enough accuracy in predicting shield position deviations,their“black box”nature makes the prediction mechanisms and decision-making processes opaque,leading to weaker explanations and practicability.This study introduces a novel explainable deep learning framework comprising the Informer model with enhanced attention mechanisms(EAMInfor)and deep learning important features(DeepLIFT),aimed at improving the prediction accuracy of shield position deviations and providing interpretability for predictive results.The EAMInfor model attempts to integrate channel attention,spatial attention,and simple attention modules to improve the Informer model's performance.The framework is tested with the four different geological conditions datasets generated from the Xiamen metro line 3,China.Results show that the EAMInfor model outperforms the traditional Informer and comparison models.The analysis with the DeepLIFT method indicates that the push thrust of push cylinder and the earth chamber pressure are the most significant features,while the stroke length of the push cylinder demonstrated lower importance.Furthermore,the variation trends in the significance of data points within input sequences exhibit substantial differences between single and composite strata.This framework not only improves predictive accuracy but also strengthens the credibility and reliability of the results.
基金supported by the research fund of Hanyang University(HY-202500000001616).
文摘Accurate purchase prediction in e-commerce critically depends on the quality of behavioral features.This paper proposes a layered and interpretable feature engineering framework that organizes user signals into three layers:Basic,Conversion&Stability(efficiency and volatility across actions),and Advanced Interactions&Activity(crossbehavior synergies and intensity).Using real Taobao(Alibaba’s primary e-commerce platform)logs(57,976 records for 10,203 users;25 November–03 December 2017),we conducted a hierarchical,layer-wise evaluation that holds data splits and hyperparameters fixed while varying only the feature set to quantify each layer’s marginal contribution.Across logistic regression(LR),decision tree,random forest,XGBoost,and CatBoost models with stratified 5-fold cross-validation,the performance improvedmonotonically fromBasic to Conversion&Stability to Advanced features.With LR,F1 increased from 0.613(Basic)to 0.962(Advanced);boosted models achieved high discrimination(0.995 AUC Score)and an F1 score up to 0.983.Calibration and precision–recall analyses indicated strong ranking quality and acknowledged potential dataset and period biases given the short(9-day)window.By making feature contributions measurable and reproducible,the framework complements model-centric advances and offers a transparent blueprint for production-grade behavioralmodeling.The code and processed artifacts are publicly available,and future work will extend the validation to longer,seasonal datasets and hybrid approaches that combine automated feature learning with domain-driven design.
基金supported by the Research Program through the National Research Foundation of Korea,NRF-2018R1D1A1B07050864.
文摘This study was conducted to enable prompt classification of malware,which was becoming increasingly sophisticated.To do this,we analyzed the important features of malware and the relative importance of selected features according to a learning model to assess how those important features were identified.Initially,the analysis features were extracted using Cuckoo Sandbox,an open-source malware analysis tool,then the features were divided into five categories using the extracted information.The 804 extracted features were reduced by 70%after selecting only the most suitable ones for malware classification using a learning model-based feature selection method called the recursive feature elimination.Next,these important features were analyzed.The level of contribution from each one was assessed by the Random Forest classifier method.The results showed that System call features were mostly allocated.At the end,it was possible to accurately identify the malware type using only 36 to 76 features for each of the four types of malware with the most analysis samples available.These were the Trojan,Adware,Downloader,and Backdoor malware.
基金supported by the Zhejiang Provincial Natural Science Foundation of China(LDT23E06012E06)National Key R&D Program of China(2023YFC3710800)+3 种基金the National EnergySaving and Low-Carbon Materials Production and Application Demonstration Platform Program(TC220H06N)Pioneer R&D Program of Zhejiang Province-China(2024SSYS0066,2023C03016)National Natural Science Foundation of China(42341208)Zhejiang Energy Group Research Fund(ZNKJ-2023-100)。
文摘Converting CO_(2)with green hydrogen to methanol as a carbon-neutral liquid fuel is a promising route for the long-term storage and distribution of intermittent renewable energy.Nevertheless,attaining highly efficient methanol synthesis catalysts from the vast composition space remains a significant challenge.Here we present a machine learning framework for accelerating the development of high space-time yield(STY)methanol synthesis catalysts.A database of methanol synthesis catalysts has been compiled,consisting of catalyst composition,preparation parameters,structural characteristics,reaction conditions and their corresponding catalytic performance.A methodology for constructing catalyst features based on the intrinsic physicochemical properties of the catalyst components has been developed,which significantly reduced the data dimensionality and enhanced the efficiency of machine learning operations.Two high-precision machine learning prediction models for the activities and product selectivity of catalysts were trained and obtained.Using this machine learning framework,an efficient search was achieved within the catalyst composition space,leading to the successful identification of high STY multielement oxide methanol synthesis catalysts.Notably,the CuZnAlTi catalyst achieved high STYs of 0.49 and 0.65 g_(MeOH)/(g_(catalyst)h)for CO_(2)and CO hydrogenation to methanol at 250℃,respectively,and the STY was further increased to 2.63 g_(Me OH)/(g_(catalyst)h)in CO and CO_(2)co-hydrogenation.
基金sponsored by the National Natural Science Foundation of China(Grants Nos.52171004,52471066,and 51871034).
文摘The corrosion performance of oxide dispersion strengthened(ODS)steel is crucial for SCWR application.Machine learning(ML)models were established to predict the mass gain of ODS steels under corrosion conditions(i.e.,supercritical water),thereby evaluating their corrosion resistance.The grain and particle morphologies and crystal and interface structures of nanoparticles of six ODS steels were studied by transmission electron microscopy,scanning transmission electron microscopy,and high-resolution transmission electron microscopy.Among six ML models employed,the LightGBM(LGBM)model shows the highest accuracy(root mean square error of 43.18 mg/dm^(2) and 50.21 mg/dm^(2),mean absolute error of 25.91 mg/dm^(2) and 27.82 mg/dm^(2),and coefficient of determination R^(2) of 0.97 and 0.96 for training set and testing set,respectively)in predicting the mass gain of ODS steels.The LGBM feature importance coefficients were also applied to denote the degree of the feature on corrosion resistance.For microstructural features,the parameters that greatly influence corrosion resistance are inter-particle spacing and grain diameter,with importance scores of 73 and 63,respectively.Moreover,there is a strong synergistic influence between Cr and Al on the corrosion resistance of ODS steels.Developing this efficient and accurate LGBM model not only enhances the understanding of ODS steel corrosion mechanisms but also provides valuable insights for the targeted optimization and design of high-performance ODS alloys.
文摘Predicting fracture intensity is essential for optimising reservoir production and mitigating drilling risks in the Brazilian pre-salt layer.However,previous studies rely excessively on conceptual models and typically do not integrate multiple types of data to perform such task.Moreover,to date,no feasibilitylike studies have assessed the reasonableness of such approaches.We propose a data-driven approach that utilises upscaled well logs(Young's modulus,Poisson's ratio,and silica content)alongside seismic attributes(curvature,distance to fault)to predict fracture intensity.The distance to fault is measured using the fault probability volume estimated by a pre-trained convolutional neural network(CNN).We evaluate the effectiveness of this data-driven approach employing two tree-ensemble models,eXtreme Gradient Boosting(XGBoost)and Random Forest,to estimate the volumetric fracture intensity(P32)in the wells.Regression and residual analyses indicate that XGBoost outperforms Random Forest.Results from feature importance methods,such as permutation importance and Shapley Additive explanations(SHAP),highlight curvature as the most important feature,followed by distance to fault,Young's modulus(or P-Impedance),silica content,and Poisson's ratio.The approach has been validated with rock sampling information and two blind tests.Consequently,we believe this workflow can be applied to other wells in nearby fields.The study offers a valuable tool for quantitatively estimating fracture intensity in pre-salt reservoirs.Future research may use this study as a reference for estimating fracture intensity within a seismic volume.The predicted fracture intensity estimates can enhance the reliability of reservoir porosity models and serve as a geohazard indicator to mitigate drilling risks.
基金supported by the National Research Foundation of Korea(NRF)grant funded by the Korea government(MSIT,Ministry of Science and ICT)[RS-2025-005183388]supported by the"Regional Innovation System&Education(RISE)"through the Seoul RISE Center,funded by the Ministry of Education(MOE)and the Seoul Metropolitan Government(2025-RISE-01-018-01).
文摘We examine how machine learning models predict stock returns in the Korean market.By analyzing various firm characteristics and macroeconomic variables,we find that tree-based models outperform other machine learning approaches.This finding suggests that,in data-constrained contexts,moderately complex models outperform advanced methods that require extensive datasets.Using PFI,SHAP,and LIME,we consistently identify the 36-month momentum as the key predictor.PDP,ICE,and ALE analyses reveal threshold effects of 36-month momentum that diminish at higher return levels.Our findings underscore the value of ensemble-based methods in settings characterized by short data histories and heightened volatility.This study illustrates how multimethod interpretability can yield deeper economic insights,ultimately guiding more effective investment strategies and policy decisions.
基金supported by the National Key Research and Development Program for Young Scientists of China(No.2022YFC3704000)the National Natural Science Foundation of China(No.42275122)the National Key Scientific and Technological Infrastructure project“Earth System Science Numerical Simulator Facility”(EarthLab).
文摘PM_(2.5)constitutes a complex and diversemixture that significantly impacts the environment,human health,and climate change.However,existing observation and numerical simulation techniques have limitations,such as a lack of data,high acquisition costs,andmultiple uncertainties.These limitations hinder the acquisition of comprehensive information on PM_(2.5)chemical composition and effectively implement refined air pollution protection and control strategies.In this study,we developed an optimal deep learning model to acquire hourly mass concentrations of key PM_(2.5)chemical components without complex chemical analysis.The model was trained using a randomly partitioned multivariate dataset arranged in chronological order,including atmospheric state indicators,which previous studies did not consider.Our results showed that the correlation coefficients of key chemical components were no less than 0.96,and the root mean square errors ranged from 0.20 to 2.11μg/m^(3)for the entire process(training and testing combined).The model accurately captured the temporal characteristics of key chemical components,outperforming typical machine-learning models,previous studies,and global reanalysis datasets(such asModern-Era Retrospective analysis for Research and Applications,Version 2(MERRA-2)and Copernicus Atmosphere Monitoring Service ReAnalysis(CAMSRA)).We also quantified the feature importance using the random forest model,which showed that PM_(2.5),PM_(1),visibility,and temperature were the most influential variables for key chemical components.In conclusion,this study presents a practical approach to accurately obtain chemical composition information that can contribute to filling missing data,improved air pollution monitoring and source identification.This approach has the potential to enhance air pollution control strategies and promote public health and environmental sustainability.
基金supported by the Fine Particle Research Initiative in East Asia Considering National Differences Project through the National Research Foundation of Korea(NRF)funded by the Ministry of Science and ICT(No.NRF-2023M3G1A1090660)supported by a grant from the National Institute of Environmental Research(NIER),funded by the Ministry of Environment of the Republic of Korea(No.NIER-2023-04-02-056).
文摘PM_(1.0),particulate matter with an aerodynamic diameter smaller than 1.0μm,can adversely affect human health.However,fewer stations are capable of measuring PM_(1.0) concentrations than PM2.5 and PM10 concentrations in real time(i.e.,only 9 locations for PM_(1.0) vs.623 locations for PM2.5 or PM10)in South Korea,making it impossible to conduct a nationwide health risk analysis of PM_(1.0).Thus,this study aimed to develop a PM_(1.0) prediction model using a random forest algorithm based on PM_(1.0) data from the nine measurement stations and various environmental input factors.Cross validation,in which the model was trained in eight stations and tested in the remaining station,achieved an average R^(2) of 0.913.The high R^(2) value achieved undermutually exclusive training and test locations in the cross validation can be ascribed to the fact that all the locations had similar relationships between PM_(1.0) and the input factors,which were captured by our model.Moreover,results of feature importance analysis showed that PM2.5 and PM10 concentrations were the two most important input features in predicting PM_(1.0) concentration.Finally,the model was used to estimate the PM_(1.0) concentrations in 623 locations,where input factors such as PM2.5 and PM10 can be obtained.Based on the augmented profile,we identified Seoul and Ansan to be PM_(1.0) concentration hotspots.These regions are large cities or the center of anthropogenic and industrial activities.The proposed model and the augmented PM_(1.0) profiles can be used for large epidemiological studies to understand the health impacts of PM_(1.0).
基金funded by the University Transportation Center for Underground Transportation Infrastructure(UTC-UTI)at the Colorado School of Mines under Grant No.69A3551747118 from the US Department of Transportation(DOT).
文摘Prediction of tunneling-induced ground settlements is an essential task,particularly for tunneling in urban settings.Ground settlements should be limited within a tolerable threshold to avoid damages to aboveground structures.Machine learning(ML)methods are becoming popular in many fields,including tunneling and underground excavations,as a powerful learning and predicting technique.However,the available datasets collected from a tunneling project are usually small from the perspective of applying ML methods.Can ML algorithms effectively predict tunneling-induced ground settlements when the available datasets are small?In this study,seven ML methods are utilized to predict tunneling-induced ground settlement using 14 contributing factors measured before or during tunnel excavation.These methods include multiple linear regression(MLR),decision tree(DT),random forest(RF),gradient boosting(GB),support vector regression(SVR),back-propagation neural network(BPNN),and permutation importancebased BPNN(PI-BPNN)models.All methods except BPNN and PI-BPNN are shallow-structure ML methods.The effectiveness of these seven ML approaches on small datasets is evaluated using model accuracy and stability.The model accuracy is measured by the coefficient of determination(R2)of training and testing datasets,and the stability of a learning algorithm indicates robust predictive performance.Also,the quantile error(QE)criterion is introduced to assess model predictive performance considering underpredictions and overpredictions.Our study reveals that the RF algorithm outperforms all the other models with the highest model prediction accuracy(0.9)and stability(3.0210^(-27)).Deep-structure ML models do not perform well for small datasets with relatively low model accuracy(0.59)and stability(5.76).The PI-BPNN architecture is proposed and designed for small datasets,showing better performance than typical BPNN.Six important contributing factors of ground settlements are identified,including tunnel depth,the distance between tunnel face and surface monitoring points(DTM),weighted average soil compressibility modulus(ACM),grouting pressure,penetrating rate and thrust force.
基金The Strategic Priority Research Program of Chinese Academy of Sciences,No.XDA20040202The Second Tibetan Plateau Scientific Expedition and Research Program(STEP),No.2019QZKK0603。
文摘Alluvial fans are an important land resource in the Qinghai-Tibet Plateau with the expansion of human activities.However,the factors of alluvial fan development are poorly understood.According to our previous investigation and research,approximately 826 alluvial fans exist in the Lhasa River Basin(LRB).The main purpose of this work is to identify the main influencing factors by using machine learning.A development index(Di)of alluvial fan was created by combining its area,perimeter,height and gradient.The 72%of data,including Di,11 types of environmental parameters of the matching catchment of alluvial fan and 10 commonly used machine learning algorithms were used to train and build models.The 18%of data were used to validate models.The remaining 10%of data were used to test the model accuracy.The feature importance of the model was used to illustrate the significance of the 11 types of environmental parameters to Di.The primary modelling results showed that the accuracy of the ensemble models,including Gradient Boost Decision Tree,Random Forest and XGBoost,are not less than 0.5(R^(2)).The accuracy of the Gradient Boost Decision Tree and XGBoost improved after grid research,and their R^(2)values are 0.782 and 0.870,respectively.The XGBoost was selected as the final model due to its optimal accuracy and generalisation ability at the sites closest to the LRB.Morphology parameters are the main factors in alluvial fan development,with a cumulative value of relative feature importance of 74.60%in XGBoost.The final model will have better accuracy and generalisation ability after adding training samples in other regions.
基金the National Natural Science Foundation of China(No.51808056)the Hunan Provincial Natural Science Foundation of China(No.2020JJ5583)+1 种基金the Research Foundation of Education Bureau of Hunan Province(No.19B012)the China Scholarship Council(No.201808430232)。
文摘Accurately estimating the interfacial bond capacity of the near-surface mounted(NSM)carbon fiber-reinforced polymer(CFRP)to concrete joint is a fundamental task in the strengthening and retrofit of existing reinforced concrete(RC)structures.The machine learning(ML)approach may provide an alternative to the commonly used semi-empirical or semi-analytical methods.Therefore,in this work we have developed a predictive model based on an artificial neural network(ANN)approach,i.e.using a back propagation neural network(BPNN),to map the complex data pattern obtained from an NSM CFRP to concrete joint.It involves a set of nine material and geometric input parameters and one output value.Moreover,by employing the neural interpretation diagram(NID)technique,the BPNN model becomes interpretable,as the influence of each input variable on the model can be tracked and quantified based on the connection weights of the neural network.An extensive database including 163 pull-out testing samples,collected from the authors’research group and from published results in the literature,is used to train and verify the ANN.Our results show that the prediction given by the BPNN model agrees well with the experimental data and yields a coefficient of determination of 0.957 on the whole database.After removing one non-significant feature,the BPNN becomes even more computationally efficient and accurate.In addition,compared with the existed semi-analytical model,the ANN-based approach demonstrates a more accurate estimation.Therefore,the proposed ML method may be a promising alternative for predicting the bond strength of NSM CFRP to concrete joint for structural engineers.
基金financially supported by the National Key Research and Development Program of China(2022YFB3706800,2020YFB1710100)the National Natural Science Foundation of China(51821001,52090042,52074183)。
文摘The complex sand-casting process combined with the interactions between process parameters makes it difficult to control the casting quality,resulting in a high scrap rate.A strategy based on a data-driven model was proposed to reduce casting defects and improve production efficiency,which includes the random forest(RF)classification model,the feature importance analysis,and the process parameters optimization with Monte Carlo simulation.The collected data includes four types of defects and corresponding process parameters were used to construct the RF model.Classification results show a recall rate above 90% for all categories.The Gini Index was used to assess the importance of the process parameters in the formation of various defects in the RF model.Finally,the classification model was applied to different production conditions for quality prediction.In the case of process parameters optimization for gas porosity defects,this model serves as an experimental process in the Monte Carlo method to estimate a better temperature distribution.The prediction model,when applied to the factory,greatly improved the efficiency of defect detection.Results show that the scrap rate decreased from 10.16% to 6.68%.
基金Shanghai Philosophy and Social Science Program,China(No.2019BGL004)。
文摘The existence of time delay in complex industrial processes or dynamical systems is a common phenomenon and is a difficult problem to deal with in industrial control systems,as well as in the textile field.Accurate identification of the time delay can greatly improve the efficiency of the design of industrial process control systems.The time delay identification methods based on mathematical modeling require prior knowledge of the structural information of the model,especially for nonlinear systems.The neural network-based identification method can predict the time delay of the system,but cannot accurately obtain the specific parameters of the time delay.Benefit from the interpretability of machine learning,a novel method for delay identification based on an interpretable regression decision tree is proposed.Utilizing the self-explanatory analysis of the decision tree model,the parameters with the highest feature importance are obtained to identify the time delay of the system.Excellent results are gained by the simulation data of linear and nonlinear control systems,and the time delay of the systems can be accurately identified.
文摘The prediction of liquefaction-induced lateral spreading/displacement(Dh)is a challenging task for civil/geotechnical engineers.In this study,a new approach is proposed to predict Dh using gene expression programming(GEP).Based on statistical reasoning,individual models were developed for two topographies:free-face and gently sloping ground.Along with a comparison with conventional approaches for predicting the Dh,four additional regression-based soft computing models,i.e.Gaussian process regression(GPR),relevance vector machine(RVM),sequential minimal optimization regression(SMOR),and M5-tree,were developed and compared with the GEP model.The results indicate that the GEP models predict Dh with less bias,as evidenced by the root mean square error(RMSE)and mean absolute error(MAE)for training(i.e.1.092 and 0.815;and 0.643 and 0.526)and for testing(i.e.0.89 and 0.705;and 0.773 and 0.573)in free-face and gently sloping ground topographies,respectively.The overall performance for the free-face topology was ranked as follows:GEP>RVM>M5-tree>GPR>SMOR,with a total score of 40,32,24,15,and 10,respectively.For the gently sloping condition,the performance was ranked as follows:GEP>RVM>GPR>M5-tree>SMOR with a total score of 40,32,21,19,and 8,respectively.Finally,the results of the sensitivity analysis showed that for both free-face and gently sloping ground,the liquefiable layer thickness(T_(15))was the major parameter with percentage deterioration(%D)value of 99.15 and 90.72,respectively.
基金supported by the National Natural Science Foundation of China(Grant No.42177291)Innovation Capability Support Program of Shaanxi Province(2023-JC-JQ-25 and 2021KJXX-11).
文摘As an essential property of frozen soils,change of unfrozen water content(UWC)with temperature,namely soil-freezing characteristic curve(SFCC),plays significant roles in numerous physical,hydraulic and mechanical processes in cold regions,including the heat and water transfer within soils and at the land–atmosphere interface,frost heave and thaw settlement,as well as the simulation of coupled thermo-hydro-mechanical interactions.Although various models have been proposed to estimate SFCC,their applicability remains limited due to their derivation from specific soil types,soil treatments,and test devices.Accordingly,this study proposes a novel data-driven model to predict the SFCC using an extreme Gradient Boosting(XGBoost)model.A systematic database for SFCC of frozen soils compiled from extensive experimental investigations via various testing methods was utilized to train the XGBoost model.The predicted soil freezing characteristic curves(SFCC,UWC as a function of temperature)from the well-trained XGBoost model were compared with original experimental data and three conventional models.The results demonstrate the superior performance of the proposed XGBoost model over the traditional models in predicting SFCC.This study provides valuable insights for future investigations regarding the SFCC of frozen soils.
文摘Breast cancer is one of the most common cancers among women in the world, with more than two million new cases of breast cancer every year. This disease is associated with numerous clinical and genetic characteristics. In recent years, machine learning technology has been increasingly applied to the medical field, including predicting the risk of malignant tumors such as breast cancer. Based on clinical and targeted sequencing data of 1980 primary breast cancer samples, this article aimed to analyze these data and predict living conditions after breast cancer. After data engineering, feature selection, and comparison of machine learning methods, the light gradient boosting machine model was found the best with hyperparameter tuning (precision = 0.818, recall = 0.816, f1 score = 0.817, roc-auc = 0.867). And the top 5 determinants were clinical features age at diagnosis, Nottingham Prognostic Index, cohort and genetic features rheb, nr3c1. The study shed light on rational allocation of medical resources and provided insights to early prevention, diagnosis and treatment of breast cancer with the identified risk clinical and genetic factors.
文摘Energy issue is of strategic importance influencing China’s overall economic and social development that needs systematic planning and far-sighted deliberation.At the present time the revolution of energy technology is advancing rapidly.The global innovation of energy technology has entered a highly dynamic period featured by multi-point breakthroughs,