Missing values in radionuclide diffusion datasets can undermine the predictive accuracy and robustness of the machine learning(ML)models.In this study,regression-based missing data imputation method using a light grad...Missing values in radionuclide diffusion datasets can undermine the predictive accuracy and robustness of the machine learning(ML)models.In this study,regression-based missing data imputation method using a light gradient boosting machine(LGBM)algorithm was employed to impute more than 60%of the missing data,establishing a radionuclide diffusion dataset containing 16 input features and 813 instances.The effective diffusion coefficient(D_(e))was predicted using ten ML models.The predictive accuracy of the ensemble meta-models,namely LGBM-extreme gradient boosting(XGB)and LGBM-categorical boosting(CatB),surpassed that of the other ML models,with R^(2)values of 0.94.The models were applied to predict the D_(e)values of EuEDTA^(−)and HCrO_(4)^(−)in saturated compacted bentonites at compactions ranging from 1200 to 1800 kg/m^(3),which were measured using a through-diffusion method.The generalization ability of the LGBM-XGB model surpassed that of LGB-CatB in predicting the D_(e)of HCrO_(4)^(−).Shapley additive explanations identified total porosity as the most significant influencing factor.Additionally,the partial dependence plot analysis technique yielded clearer results in the univariate correlation analysis.This study provides a regression imputation technique to refine radionuclide diffusion datasets,offering deeper insights into analyzing the diffusion mechanism of radionuclides and supporting the safety assessment of the geological disposal of high-level radioactive waste.展开更多
基金supported by the National Natural Science Foundation of China(No.12475340 and 12375350)Special Branch project of South Taihu Lakethe Scientific Research Fund of Zhejiang Provincial Education Department(No.Y202456326).
文摘Missing values in radionuclide diffusion datasets can undermine the predictive accuracy and robustness of the machine learning(ML)models.In this study,regression-based missing data imputation method using a light gradient boosting machine(LGBM)algorithm was employed to impute more than 60%of the missing data,establishing a radionuclide diffusion dataset containing 16 input features and 813 instances.The effective diffusion coefficient(D_(e))was predicted using ten ML models.The predictive accuracy of the ensemble meta-models,namely LGBM-extreme gradient boosting(XGB)and LGBM-categorical boosting(CatB),surpassed that of the other ML models,with R^(2)values of 0.94.The models were applied to predict the D_(e)values of EuEDTA^(−)and HCrO_(4)^(−)in saturated compacted bentonites at compactions ranging from 1200 to 1800 kg/m^(3),which were measured using a through-diffusion method.The generalization ability of the LGBM-XGB model surpassed that of LGB-CatB in predicting the D_(e)of HCrO_(4)^(−).Shapley additive explanations identified total porosity as the most significant influencing factor.Additionally,the partial dependence plot analysis technique yielded clearer results in the univariate correlation analysis.This study provides a regression imputation technique to refine radionuclide diffusion datasets,offering deeper insights into analyzing the diffusion mechanism of radionuclides and supporting the safety assessment of the geological disposal of high-level radioactive waste.