摘要
目的 基于机器学习方法构建最优的慢性乙型病毒性肝炎患者患肝硬化的诊断评估模型,为临床工作者识别高风险个体,慢性乙型病毒性肝炎患者肝硬化的早期预防提供参考依据。方法 本研究回顾性收集了2020年1月—2024年1月在华北理工大学附属医院消化内科门诊及住院部治疗的患者420例,其中明确诊断为慢性乙型病毒性肝炎合并肝硬化的患者200例,慢性乙型病毒性肝炎未合并肝硬化的患者220例。所有患者按7:3的比例随机分为训练集和测试集。回顾性收集2019年1—12月在该院门诊及住院的150例慢性乙型病毒性肝炎患者及慢性乙型病毒性肝炎合并肝硬化患者,作为外部验证集。对肝硬化和未肝硬化两组患者的年龄、性别等一般资料和血常规、血生化等实验室检查资料进行比较分析,采用多因素logistic回归分析,确定慢性乙型病毒性肝炎合并肝硬化的独立影响因素。使用训练集研究对象构建XGBoost、logistic回归、LightGBM、KNN和SVM模型,采用验证集进行模型验证。通过受试者工作特征(ROC)曲线、曲线下面积(AUC)、准确率和F1分数等多项指标评估各模型性能,最终选出最优模型。采用Shapley Additive exPlanations(SHAP)方法,进行最优模型展示。结果 总胆红素[OR=1.046,95%CI:1.006~1.085]、血小板/红细胞分布宽度比值(RPR)[OR=1.417,95%CI:1.250~1.666]及HBV DNA载量[OR=15.855,95%CI:4.032~25.485]是慢性乙型肝炎患者发生肝硬化的独立危险因素(P<0.05);血红蛋白[OR=0.954,95%CI:0.927~0.978]和抗病毒治疗[OR=0.014, 95%CI:0.002~0.056]则是肝硬化形成的保护因素(P<0.05)。基于这些独立影响因素,分别构建了XGBoost、logistic回归、LightGBM、KNN和SVM模型,结果表明这些模型的AUC值均较高,其中logistic回归模型的预测效果最优。logistic回归模型的灵敏度、特异度、准确度、F1值和AUC分别为0.929、0.970、0.947、95.1%(95%CI:93.3%~96.9%)和0.988;内部验证的AUC为0.988(95%CI:0.983~0.993),准确率为96.3%(95%CI:94.8%~97.8%);外部验证集的AUC为0.979(95%CI:0.967~0.991),准确率为95.5%(95%CI:93.8%~97.2%)。这些结果表明,所构建的最优模型具有良好的预测性能。结论 基于机器学习所建立的XGBoost、logistic回归、LightGBM、KNN、SVM模型均显示出良好的判别性能,其中logistic回归模型的预测效能最佳,同时具有良好的泛化能力。
Objective To construct prediction models for liver cirrhosis in patients with chronic viral hepatitis B using machine learning methods,compare and discriminate their performance,and screen out the best performance model,so as to provide theoretical reference and clinical guidance for clinical workers to identify high-risk individuals and take targeted preventive measures.Methods This study retrospectively collected data from 200 patients diagnosed with chronic hepatitis B(CHB)complicated by cirrhosis and 220 patients diagnosed with chronic hepatitis B without cirrhosis at the Department of Gastroenterology,North China University of Science and Technology Affiliated Hospital,between January 2020 and January 2024.The cirrhosis group consisted of 200 patients,while the non-cirrhosis group consisted of 220 patients.Both groups were randomly split into training and testing sets at a 7:3 ratio.The clinical and laboratory data of both groups were analyzed statistically.Initially,univariate analysis was performed to identify factors with statistical significance,followed by multivariate logistic regression analysis to determine independent risk factors for cirrhosis in patients with chronic hepatitis B.Various machine learning models,including XGBoost,logistic regression,LightGBM,KNN,and SVM,were constructed using the training set.The performance of these models was evaluated based on Receiver Operating Characteristic curves,Area Under Curve,accuracy,and F1 score,with the best-performing model selected for further analysis.To assess the generalization ability of the model,calibration curves and decision curves were used.To improve the robustness of the models,the study employed the bootstrap method for internal validation.In addition,a separate external validation set consisting of 150 patients diagnosed with chronic hepatitis B and chronic hepatitis B with cirrhosis from the hospital during January to December 2019 was retrospectively collected.The optimal model's performance was evaluated in the external validation set using AUC and accuracy.Finally,the contribution of each predictor to the optimal model was assessed using Shapley Additive Explanations(SHAP)values.Results Total bilirubin[OR=1.046,95%CI:1.006-1.085],platelet-to-red blood cell distribution width ratio(RPR)[OR=1.417,95%CI:1.250-1.666],and HBV DNA load[OR=15.855,95%CI:4.032-25.485]were identified as independent risk factors for the development of cirrhosis in patients with chronic hepatitis B(P<0.05).In contrast,hemoglobin[OR=0.954,95%CI:0.927-0.978]and antiviral treatment[OR=0.014,95%CI:0.002-0.056]were found to be protective factors against cirrhosis(P<0.05).Based on these independent factors,we developed several predictive models,including XGBoost,logistic regression,LightGBM,KNN,and SVM.All models demonstrated high AUC values,with the logistic regression model showing the best performance.The logistic regression model achieved a sensitivity of 0.929,specificity of 0.970,accuracy of 0.947,an F1 score of 95.1%(95%CI:93.3%-96.9%),and an AUC of 0.988.Internal validation yielded an AUC of 0.988(95%CI:0.983-0.993)with an accuracy of 96.3%(95%CI:94.8%-97.8%),while external validation resulted in an AUC of 0.979(95%CI:0.967-0.991)and accuracy of 95.5%(95%CI:93.8%-97.2%).These results indicate that the optimal model constructed has excellent predictive performance.Conclusion The XGBoost,logistic regression,LightGBM,KNN,and SVM models built based on machine learning all show good discriminative performance,with the logistic regression model having the best predictive performance and good generalization ability.
作者
张国顺
洪国议
吴童童
梅冬雪
辛英瑛
韩超
蒋美钰
王素颖
ZHANG Guoshun;HONG Guoyi;WU Tongtong;MEI Dongxue;XIN yingying;HAN Chao;JIANG Meiyu;WANG Suying(Department of Gastroenterology,Affiliated Hospital of North China University of Science and Technology,Tangshan 063000,China)
出处
《中国煤炭工业医学杂志》
2024年第6期617-625,共9页
Chinese Journal of Coal Industry Medicine
基金
河北省医学科学研究课题项目(编号:20240401)
中国肝炎防治基金会王宝恩肝纤维化研究基金(编号:2025030)