摘要
目的探讨结直肠腺瘤性息肉发生的危险因素并构建决策树(decision tree,DT)、随机森林(random forest,RF)、极端梯度提升(xtreme gradient boosting,XGBoost)、支持向量机(support vector machine,SVM)风险预测模型,并评估各模型效能,为结直肠腺瘤性息肉早期诊断和早期干预提供依据。方法回顾性收集2019年12月至2024年5月在兰州大学第一医院行结肠镜检查患者的临床资料(共31项指标),采用单因素及多因素Logistic回归分析筛选结直肠腺瘤性息肉发生的危险因素。将数据集按8∶2随机分组方式分为训练集和测试集,将其筛选重要变量纳入DT、RF、XGBoost及SVM算法中并构建模型,分别计算模型的敏感度、特异度、准确度、AUC等,筛选出最优模型。并对纳入变量的重要性进行评估。结果根据单因素及多因素Logistic回归分析结果提示:年龄、吸烟史、饮酒史、便秘史、脂肪肝、息肉直径及息肉数目为结直肠腺瘤性息肉发生的独立危险因素。采用DT、RF、SVM、XGBoost等四种机器学习算法构建预测模型,训练集AUC值分别为0.830、0.828、0.765、0.820;测试集AUC值分别为0.724、0.717、0.705、0.725。其中测试集结果显示,XGBoost的AUC值最高(0.725),DT次之(0.724)。Delong检验显示,各模型之间的AUC值差异无统计学意义(P>0.05)。DT和XGBoost在训练集和测试集上的分类性能在敏感度、特异度、准确度上具有一致性。特征重要性评估显示年龄的重要性最大,其次是息肉数目。结论本研究基于机器学习成功建立了4种结直肠腺瘤性息肉风险预测模型,DT和XGBoost均为预测结直肠腺瘤性息肉发生的最优模型。
Objective To investigate the risk factors for colorectal adenomatous polyps(CAP)and develop four ma-chine learning-based risk prediction models-decision tree(DT),random forest(RF),xtreme gradient boosting(XGBoost),and support vector machine(SVM)-for early diagnosis and intervention.Methods A retrospective anal-ysis was conducted on clinical data collected from patients undergoing colonoscopy at the First Hospital of Lanzhou Uni-versity from Dec.2019 to May 2023.A total of 31 variables were included.Univariate and multivariate Logistic regres-sion analysis were performed to identify independent risk factors.The dataset was randomly divided into training and testing sets(8∶2).Key variables were selected to construct DT,RF,XGBoost,and SVM models.Model performance was assessed using sensitivity,specificity,accuracy and AUC.Variable importance was evaluated.Results Univariate and multivariate Logistic regression analysis indicated that age,smoking history,alcohol consumption,history of consti-pation,fatty liver,polyp diameter,and number of polyps were independent risk factors for colorectal adenomatous pol-yps.Four machine learning algorithms-DT,RF,SVM,and XGBoost-were employed to construct prediction models.The AUC values for the training set were 0.830,0.828,0.765,and 0.820,respectively;and for the test set,the AUC values were 0.724,0.717,0.705,and 0.725,respectively.In the test set,XGBoost achieved the highest AUC(0.725),followed by the DT(0.724).Delong′s test showed no statistically significant differences in AUC values among the models(P>0.05).Both DT and XGBoost models demonstrated consistent classification performance in terms of sensitivity,specificity,and accuracy across the training and test sets.Feature importance analysis revealed age as the most influential factor,followed by the number of polyps.Conclusion Four machine learning models for CAP risk prediction were successfully established.DT and XGBoost models demonstrated superior performance and were the most effective predictors.
作者
李倩倩
徐丽婷
许雪玲
赵悟晴
李世晓
路红
李强
LI Qianqian;XU Liting;XU Xueling;ZHAO Wuqing;LI Shixiao;LU Hong;LI Qiang(The First Clinical Medical College of Lanzhou University,Lanzhou 730000;Department of Gastroenterology,the First Hospital of Lanzhou University,China)
出处
《胃肠病学和肝病学杂志》
2025年第12期1738-1745,共8页
Chinese Journal of Gastroenterology and Hepatology
关键词
结直肠腺瘤性息肉
危险因素
预测模型
机器学习
Colorectal adenomatous polyps
Risk factors
Prediction models
Machine learning