摘要
为了提升蛋白质-HEME结合残基的预测精度,本文基于Gradient Boosting Machine(GBM)算法,重点优化了其内置参数,包括最小训练样本数(minobsinnode)、决策树深度(depth)、迭代次数(n.trees)和学习速率(shrinkage)。通过总结数据科学竞赛中的经验,调整最优扰动权重和参数配置,成功避免过拟合问题,并实现了更高的预测精确度。在此基础上,改进采样方法以克服样本不平衡问题,结合五折交叉检验与独立检验对模型进行全面优化。
To improve the prediction accuracy of protein-HEME binding residues,this study focuses on optimizing the built-in parameters of the Gradient Boosting Machine(GBM)algorithm,including the minimum number of training samples per node(minobsinnode),tree depth(depth),number of iterations(n.trees),and learning rate(shrinkage).By leveraging insights from data science competitions,optimal perturbation weights and parameter configurations were adjusted to successfully avoid overfitting and achieve higher prediction precision.Furthermore,improved sampling methods were employed to address class imbalance issues,and the model was comprehensively refined through five-fold cross-validation and independent testing.
作者
郭国栋
潘星宇
白云霞
包梦雪
李彩艳
GUO Guodong;PAN Xingyu;BAI Yunxia;BAO Mengxue;LI Caiyan(School of Computer Science and Technology,Baotou Medical College,Baotou,Inner Mongolia 014000,China;School of Humanities,Baotou Medical College,Baotou,Inner Mongolia 014000,China)
出处
《中国科技论文在线精品论文》
2025年第2期198-200,共3页
Highlights of Sciencepaper Online
基金
包头医学院科学研究基金项目(秦文斌基金项目)(BYJJ-QWB-202306)
内蒙古大学生创新创业训练计划项目(NO.S202510130021)。