摘要
Identifying druggable proteins,which are capable of binding therapeutic compounds,remains a critical and resource-intensive challenge in drug discovery.To address this,we propose CEL-IDP(Comparison of Ensemble Learning Methods for Identification of Druggable Proteins),a computational framework combining three feature extraction methods Dipeptide Deviation from Expected Mean(DDE),Enhanced Amino Acid Composition(EAAC),and Enhanced Grouped Amino Acid Composition(EGAAC)with ensemble learning strategies(Bagging,Boosting,Stacking)to classify druggable proteins from sequence data.DDE captures dipeptide frequency deviations,EAAC encodes positional amino acid information,and EGAAC groups residues by physicochemical properties to generate discriminative feature vectors.These features were analyzed using ensemble models to overcome the limitations of single classifiers.EGAAC outperformed DDE and EAAC,with Random Forest(Bagging)and XGBoost(Boosting)achieving the highest accuracy of 71.66%,demonstrating superior performance in capturing critical biochemical patterns.Stacking showed intermediate results(68.33%),while EAAC and DDE-based models yielded lower accuracies(56.66%–66.87%).CEL-IDP streamlines large-scale druggability prediction,reduces reliance on costly experimental screening,and aligns with global initiatives like Target 2035 to expand action-able drug targets.This work advances machine learning-driven drug discovery by systematizing feature engineering and ensemble model optimization,providing a scalable workflow to accelerate target identification and validation.
基金
supported by the MSIT(Ministry of Science and ICT),Korea,under the ITRC(Information Technology Research Centre)support program(IITP-2024-RS-2024-00437191)
supervised by the IITP(Institute for Information&Communications Technology Planning&Evaluation).