Objectives:This study aimed to develop and validate a stroke risk prediction model based on machine learning(ML)and regional healthcare big data,and determine whether it may improve the prediction performance compared...Objectives:This study aimed to develop and validate a stroke risk prediction model based on machine learning(ML)and regional healthcare big data,and determine whether it may improve the prediction performance compared with the conventional Logistic Regression(LR)model.Methods:This retrospective cohort study analyzed data from the CHinese Electronic health Records Research in Yinzhou(CHERRY)(2015–2021).We included adults aged 18–75 from the platform who had established records before 2015.Individuals with pre-existing stroke,key data absence,or excessive missingness(>30%)were excluded.Data on demographic,clinical measures,lifestyle factors,comorbidities,and family history of stroke were collected.Variable selection was performed in two stages:an initial screening via univariate analysis,followed by a prioritization of variables based on clinical relevance and actionability,with a focus on those that are modifiable.Stroke prediction models were developed using LR and four ML algorithms:Decision Tree(DT),Random Forest(RF),eXtreme Gradient Boosting(XGBoost),and Back Propagation Neural Network(BPNN).The dataset was split 7:3 for training and validation sets.Performance was assessed using receiver operating characteristic(ROC)curves,calibration,and confusion matrices,and the cutoff value was determined by Youden's index to classify risk groups.Results:The study cohort comprised 92,172 participants with 436 incident stroke cases(incidence rate:474/100,000 person-years).Ultimately,13 predictor variables were included.RF achieved the highest accuracy(0.935),precision(0.923),sensitivity(recall:0.947),and F1 score(0.935).Model evaluation demonstrated superior predictive performance of ML algorithms over conventional LR,with training/validation areaunderthe curve(AUC)sof0.777/0.779(LR),0.921/0.918(BPNN),0.988/0.980(RF),0.980/0.955(DT),and 0.962/0.958(XGBoost).Calibration analysis revealed a better fit for DT,LR and BPNN compared to RF and XGBoost model.Based on the optimal performance of the RF model,the ranking of factors in descending order of importance was:hypertension,age,diabetes,systolic blood pressure,waist,high-density lipoprotein Cholesterol,fasting blood glucose,physical activity,BMI,low-density lipoprotein cholesterol,total cholesterol,dietary habits,and family history of stroke.Using Youden's index as the optimal cutoff,the RF model stratified individuals into high-risk(>0.789)and low-risk(≤0.789)groups with robust discrimination.Conclusions:The ML-based prediction models demonstrated superior performance metrics compared to conventional LR and the RF is the optimal prediction model,providing an effective tool for risk stratifi cation in primary stroke prevention in community settings.展开更多
We report a significantly-enhanced bioinformatics suite and database for proteomics research called Yale Protein Expression Database(YPED) that is used by investigators at more than 300 institutions worldwide. YPED ...We report a significantly-enhanced bioinformatics suite and database for proteomics research called Yale Protein Expression Database(YPED) that is used by investigators at more than 300 institutions worldwide. YPED meets the data management, archival, and analysis needs of a high-throughput mass spectrometry-based proteomics research ranging from a singlelaboratory, group of laboratories within and beyond an institution, to the entire proteomics community. The current version is a significant improvement over the first version in that it contains new modules for liquid chromatography–tandem mass spectrometry(LC–MS/MS) database search results, label and label-free quantitative proteomic analysis, and several scoring outputs for phosphopeptide site localization. In addition, we have added both peptide and protein comparative analysis tools to enable pairwise analysis of distinct peptides/proteins in each sample and of overlapping peptides/proteins between all samples in multiple datasets. We have also implemented a targeted proteomics module for automated multiple reaction monitoring(MRM)/selective reaction monitoring(SRM) assay development. We have linked YPED's database search results and both label-based and label-free fold-change analysis to the Skyline Panorama repository for online spectra visualization. In addition, we have built enhanced functionality to curate peptide identifications into an MS/MS peptide spectral library for all of our protein database search identification results.展开更多
基金funded by Beijing Natural Science Foundation-Haidian Original Innovation Joint Fund(Grant No.L222103)the National Natural Science Foundation of China(Grant No.72174012)。
文摘Objectives:This study aimed to develop and validate a stroke risk prediction model based on machine learning(ML)and regional healthcare big data,and determine whether it may improve the prediction performance compared with the conventional Logistic Regression(LR)model.Methods:This retrospective cohort study analyzed data from the CHinese Electronic health Records Research in Yinzhou(CHERRY)(2015–2021).We included adults aged 18–75 from the platform who had established records before 2015.Individuals with pre-existing stroke,key data absence,or excessive missingness(>30%)were excluded.Data on demographic,clinical measures,lifestyle factors,comorbidities,and family history of stroke were collected.Variable selection was performed in two stages:an initial screening via univariate analysis,followed by a prioritization of variables based on clinical relevance and actionability,with a focus on those that are modifiable.Stroke prediction models were developed using LR and four ML algorithms:Decision Tree(DT),Random Forest(RF),eXtreme Gradient Boosting(XGBoost),and Back Propagation Neural Network(BPNN).The dataset was split 7:3 for training and validation sets.Performance was assessed using receiver operating characteristic(ROC)curves,calibration,and confusion matrices,and the cutoff value was determined by Youden's index to classify risk groups.Results:The study cohort comprised 92,172 participants with 436 incident stroke cases(incidence rate:474/100,000 person-years).Ultimately,13 predictor variables were included.RF achieved the highest accuracy(0.935),precision(0.923),sensitivity(recall:0.947),and F1 score(0.935).Model evaluation demonstrated superior predictive performance of ML algorithms over conventional LR,with training/validation areaunderthe curve(AUC)sof0.777/0.779(LR),0.921/0.918(BPNN),0.988/0.980(RF),0.980/0.955(DT),and 0.962/0.958(XGBoost).Calibration analysis revealed a better fit for DT,LR and BPNN compared to RF and XGBoost model.Based on the optimal performance of the RF model,the ranking of factors in descending order of importance was:hypertension,age,diabetes,systolic blood pressure,waist,high-density lipoprotein Cholesterol,fasting blood glucose,physical activity,BMI,low-density lipoprotein cholesterol,total cholesterol,dietary habits,and family history of stroke.Using Youden's index as the optimal cutoff,the RF model stratified individuals into high-risk(>0.789)and low-risk(≤0.789)groups with robust discrimination.Conclusions:The ML-based prediction models demonstrated superior performance metrics compared to conventional LR and the RF is the optimal prediction model,providing an effective tool for risk stratifi cation in primary stroke prevention in community settings.
基金supported in part by National Institute on Drug Abuse(NIDA)grants K01 DA029643 and R01DA016750National Institute on Alcohol Abuse and Alcoholism(NIAAA)grants R21 AA021380 and R21 AA020319+9 种基金the National Alliance for Research on Schizophrenia and Depression(NARSAD)Award 17616(L.Z.)ABMRF/The Foundation for Alcohol Research(L.Z.)Funding and other supports for phenotype and genotype data were provided through the National Institutes of Health(NIH)Genes,Environment and Health Initiative(GEI)(U01HG004422,U01HG004436 and U01HG004438)the GENEVA Coordinating Center(U01HG004446)the NIAAA(U10AA008401,R01AA013320,P60AA011998)the NIDA(R01DA013423)the National Cancer Institute(P01 CA089392)the NIH contract‘High throughput genotyping for studying the genetic contributions to human disease’(HHSN268200782096C)the Center for Inherited Disease Research(CIDR)the National Center for Biotechnology Information.Genotyping was performed at the Johns Hopkins University Center for Inherited Disease Research
基金supported in part by the National Institutes of Health of the United States(Grant Nos.UL1 RR024139 to Yale Clinical and Translational Science Award,1S10OD018034-01 to 6500 QTrap Mass Spectrometer for Yale University,1S10RR026707-01 to 5500QTrap Mass Spectrometer for Yale University,P30DA018343 to Yale/NIDA Neuroproteomics Center and NIDDK-K01DK089006 awarded to JR)
文摘We report a significantly-enhanced bioinformatics suite and database for proteomics research called Yale Protein Expression Database(YPED) that is used by investigators at more than 300 institutions worldwide. YPED meets the data management, archival, and analysis needs of a high-throughput mass spectrometry-based proteomics research ranging from a singlelaboratory, group of laboratories within and beyond an institution, to the entire proteomics community. The current version is a significant improvement over the first version in that it contains new modules for liquid chromatography–tandem mass spectrometry(LC–MS/MS) database search results, label and label-free quantitative proteomic analysis, and several scoring outputs for phosphopeptide site localization. In addition, we have added both peptide and protein comparative analysis tools to enable pairwise analysis of distinct peptides/proteins in each sample and of overlapping peptides/proteins between all samples in multiple datasets. We have also implemented a targeted proteomics module for automated multiple reaction monitoring(MRM)/selective reaction monitoring(SRM) assay development. We have linked YPED's database search results and both label-based and label-free fold-change analysis to the Skyline Panorama repository for online spectra visualization. In addition, we have built enhanced functionality to curate peptide identifications into an MS/MS peptide spectral library for all of our protein database search identification results.