Efficient and accurate identification of candidate causal genes within quantitative trait loci(QTL)is a significant challenge in genetic research.Conventional linkage analysis methods often require substantial time an...Efficient and accurate identification of candidate causal genes within quantitative trait loci(QTL)is a significant challenge in genetic research.Conventional linkage analysis methods often require substantial time and resources to identify causal genes.This paper proposes a QTG-LGBM method for prioritizing causal genes in maize based on the Light GBM algorithm.QTG-LGBM dynamically adjusts gene weights and sample proportions during training to mitigate the effects of class imbalance.The method prevents overfitting in datasets with small samples by introducing a regularization term.Experimental results on maize traits,including plant height(PH),flowering time(FT),and tassel branch number(TBN),demonstrated that QTG-LGBM outperforms the commonly used methods QTG-Finder,GBDT,XGBoost,Bernoulli NB,SVM,CNN,and ensemble learning.We validated the generalization of QTG-LGBM using Arabidopsis,rice,Setaria,and sorghum.We also applied QTG-LGBM using reported QTL that affect traits of maize PH,FT and TBN,and FT in Arabidopsis,rice,and sorghum,as well as known causal genes within the QTL.When examining the top 20%of ranked genes,QTG-LGBM demonstrated a significantly higher recall rate of causal genes compared to random selection methods.We identified key gene features affecting phenotypes through feature importance analysis.QTG-LGBM is available at http://www.deepcba.com/QTG-LGBM.展开更多
基金supported by the Biological Breeding-Major Projects(2023ZD04067)Hubei Provincial Natural Science Foundation of China(2023AFB832)+1 种基金Guizhou Provincial Basic Research Program(Natural Science)(MS[2025]096)Major Project of Hubei Hongshan Laboratory(2022HSZD031)。
文摘Efficient and accurate identification of candidate causal genes within quantitative trait loci(QTL)is a significant challenge in genetic research.Conventional linkage analysis methods often require substantial time and resources to identify causal genes.This paper proposes a QTG-LGBM method for prioritizing causal genes in maize based on the Light GBM algorithm.QTG-LGBM dynamically adjusts gene weights and sample proportions during training to mitigate the effects of class imbalance.The method prevents overfitting in datasets with small samples by introducing a regularization term.Experimental results on maize traits,including plant height(PH),flowering time(FT),and tassel branch number(TBN),demonstrated that QTG-LGBM outperforms the commonly used methods QTG-Finder,GBDT,XGBoost,Bernoulli NB,SVM,CNN,and ensemble learning.We validated the generalization of QTG-LGBM using Arabidopsis,rice,Setaria,and sorghum.We also applied QTG-LGBM using reported QTL that affect traits of maize PH,FT and TBN,and FT in Arabidopsis,rice,and sorghum,as well as known causal genes within the QTL.When examining the top 20%of ranked genes,QTG-LGBM demonstrated a significantly higher recall rate of causal genes compared to random selection methods.We identified key gene features affecting phenotypes through feature importance analysis.QTG-LGBM is available at http://www.deepcba.com/QTG-LGBM.