期刊文献+

基于代价敏感激活函数XGBoost的不平衡数据分类方法 被引量:15

XGBoost for Imbalanced Data Based on Cost-sensitive Activation Function
在线阅读 下载PDF
导出
摘要 为解决在数据不平衡条件下使用XGBoost框架处理二分类问题时算法对少数类样本的识别能力下降的问题,提出了基于代价敏感激活函数的XGBoost算法(Cost-sensitive Activation Function XGBoost,CSAF-XGBoost)。在XGBoost框架构建决策树时,数据不平衡会影响分裂点的选择,导致少数类样本被误分。通过引入代价敏感激活函数改变样本在不同预测结果下损失函数的梯度变化,来解决被误分的少数类样本因梯度变化小而无法在XGBoost迭代过程中被有效分类的问题。通过实验分析了激活函数的参数与数据不平衡度的关系,并对CSAF-XGBoost算法与SMOTE-XGBoost,ADASYN-XGBoost,Focal loss-XGBoost,Weight-XGBoost优化算法在UCI公共数据集上的分类性能进行了对比。结果表明,在F1值和AUC值相同或有提高的情况下,CSAF-XGBoost算法对少数类样本的检出率比最优算法平均提高了6.75%,最多提高了15%,证明了CSAF-XGBoost算法对少数类样本有更高的识别能力,且具有广泛的适用性。 For binary classification with category imbalance,acost-sensitive activation function XGBoost algorithm(CSAF-XGBoost)is proposed to promote the ability of recognizing minority samples.When XGBoost algorithm constructs decision trees,unbalanced data will affect split point selection,which lead to misclassification of minority.By constructing cost-sensitive activation function(CSAF),samples in different estimation are under different gradient variations,which approach the problem that the gradient variation of misclassified minority sample is too small to make samples be recognized correctly in iterations.The experiments analyze the relation of imbalanced rate(IR)to parameters,and compare performance with SMOTE-XGBoost,ADASYN-XGBoost,Focal loss-XGBoost and Weight-XGBoost on UCI datasets.As for recall rate of minority,CSAF-XGBoost surpasses the best methods 6.75%in average and 15%in maximum with F1-score and AUC score in the same level.The results prove CSAF-XGBoost has better performance in recognizing minority class samples and wider applicability.
作者 李京泰 王晓丹 LI Jing-tai;WANG Xiao-dan(Air and Missile Defense College,Air Force Engineering University,Xi’an 710051,China)
出处 《计算机科学》 CSCD 北大核心 2022年第5期135-143,共9页 Computer Science
关键词 代价敏感 LOGISTIC回归 数据不平衡分类 XGBoost 激活函数 Cost-sensitive Logistic regression Data imbalanced classification XGBoost Activation function
  • 相关文献

参考文献7

二级参考文献78

  • 1凌晓峰,SHENG Victor S..代价敏感分类器的比较研究(英文)[J].计算机学报,2007,30(8):1203-1212. 被引量:35
  • 2WU Xin-dong,KUMAR V,QUINLAN J R,et al.Top 10 algorithms in data mining[J].Knowledge and Information Systems,2008,14(1):1-37.
  • 3CHAWLA N V,JAPKOWICZ N,KOTCZ A.Editorial:special issue on learning from imbalanced data sets[J].ACM SIGKDD Explorations Newsletter,2004,6(1):1-6.
  • 4HE Hai-bo,GARCIA E A.Learning from imbalanced data[J].IEEE Trans on Knowledge and Data Engineering,2009,21(9):1263-1284.
  • 5TING K M.A comparative study of cost-sensitive boosting algorithms[C]//Proc of the 17th International Conference on Machine Learning.2000:983-990.
  • 6FAN Wei,STOLFO S J,ZHANG Jun-xin,et al.AdaCost:misclassification cost-sensitive boosting[C]//Proc of the 16th International Conference on Machine Learning.1999:97-105.
  • 7SUN Yan-min,KAMEL M S,WONG A K C,et al.Cost-sensitive boosting for classification of imbalanced data[J].Pattern Recognition,2007,40(12):3358-3378.
  • 8GALAR M,FERNNDEZ A,BARRENCHEA E,et al.EUSBoost:enhancing ensembles for highly imbalanced data-sets by evolutionary undersampling[J].Pattern Recognition,2013,46(12):3460-3471.
  • 9JOSHI M V,KUMAR V,AGARWAL R C.Evaluating boosting algorithms to classify rare classes:comparison and improvements[C]//Proc of IEEE International Conference on Data Mining.Washington DC:IEEE Computer Society,2001:257-264.
  • 10GUO Hong-yu,VIKTOR H L.Learning from imbalanced data sets with boosting and data generation:the DataBoost-IM approach[J].SIGKDD Exploration Newsletter,2004,6(1):30-39.

共引文献348

同被引文献183

引证文献15

二级引证文献32

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部