摘要
为解决绝大部分传统的以精度准则为优化目标而获得的分类器不适于不平衡数据学习(IDL)的问题,文中通过在支持向量机(SVM)模型上进行"元学习",研究了精度、平衡精度、几何平均、F1得分、信息增益、AUC(ROC曲线下方图面积)以及文中新提出的GAF和GBF等评价准则对IDL的影响.在16个来自UCI的不平衡数据集上进行了仿真实验.对实验结果的统计分析表明:不同准则对分类器性能的影响有显著差异;即便是对于先进的学习方法支持向量机(SVM)而言,若以精度准则最大化选择分类器,那么得到的SVM分类器也容易偏向预测多类;通过在其他准则上优化,能输出纠偏了的SVM分类器,它们的整体性能更好,尤其是在预测少类能力方面;在GAF以及GBF准则上优化所得的SVM分类器具有稳定且良好的性能.
As most traditional classifiers optimized with the accuracy metric are unsuitable for imbalanced data learning(IDL),this paper performs a meta-learning on a support vector machine(SVM) model,and investigates the IDL affected by such metrics as the accuracy,the balance accuracy,the geometric mean,the F1 score,the information gain,the AUC(Area Under ROC Curve),as well as the two new metrics proposed in this paper,namely GAF and GBF.Moreover,simulation experiments are conducted on 16 imbalanced datasets from UCI,with a statistical analysis of the experimental results being also carried out.It is indicated that(1) there are distinct differences in the effects of these metrics on the classifier's performances;(2) even for the support vector machine(SVM),an advanced learning method,its output classifier is still readily biased to majority class when the classifier is selected by maximizing the accuracy;(3) through the optimization with the help of other metrics,it is feasible to output bias-rectified SVM classifiers,which are of better overall performance,especially in terms of the prediction ability for minor classes;and(4) the output SVM classifiers optimized with GAF and GBF metrics are of stable and good performance.
出处
《华南理工大学学报(自然科学版)》
EI
CAS
CSCD
北大核心
2010年第4期147-155,共9页
Journal of South China University of Technology(Natural Science Edition)
基金
广东省教育部产学研结合项目(2007B090400031)
广东高校优秀青年创新人才培育项目(LYM08074)