期刊文献+

HSMOTE-AdaBoost:改进混合边界重采样集成分类算法

HSMOTE-AdaBoost:An integrated classification algorithm based on improved mixed boundary resampling
在线阅读 下载PDF
导出
摘要 处理类不平衡问题时,已有的采样方法存在易受噪声影响和忽略边界样本的问题,尤其是忽略多数类样本的类内差异,位于边界的样本实例非常容易被错分,而这些样本对划分决策边界具有重要作用。将SMOTE过采样和RUS随机欠采样方法结合并进行改进,提出混合边界重采样算法(HSMOTE-AdaBoost)。HSMOTE-AdaBoost算法首先对少数类运用SMOTE过采样,提高数据的平衡度;再使用K近邻算法清除噪声和采样方法产生的重叠实例;同时,基于与少数类样本的平均欧氏距离识别并保留边界多数类样本,然后对剩余的数据进行随机欠采样;最后,利用AdaBoost算法的优势,对平衡后的数据集进行多次迭代训练得到最终的分类模型。仿真实验结果表明,与传统的SMOTE-Boost、RUS-Boost、PC-Boost及改进后的算法KSMOTE-AdaBoost相比,该分类模型在不平衡数据集上的所有性能指标F-measure,G-mean,AUC值分别最高提升了22.97%,13.88%和10.03%,具有更优的分类效果。 When dealing with the class imbalance problem,existing sampling methods have the issues of being susceptible to noise and ignoring boundary samples,especially majority class boundary samples,making the boundary sample instances,which play an important role in determining the decision boundary,easily be misclassified.By improving the combination of SMOTE oversampling and random under sampling(RUS),a hybrid boundary resampling algorithm(HSMOTE-AdaBoost)is proposed.The HSMOTE-AdaBoost algorithm firstly performs SMOTE oversampling on the minority samples to improve data balance and effectiveness.Then,the paper uses the K nearest neighbor algorithm to remove noise and overlapping instances generated by the sampling method.Meanwhile,the paper recognizes and retains the boundary majority class samples based on the average Euclidean distance to the minority samples.After that,the remaining data is randomly undersampled.Finally,by making use of the advantages of AdaBoost algorithm,the balanced dataset is trained for multiple iterations to obtain the final classification model.The experimental results show that,comparing with the traditional SMOTE-Boost,RUS-Boost,PC-Boost and the improved algorithm KSMOTE-AdaBoost,the increase of the F-measure,G-mean and AUC values of HSMOTE-AdaBoost could reach 22.97%,13.88% and 10.03% respectively,implying a better performance of HSMOTE-AdaBoost.
作者 李静 刘姜 倪枫 李笑语 LI Jing;LIU Jiang;NI Feng;LI Xiaoyu(Business School,University of Shanghai for Science and Technology,Shanghai 200093,China)
出处 《智能计算机与应用》 2023年第7期7-14,共8页 Intelligent Computer and Applications
基金 国家自然科学基金(11701370) 上海市“系统科学”高峰学科建设项目。
关键词 类不平衡 SMOTE过采样 ADABOOST算法 噪声样本 边界样本 class imbalance SMOTE oversampling AdaBoost algorithm noise sample boundary sample
  • 相关文献

参考文献6

二级参考文献56

  • 1郑恩辉,李平,宋执环.不平衡数据知识挖掘:类分布对支持向量机分类的影响[J].信息与控制,2005,34(6):703-708. 被引量:17
  • 2Phua C, Alahakoon D, Lee V. Minority Report in Fraud Detection: Classification of Skewed Data. ACM SIGKDD Explorations Newsletter, 2004, 6 ( 1 ) : 50 - 59.
  • 3Zheng Zhaohui, Srihari R. Optimally Combining Positive and Negative Features for Text Categorization [ EB/OL]. [ 2003-08-24 ]. http ://www. site. uottwa. ca/-nat/Workshop2003/zheng.pdf.
  • 4Ertekin S, Huang Jian, Bottou L, et al. Learning on the Border: Active Learning in Imbalanced Data Classification [ EB/OL ]. [ 2007-11-08 ]. http://www. personal. psu. edu/juh177/pubs/ CIKM2007. pdf.
  • 5Kubat M, Matwin S. Addressing the Curse of Imbalanced Training Sets: One Sided Selection// Proc of the 14th International Conference on Machine Learning. Nashville, USA, 1997: 179- 186.
  • 6Barandela R, Valdovinos R M, Sanchez J S, et al. The Imbalanced Training Sample Problem: Under or over Sampling// Proc of the Joint IAPR International Workshops on Structural, Syntactic and Statistical Pattern Recognition. Lisbon, Portugal, 2004 : 806 - 814.
  • 7Chawla N V, Hall L O, Bowyer K W, et al. Smote: Synthetic Minority Over-Sampling Technique. Journal of Artificial Intelligence Research, 2002, 16 : 321 - 357.
  • 8Han Hui, Wang Wenyuan, Mao Binghua. Borderline Smote: A New Over-Sampling Method in Imbalanced Data Sets Learning//Proc of the International Conference on Intelligent Computing. Hefei, China, 2005 : 878 -887.
  • 9Jo T, Japkowicz N. Class Imbalances versus Small Disjuncts. ACM SIGKDD Explorations Newsletter, 2004, 6( 1 ) : 40 -49.
  • 10Hulse J V, Khoshgoftaar T M, Napolitano A. Experimental Perspectives on Learning from Imbalanced Data//Proc of the 24th International Conference on Machine Learning. Corvallis, USA, 2007 : 935 - 942.

共引文献318

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部