摘要
处理类不平衡问题时,已有的采样方法存在易受噪声影响和忽略边界样本的问题,尤其是忽略多数类样本的类内差异,位于边界的样本实例非常容易被错分,而这些样本对划分决策边界具有重要作用。将SMOTE过采样和RUS随机欠采样方法结合并进行改进,提出混合边界重采样算法(HSMOTE-AdaBoost)。HSMOTE-AdaBoost算法首先对少数类运用SMOTE过采样,提高数据的平衡度;再使用K近邻算法清除噪声和采样方法产生的重叠实例;同时,基于与少数类样本的平均欧氏距离识别并保留边界多数类样本,然后对剩余的数据进行随机欠采样;最后,利用AdaBoost算法的优势,对平衡后的数据集进行多次迭代训练得到最终的分类模型。仿真实验结果表明,与传统的SMOTE-Boost、RUS-Boost、PC-Boost及改进后的算法KSMOTE-AdaBoost相比,该分类模型在不平衡数据集上的所有性能指标F-measure,G-mean,AUC值分别最高提升了22.97%,13.88%和10.03%,具有更优的分类效果。
When dealing with the class imbalance problem,existing sampling methods have the issues of being susceptible to noise and ignoring boundary samples,especially majority class boundary samples,making the boundary sample instances,which play an important role in determining the decision boundary,easily be misclassified.By improving the combination of SMOTE oversampling and random under sampling(RUS),a hybrid boundary resampling algorithm(HSMOTE-AdaBoost)is proposed.The HSMOTE-AdaBoost algorithm firstly performs SMOTE oversampling on the minority samples to improve data balance and effectiveness.Then,the paper uses the K nearest neighbor algorithm to remove noise and overlapping instances generated by the sampling method.Meanwhile,the paper recognizes and retains the boundary majority class samples based on the average Euclidean distance to the minority samples.After that,the remaining data is randomly undersampled.Finally,by making use of the advantages of AdaBoost algorithm,the balanced dataset is trained for multiple iterations to obtain the final classification model.The experimental results show that,comparing with the traditional SMOTE-Boost,RUS-Boost,PC-Boost and the improved algorithm KSMOTE-AdaBoost,the increase of the F-measure,G-mean and AUC values of HSMOTE-AdaBoost could reach 22.97%,13.88% and 10.03% respectively,implying a better performance of HSMOTE-AdaBoost.
作者
李静
刘姜
倪枫
李笑语
LI Jing;LIU Jiang;NI Feng;LI Xiaoyu(Business School,University of Shanghai for Science and Technology,Shanghai 200093,China)
出处
《智能计算机与应用》
2023年第7期7-14,共8页
Intelligent Computer and Applications
基金
国家自然科学基金(11701370)
上海市“系统科学”高峰学科建设项目。