摘要
针对软件缺陷预测数据中的数据不平衡、预测精度低以及特征维度高的问题,提出了一种RUS-RSMOTE-PCA-Vote的软件缺陷不平衡数据分类方法。首先通过随机欠采样来减少无缺陷样本的数量;在此基础上进行SMOTE过采样,在过采样中综合总体样本的分布状况引入影响因素posFac指导新样本的合成;对经过RUS-RSMOTE混合采样处理后的数据集进行PCA降维,最后应用Vote组合K最近邻、决策树、支持向量机构造集成分类器。在NASA数据集上的实验结果表明,与现有不平衡数据分类方法相比,所提方法在F-value值、G-mean值和AUC值上更优,有效地改善了软件缺陷预测数据集的分类性能。
To solve the problems of data imbalance,low prediction accuracy and feature dimension in software defect prediction data,a RUS-RSMOTE-PCA-Vote(random under sampling-random synthetic minority oversampling technique-principal components analysis-vote)software defect imbalance data classification method was proposed.Firstly,the number of non-defective samples was reduced by random under sampling.On this basis,SMOTE oversampling was carried out,during which the influence factor posFac(position factor)was introduced into the overall sample distribution to guide the synthesis of the new sample.Then the data set after RUS-RSMOTE sampling was subjected to PCA dimensionality reduction.Finally,an integrated classifier was constructed by using Vote in combination with K nearest neighbor,decision tree,and support vector machine.The experimental results on the NASA(National Aeronautics and Space Administration)data set show that the proposed method is superior to the existing unbalanced data classification methods in terms of F-value,G-mean value and AUC value,thus effectively improves the classification performance of the software defect prediction data set.
作者
刘文英
林亚林
李克文
雷永秀
LIU Wenying;LIN Yalin;LI Kewen;LEI Yongxiu(College of Computer Science and Technology,China University of Petroleum(East China),Qingdao,Shandong 266580,China)
出处
《山东科技大学学报(自然科学版)》
CAS
北大核心
2021年第2期84-94,共11页
Journal of Shandong University of Science and Technology(Natural Science)
基金
国家自然科学基金项目(61673396)
山东省自然科学基金项目(ZR2017MF032)。
关键词
软件缺陷预测
不平衡数据
混合采样
特征降维
集成分类器
software defect prediction
unbalanced data
hybrid sampling
feature dimensionality reduction
ensemble classifier