摘要
软件缺陷预测通过挖掘软件历史仓库,构建缺陷预测模型来预测出被测项目内的潜在缺陷程序模块.但有时候搜集到的缺陷预测数据集中含有的冗余特征和无关特征会影响到缺陷预测模型的性能.提出一种基于聚类分析的特征选择方法 FECAR.具体来说,首先基于特征之间的关联性(即FFC),将已有特征进行聚类分析.随后基于特征与类标间的相关性(即FCR),对每个簇中的特征从高到低进行排序并选出指定数量的特征.在实证研究中,借助对称不确定性(symmetric uncertainty)来计算FFC,借助信息增益(information gain)、卡方值(chi-square)或Relief F来计算FCR.以Eclipse和NASA数据集等实际项目为评测对象,重点分析了应用FECAR方法后的缺陷预测模型的性能,FECAR方法选出的特征子集冗余率和比例.结果验证了FECAR方法的有效性.
By mining historical software repositories, software defect prediction can construct defect-prediction models to predict potentially faulty modules in projects under testing. However, redundant and irrelevant features in the gathered datasets may influence the effectiveness of existing methods. A novel cluster-analysis-based feature-selection method(FECAR) is proposed. In particular, the original features are first clustered, based on a specific feature correlation(i.e., FFC) measure. Then, for each cluster, features are ranked based on a specific feature and class relevance(i.e., FCR) measure and a given number of features are chosen. In empirical studies,we chose symmetric uncertainty as the FFC measure, and information gain, chi-square, or Relief F as the FCR measures. Based on some real-world projects, such as Eclipse and NASA, we focus on the prediction performance after using FECAR, and analyze the redundancy rate and selection proportion of the selected feature subset. The final results show the effectiveness of FECAR.
出处
《中国科学:信息科学》
CSCD
北大核心
2016年第9期1298-1320,共23页
Scientia Sinica(Informationis)
基金
国家自然科学基金(批准号:61373012
61321491
91218302
61202006)
国家重点基础研究发展计划(973计划)(批准号:2009C B320705)
江苏省高校自然科学研究项目(批准号:12KJB520014)
南京大学计算机软件新技术国家重点实验室开放课题(批准号:KFKT2016B18)
南京大学软件新技术与产业化协同创新中心资助项目
关键词
软件质量保障
缺陷预测
数据挖掘
特征选择
聚类分析
software quality assurance
defect prediction
data mining
feature selection
cluster analysis