摘要
基于多元数据的雷达图图表示,提出了雷达图重心图形特征。针对同样的多元数据不同的特征排序会导致不同的雷达图图表示,进而产生不同的重心特征,而这些重心特征会最终影响分类器的性能,因此提出一种新的问题,即雷达图图形特征提取中的特征排序问题。基于这个新的问题,设计了一种新的解决方法,即提出了基于改进的遗传算法的特征排序。同时也研究并改进了传统的基于排序的特征选择方法。基于一些机器学习数据库的分类实验结果表明:一方面,数据的原始特征排序下的重心特征和传统的特征提取方法相比,并不总是最优,但是在遗传算法下特征排序的重心特征优于传统的特征提取方法;另一方面,在遗传算法下特征排序的重心特征优于传统的基于排序的特征选择方法下的重心特征。尤其对于高维小样本的肺癌数据达到了12.5%的留一法交叉验证错误率,效果非常好。乳腺癌数据和糖尿病数据等的分类结果优于目前国际上的报道。
The barycentre graphical feature extraction method of the star plot is proposed based on the graphical representation of multi-dimensional data. For the different feature ordering the same multi-dimensional data lead to the different star plots and extract the different barycentre graphical features, which affect the classification error of the classifiers, the novel question is proposed, that is feature ordering question in the feature extraction of star plots of multivariate data. For the question, the novel feature ordering method based on the improved genetic algorithm is proposed. Meanwhile the traditional feature ordering method based on the feature selection is researched and the traditional feature extraction method is researched. The experiments results of the 9 real data sets of machine learning show as follow. One hand, the classification errors ofbarycentre feature of star plot of the original feature ordering of multivariate data are better or worse than the traditional feature extraction method. But the classification errors of barycentre feature of star plot of the feature ordering based on the improved genetic algorithm of multivariate data are better than the traditional feature extraction method. On the other hand, the classification errors ofbarycentre feature of star plot of the feature ordering based on the improved genetic algorithm of multivariate data are better than the traditional feature ordering method based on the feature selection. Especially for the lung cancer data set with high dimension and small sample, the classification errors of leave one out cross validation is 12.5%. The classification errors of breast-cancer-Wisconsin data set and Pima Indians diabetes data set is superior to the report of international paper.
出处
《燕山大学学报》
CAS
2008年第5期421-428,共8页
Journal of Yanshan University
基金
国家自然科学基金资助项目(60474065
60504035
60605006)
关键词
特征提取
特征排序
特征选择
遗传算法
雷达图
模式识别
feature extraction
feature ordering
feature selection
genetic algorithm
star plot
pattern recognition