摘要
针对多种数据预处理方式及其组合如何提升模型拟合效果这一问题,基于UCI Machine Learning Repository典型数据集,分别利用20种变量处理方式和4种变量选择方法对数据预处理,并对模型拟合效果对比分析,分别考查了多种数据预处理方式及其组合对常见分类模型和回归模型的影响.通过对实验结果的分析讨论,提出一种启发式算法,根据数据特征、模型特点以及研究问题种类等信息推荐数据预处理方法.在更广泛的数据集上的实验结果表明,该算法所推荐的数据预处理方法可以一定程度上提升模型拟合效果,节省手工选取数据预处理方法的开销.
Aiming at the problem of how multiple data preprocessing methods and their combinations can improve model fitting effect,20 variable processing methods and 4 variable selection methods were used to preprocess the data based on typical data sets of UCI Machine Learning Repository,and the model fitting effects were compared and analyzed.The influence of various data preprocessing methods and their combinations on common classification models and regression models was separately examined.Through the analysis and discussion of experimental results,a heuristic algorithm was proposed to recommend data preprocessing methods based on data features,model characteristics and research types.The experimental results on more extensive data sets show that the data preprocessing method recommended by this algorithm can improve model fitting effect to a certain extent and save the cost of data preprocessing with manual selection methods.
作者
李颜平
吴刚
LI Yan-ping;WU Gang(School of Statistics and Data Science,Nankai University,Tianjin 300071,China;School of Computer Science and Engineering,Northeastern University,Shenyang 110004,China)
出处
《沈阳工业大学学报》
CAS
北大核心
2022年第2期185-192,共8页
Journal of Shenyang University of Technology
基金
国家重点研发计划项目(2019YFB1405300).
关键词
数据预处理
正态化
归一化
哑变量
方差分析
卡方检验
互信息
Copula熵
data preprocessing
normalization
uniformization
dummy variable
analysis of variance
chi-square test
mutual information
Copula entropy