期刊文献+

基于典型数据集的数据预处理方法对比分析 被引量:32

Comparative analysis of data preprocessing methodsbased on typical data set
在线阅读 下载PDF
导出
摘要 针对多种数据预处理方式及其组合如何提升模型拟合效果这一问题,基于UCI Machine Learning Repository典型数据集,分别利用20种变量处理方式和4种变量选择方法对数据预处理,并对模型拟合效果对比分析,分别考查了多种数据预处理方式及其组合对常见分类模型和回归模型的影响.通过对实验结果的分析讨论,提出一种启发式算法,根据数据特征、模型特点以及研究问题种类等信息推荐数据预处理方法.在更广泛的数据集上的实验结果表明,该算法所推荐的数据预处理方法可以一定程度上提升模型拟合效果,节省手工选取数据预处理方法的开销. Aiming at the problem of how multiple data preprocessing methods and their combinations can improve model fitting effect,20 variable processing methods and 4 variable selection methods were used to preprocess the data based on typical data sets of UCI Machine Learning Repository,and the model fitting effects were compared and analyzed.The influence of various data preprocessing methods and their combinations on common classification models and regression models was separately examined.Through the analysis and discussion of experimental results,a heuristic algorithm was proposed to recommend data preprocessing methods based on data features,model characteristics and research types.The experimental results on more extensive data sets show that the data preprocessing method recommended by this algorithm can improve model fitting effect to a certain extent and save the cost of data preprocessing with manual selection methods.
作者 李颜平 吴刚 LI Yan-ping;WU Gang(School of Statistics and Data Science,Nankai University,Tianjin 300071,China;School of Computer Science and Engineering,Northeastern University,Shenyang 110004,China)
出处 《沈阳工业大学学报》 CAS 北大核心 2022年第2期185-192,共8页 Journal of Shenyang University of Technology
基金 国家重点研发计划项目(2019YFB1405300).
关键词 数据预处理 正态化 归一化 哑变量 方差分析 卡方检验 互信息 Copula熵 data preprocessing normalization uniformization dummy variable analysis of variance chi-square test mutual information Copula entropy
  • 相关文献

参考文献9

二级参考文献62

  • 1肖燕彩,陈秀海,朱衡君.用改进的灰色多变量模型预测变压器油中溶解气体的浓度[J].电网技术,2006,30(10):86-89. 被引量:16
  • 2肖燕彩,朱衡君,陈秀海.用灰色多变量模型预测变压器油中溶解的气体浓度[J].电力系统自动化,2006,30(13):64-67. 被引量:30
  • 3Weston J, Elisseeff A, Schukopf B, Tipping M. Use of the zero norm with linear models and kernel methods[J]. J Mach Learn Res, 2003(3): 1439-1461.
  • 4Forman G. An extensive empirical study of feature selection metrics for text classification[J]. J Mach Learn Res, 2003, 3: 1289-1305.
  • 5Globerson A, Tishby N. Sufficient dimensionality reduction[J], J Mach Learn Res, 2003, 3: 1307- 1331.
  • 6Tibshirani R. Regression shrinkage and selection via the lasso[J]. J Roy Statist Soc Ser B, 1996, 58(1): 267-288.
  • 7Roweis S, Saul L. Nonlinear dimensionality reduction by locally linear embedding[J]. Science, 2000, 290(5500): 2323-2326.
  • 8Tenenbaum J, Silva V, Langford J. A global geometric framework for nonlinear dimensionality reduction[J]. Science, 2000, 290(5500): 2319-2323.
  • 9Balasubramanian M, Schwartz E L. The isomap algorithm and topological stability[J]. Science, 2000, 295(5552): 7.
  • 10Scholkopf B. Statistical Learning and Kernel Methods[R]. CISM Courses and Lectures, International Centre for Mechanical Sciences, 2000, 431(23): 3-24.

共引文献296

同被引文献336

引证文献32

二级引证文献93

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部