摘要
数据挖掘已被广泛用于医疗领域,而大多数医疗数据集都存在缺失值。本文介绍了一些缺失值估计算法。建立了5种模型来提高预测的有效性,它们是保留缺失模型、直接丢弃模型、贝叶斯朴缺模型、贝叶斯重叠补缺模型和基于信息增益的贝叶斯重叠补缺模型。这些模型在Clinics数据集上进行了处理和分析。用C4.5决策树和10叠交叉确认法来检验这些模型的性能,结果表明根据信息增益递减顺序排序,用朴素贝叶斯分类器来预测缺失值是有效的。
Data mining approaches have been applied widely in the field of healthcare and most healthcare datasets are full of missing values. Some missing value estimation methods are introduced in this paper. Five models are built to improve the efficiency of the prediction: Basic model; Delete straight model; Bayesian estimation model; Bayesian estimation iteration model and Bayesian estimation iteration model based on information gain. The models are conducted and analyzed on Clinics dataset. Decision tree C4.5 and 10-folds cross-validation are used to estimate the performances of each model, which shows that use naive Bayesian classifier to predict missing values iteratively in degressive order of information gain is effective.
出处
《计算机科学》
CSCD
北大核心
2004年第10期155-156,174,共3页
Computer Science
基金
上海财经大学"211工程"重点学科建设项目资助(2004[9])