摘要
为了解决数据仓库中相似重复记录的数据问题,提出了基于无监督学习的数据清洗算法。该算法采用基于Hebb ian假设的自适应学习方法,并通过相似度确定奖励和惩罚等级。在学习过程中根据需要增加新的聚类,在学习结束后,通过分析聚类情况删除错误的聚类,从而避免了死神经元问题并使聚类更加准确。实验表明,该算法能准确地完成实体识别。
To resolve the similarity and iteration record problem in the data warehouse, which is based on unsupervised learning was put forward. The learning method is based and the main idea of the learning is that the similarity level decides the rewarded and a data cleaning on the Hebbian algorithm postulate penalized rate. To over- come the problem of dead cluster a new cluster is constituted when no existing cluster is similar to one pattern. After learning, another important task is to detect whether there are wrong clusters, if one is found, the cluster will be deleted and combined with the cluster which is the most similar cluster to it, and thus the result of clustering is more accurate. In the experiments, the learning algorithm is applied to clustering task to check its capability and the results show that it performs accurately.
出处
《吉林大学学报(信息科学版)》
CAS
2008年第6期599-604,共6页
Journal of Jilin University(Information Science Edition)
基金
吉林省科技厅基金资助项目(20071103)