期刊文献+

基于无监督学习的数据清洗算法 被引量:3

Data Cleaning Algorithm Based on Unsupervised Learning
在线阅读 下载PDF
导出
摘要 为了解决数据仓库中相似重复记录的数据问题,提出了基于无监督学习的数据清洗算法。该算法采用基于Hebb ian假设的自适应学习方法,并通过相似度确定奖励和惩罚等级。在学习过程中根据需要增加新的聚类,在学习结束后,通过分析聚类情况删除错误的聚类,从而避免了死神经元问题并使聚类更加准确。实验表明,该算法能准确地完成实体识别。 To resolve the similarity and iteration record problem in the data warehouse, which is based on unsupervised learning was put forward. The learning method is based and the main idea of the learning is that the similarity level decides the rewarded and a data cleaning on the Hebbian algorithm postulate penalized rate. To over- come the problem of dead cluster a new cluster is constituted when no existing cluster is similar to one pattern. After learning, another important task is to detect whether there are wrong clusters, if one is found, the cluster will be deleted and combined with the cluster which is the most similar cluster to it, and thus the result of clustering is more accurate. In the experiments, the learning algorithm is applied to clustering task to check its capability and the results show that it performs accurately.
出处 《吉林大学学报(信息科学版)》 CAS 2008年第6期599-604,共6页 Journal of Jilin University(Information Science Edition)
基金 吉林省科技厅基金资助项目(20071103)
关键词 数据仓库 数据抽取 数据转换 数据清洗 数据装载 data warehouse data extract data transform data cleaning data loading
  • 相关文献

参考文献14

  • 1WAND Y, ANCHORING WANG R Y. Data Quality Dimensions in Ontological Foundations [J]. Commun ACM, 1996, 39 (11) : 86-95.
  • 2STRONG DIANE M, LEE YANG W, WANG RICHARD Y. Data Quality in Context [J]. Commun ACM, 1997, 40 (5) : 103-110.
  • 3VASSILIADIS P. Arktos : Towards the Modeling, Design, Control and Execution of ETL Processes [ J ]. Information System, 2001, 26 (8): 537-561.
  • 4郭志懋,周傲英.数据质量和数据清洗研究综述[J].软件学报,2002,13(11):2076-2082. 被引量:282
  • 5贾自艳,黄友平,罗平,李嘉佑,秦亮曦,史忠植.面向数据质量的ETL过程建模与实现[J].系统仿真学报,2004,16(5):907-911. 被引量:24
  • 6HEMANDEZ M A, STOLFO S J. Real-World Data is Dirty: Data Cleansing and the Merge Ppurge Problem [ J]. Data Mining and Knowledge Discovery, 1998, 2 ( 1 ) : 9-37.
  • 7RAMAN V, HELLERSTEtN J. Potter's Wheel: An Interactive Data Cleaning System [ C] ///Proceedings of the 27th International Conference on Very Large Databases. Roma: Morgan Kaufmann, 2001: 381-390.
  • 8GALHARDAS H, FLORESCU D, SHASHA D. Declarative Data Cleaning: Language, Model and Algorithms [C]// Proceedings of the 27th International Conference on Very Large Databases. Cairo: Morgan Kaufmann, 2001 : 615-618.
  • 9HIPP J, GUNTZER U, GRIMMER U. Data Quality Mining: Making a Virtue of Necessity [C]//Workshop on Research Lssues in Data Mining and Knowledge Discovery. Santa Barbara: ACM, 2001: 52-57.
  • 10LEE D H, KIM M H. Database Summarization Using Fuzzy ISA Hierarchies [ J ]. IEEE Transition Systems, Man, and Cybernetics-Part B: Cybernetics, 1997, 27 (4) : 671-680.

二级参考文献35

  • 1Panos Vassiliadis, Zografoula Vagena, Spiros Skiadopoulos, Nikos Karayannidis, Timos Sellis. ARKTOS: towards the modeling, design, control and execution of ETL processes[J]. Infornation Systems, 2001, 26(8):537-561.
  • 2R.Y. Wang, V.c. Storey, C.P. Firth, A framework for analysis of data quality research[J]. IEEE Transactions on Knowledge and Data Engineering, 1995, 7(4): 623-640.
  • 3H. Galhardas, D. Florescu, D. Shasha, E. Simon. AJAX: an extensible data cleaning tool[A] in Proceeding of the ACM SIGMOD International Conference on the Management of Data[C]. Dallas: TX, 2000.
  • 4V. Borkar, K. Deshmuck, S. Sarawagi, Automatically extracting structure from free text addresses [J]. Bull. Techn. Committee Data Engineering, 2000, 23 (4): 27-32.
  • 5V. Raman, J. Hellerstein, Potters wheel: an interactive framework for data cleaning and transformation[R], Technical Report, University of California at Berkeley, Computer Science Division, 2000.
  • 6J. M. Hellerstein, M. Stonebraker, R. Caccia. Independent, open enterprise data integration [J]. Bull. Techn. Committee Data Engineering, 1999, 22 (1): 31-36.
  • 7M. Jarke, M.A. Jeusfeld, C. Quix, P. Vassiliadis. Architecture and quality in data warehouses: an extended repository approach[J]. Information Systems, 1999, 24 (3) : 229-253.
  • 8P. Vassiliadis, M. Bouzeghoub, C. Quix. Towards quality-oriented data warehouse usage and evolution[J], Information Systems, 2000, 25 (2) : 89-115.
  • 9P. Vassiliadis, C. Quix, Y. Vassiliou, A model for data warehouse operational processes[C], Proceedings of the 12th Conference on Advanced Information Systems Engineering (CaiSE'00), Stockholm, Sweden, 2000.
  • 10WHInmon著 王志海译.Building the Data Warehouse (Second Edition)[M].北京:机械工业出版社,2000,5..

共引文献300

同被引文献89

引证文献3

二级引证文献26

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部