期刊文献+

一种相似重复记录检测算法的改进研究 被引量:4

Improved Method for Detecting Incremental Approximately Duplicate Records
在线阅读 下载PDF
导出
摘要 相似重复记录检测是数据清洗领域中的一个重要方面。文中研究了在数据模式与匹配规则不变的前提下,数据集动态增加时近似重复记录的识别问题,针对基于聚类数算法精度不高、效率低下等问题提出一种改进算法。该算法运用等级法给属性赋予相应权重并约减属性,通过构造聚类树对相似记录进行聚类,增设了一个阈值以减少不必要的相似度比较次数,提高了算法的效率和准确率。最后通过实验证明了该算法的有效性,并提出了进一步的研究方向。 Cleaning approximately duplicate records is an important task in data cleaning.Problems of detecting approximately duplicate records when the data set is dynamically increased on the assumption of stable data model and matching rules are studied.An improved method is proposed to deal with problems in the method based on clustering tree.The proposed method appoints proper weight to each field of the record and reduces attributes through using ranked-based weights method;clusters duplicate records by creating a clustering tree.To improve the efficiency of this method,a limen is added into the arithmetic.Finally,the validity of this method is proved by experiment and further research directions are proposed.
出处 《计算机技术与发展》 2010年第7期13-16,共4页 Computer Technology and Development
基金 国家自然科学基金项目(70871033)
关键词 相似重复记录 增量式 聚类树 等级法 approximately duplicate record incremental clustering tree ranked-based method
  • 相关文献

参考文献10

二级参考文献36

  • 1陈伟,丁秋林.数据清理中编辑距离的应用及Java编程实现[J].电脑与信息技术,2003,11(6):33-35. 被引量:9
  • 2程国达,苏杭丽.一种检测汉语相似重复记录的有效方法[J].计算机应用,2005,25(6):1362-1365. 被引量:8
  • 3李先国,梁涌.一种高效的适用于字词检索的数据结构[J].微电子学与计算机,2006,23(12):157-160. 被引量:2
  • 4张永,迟忠先.位置编码在数据仓库ETL中的应用[J].计算机工程,2007,33(1):50-52. 被引量:12
  • 5[1]Bitton D, DeWitt D J. Duplicate record elimination in large data files. ACM Trans Database Systems, 1983, 8(2):255-65
  • 6[2]Hernandez M, Stolfo S. The Merge/Purge problem for large databases. In: Proc ACM SIGMOD International Conference on Management of Data, 1995. 127-138
  • 7[3]Howard B Newcombe, Kennedy J M, Axford S J, James A P. Automatic linkage of vital records. Science, 1959, 130:954-959
  • 8[4]DeWitt D J, Naught J F, Schneider D A. An evaluation of non-equijoin algorithms. In: Proc 17th International Conference on Very Large Databases, Barcelona, Spain, 1991. 443-452
  • 9[5]Hylton J A. Identifying and merging related bibliographic records[MS dissertation]. MIT: MIT Laboratory for Computer Science Technical Report 678, 1996
  • 10[6]Monge A E, Elkan C P. An efficient domain-independent algorithm for detecting approximately duplicate database records. In: Proc DMKD'97, Tucson Arizona, 1997

共引文献100

同被引文献32

  • 1陈伟,丁秋林.一种XML相似重复数据的清理方法研究[J].北京航空航天大学学报,2004,30(9):835-838. 被引量:7
  • 2韩京宇,徐立臻,董逸生.一种大数据量的相似记录检测方法[J].计算机研究与发展,2005,42(12):2206-2212. 被引量:32
  • 3Deshpande A, Cuestrln C, Madden S, et al. Model-driven Data Acquisition in Sensor Networks[ C]//Proeeedings of the 30th VLDB Conferonee. Toronto: [ s. n. ] ,2004:588-599.
  • 4Barbara D, Garcia-Molina H, Porter D. The Management of Probabilistic Data[ J ]. IEEE Transactions on Knowledge and Data Engineering, 1992,4 (5) :487-502.
  • 5Keulen M, Keijzer A, Alink W. A Probabilistic XML Approach to Data Integration[ C]//Proceedings of the 21st International Conference on Data Engineering. [ s. 1. ] : [ s. n. ] ,2005:459- 470.
  • 6Elmagarmid A K, Ipeirotis P G, Verykios V S. Duplicate Re- cord Detection:A Survey[ J ]. IEEE Transactions on Knowl-edge and Data Engineering,2007,19( 1 ) :1-16.
  • 7Keulen M, Keijzer A. Qualitative effects of knowledge rules and user feedback in probabilistie data integration [ J ]. VLDB journal,2009,18(5) :1191-1217.
  • 8Data Quality : Concepts, Methodologies and Techniques ( Data- centrie Systems and Applications ) [ M ]. [ s. 1. ]: [ s. n. ], 2006.
  • 9邓玮舛,余永权.数据挖掘中粗糙决策规则及其不确定性研究[J].计算机技术与发展,2008,18(8):50-53. 被引量:1
  • 10周迪民,段国云.地理信息系统属性数据不确定性的研究[J].计算机技术与发展,2009,19(12):174-177. 被引量:7

引证文献4

二级引证文献10

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部