期刊文献+

一种检测多语言文本相似重复记录的综合方法 被引量:26

A Synthetical Approach for Detecting Approximately Duplicate Database Records of Multi-Language Data
在线阅读 下载PDF
导出
摘要 1.前言随着信息技术的广泛应用,如何有效利用不断激增的数据成为企业的迫切问题.数据仓库和数据挖掘技术为企业从浩瀚的数据海洋中获取有用的知识提供了一种有效的手段.然而,现实世界中的数据往往存在着大量的质量问题,从简单的数据输入错误到相对较复杂的数据间的语义不一致性.如果数据的质量达不到要求,那么数据挖掘这类技术产生的结果也不会理想,甚至产生错误的分析结果,从而误导决策.可见提高数据质量的重要性. Detecting approximate duplicate records in database is a key problem related to data quality. In this paper, we present a synthetical approach for recognizing clusters of approximately duplicate records of multi-language data. The key ideas are: (1) an efficient algorithm for sorting multi-language data; (2)an efficient edit-distance based pair-wise comparison method for multi-language data; (3)using a priority queue of duplicates clusters and representative records strategy to respond adaptively to the. data scale.
出处 《计算机科学》 CSCD 北大核心 2002年第1期118-121,共4页 Computer Science
关键词 数据仓库 数据挖掘 数据库 信息重复 多语言文本相似重复记录方法 检测 Approximate duplicates records, Clustering, Pairwise comparison, Priority queue
  • 相关文献

参考文献8

  • 1Bitton D,DeWitt D J.Duplicate record elimination in large data files.ACM Transactions on Database Systems,1983,8(2): 255~265
  • 2Monge A E,Elkan C P.An efficient domain-independent algorithm for detecting approximately duplicate database records.1997
  • 3Hernandez M,Stolfo S.The merge/purge problem for large databases.In:Proc.of the ACM SIGMOD International Conference on Management of Data,May 1995.127~138
  • 4邱越峰,田增平,季文贇,周傲英.一种高效的检测相似重复记录的方法[J].计算机学报,2001,24(1):69-77. 被引量:73
  • 5Monge A E,Elkan C P.The field matching problem: Algorithms and applications.In: Proc.of the 2nd Int.Conf.on Knowledge Discovery and Data Mining,1996.267~270
  • 6Smith T F,Waterman M S.Identification of common molecular subsequences.Journal of Molecular Bilogy,1981,147:195~197
  • 7Lowrance R,Wagner R A.An extension of the string-to-string correction problem.J.ACM,1975,22(2): 177~183
  • 8Tarjian R E.Effiency of a good but not linear set union algorithm.Journal of the ACM,1975,22(2):215~225

二级参考文献12

  • 1[1]Bitton D, DeWitt D J. Duplicate record elimination in large data files. ACM Trans Database Systems, 1983, 8(2):255-65
  • 2[2]Hernandez M, Stolfo S. The Merge/Purge problem for large databases. In: Proc ACM SIGMOD International Conference on Management of Data, 1995. 127-138
  • 3[3]Howard B Newcombe, Kennedy J M, Axford S J, James A P. Automatic linkage of vital records. Science, 1959, 130:954-959
  • 4[4]DeWitt D J, Naught J F, Schneider D A. An evaluation of non-equijoin algorithms. In: Proc 17th International Conference on Very Large Databases, Barcelona, Spain, 1991. 443-452
  • 5[5]Hylton J A. Identifying and merging related bibliographic records[MS dissertation]. MIT: MIT Laboratory for Computer Science Technical Report 678, 1996
  • 6[6]Monge A E, Elkan C P. An efficient domain-independent algorithm for detecting approximately duplicate database records. In: Proc DMKD'97, Tucson Arizona, 1997
  • 7[7]Kukich K. Techniques for automatically correcting words in text. ACM Computing Surveys, 1992, 24(4):377-439
  • 8[8]Wagner R A, Fischer M J. The string-to-string correction problem. J ACM, 1974, 21(1):168-173
  • 9[9]Lowrance R, Robert A Wagner. An extension of the string-to-string correction problem. J ACM, 1975, 22(2):177-183
  • 10[10] Sellers P H. On the theory and computation of evolutionary distances. SIAM J Applied Mathematics, 1974, 26(4):787-793

共引文献72

同被引文献280

引证文献26

二级引证文献488

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部