期刊文献+

基于中文地址类信息的分词处理 被引量:3

A segment method of chinese address information
在线阅读 下载PDF
导出
摘要 数据仓库中脏数据处理的热点问题是识别与消除相似重复记录。针对中文地址类重复信息的处理,提出了一种基于特征字符的分词策略,在建立了包含分词规则的元数据库基础上,描述了基于特征字符的分词算法。实验结果表明分词所用的时间随着数据集的增长变化不大。因此,将分词方法应用于中文地址类重复记录的检测,也不会增加检测的时间。 It's a hot issue to eliminate approximately duplicated records in cleansing dirty data of data warehouse.Aiming at processing of Chinese address information,a segment mechanism based on the feature word is proposed.The meta-database of segment rules is established,and the feature word based segment algorithm is presented.The experiment results indicate that the segment time is invariable along with the data set growing.So this method can be used in detecting approximately duplicated records,but the detecting time will not increase.
出处 《沈阳航空工业学院学报》 2008年第4期63-66,共4页 Journal of Shenyang Institute of Aeronautical Engineering
关键词 相似重复记录 中文地址 特征字符 分词 Approximately duplicated records Chinese address information Tagged word Segment
  • 相关文献

参考文献4

二级参考文献16

  • 1鲍玉斌,孙焕良,冷芳玲,王大玲,于戈.数据仓库环境下以用户为中心的数据清洗过程模型[J].计算机科学,2004,31(5):52-55. 被引量:15
  • 2黄昌宁.统计语言模型能做什么?[J].语言文字应用,2002(1):77-84. 被引量:31
  • 3[1]Erhard R., Do H.H. Data Cleaning:Problem and Current Approaches[J]. IEEE Techn. Bulletin Data Engineering,2000,23(4).
  • 4[2]Hern′andez M.A.,Stolfo S.J. The merge/purge problem for large databases[A]. Proceedings of the ACM SIGMOD,International Conference on Management of Data[C]. ACM Press,May 1995. 127-138.
  • 5[3]Monge A.E. An adaptive and efficient algorithm for detecting approximately duplicate database records[J]. Submitted for journal publication, June 2000.
  • 6[4]Monge A. E.,Elkan C.P. The field matching problem: Algorithms and applications[A]. Proc. 2nd Intl. Conf. Knowledge Discovery and Data Mining[C]. Portland, Oregon,1996.
  • 7[5]Lee M.L.,Lu H., Ling T.W. et al. Cleansing Data for Mining and Warehousing[A]. 10th International Conference and Workshop on Database and Expert Systems Applications (DEXA99)[C]. Florence, Italy, August 30 - September 3,1999.
  • 8MANNING C, SCHüTZE H. Foundations of Statistical Natural Language Processing[M] MIT Press. Cambridge, MA: 1999.
  • 9ZHANG HP. Chinese Lexical Analysis Using Hierarchical Hidden Markov Model[A]. Second SIGHAN workshop affiliated with 41th ACL[C], 2003.63 -70.
  • 10Bitton D,DeWitt D J. Duplicate record elimination in large data files[J]. ACM Transactions on Database Systems,1983, 8(2): 255-265.

共引文献17

同被引文献33

  • 1王凌云,李琦,江洲.国内地理编码数据库系统开发与研究[J].计算机工程与应用,2004,40(21):167-168. 被引量:33
  • 2洪圆,孙未未,施伯乐.一种使用双阈值的数据仓库环境下重复记录消除算法[J].计算机工程与应用,2005,41(1):168-170. 被引量:9
  • 3高红,黄德根,杨元生.汉语自动分词中中文地名识别[J].大连理工大学学报,2006,46(4):576-581. 被引量:10
  • 4吴昊,潘无名,王硕,杨博.一种基于变型B-树的中文自动分词词典机制[J].技术与市场,2007,14(4):37-38. 被引量:1
  • 5US Census Bureau[EB/OL].[2012-10-20].http ://www.census. gov/geo/www/tiger.
  • 6Christen P.A probabilistic geocoding system based on a national address file[C]//Proceedings of the 3rd Austral- asian Data Mining Conference,2004.
  • 7Goldberg D W.From text to geographic coordinates:the current state of geocoding[J].URISA Journal, 2007, 19( 1 ) : 33-46.
  • 8Leidner J L.Toponym resolution in text:annotation,eval- uation and applications of spatial grounding of place names[D].Edinburgh: University of Edinburgh, 2007.
  • 9Hemandez M A, Stolfo S J.Real-world data is dirty: data cleansing and the merge/purge problem[J].Data Mining and Knowledge Discovery, 1998,2( 1 ) :9-37.
  • 10Hernandez M, Stolfo S.The merge/purge problem for large databases[C]//Proceedings of the ACM SIGMOD International Conference on Management of Data, San Jose,California, 1995: 127-138.

引证文献3

二级引证文献16

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部