期刊文献+

基于潜在语义分析的BBS文档Bayes鉴别器 被引量:17

Bayes Discriminator for BBS Documents Based on Latent Semantic Analysis
在线阅读 下载PDF
导出
摘要 电子公告栏 (BBS)的滥用是一种以信息污染为特色的社会问题 ,对BBS文档进行鉴别已成为信息安全重要内容之一 .该文融合了数据挖掘技术、数理统计技术和自然语言理解技术 ,提出了基于潜在语义分析与Bayes分类的BBS文档鉴别方法 :利用自然语言处理技术从训练文档中抽取典型短语集 ;通过潜在语义分析进行典型短语同义归约 ,应用关联规则采掘技术提高典型短语间的独立性 ;用Bayes分类器对BBS文档进行鉴别 .该文还对影响系统的关键参数进行了大量的讨论和测试 ,实验表明该方法对于BBS文档的鉴别是可行而有效的 . With the rapid development of Internet, the abuse and misuse of BBS become a social problem of information pollution and call on the demand to the discrimination techniques for BBS document. Borrowing the techniques from data mining, probability-statistics and Natural Language Understanding, this paper proposes a new discrimination method for BBS document, called Bayes Discrimination based on Latent Semantic Analysis(BDLSA). The main steps of the new method includes following steps: (1)Makes typical phrase set by extracting the typical sentences from training documents in preprocessing stage with natural language understanding techniques.(2)Applies synonymy reduction on typical phrases by Latent Semantic Analysis.(3)Discovers the association rules between typical phrases to increase the independency of phrases so that the traditional Bayes discriminator works efficiently.(4)Discriminates BBS document by Bayes classifier. The algorithms to construct typical phrase set and to reduce synonymy are proposed and implemented. The experiment is based on real document form Web, with training data of 583 documents and test-data of 308 documents, the correctness is up to 75%. This shows the effetiveness and validation of the new method.
出处 《计算机学报》 EI CSCD 北大核心 2004年第4期566-572,共7页 Chinese Journal of Computers
基金 国家自然科学基金 ( 60 0 73 0 46) 高等学校博士学科点专项科研基金( 2 0 0 2 0 610 0 0 7)资助
关键词 数据挖掘 关联规则 BAYES分类 潜在语义分析 BBS 电子公告栏 data mining associate rule Bayes classifier latent semantic analysis BBS
  • 相关文献

参考文献10

  • 1Lang K.,News Weeder. Learning to filter net-news. In: Proceedings of the 12th International Conference on Machine Learning, 1995, 331~339
  • 2Chakrabarti S., Dom B., Agrawal R., Raghavan P.. Using taxonomy, discriminates and signatures for navigating in text databases. In: Proceedings of the 23rd International Conference on Very Large Databases, Athens, Greece 26-29, 1997, 446~455
  • 3Tang Chang-Jie, Li Tong, Liu Chang-Yu, Ge Yin. Classify web document by key phrase understanding. In: Proceedings of WIAM2001(International conference for Web Information Age 2001), 2001, 80~88
  • 4Deerwester S., Dumais S T et al.. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 1990, 41(6): 391~407
  • 5Dumais S.T..Latent Semantic Indexing(LSI) and TREC-2. In:Harman D.ed..The second text retrieval conference(TREC2), National Institute of Standards and Technology Special Publication, Maryland, USA, 1994, 105~116
  • 6Sturt G.W.. Introduction to Matrix Computing. Shanghai: Shanghai Publishing Company, 1980(in Chinese)(G.W.斯图尔特.矩阵计算引论.上海:上海科学技术出版社,1980)
  • 7李通 刘昌钰 唐常杰.基于自然语言理解技术的Web文件分类与过滤[A]..第17界全国数据库学术会议论文集[C].保定,2000,27.136-140.
  • 8Agrawal R., Srikant R.. Fast algorithms for mining association rules. In: Proceedings of the 20th Very Large Database Conference, Santiago, Chile, 1994, 487~499
  • 9于中华,唐常杰,张天庆,朱敏,廖果,李志蜀,沙芦华.“信译”英汉机器翻译系统的语法分析策略[J].小型微型计算机系统,2000,21(3):316-318. 被引量:4
  • 10Zuo Jie, Tang Chang-Jie, Zhang Tian-Qing. Mining Predicate Association Rule by Gene Expression Programming. LNCS(Lecture Notes in Computer Science) 2419, Berling Heidelberg: Springer-Verlag, 2002, 92~103

二级参考文献3

  • 1Tang Changjie,J Comput Sci Technol,1996年,11卷,4期,365页
  • 2刘开瑛,自然语言处理,1991年
  • 3牛津现代高级英汉双解词典,1988年

共引文献3

同被引文献188

引证文献17

二级引证文献171

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部