期刊文献+

WNBTE网页正文抽取方法研究 被引量:5

An Approach Based on Words Numbers for Extracting Text from Web Pages
在线阅读 下载PDF
导出
摘要 WNBTE是一种基于文本字数统计信息,从网页中抽取正文内容的方法。该方法分析网页上存在的各种文字及其特点,寻找网页中包含字符数最多的结点,去掉该结点内的布局文字和说明文字,从而得到正文信息。该方法不需要人工参与,也不需要样本学习,克服了传统网页内容抽取方法中需要根据不同数据源构造不同抽取器的问题。 WNBTE is a method for text extraction from web pages based on the statistics of words numbers. According to the characteristic of characters on web pages, WNBTE picks the node in which the most words are included. For getting the text, words used in layout and narrative words should be removed. Unlike the traditional text extraction method, it does not need user' s intervention and extra samples studying.
作者 李纲 戴强斌
出处 《情报科学》 CSSCI 北大核心 2008年第3期333-336,共4页 Information Science
基金 国家自然科学基金项目(70673070)
关键词 信息处理 网页正文抽取 自动识别 information mining text extraction self-motion recognices
  • 相关文献

参考文献6

二级参考文献20

  • 1[1]Baumgartner R.,Flesca S.,Gottlob G.. Visual web information extraction with lixto. In: Proceedings of the 27th International Conference on Very Large Data Bases, Roma, 2001,119~128
  • 2[2]Liu L.,Pu C., Han W.. XWRAP: An XML-enabled wrapper construction system for web information sources. In: Proceedings of the 16th International Conference on Data Engineering, California, 2000, 611~621
  • 3[3]Gottlob G., Koch C.. Monadic datalog and the expressive power of languages for web Information extraction. In: Proceedings of the 21th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, Wisconsin, 2002, 17~28
  • 4[4]Hamer J.,Brennig M., Garcia-Molina H.. Template-based wrappers in the TSIMMIS system. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, Arizona, 1997, 532~535
  • 5[5]Atzeni P., Mecca G.. Cut and paste. In: Proceedings of the 16th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, Arizona, 1997, 144~153
  • 6[6]Crescenzi V., Mecca G., Merialdo P.. RoadRunner: Towards automatic data extraction from large web sites. In: Proceedings of the 27th International Conference on Very Large Data Bases, Roma, 2001, 109~118
  • 7[7]Soderland S.. Learning information extraction rules for semistructured and free text. Machine Learning,1999, 34(1~3):233~272
  • 8[8]Adelberg B.. Nodose-A tool for semi automatically extracting structured and semi-structured data from text document. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, Washington, 1998, 283~294
  • 9[9]Ribeiro-Neto B.A., Laender A., da silva A.S.. Extracting semistructured data through examples. In: Proceedings of the 1999 ACM CIKM International Conference on Information and Knowledge Management, Missouri, 1999,94~101
  • 10[10]EmbleyD.W., Campbell D.M., Jiang Y.S.. A conceptual-modeling approach to extracting data from web. In: Proceedings of the 17th International Conference on Conceptual Modeling, Singapore, 1998,78~91

共引文献53

同被引文献48

引证文献5

二级引证文献17

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部