期刊文献+

基于结构与内容的Web主要信息提取方法研究

Research on main web information extraction based on structure and content
在线阅读 下载PDF
导出
摘要 Web页面的主要信息被广告、超链等无用信息包围,是Web信息自动处理所要解决的难题。传统的信息提取方法是从内容着手,或者从结构出发,很少将两者相结合,因此提出了一种Web主要信息提取方法。该方法可以从Web页面的结构和内容两方面出发,准确地将Web内容进行分块,并对分块内容进行分析处理,从而提取出Web页面的主要信息。 The main web information is usually surrounded by advertisings, hyperlinks and other useless information. It is a main problem for the automatic processing of web information. The traditional method of main web information extraction is either based on content or on structure, rarely both. A method for extracting main web information based on structure and content is presented. It can first block the web content accurately, and then analyze the blocks, lastly extract the main web information.
作者 张文东 李伟
出处 《计算机工程与设计》 CSCD 北大核心 2008年第24期6210-6212,共3页 Computer Engineering and Design
关键词 WEB页面 内容 结构 分块 信息提取 web pages content structure blocking information extraction
  • 相关文献

参考文献8

  • 1Liu Ling, Pu Calton, Han Wei. XWRAP: An XML-enabled wrapper construction system for web information sources [C]. Proc of the 16th Int'l Conf on Data Engineering. Washington: IEEE Computer Society Press,2000:611-621.
  • 2Baumgartner R,Flesca S,Gottlob G.Visua! web information extraction with Lixto[C].Proc of the 27th Int'l Conf on Very Large Data Bases.San Francisco:Morgan Kaufmann,2001:119-128.
  • 3陈志敏,沈洁,林颖,周峰.基于主题划分的网页自动摘要[J].计算机应用,2006,26(3):641-644. 被引量:8
  • 4Gupta S,Kaiser G,Neistadt D,et al.DOM-based content extraction of HTML documents[C].Proc of the 12th Int'l World Wide Web Conf.New York :ACM Press,2003:207-214.
  • 5HTMLParser[EB/OL].http://www.apache.org/.
  • 6瞿有利,于浩,徐国伟,西野文人.Web页面信息块的自动分割[J].中文信息学报,2004,18(1):6-13. 被引量:10
  • 7Embley D W, Jiang Y S,Ng Y K.Record-boundary discovery in web documents[C].Philadelphia,USA:Proceedings of SIGMOD, 1999.
  • 8刘挺,吴岩,王开铸.基于信息抽取和文本生成的自动文摘系统设计[J].情报学报,1997,16(S1):31-36. 被引量:13

二级参考文献19

  • 1刘挺,吴岩,王开铸.基于信息抽取和文本生成的自动文摘系统设计[J].情报学报,1997,16(S1):31-36. 被引量:13
  • 2[1]Line Eikvil, Information Extraction from World Wide Web- A Survey[M], Report No. 945, Norwegian Computing Center, ISBN 82-539-0429-0, July, 1999.
  • 3[2]Chia-Hui Chang, Shao-Chen Lui , IEPAD: Information Extraction Based on Pattern Discovery [C], Proceedings of the Tenth International World Wide Web Conference, Hong Kong , May 2001. http:// www10.org/ cdrom/ papers/223/.
  • 4[3]Embley D.W., Jiang Y.S., Ng Y.K., Record-Boundary Discovery in Web Documents[C], Proceedings of SIGMOD, Philadelphia, USA, 1999.
  • 5[4]Morrison, D.R. Journal of ACM [J], 15:514-534.
  • 6[5]E. Ukkonen. On-line construction of suffix-tree[J], algorithmica,14:249-60,1995.
  • 7LUHN HP.The automatic creation of literature abstract[J].IBM Journal of Research and Development,1958,2(2):159-165.
  • 8RUSH JE,SALVADOR R,ZAMORA A.Automatic abstracting and indexing production of indicative abstracts by application of contextual inference and syntactic coherence criteria[J].Journal of American Society for Information Society,1971,22(4):260-274.
  • 9SALTON G,SINGHAL A,MITRA M.Automatic Text Structuring and Summarization[J].Information Processing and Management,1997,33(2):193-207.
  • 10RAU LF.Concpetual information extraction and retrieval from natural language input[A].Proceedings of RIAO 88 Conference[C],1988.424-437.

共引文献27

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部