期刊文献+

采用树自动机推理技术的信息抽取方法 被引量:2

Information extraction using tree automata inference technique
在线阅读 下载PDF
导出
摘要 提出了一种利用改进的k-contextual树自动机推理算法的信息抽取技术。其核心思想是将结构化(半结构化)文档转换成树,然后利用一种改进的k-contextual树(KLH树)来构造出能够接受样本的无秩树自动机,依据该自动机接收和拒绝状态来确定是否抽取网页信息。该方法充分利用了网页文档的树状结构,依托树自动机将传统的以单一结构途径的信息抽取方法与文法推理原则相结合,得到信息抽取规则。实验证明,该方法与同类抽取方法相比,样本学习时间以及抽取所需时间上均有所缩短。 This paper proposes an information extraction method based on an improved k-contextual tree automata inference algorithm.The key idea is to transform(semi-) structured documents into tree,creating unranked tree automata which can accept the tree and extract data according to the unranked tree automata state of acceptance and rejection,using an advanced k-contextual tree language,which is called KLH tree language.The method makes full use of the tree structure of the web document and combines the method based on web structure with grammar inference.Experimental results show that the approach with tree automata inference is favorable against some other approach in the learning time and extraction time.
出处 《计算机工程与应用》 CSCD 北大核心 2010年第16期153-156,共4页 Computer Engineering and Applications
关键词 树自动机推理算法 结构化(半结构化)文档 无秩树自动机 信息抽取 KLH树 tree automata inference algorithm (semi-)structured documents unranked tree automata information extraction KLH tree language
  • 相关文献

参考文献7

  • 1Ahonen H.Generating grammars for structured documents using grammatical inference methods[D].Univemity of Helsinki,Department of Computer Science,1996.
  • 2Freitag.Using grammatical inference to improve precision in information extractiou[C] //Workshop on Automata Induction,Grammatical Inference,and Language Acquisition,ICML-97,1997.
  • 3Rico-Juan J,Calera-Rubio J,Carrasco R.Probabilistic k-testable tree-lauguages[C] //lecture Notes in Computer Science 1891:ICGI 2000.[S.I.] :Springer,2000:221-228.
  • 4Kosala R,van den Bussche J,Bruynooghe M,et al.Information extraction in structured documents using tree automata induction[C] //Lecture Notes in Computer Science 2431:PKDD.[S.I.] :Springer,2002:299-310.
  • 5王茹,宋瀚涛,陆玉昌.基于树自动机的网页数据抽取[J].北京理工大学学报,2004,24(9):790-793. 被引量:6
  • 6Knsala R.Information extraction from Web documents based on local unrauked tree automaton inference[C] //Proceedings of the Eighteenth International Joint Conference on Artificial Intelligence.[s.I.] :Morgan Kaufmann,2003:403-408.
  • 7Muggleton s.Inductive acquisition of expert knowledge[M].Wokingham:Addison-Wesley,1990:80-85.

二级参考文献10

  • 1Alberto H F, Berthier A. A brief survey of Web data extraction tools[J]. ACM SIGMOD Record, 2002,31(2):170-179.
  • 2Crescenzi V, Mecca G, Meraldo P. Roadrunner: Towards automatic data extraction from large Web sites[A]. Atzeri P, Aprs P, Ceri S, et al. Int Conf on Very Large Data Base 2001[C]. Roma,Italy:Morgan Kaufmann,2001.109-118.
  • 3Arvind A, Garcia-Molina H. Extracting structured data from Web pages[R]. Stanford:Stanford University, 2002.
  • 4Neven F. Automata theory for XML researchers[J]. ACM SIGMOD Record, 2002,31(3):39-46.
  • 5Roci-Juan J, Calera-Rubio J, Carrasco R. Probabilistic k-testable tree language[A]. Arlindo L. ICGI 2000[C]. Lisbon, Portugal:Springer, 2000.221-228.
  • 6Kosala R, Bussche J. Information extraction in structured documents using tree automata induction[A]. Elomaa T. Principles of Data Mining and Knowledge Discovery 2002[C]. Helsinki, Finland:Springer,2002.299-310.
  • 7Kosala R. Information extraction by tree automata inference[R]. Belgium:Katholieke University, 2003.
  • 8Apparao V, Byrne S, Champion M. Document object model level 1[EB/OL]. http:∥www.w3c.org/TR/1998/REC-DOM-Level-1-19981001/,1998-10-01/2003-08-12.
  • 9孟小峰,王海燕,谷明哲,王静.XWIS中基于预定义模式的包装器[J].计算机应用,2001,21(9):1-3. 被引量:3
  • 10李效东,顾毓清.基于DOM的Web信息提取[J].计算机学报,2002,25(5):526-533. 被引量:102

共引文献5

同被引文献20

引证文献2

二级引证文献34

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部