期刊文献+

基于混合条件模型的Web信息抽取 被引量:2

Web Information Extraction Based on Hybrid Conditional Model
在线阅读 下载PDF
导出
摘要 针对传统Web信息抽取的隐马尔可夫模型对初值十分敏感和在实际训练中极易得到局部最优模型参数,提出了一种最大熵和最大熵马尔可夫模型相结合的条件模型.该方法对输入的Web页面进行解析并构建HTML树,通过计算HTML子树结点的熵定位数据域,允许观察值表示任意重叠特征(像词、大写、HTML标记、语义)和定义状态序列给予观察序列的条件概率实现了Web信息抽取.实验结果表明,新的方法在精确度和召回率指标上比传统隐马尔可夫模型和最大熵马尔可夫模型具有更好的性能. The traditional HMM for Web information extraction is sensitive to the initial model parameters and easy to lead to a sub-optimal model in practice. A hybrid conditional model to combine maximum entropy and maximum entropy Markov model is put formard for Web information extraction. With this approach, the input Web page is parsed to build an HTML tree, data regions are located in each HTML subtree node by estimating the entropy, which allows observa- tions to be represented as arbitrary overlapping features(such as vocabulary, capitalization, HT- ML tags, and semantics), and defines the conditional probability of state sequences given to observation sequences for Web information extraction. Experimental results show that the new approach improves the performance in precision and recall over traditional hidden Markov model and maximum entropy Markov model.
出处 《郑州大学学报(理学版)》 CAS 2008年第3期52-55,共4页 Journal of Zhengzhou University:Natural Science Edition
基金 湖南省自然科学基金资助项目 编号04JJ40051 湖南省教育厅科研项目 编号06c724
关键词 WEB信息抽取 最大熵马尔可夫模型 条件模型 最大熵 隐马尔可夫模型 Web information extraction maximum entropy Markov model conditional model maximum entropy hidden Markov model
  • 相关文献

参考文献6

  • 1Seymore K, McCallum A, Rosenfel R. Learning hidden Markov model structure for information extraction[C]//Proceedings of the AAAI-99 Workshop on Machine Learning for Information Extraction. Orlando, Florida, 1999:37-42.
  • 2刘云中,林亚平,陈治平.基于隐马尔可夫模型的文本信息抽取[J].系统仿真学报,2004,16(3):507-510. 被引量:52
  • 3Berger A, Pietra S, Pietra V. A maximum entropy approach to natural language processing[J]. Computational Languistics, 1996,22(1) :39-71.
  • 4林亚平,刘云中,周顺先,陈治平,蔡立军.基于最大熵的隐马尔可夫模型文本信息抽取[J].电子学报,2005,33(2):236-240. 被引量:49
  • 5McCallum A, Freitag D, Pereira F. Maximum entropy Markov models for information extraction and segmentation[C]// Proceedings of the Seventeenth International Conference on Machine Learning. San Francisco, 2000:591-598.
  • 6Phan X, Horiguchi S, Ho T. Automated data extraction from the Web with conditional models[J]. Int J Business Intelligence and Data Mining, 2005,1(2) : 194-209.

二级参考文献25

  • 1[1]A. McCallum, K. Nigam, J. Rennie, and K. Seymore. A machine learning approach to building Domain-Specific Search Engines [A]. In Proceedings of IJCAI-99 [C]. 622-667.
  • 2[2]Ellien Riloff. Automatically Constructing a Dictionary for Information Extraction Task [A]. Proceeding for the Eleventh National Conference on Artificial Intelligence [C]. 1993. 811-816.
  • 3[3]E. Riloff , R. Jones. Learning Dictionaries for Information Extraction by Multi-Level Bootstrapping [A]. Proceedings of the Sixteenth National Conference on Artificial Intelligence [C]. 1999. 811-816.
  • 4[4]S. Soderland. Learning information extraction rules for semi-structured and free text [J]. Machine Learning, 1999, 1-44.
  • 5[5]Kushmerick, N. Wrapper induction: efficiency and Expressiveness [J]. Artificial Intelligence,2000, Vol. 118, pp. 15--68.
  • 6[6]Leek,T. R. Information Extraction Using Hidden Markov Models [D]. Master's thesis, UC san Diego,1997.
  • 7[7]Kristie Seymore, Andrew McCallum, Ronal Rosenfel. Learning Hidden Markov Model Structure for Information Extract [A]. AAAI' 99 Workshop on Machine Learning for Information Extraction [C]. 1999. 37-42.
  • 8[8]Dayne Frietag, Andrew McCallum. Information Extraction with HMMs and shrinkage [A]. In Proceedings of the AAAI'99 Workshop on Machine Learning for Information Extraction [C], 1999, pp. 31-36.
  • 9[9]Freitag, D., & McCallum, A. Information extraction with HMM structures learned by stochastic optimization [A]. Proceedings of the Eighteenth Conference on Artificial Intelligence [C]. 2000.584-589.
  • 10[10]Freitag, D., McCallum, A., and Pereira F. Maximum Entropy Markov Models for Information Extraction and Segmentation [A]. In proceedings of ICML-2000 [C]. 591-598.

共引文献86

同被引文献10

引证文献2

二级引证文献5

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部