摘要
针对传统Web信息抽取的隐马尔可夫模型对初值十分敏感和在实际训练中极易得到局部最优模型参数,提出了一种最大熵和最大熵马尔可夫模型相结合的条件模型.该方法对输入的Web页面进行解析并构建HTML树,通过计算HTML子树结点的熵定位数据域,允许观察值表示任意重叠特征(像词、大写、HTML标记、语义)和定义状态序列给予观察序列的条件概率实现了Web信息抽取.实验结果表明,新的方法在精确度和召回率指标上比传统隐马尔可夫模型和最大熵马尔可夫模型具有更好的性能.
The traditional HMM for Web information extraction is sensitive to the initial model parameters and easy to lead to a sub-optimal model in practice. A hybrid conditional model to combine maximum entropy and maximum entropy Markov model is put formard for Web information extraction. With this approach, the input Web page is parsed to build an HTML tree, data regions are located in each HTML subtree node by estimating the entropy, which allows observa- tions to be represented as arbitrary overlapping features(such as vocabulary, capitalization, HT- ML tags, and semantics), and defines the conditional probability of state sequences given to observation sequences for Web information extraction. Experimental results show that the new approach improves the performance in precision and recall over traditional hidden Markov model and maximum entropy Markov model.
出处
《郑州大学学报(理学版)》
CAS
2008年第3期52-55,共4页
Journal of Zhengzhou University:Natural Science Edition
基金
湖南省自然科学基金资助项目
编号04JJ40051
湖南省教育厅科研项目
编号06c724