摘要
为了进一步提高Web信息抽取的准确性和效率,针对Web信息抽取的遗传算法和一阶隐马尔可夫模型混合方法在初值选取和参数寻优上的不足,提出了一种遗传算法和二阶隐马尔可夫模型内嵌结合的改进方法。在分层预处理阶段,利用格式信息和文本特征将文本切分成文本行、块或单个的词等恰当的层次;然后采用内嵌的遗传算法和二阶隐马尔可夫混合模型训练参数,保留最优和次优染色体,修正Baum-Welch算法的初始参数,多次使用遗传算法微调二阶隐马尔可夫模型;最后用改进的Viterbi算法实现Web信息抽取。实验结果表明,改进方法在精确度、召回率指标和时间性能上均比遗传算法和一阶隐马尔可夫模型的混合方法具有更好的性能。
In order to further enhance the accuracy and efficiency of Web information extraction, for the shortcomings of hybrid method of genetic algorithm and first-order hidden Markov model in the initial value selection and parameter optimization, an improved combined method embedded with genetic algorithm and second-order hidden Markov model was presented. In the hierarchical preprocessing phase, text was segmented hierarchically into proper lines, blocks and words by using the format information and text features. And then the embedded genetic algorithm and second-order hidden Markov hybrid model were adopted to train parameters, and the optimal and sub-optimal chromosomes were all retained to modify initial parameters of Baumelch algorithm and genetic algorithm was used repeatedly to fine-tune the second,order hidden Markov model. Finally the improved Viterbi algorithm was used to extract Web information. Experi- mental results show that the new method improves the performance in precision,recall and time.
出处
《计算机科学》
CSCD
北大核心
2012年第3期196-199,215,共5页
Computer Science
基金
国家自然科学基金(60775041)
山西省高校科技开发项目(20101120)资助
关键词
WEB信息抽取
遗传算法
二阶隐马尔可夫模型
分层
Information extraction, Genetic algorithm, Second-order hidden markov model, Hierarchy