摘要
提出了一种利用改进的k-contextual树自动机推理算法的信息抽取技术。其核心思想是将结构化(半结构化)文档转换成树,然后利用一种改进的k-contextual树(KLH树)来构造出能够接受样本的无秩树自动机,依据该自动机接收和拒绝状态来确定是否抽取网页信息。该方法充分利用了网页文档的树状结构,依托树自动机将传统的以单一结构途径的信息抽取方法与文法推理原则相结合,得到信息抽取规则。实验证明,该方法与同类抽取方法相比,样本学习时间以及抽取所需时间上均有所缩短。
This paper proposes an information extraction method based on an improved k-contextual tree automata inference algorithm.The key idea is to transform(semi-) structured documents into tree,creating unranked tree automata which can accept the tree and extract data according to the unranked tree automata state of acceptance and rejection,using an advanced k-contextual tree language,which is called KLH tree language.The method makes full use of the tree structure of the web document and combines the method based on web structure with grammar inference.Experimental results show that the approach with tree automata inference is favorable against some other approach in the learning time and extraction time.
出处
《计算机工程与应用》
CSCD
北大核心
2010年第16期153-156,共4页
Computer Engineering and Applications
关键词
树自动机推理算法
结构化(半结构化)文档
无秩树自动机
信息抽取
KLH树
tree automata inference algorithm
(semi-)structured documents
unranked tree automata
information extraction
KLH tree language