摘要
作为垂直搜索的关键技术之一,网页结构化信息抽取近年来得到越来越多的关注.网页结构化信息抽取通过打碎网页,从中提取"精细化"、"条目化"的信息,存储在数据库中,通过对数据库的查询达到垂直搜索"精准"的目的.已有的方法大多是基于规则的模型和基于隐马尔可夫的模型,这些方法要么依赖特定网页结构,适用性差;要么依赖大量的训练样本,训练效率低.结合垂直搜索特定领域特征词数量有限的特点和统计方法,提出基于特征词统计的结构化信息抽取技术,解决了只能抽取特定HTML标记节点和单个信息块的问题,关键信息块的抽取平均准确率为97%.
As one of the key technologies of vertical search,web pages structured information extraction gets more and more attention.Web pages structured information extraction breaks web pages,from which it extracts the "fine" and the "item" of information,to store in the database.Through queries on the database vertical searches achieves the "precise" purpose.Most existing methods are based on rule model or based on hidden Markov model.Those methods either relies on a specific page structure,the applicability is poor,or relies on a large number of training samples,the training is inefficient.Combining vertical search specific areas which the characteristics of a limited number of feature words and a statistical method,this paper presented the structured information extraction technology based on feature words statistics to solve the problem that only specific HTML tag's nodes or only the single block can be extracted.The average accuracy rate of key information block extraction is 97 %.
出处
《嘉应学院学报》
2011年第2期18-21,共4页
Journal of Jiaying University
基金
广东省自然科学基金项目(9251401501000002)
梅州市科学技术局
嘉应学院联合自然科学研究项目(08KJ08)
关键词
垂直搜索
信息抽取
结构化
特征词
统计
vertical search
information extraction
structured
feature words
statistics