摘要
针对密集型Web信息的数据抽取问题,提出了一种适合于XML结构又较为通用的树型结构抽取规则,把密集型Web上的数据抽取出来整合到指定模式的XML文档中.使用基于样例学习的半结构化Web信息抽取方法,自行开发了一个基于XML的Web新书查询原型系统,抽取Web页面效果良好,可直接应用于专门的Web网站信息的抽取,也可以用于其他相关应用的数据准备阶段.
For the problem of intensive web information data extraction, one kind of general tree structure extraction rule which suits in the XML structure is proposed. It assigned the pattern of the intensive Web on data extraction conformity in the XML documents. Using the half structure Web information extraction method based on the example studies, the prototype system based on the XML Web inquiry has been developed which can extract the Web page with good effect. It can be applied in the special Web website information extraction directly, and also may be used the data preparation stage in other correlation application.
出处
《郑州轻工业学院学报(自然科学版)》
CAS
2008年第3期31-35,共5页
Journal of Zhengzhou University of Light Industry:Natural Science
基金
河南省自然科学基金资助项目(0411010500)