期刊文献+

针对模板生成网页的一种数据自动抽取方法(英文) 被引量:45

Automatic Data Extraction from Template-Generated Web Pages
在线阅读 下载PDF
导出
摘要 当前,Web上的很多网页是动态生成的,网站根据请求从后台数据库中选取数据并嵌入到通用的模板中,例如电子商务网站的商品描述网页.研究如何从这类由模板生成的网页中检测出其背后的模板,并将嵌入的数据(例如商品名称、价格等等)自动地抽取出来.给出了模板检测问题的形式化描述,并深入分析模板产生网页的结构特征.提出了一种新颖的模板检测方法,并利用检测出的模板自动地从实例网页中抽取数据.与其他已有方法相比,该方法能够适用于"列表页面"和"详细页面"两种类型的网页.在两个第三方的测试集上进行了实验,结果表明,该方法具有很高的抽取准确率. A substantial fraction of the Web consists of pages that are dynamically generated using a common template populated with data from databases, such as product description pages on e-commerce sites. The objective of the proposed research is to automatically detect the template behind these pages and extract embedded data (e.g., product name, price...). The template detection problem is formalized and an analysis of the underlying structure of template-generated pages is made. A template detection approach is presented and the detected templates are used to extract data from instance pages. Comparing with many other existing work, the approach is applicable for both "list pages" and "detail pages". Experimental results on two large third-party test beds show that the approach can achieve high extraction accuracy.
出处 《软件学报》 EI CSCD 北大核心 2008年第2期209-223,共15页 Journal of Software
基金 Supported by the National Basic Research Program of China under Grant No.2007CB310804 (国家重点基础研究发展计划(973)) the National Natural Science Foundation of China under Grant No.60573117 (国家自然科学基金重大研究计划) the National High-Tech Research and Development Plan of China under Grant No.2006AA01A106 (国家高技术研究发展计划(863))
关键词 WEB 自动数据抽取 信息抽取 模板发现 Wrapper生成 Web automatic data extraction information extraction template detection wrapper generation
  • 相关文献

参考文献12

  • 1Chang CH, Kayed M, Girgis MR, Shaalan K. A survey of Web information extraction systems. IEEE Trans. on Knowledge and Data Engineering, 2006,18(10): 1411-1428.
  • 2Gold ME. Language identification in the limit. Information and Control, 1967,10(5):447-474.
  • 3Laender AHF, Ribeiro-Neto BA, da Silva AD, Teixeira JS. A brief survey of Web data extraction tools. SIGMOD Record, 2002,31 (2):84-93.
  • 4Arasu A, Hector GM. Extracting structured data from Web pages. In: Proc. of the ACM SIGMOD Int'l Conf. on Management of Data. San Diego: ACM Press, 2003. 337-348.
  • 5EXALG datasets, http://infolab.stanford.edu/-arvind/extract/
  • 6TBDW v1.02, http://daisen.cc.kyushu-u.ac.jp/TBDW/testbed/
  • 7Zhao HK, Meng WY, Wu ZH, Raghavan V, Yu C. Fully automatic wrapper generation for search engines. In: Proc. of the 14th Int'l Conf. on World Wide Web (WWW 2005). Chiba: ACM Press, 2005.66-75.
  • 8Simon K, Lausen G. VIPER: Augmenting automatic information extraction with visual perceptions. In: Proc. of the ACM CIKM Int'l Conf. on Information and Knowledge Management. Bremen: ACM Press, 2005. 381-388.
  • 9Crescenzi V, Mecca G, Meraldo P. RoadRunner: Towards automatic data extraction from large Web sites. In: Proc. of the 27th Int'l Conf. on Very Large Data Bases (VLDB 2001). Roma: Morgan Kaufmann Publishers, 2001. 109-118.
  • 10Wang JY, Lochovsky FH. Data extraction and label assignment for Web databases. In: Proc. of the 12th Int'l World Wide Web Conf. (WWW 2003). Budapest: ACM Press, 2003. 187-196.

同被引文献328

引证文献45

二级引证文献196

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部