期刊文献+

一种Web评论自动抽取方法 被引量:12

Solution for Automatic Web Review Extraction
在线阅读 下载PDF
导出
摘要 Web用户评论是许多重要应用的信息来源,比如公众舆情的检测与分析,Web用户评论必须从网页中准确地抽取出来.用户生成内容(user-generated content)不受页面模板的限制,这就给Web数据抽取提出了新的挑战:首先,不同用户评论内容的不一致性严重影响了评论记录在DOM树和视觉上的相似性;其次,评论内容在DOM树中是一棵复杂的子树,而且彼此之间在DOM树中的结构相差巨大.为了解决这两个问题,提出了一种完整的解决方案,使用多种技术来实现对用户评论内容的抽取.抽取过程分为两个步骤,基于深度加权的树相似性算法评论记录首先从网页中抽取出来,然后通过比较DOM树中节点的一致性,将纯粹的用户评论内容从评论记录中抽取出来.在多个新闻网站和论坛网站上的实验结果表明,该方法可以达到较高的准确度和效率. Web user reviews are the important information source for many popular applications (e.g. monitoring and analysis of public opinion), and they need to be extracted accurately from Web pages. Web user reviews belong to user-generated contents, whose presentation is not restricted by the Web page template. Therefore new challenges are raised. First, the inconsistency of review contents on both DOM tree and visual appearance impair the similarity between review records; second, the review content in a review record corresponds to a complicated subtree rather than one single node in the DOM tree. To tackle these challenges, a comprehensive solution is proposed to perform automatic extraction of Web reviews by employing sophisticated techniques. The review records are extracted from Web pages based on the level-weighted tree similarity algorithm first, and then, the pure review contents in records are extracted by comparing the node consistency. The experimental results on news Web sites and forum Web sites indicate that our solution can achieve high extraction accuracy and efficiency.
出处 《软件学报》 EI CSCD 北大核心 2010年第12期3220-3236,共17页 Journal of Software
基金 国家高技术研究发展计划(863)No.2008AA01Z421 中国博士后科学基金Nos.20080440256,200902014~~
关键词 Web用户评论 结构化数据记录 WEB数据抽取 Web user review structured data record Web data extraction
  • 相关文献

参考文献22

  • 1Cai R, Yang JM, Lai W. iRobot: An intelligent crawler for Web forums. In: Huai J, Chen R, Hon H, Liu Y, Ma W, Tomkins A, Zhang X, eds. Proc. of the Int'l Conf. on World Wide Web (WWW 2008). Beijing: ACM Press, 2008. 447-456.
  • 2Guo Y, Li K, Zhang K. Board forum crawling: A Web crawling method for Web forum. In: Nishida T, ed. Proc. of the Int'l Conf. on Web Intelligence (WI 2006). Hong Kong: IEEE Computer Society, 2006.745-748.
  • 3Wang Y, Yang JM, Lai W. Exploring traversal strategy for Web forum crawling. In: Myaeng S, Oard D, Sebastiani F, Chua T, Leong M, eds. Proc. of the ACM Conf. on Research and Development in Information Retrieval (SIGIR 2008). Singapore: ACM Press, 2008. 459-466.
  • 4Chang CH, Kayed M, Girgis MR, Shaalan KF. A survey of Web information extraction systems. IEEE Trans. on Knowledge and Data Engineering, 2006,18(10): 1411-1428.
  • 5Liu B, Grossman RL, Zhai Y. Mining data records in Web pages. In: Getoor L, Senator T, Domingos P, Faloutsos C, eds. Proc. of the Int'l Conf. on Knowledge Discovery and Data Mining (KDD 2003). Washington: ACM Press, 2003. 601-606.
  • 6Liu W, Meng X, Meng W. Vision-Based Web data records extraction. In: Zhou D, ed. Proc. or-the Int'l Workshop on the Web aria Databases (WebDB 2006). 2006.20-25. http://db.ucsd.edu/webdb2006/camera-ready/paginated/04-144.pdf.
  • 7Simon K, Lausen G. VIPER: Augmenting automatic information extraction with visual perceptions. In: Herzog O, Schek H, Fuhr N, Chowdhury A, Teiken W, eds. Proc. of the Int'l Conf. on Information and Knowledge Management (CIKM 2005). Bremen: ACM Press, 2005. 381-388.
  • 8Song R, Liu H, Wen JR, Ma WY. Learning block importance models for Web pages. In: Feldman S, Uretsky M, Najork M, Wills C, eds. Proc. of the Int'l Conf. on World Wide Web (WWW 2004). New York: ACM Press, 2004. 203-211.
  • 9Zhao H, Meng W, Wu Z, Raghavan V, Yu CT.-Fully automatic wrapper generation for search engines. In: Ellis A, Hagino T, eds. Proc. of the Int'l Conf. on World Wide Web (WWW 2005). Chiba: ACM Press, 2005.66-75.
  • 10Jansson J, Lingas A. A fast algorithm for optimal alignment between similar ordered trees. In: Amir A, Landau G, eds. Proc. of the Int'l Conf. on Combinatorial Pattern Matching (CPM2001). Jerusalem: Springer-Verlag, 2001. 232-240.

同被引文献174

引证文献12

二级引证文献56

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部