期刊文献+

基于概率模型的主题爬虫的研究和实现 被引量:7

Research and implementation for focused crawler based on probabilistic model
在线阅读 下载PDF
导出
摘要 在现有多种主题爬虫的基础上,提出了一种基于概率模型的主题爬虫。它综合抓取过程中获得的多方面的特征信息来进行分析,并运用概率模型计算每个URL的优先值,从而对URL进行过滤和排序。基于概率模型的主题爬虫解决了大多数爬虫抓取策略单一这个缺陷,它与以往主题爬虫的不同之处是除了使用主题相关度评价指标外,还使用了历史评价指标和网页质量评价指标,较好地解决了"主题漂移"和"隧道穿越"问题,同时保证了资源的质量。最后通过多组实验验证了其在主题网页召回率和平均主题相关度上的优越性。 Based on the study and research of the existing variety of focused crawlers, the paper pro- poses a focused crawler using probabilistic model, which analyzes various characteristics obtained in crawl process and uses probabilistic model to calculate each URL priority so as to filter and sort URLs. The proposed focused crawler based on probabilistic model solves the deficiency that most existing crawlers usually only adopt a single strategy for fetching webs from Internet. The distinct feature of our focused crawler is that: not only subject relativity but also history evaluation and web equality are con- sidered so that the "topic drift" and "tunneling" problems are solved as well as the resource equality is guaranteed. Experimental results show that, compared with other focused crawlers, the focused crawler based on probabilistic prediction can gather more subject relevant web pages by retrieving less web pa- ges, and has a better average topic relevant degree.
出处 《计算机工程与科学》 CSCD 北大核心 2013年第1期160-165,共6页 Computer Engineering & Science
基金 国家自然科学基金资助项目(61170121)
关键词 主题爬虫 概率模型 URL过滤 URL排序 优先值 focused crawler probabilistic model URL filtering URL ordering priority value
  • 相关文献

参考文献1

二级参考文献12

  • 1王琦,唐世渭,杨冬青,王腾蛟.基于DOM的网页主题信息自动提取[J].计算机研究与发展,2004,41(10):1786-1792. 被引量:81
  • 2Donna B,Carl L,Alex S.Focused crawls,tunneling,and digital libraries[G]//LNCS 2458:Proc of the 6th European Conf on Research and Advanced Technology for Digital Libraries.Berlin:Springer,2002:91-106.
  • 3Pant G,Srinivasan P,Menczer F.Exploration versus exploitation in topic driven crawlers[C]//Proc of WWW-02 Workshop on Web Dynamics.New York:ACM,2002.
  • 4Peng Tao,Zhang Changli,Zuo Wanli.Tunneling enhanced by Web page content block partition for focused crawling[J].Concurrency and Computation:Practice and Experience,2008,20(1):61-74.
  • 5Lin Shian-Hua,Ho Jan-Ming.Discovering informative content blocks from Web documents[C]//Proc of SIGKDD 2002.New York:ACM,2002:588-593.
  • 6Wong W,Fu A W.Finding structure and characteristics of Web documents for classification[C]//Proc of ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery (DMKD).New York:ACM,2000.
  • 7Embley D W,Jiang Y,Ng Y-K.Record-boundary discovery in Web documents[C]//Proc of the 1999 ACM SIGMOD Int Conf on Management of Data.New York:ACM,1999.
  • 8Chakrabarti S.Integrating the document object model with hyperlinks for enhanced topic distillation and information extraction[C]//Proc of the 10th Int World Wide Web Conf.New York:ACM,2001.
  • 9Peng Tao,He Fengling,Zuo Wanli,et al.Adaptive topical Web crawling for domain-specific resource discovery guided by link-context[C]//Proc of MICAI 2006.Berlin:Springer,2006:963-973.
  • 10Chakrabarti S,Berg M van den,Dom B.Focused crawling:A new approach to topic-specific Web resource discovery[C]//Proc of the 8th Intl WWW Conf.Amsterdam,Netherlands:Elsevier,1999.

共引文献10

同被引文献58

引证文献7

二级引证文献115

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部