摘要
在现有多种主题爬虫的基础上,提出了一种基于概率模型的主题爬虫。它综合抓取过程中获得的多方面的特征信息来进行分析,并运用概率模型计算每个URL的优先值,从而对URL进行过滤和排序。基于概率模型的主题爬虫解决了大多数爬虫抓取策略单一这个缺陷,它与以往主题爬虫的不同之处是除了使用主题相关度评价指标外,还使用了历史评价指标和网页质量评价指标,较好地解决了"主题漂移"和"隧道穿越"问题,同时保证了资源的质量。最后通过多组实验验证了其在主题网页召回率和平均主题相关度上的优越性。
Based on the study and research of the existing variety of focused crawlers, the paper pro- poses a focused crawler using probabilistic model, which analyzes various characteristics obtained in crawl process and uses probabilistic model to calculate each URL priority so as to filter and sort URLs. The proposed focused crawler based on probabilistic model solves the deficiency that most existing crawlers usually only adopt a single strategy for fetching webs from Internet. The distinct feature of our focused crawler is that: not only subject relativity but also history evaluation and web equality are con- sidered so that the "topic drift" and "tunneling" problems are solved as well as the resource equality is guaranteed. Experimental results show that, compared with other focused crawlers, the focused crawler based on probabilistic prediction can gather more subject relevant web pages by retrieving less web pa- ges, and has a better average topic relevant degree.
出处
《计算机工程与科学》
CSCD
北大核心
2013年第1期160-165,共6页
Computer Engineering & Science
基金
国家自然科学基金资助项目(61170121)
关键词
主题爬虫
概率模型
URL过滤
URL排序
优先值
focused crawler
probabilistic model
URL filtering
URL ordering
priority value