期刊文献+

基于质心向量的增量式主题爬行 被引量:4

Centroid-Based Focused Crawler with Incremental Ability
在线阅读 下载PDF
导出
摘要 研究如何在一个网页内部进行有选择的爬行.使用TFIDF-2模型以及Max,Ave,Sum三个启发式规则分别计算文档特征权重和质心特征权重,在此基础上构建与根集文档相对应的质心向量,利用它作为前端分类器指导主题爬行.使用前后端分类器分别给Frontier中的各个锚文本打分,将它们的打分求和,从中选择打分最高的链接,下载其对应的网页.实验结果表明,在质心向量的指导下,爬行程序借助于锚文本便可以准确地预测链接所指向网页的相关性;另外,双分类器框架还使得爬行策略具有增量爬行的能力. How to crawl selectively in a Web page is studied in this paper. Document feature weight and centroid feature weight are calculated based on the proposed TFIDF-2 model and the three heuristic rules Max, Ave, and Sum. After these two weights are figured out, a centroid vector which corresponds to a root set can be easily constructed. The centroid vector is then used as a front-end classifier to guide a focused crawler. First of all, the authors use the front-end classifier and the backend one respectively to score anchor texts of URLs. Then, they sum up the two anchor text scores of the same URL. Finally, they select the URL which has the highest anchor text score from the frontier and download the URL's corresponding Web page. Four series experiments are conducted. Experimental results show that with the aid of newly constructed centroid vector, the focused crawler can efficiently and accurately predict the relevance of a Web page simply by using URLs' corresponding anchor texts. Furthermore, the two classifiers' framework contributes to the focused crawler an incremental crawling ability, which is one of the most important and interesting features and must be settled down in the domain of focused crawling.
出处 《计算机研究与发展》 EI CSCD 北大核心 2009年第2期217-224,共8页 Journal of Computer Research and Development
基金 天津科技大学引进人才科研启动基金项目(20080418) 天津市高等学校科技发展计划基金项目(20071303) 吉林省科技发展计划基金项目(20070533)~~
关键词 文档特征权重 质心特征权重 主题爬行 锚文本 质心向量 document feature weight centroid feature weight focused crawling anchor text centroid vector
  • 相关文献

参考文献24

  • 1Davison B D. Topical locality in the Web [C] //Proc of SIGIR. New York: ACM, 2000:272-279
  • 2Hofmann T. Probabilistic latent semantic analysis[C]//Proc of the 15th Conf on Uncertainty in Artificial Intelligence. New York: ACM, 1999:289-296
  • 3Hofmann T. Probabilistic latent semantic indexing [C] // Proc of SIGIR. New York: ACM, 1999:103-110
  • 4Barbosa L, Freire J. An adaptive crawler for locating hidden- Web entry points [C]//Proc of the 16th Int World Wide Web Conf. New York: ACM, 2007:441-450
  • 5Barbosa L, Freire J. Combining cl.assifiers to identify online databases [C] //Proc of the 16th Int World Wide Web Conf. New York: ACM, 2007:431-439
  • 6Barbosa L, Freire J. Siphoning hidden-Web data through keyword-based interfaces [C] //Proc of SBBD. Brazil: UnB, 2004:309-321
  • 7Bergholz A, Chidlovskii B. Crawling for domain-specific hidden Web resources [C]//Proc of WISE. Los Alamitos, CA: IEEE Computer Society, 2003:125-133
  • 8王辉,刘艳威,左万利.使用分类器自动发现特定领域的深度网入口(英文)[J].软件学报,2008,19(2):246-256. 被引量:14
  • 9Han E, Karypis G. Centroid-based document classification: Analysis & experimental results [C]//Proc of European Conf on Principles of Data Mining and Knowledge Discovery (PKDD). Berlin: Springer, 2000:424-431
  • 10Lertnattee V, Theeramunkong T. Combining homogeneous classifiers for centroid based text classification [C] //Proc of the 7th Int Syrup on Computers and Communications. Los Alamitos, CA: IEEE Computer Society, 2002: 1034-1039

二级参考文献45

  • 1苏金树,张博锋,徐昕.基于机器学习的文本分类技术研究进展[J].软件学报,2006,17(9):1848-1859. 被引量:394
  • 2[1]S Chakrabarti.Mining the Web:Discovering Knowledge from Hypertext Data.San Francisco:Morgan Kaufmann,2003,
  • 3[2]T Joachims.SVMlightsupport vector machine.http://svmlight.joachims.org/,2004-02-09/2006-12-25
  • 4[3]B Liu,W S Lee,P Yu,et al.Partially supervised classification of text documents.In:Proc of the 19th Int'lConf on Machine Learning.San Francisco:Morgan Kaufmann,2002
  • 5[4]Y Yang,X Liu.A re-examination of text categorization methods.In:Proc of the 22nd Annual Int'lACM SIGIR Conf on Research Development in Information Retrieval.New York:ACM Press,1999.42-49
  • 6[6]Maedche,Alexander.Ontology Learning for the Semantic Web.Boston:Kluwer Academic Publishers,2002.151-169
  • 7[7]S Chua,N Kulathuramaiyer.Semantic feature selection using wordNet.The IEEE/WIC/ACM Int'lConf on Web Intelligence(WI'04),Beijing,2004
  • 8[8]S Tan,X Cheng,B Wang,et al.Using dragpushing to refine centroid text classifiers.In:Ricardo A B Y,Z Nivio,M Gary,et al,eds.Proc of the ACM SIGIR-05.New York:ACM Press,2005.653-654
  • 9[9]V Lertnattee,T Theeramunkong.Effect of term distributions on centroid-based text categorization.Information Sciences,2004,158(1):89-115
  • 10[10]E Han,G Karypis.Centroid-based document classification:Analysis & experimental results.In:European Conf on Principles of Data Mining and Knowledge Discovery (PKDD).Berlin:Springer-Verlag,2000.424-431

共引文献15

同被引文献44

  • 1李卫,刘建毅,何华灿,王枞.基于主题的智能Web信息采集系统的研究与实现[J].计算机应用研究,2006,23(2):163-166. 被引量:15
  • 2郑健珍,林坤辉,周昌乐,康恺.基于本体语义的定题爬虫[J].山东大学学报(理学版),2006,41(3):106-110. 被引量:11
  • 3张玉峰,朱莹.基于Web文本挖掘的企业竞争情报获取方法研究[J].情报理论与实践,2006,29(5):563-566. 被引量:23
  • 4马静,倪辉峰.基于模式匹配抽取技术的网上产品情报获取[J].情报理论与实践,2007,30(2):228-231. 被引量:3
  • 5Nie Z,Ma Y,Shi S,et al.Web object retrieval[C]//Proc of the 16th ACM Int Conf on World Wide Web.New York:ACM,2007:81-90.
  • 6Chakrabarti S,Vandenberg M H,Dom B E.Focused crawling:A new approach to topic-specific Web resource discovery[J].Computer Networks,1999,31(11-16):1623-1640.
  • 7Cho J,Hector G-M,Page L.Efficient crawling through URL ordering[J].Computer Networks and ISDN Systems,1998,30(1-7):161-172.
  • 8Najork M,Wiener I N.Breadth-first search crawling yields high-quality pages[C]//Proc of the 10th ACM Int Conf on World Wide Web.New York:ACM,2001:114-118.
  • 9Menczer F,Pant G,Ruiz M E.Evaluating topic-driven Web crawlers[C]//Proc of the 24th Annual Int ACM SIGIR Conf on Research and Development in Information Retrieval.New York:ACM,2001:241-249.
  • 10Ester M,Kriegel H -P,Schubert M.Accurate and efficient crawling for relevant websites[C]//Proc of the 30th Int Conf on Very Large Data Bases.Trondheim,Norway:VLDB Endowment Press,2004:396-407.

引证文献4

二级引证文献13

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部