基于质心向量的增量式主题爬行被引量：4

Centroid-Based Focused Crawler with Incremental Ability

下载PDF

导出

摘要研究如何在一个网页内部进行有选择的爬行.使用TFIDF-2模型以及Max,Ave,Sum三个启发式规则分别计算文档特征权重和质心特征权重,在此基础上构建与根集文档相对应的质心向量,利用它作为前端分类器指导主题爬行.使用前后端分类器分别给Frontier中的各个锚文本打分,将它们的打分求和,从中选择打分最高的链接,下载其对应的网页.实验结果表明,在质心向量的指导下,爬行程序借助于锚文本便可以准确地预测链接所指向网页的相关性;另外,双分类器框架还使得爬行策略具有增量爬行的能力. How to crawl selectively in a Web page is studied in this paper. Document feature weight and centroid feature weight are calculated based on the proposed TFIDF-2 model and the three heuristic rules Max, Ave, and Sum. After these two weights are figured out, a centroid vector which corresponds to a root set can be easily constructed. The centroid vector is then used as a front-end classifier to guide a focused crawler. First of all, the authors use the front-end classifier and the backend one respectively to score anchor texts of URLs. Then, they sum up the two anchor text scores of the same URL. Finally, they select the URL which has the highest anchor text score from the frontier and download the URL＇s corresponding Web page. Four series experiments are conducted. Experimental results show that with the aid of newly constructed centroid vector, the focused crawler can efficiently and accurately predict the relevance of a Web page simply by using URLs＇ corresponding anchor texts. Furthermore, the two classifiers＇ framework contributes to the focused crawler an incremental crawling ability, which is one of the most important and interesting features and must be settled down in the domain of focused crawling.

作者王辉左万利王晖昱宁爱军孙志伟满春雷

机构地区天津科技大学计算机科学与信息工程学院吉林大学计算机科学与技术学院澳大利亚卧龙岗大学信息学院

出处《计算机研究与发展》 EI CSCD 北大核心 2009年第2期217-224,共8页 Journal of Computer Research and Development

基金天津科技大学引进人才科研启动基金项目(20080418) 天津市高等学校科技发展计划基金项目(20071303) 吉林省科技发展计划基金项目(20070533)~~

关键词文档特征权重质心特征权重主题爬行锚文本质心向量 document feature weight centroid feature weight focused crawling anchor text centroid vector

分类号 TP393 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献24

1Davison B D. Topical locality in the Web [C] //Proc of SIGIR. New York: ACM, 2000:272-279
2Hofmann T. Probabilistic latent semantic analysis[C]//Proc of the 15th Conf on Uncertainty in Artificial Intelligence. New York: ACM, 1999:289-296
3Hofmann T. Probabilistic latent semantic indexing [C] // Proc of SIGIR. New York: ACM, 1999:103-110
4Barbosa L, Freire J. An adaptive crawler for locating hidden- Web entry points [C]//Proc of the 16th Int World Wide Web Conf. New York: ACM, 2007:441-450
5Barbosa L, Freire J. Combining cl.assifiers to identify online databases [C] //Proc of the 16th Int World Wide Web Conf. New York: ACM, 2007:431-439
6Barbosa L, Freire J. Siphoning hidden-Web data through keyword-based interfaces [C] //Proc of SBBD. Brazil: UnB, 2004:309-321
7Bergholz A, Chidlovskii B. Crawling for domain-specific hidden Web resources [C]//Proc of WISE. Los Alamitos, CA: IEEE Computer Society, 2003:125-133
8王辉,刘艳威,左万利.使用分类器自动发现特定领域的深度网入口(英文)[J].软件学报,2008,19(2):246-256. 被引量：14
9Han E, Karypis G. Centroid-based document classification: Analysis & experimental results [C]//Proc of European Conf on Principles of Data Mining and Knowledge Discovery (PKDD). Berlin: Springer, 2000:424-431
10Lertnattee V, Theeramunkong T. Combining homogeneous classifiers for centroid based text classification [C] //Proc of the 7th Int Syrup on Computers and Communications. Los Alamitos, CA: IEEE Computer Society, 2002: 1034-1039

二级参考文献45

1苏金树,张博锋,徐昕.基于机器学习的文本分类技术研究进展[J].软件学报,2006,17(9):1848-1859. 被引量：394
2[1]S Chakrabarti.Mining the Web:Discovering Knowledge from Hypertext Data.San Francisco:Morgan Kaufmann,2003,
3[2]T Joachims.SVMlightsupport vector machine.http://svmlight.joachims.org/,2004-02-09/2006-12-25
4[3]B Liu,W S Lee,P Yu,et al.Partially supervised classification of text documents.In:Proc of the 19th Int'lConf on Machine Learning.San Francisco:Morgan Kaufmann,2002
5[4]Y Yang,X Liu.A re-examination of text categorization methods.In:Proc of the 22nd Annual Int'lACM SIGIR Conf on Research Development in Information Retrieval.New York:ACM Press,1999.42-49
6[6]Maedche,Alexander.Ontology Learning for the Semantic Web.Boston:Kluwer Academic Publishers,2002.151-169
7[7]S Chua,N Kulathuramaiyer.Semantic feature selection using wordNet.The IEEE/WIC/ACM Int'lConf on Web Intelligence(WI'04),Beijing,2004
8[8]S Tan,X Cheng,B Wang,et al.Using dragpushing to refine centroid text classifiers.In:Ricardo A B Y,Z Nivio,M Gary,et al,eds.Proc of the ACM SIGIR-05.New York:ACM Press,2005.653-654
9[9]V Lertnattee,T Theeramunkong.Effect of term distributions on centroid-based text categorization.Information Sciences,2004,158(1):89-115
10[10]E Han,G Karypis.Centroid-based document classification:Analysis & experimental results.In:European Conf on Principles of Data Mining and Knowledge Discovery (PKDD).Berlin:Springer-Verlag,2000.424-431

共引文献15

1高明,黄哲学.Deep Web研究现状与展望[J].集成技术,2012,1(3):47-54. 被引量：1
2崔晓军,彭智勇,杨先娣,张莹.Deep Web信息按需集成研究综述[J].武汉大学学报（理学版）,2009,55(4):465-472. 被引量：2
3陆余良,房珊瑶,刘金红,施凡.Deep Web站点分类研究进展[J].安徽大学学报（自然科学版）,2010,34(1):103-108. 被引量：1
4杨丽华,袁方,姚增利,王煜.基于启发式规则的Deep Web接口发现[J].河北大学学报（自然科学版）,2010,30(1):107-112. 被引量：1
5陈文,晏立,周亮.一种具有增量学习能力的PU主动学习算法[J].计算机工程,2011,37(4):214-215. 被引量：1
6叶育鑫,欧阳丹彤.基于语义的主题爬行策略[J].软件学报,2011,22(9):2075-2088. 被引量：12
7王彩霞,高明.Deep Web查询接口及其识别算法[J].电脑知识与技术,2011,7(8):5422-5424.
8吴春明,谢德体.一种有效的深网入口识别方法[J].计算机科学,2011,38(10):199-201.
9张会福,周亚平.基于事件驱动的车型参数主题爬虫[J].计算机系统应用,2011,20(10):198-201.
10李道申,刘勇.基于本体的DeepWeb数据源发现方法[J].计算机工程,2012,38(4):52-54. 被引量：1

同被引文献44

1李卫,刘建毅,何华灿,王枞.基于主题的智能Web信息采集系统的研究与实现[J].计算机应用研究,2006,23(2):163-166. 被引量：15
2郑健珍,林坤辉,周昌乐,康恺.基于本体语义的定题爬虫[J].山东大学学报（理学版）,2006,41(3):106-110. 被引量：11
3张玉峰,朱莹.基于Web文本挖掘的企业竞争情报获取方法研究[J].情报理论与实践,2006,29(5):563-566. 被引量：23
4马静,倪辉峰.基于模式匹配抽取技术的网上产品情报获取[J].情报理论与实践,2007,30(2):228-231. 被引量：3
5Nie Z,Ma Y,Shi S,et al.Web object retrieval[C]//Proc of the 16th ACM Int Conf on World Wide Web.New York:ACM,2007:81-90.
6Chakrabarti S,Vandenberg M H,Dom B E.Focused crawling:A new approach to topic-specific Web resource discovery[J].Computer Networks,1999,31(11-16):1623-1640.
7Cho J,Hector G-M,Page L.Efficient crawling through URL ordering[J].Computer Networks and ISDN Systems,1998,30(1-7):161-172.
8Najork M,Wiener I N.Breadth-first search crawling yields high-quality pages[C]//Proc of the 10th ACM Int Conf on World Wide Web.New York:ACM,2001:114-118.
9Menczer F,Pant G,Ruiz M E.Evaluating topic-driven Web crawlers[C]//Proc of the 24th Annual Int ACM SIGIR Conf on Research and Development in Information Retrieval.New York:ACM,2001:241-249.
10Ester M,Kriegel H -P,Schubert M.Accurate and efficient crawling for relevant websites[C]//Proc of the 30th Int Conf on Very Large Data Bases.Trondheim,Norway:VLDB Endowment Press,2004:396-407.

引证文献4

1黄健斌,孙鹤立.基于链接路径预测的聚焦Web实体搜索[J].计算机研究与发展,2010,47(12):2059-2066. 被引量：1
2张乃洲,李石君,余伟,张卓.使用联合链接相似度评估爬取Web资源[J].计算机学报,2010,33(12):2267-2280. 被引量：6
3赵永霄,哈力旦.阿布都热依木,张振东.面向增量同生主题的维吾尔文爬虫的研究[J].计算机应用研究,2014,31(11):3269-3272. 被引量：1
4田雪筠.网络竞争情报主题采集技术研究[J].图书与情报,2014(5):132-137. 被引量：5

二级引证文献13

1黄健斌,白杨,康剑梅,钟翔,张鑫,孙鹤立.一种基于同步动力学模型的网络社团发现方法[J].计算机研究与发展,2012,49(10):2198-2207. 被引量：3
2何超,张玉峰.融合语义相似度的商务情报链接分析算法研究[J].现代图书情报技术,2013(3):27-32. 被引量：3
3曾光.基于分类指标体系的竞争情报分类分析模型研究[J].农业图书情报学刊,2014,26(1):5-8.
4张乃洲,曹薇,李石君.一种基于节点密度分割和标签传播的Web页面挖掘方法[J].计算机学报,2015,38(2):349-364. 被引量：13
5陈先福,李石君,曾慧.基于极限学习机的网页分类应用[J].计算机工程与应用,2015,51(5):102-106. 被引量：1
6陈祖琴.面向应急情报采集与组织的突发事件特征词典编制[J].图书与情报,2015(3):26-33. 被引量：11
7朱宁.面向Web大数据的企业竞争情报平台设计[J].淮海工学院学报（自然科学版）,2015,24(4):26-29. 被引量：3
8郭颂,边伟,刘洋,胡钛.基于SVM主题爬虫的航天情报采集应用研究[J].电子设计工程,2016,24(17):28-30. 被引量：9
9何晓冬.强干扰环境下网络情报数据滤波通信系统设计[J].计算机测量与控制,2016,24(10):162-164. 被引量：4
10朱浩,连德富,左志宏,颜凯.余弦相似度在高校综合信息系统中的应用[J].东南大学学报（自然科学版）,2017,47(A01):123-128. 被引量：5

1王辉,左万利,袁华.一种基于质心与本体的文本分类方法[J].计算机研究与发展,2007,44(z2):6-11. 被引量：3
2谢华,王健,林鸿飞,杨志豪.基于特征选择的质心向量构建方法[J].计算机工程,2012,38(1):195-196. 被引量：2
3陈震,吴斌,沈崇玮,张忠辉,王柏.一种改进的基于质心的文本分类算法[J].计算机应用与软件,2013,30(1):43-47. 被引量：3
4王德庆,张辉.基于支持向量的迭代修正质心文本分类算法[J].北京航空航天大学学报,2013,39(2):269-274. 被引量：3
5黄家裕,刘连芳.基于多质心的不良文本快速过滤方法[J].广西科学院学报,2010,26(4):436-438.
6万杏超.LC863528C55L7与LC863524C55L7及TFI-50J2的代换[J].家电维修,2007(1):17-17.
7贾建强,陈卫东,席裕庚.开放式自主移动机器人系统设计与控制实现[J].上海交通大学学报,2005,39(6):905-909. 被引量：6
8卢玲,杨武,唐继强.伪相关反馈的文本情感分类方法[J].计算机仿真,2013,30(11):268-271. 被引量：1
9刘菊新,徐从富.基于多分类器组合模型的垃圾邮件过滤[J].计算机工程,2010,36(18):194-196. 被引量：2
10熊平,顾霄.基于属性权重最优化的k-means聚类算法[J].微电子学与计算机,2014,31(4):40-43. 被引量：10

计算机研究与发展

2009年第2期

浏览历史

内容加载中请稍等...

基于质心向量的增量式主题爬行被引量：4

参考文献24

二级参考文献45

共引文献15

同被引文献44

引证文献4

二级引证文献13

相关作者

相关机构

相关主题

浏览历史

基于质心向量的增量式主题爬行 被引量：4

参考文献24

二级参考文献45

共引文献15

同被引文献44

引证文献4

二级引证文献13

相关作者

相关机构

相关主题

浏览历史

基于质心向量的增量式主题爬行被引量：4