主题爬虫相关度算法研究综述被引量：6

Reviews of Relevance Algorithm in Focused Crawler

下载PDF

导出

摘要首先阐述主题爬虫相关度算法目标和相关度的计算内涵;然后根据信息处理的进化观点,以信息特征项的处理为线索,分别从字符层、语言层、语义层3个层次系统分析当前主题爬虫相关度的计算方法,并比较不同层次间各个算法的优缺点;最后总结现有的研究成果,并给出进一步的研究方向。 This paper describes the goal of relevance algorithm and relevance calculation connotation in focused crawler. Then, according to the evolutionary point of view of information processing, it systematically analyzes the current relevance calculation method of focused crawler in three levels： character layer, language layer, semantic layer, and compares the advantages/disad- vantages among algorithms from different levels. Finally, it summarizes the current research results and indicates the direction in future works.

作者王帅周国民王健

机构地区中国农业科学院农业信息研究所

出处《计算机与现代化》 2013年第4期27-30,39,共5页 Computer and Modernization

基金公益性科研院所基本科研业务费专项资金资助项目(2012-J-06)

关键词相关度算法主题爬虫概念 relevance algorithm focused crawler concept

分类号 TP391.3 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献25

1Murray B H, Brian H. Sizing the Internet[ R/OL]. http;//www. cyveillance. com/web/downloads/Sizing.the..Internet.pdf, 2000-07-10.
2Mizzaro S. Relevance: The whole history [ J] . Journal ofthe American Society for Information Science, 1997,48(9):810-832.
3Draper S. Mizzaro* s Framework for Relevance[ EB/OL].http://www. psy. gla. ac. uk/ ,steve/stefano. html, 1998-08-16.
4Borlund P. The concept of relevance in IR[ J] . Journal ofthe American Society for Information Science and Technolo-gy, 2003,54(10):913-925.
5Hjorland B. The foundation of the concept of relevance[J]. Journal of the American Society for Information Sci-ence and Technology, 2010,61 (2) :217-237.
6Saracevic T. Relevance: A review of and a framework forthe thinking on the notion in information science[ J] . Jour-nal of the American Society for Information Science, 1975,26(6) :321-343.
7Srinivasan P, Menczer F, Pant G. A general evaluationframework for topical crawlers[J]. Information Retrieval,2005,8(3) :417^47.
8Noh S, Choi Y, Seo H, et al. An intelligent topic-specificcrawler using degree of relevance[ C]// Proceedings of theIntelligent Data Engineering and Automated Leaming-IDE-AL 2004. 2004:491-498.
9Ahmadi-Abkenari F, Selamat A. An architecture for a fo-cused trend parallel Web crawler with the application ofclickstream analysis [ J ] , Information Sciences, 2012,184(1):266-281.
10Ingwersen P, Jarvelin K. The Turn: Integration of Informa-tion Seeking and Retrieval in Context [ M ]. Springer,2005.

二级参考文献9

1刘林,汪涛,樊孝忠.主题爬虫的解决方案[J].华南理工大学学报（自然科学版）,2004,32(z1):137-141. 被引量：10
2龙宇巍,王永成,许欢庆.定题搜索引擎Robot的设计与算法[J].计算机仿真,2004,21(4):69-72. 被引量：9
3[5]Page L, Brin S, Motwani R, et al. The PageRank citation ranking: Bringing order to the Web[EB/OL]. http:∥www-db.stanford.edu/～backrub/pageranksub.ps,1998-01-20/2003-03-25.
4Marc Ehring, Mexander maedche. Ontology-focused crawling of Web documents[J], Proceedings of the 2003 ACM Symposium on Applied Computing, 2003, 1(3) :624 - 626.
5董振东，董强．Ontology和HowNet[EB／OL]．http://www．keenage.com/html/c-index.html., 2003-08/2006-02.
6Cutler M, Shih Y, Meng W. Using the structure of HTML documents to improve retrieval [A]. Proceedings of the USENIX Symposium on Intemet Technologies and Systems Monterey[C]. California: California Press, 1997. 241 - 251.
7Mdiligenti F Coetzee. Focused crawling using context graphs[A]. Proceedings of the 26th International Conference on Very Large Data Bases[C]. Cairo: Cairo Press, 2000. 527 - 534.
8Ricardo Baeza-yates, Berthier Ribeiro-neto. Modem Information Retrieval[M]. Beijing: China Machine Press, 2005.
9曹军.Google的PageRank技术剖析[J].情报杂志,2002,21(10):15-18. 被引量：71

共引文献19

1汪涛,樊孝忠.链接分析对主题爬虫的改进[J].计算机应用,2004,24(B12):174-176. 被引量：12
2谭爱平,成亚玲.搜索引擎技术综述[J].湖南工业职业技术学院学报,2008,8(3):19-21.
3关慧芬,师军,马继红.基于遗传算法的主题爬行技术研究[J].计算机与数字工程,2008,36(10):50-53. 被引量：4
4陈方,谭爱平,成亚玲,文益民.主题爬虫技术研究综述[J].湖南工业职业技术学院学报,2008,8(5):13-16. 被引量：6
5何毅.基于Web的建筑业主题搜索引擎技术[J].吉林广播电视大学学报,2009(6):126-128.
6金明珠,丁岳伟.基于动态主题库的主题爬虫[J].计算机应用,2009,29(B12):44-46. 被引量：4
7刘淑梅,夏亮,许南山.主题搜索引擎网络爬虫搜索策略的研究与实现[J].计算机系统应用,2010,19(3):49-52. 被引量：13
8吴小惠.分布式网络爬虫系统的任务调度策略改进[J].上饶师范学院学报,2010,30(3):87-91. 被引量：1
9何毅.建筑院校主题搜索引擎设计与实现[J].吉林建筑工程学院学报,2010,27(5):114-117.
10张素智,李宝燕,樊得强.面向用户的本体爬虫研究与设计[J].郑州轻工业学院学报（自然科学版）,2010,25(6):62-66.

同被引文献58

1杨学明,刘柏嵩.主题爬虫在数字图书馆中的应用[J].图书馆杂志,2007,26(8):47-50. 被引量：3
2沈贺丹,潘亚楠,邵良杉.关于搜索引擎的研究综述[J].计算机技术与发展,2006,16(4):147-149. 被引量：17
3严莉莉,王倩倩,孟杰,张燕平.基于聚类的个性化元搜索引擎设计[J].计算机技术与发展,2007,17(4):186-188. 被引量：7
4林海霞,原福永,陈金森,刘俊峰.一种改进的主题网络蜘蛛搜索算法[J].计算机工程与应用,2007,43(10):174-176. 被引量：18
5夏崇镨,康丽.基于叙词表的主题爬虫技术研究[J].现代图书情报技术,2007(5):41-44. 被引量：8
6赵燕,陈晓云,莫明辉,汤勇.基于用户群的智能主题爬虫[J].广西师范大学学报（自然科学版）,2007,25(2):230-233. 被引量：3
7陈军,陈竹敏.基于网页分块的Shark-Search算法[J].山东大学学报（理学版）,2007,42(9):62-66. 被引量：7
8CVE Website. Common Vulnerabilities and Exposures [EB/OL]. http:// cve.mitre.org/, 2014-05-10.
9NIST. National Vulnerability Database [EB/OL]. http:// nvd.nist.gov/, 2014-05-08.
10中国信息安全测评中心. 中国国家信息安全漏洞库[EB/OL]. http:// www.cnnvd.org.cn/, 2014-05-10.

引证文献6

1刘海燕,黄睿,黄轩.基于主题爬虫的漏洞库维护系统[J].计算机与现代化,2014(8):67-70. 被引量：10
2于娟,刘强.主题网络爬虫研究综述[J].计算机工程与科学,2015,37(2):231-237. 被引量：104
3张林.基于Heritrix的视频垂直搜索引擎[J].计算机系统应用,2016,25(9):52-59. 被引量：3
4薛丽敏,吴琦,李骏.面向专用信息获取的用户定制主题网络爬虫技术研究[J].信息网络安全,2017(2):12-21. 被引量：18
5王相军,刘春晓,刁慕言,何建安,顾大勇,史蕾,赵纯中,叶允明,田桢干,李深伟.全球传染病疫情信息自动收集系统的研发[J].中国国境卫生检疫杂志,2017,40(6):431-434. 被引量：7
6涂思羽,彭平安,蒋元建.基于深度学习的井下环境异常工况智能识别技术研究[J].中国安全生产科学技术,2018,14(11):58-63. 被引量：8

二级引证文献149

1李家瑞,李华昱,闫阳,付亚凤.基于事件抽取的学科建设知识图谱构建与应用[J].计算机系统应用,2022,31(11):100-110. 被引量：8
2齐虎春.高职化工院校网络化工数据采集课程实践研究[J].内蒙古石油化工,2020,46(10):88-90. 被引量：1
3井世洁,邹利.“校园欺凌”的网络表达与治理——基于LDA主题模型的大数据分析[J].青少年犯罪问题,2020(6):60-68. 被引量：1
4项博良,唐淳淳,钱前,曹健东.基于网络爬虫的就业数据分析[J].智能计算机与应用,2020,10(1):223-226. 被引量：4
5邵云蛟,吴丽莎,张凯,吴屏.一种基于Python的信息安全情报收集工具[J].中国科技纵横,2018,0(13):19-19.
6李慧敏,孙佳亮.论爬虫抓取数据行为的法律边界[J].电子知识产权,2018(12):58-67. 被引量：55
7殷帅,胡越黎,刘思齐,燕明.基于YOLO网络的数据采集与标注[J].仪表技术,2018(12):22-25. 被引量：10
8李静力.面向高危风险漏洞修复行为的系统研究[J].自动化技术与应用,2019,38(1):39-45. 被引量：2
9周少波.基于SSM框架的数据采集系统的设计与实现[J].电脑知识与技术,2018,14(12):45-47. 被引量：1
10赵红艳,初好勃.浅谈计算机网络安全现状与防护策略[J].数字技术与应用,2014,32(12):182-182.

1周晴,李衍达.考虑基因表达过程的进化算法[J].电子学报,2002,30(1):114-117. 被引量：5
2WANG Wei,SU JingYu,MA DongHui,TIAN Jie.Integrated risk assessment of complex disaster system based on a non-linear information dynamics model[J].Science China(Technological Sciences),2012,55(12):3344-3351. 被引量：4

计算机与现代化

2013年第4期

浏览历史

内容加载中请稍等...

主题爬虫相关度算法研究综述被引量：6

参考文献25

二级参考文献9

共引文献19

同被引文献58

引证文献6

二级引证文献149

相关作者

相关机构

相关主题

浏览历史

主题爬虫相关度算法研究综述 被引量：6

参考文献25

二级参考文献9

共引文献19

同被引文献58

引证文献6

二级引证文献149

相关作者

相关机构

相关主题

浏览历史

主题爬虫相关度算法研究综述被引量：6