期刊文献+

面向BBS的通用提取算法的分析与设计

在线阅读 下载PDF
导出
摘要 BBS型网站作为整个互联网生态中的重要一环,其中蕴含着海量的数据,也是我们获取信息的重要来源。如何针对这些不同类型的论坛网页,设计一种通用的算法,对其主题贴和回帖等有价值的信息进行提取,是文章所研究的主要内容。文章在基于对不同类型网页结构的深入分析,并充分考虑了论坛网页类型的不一致性、单个网站的易爬取性及通用爬虫的不可靠性,设计了一种基于网页纵向分析的提取方案,并详细叙述了主题爬虫的算法方案。 As an important part of the whole Internet ecology, BBS-type website contains a huge amount of data, and it is alsoan important source of information. How to design a general algorithm for these different types of forum pages to extract valuable in原formation such as theme posts and reply posts is the main concern of this paper. In this paper, based on the in-depth analysis ofdifferent types of web pages, the inconsistency of web page types, the accessibility of individual web sites and the unreliability ofcommon crawlers are fully considered. An extraction scheme based on longitudinal analysis of web pages is designed, and the algo原rithm of topic crawler is described in detail.
出处 《科技创新与应用》 2018年第9期132-133,共2页 Technology Innovation and Application
关键词 BBS 噪音处理 聚类分析 符号匹配 BBS noise processing cluster analysis symbol matching
  • 相关文献

参考文献4

二级参考文献29

  • 1李卫,刘建毅,何华灿,王枞.基于主题的智能Web信息采集系统的研究与实现[J].计算机应用研究,2006,23(2):163-166. 被引量:15
  • 2MURRAY B,MOORE A.Sizing the Internet[M].[S.l.]:Cyveillance Inc,2000.
  • 3LAWRENCE S,GILES L.Accessibility and distribution of information on the Web[J].Nature,1999,400(8):107-109.
  • 4CHO J,CARCIA M H.The evolution of the Web and implication for an incremental crawler[C]//Proc of the 26th International Conference on Very Large Databases (NVLDB-00).2000.
  • 5BREWINGTON B E,CYBENKO C.How dynamic is the Web[C]//Proc of the 9th International World Wide Web Conference.2000.
  • 6MENCZER F,PANT C,RUIZ M E.Evaluating topic-driven Web crawlers[C]//Proc of SIGIR'01.New Orleans,Louisiana:[s.n.],2001:241-249.
  • 7MENCZER F,PANT C,SRINIVASAN P.Topic-driven crawlers:machine learning issues[EB/OL].(2002-05-15).http://dollar.biz.uiowa.edu/-fil/papers.html.
  • 8CHO J,GARCIA M H,PAGE L.Efficient crawling through URL ordering[J].Computer Networks and ISDN Systems,1998,30(1-7):161-172.
  • 9DeBRA P,HOUBEN G,KORNATZKY Y,et al.Information retrieval in distributed hypertexts[C]//Proc of the 4th RIAO Conference.New York:[s.n.],1994:481-491.
  • 10HERSOVICI M,JACOVI M,MAAREK Y S,et al.The shark-search algorithm:an application:tailored Web site mapping[C]//Proc of the 7th International World Wide Web Conference.Brisbane:[s.n.],1998:65-74.

共引文献213

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部