摘要
BBS型网站作为整个互联网生态中的重要一环,其中蕴含着海量的数据,也是我们获取信息的重要来源。如何针对这些不同类型的论坛网页,设计一种通用的算法,对其主题贴和回帖等有价值的信息进行提取,是文章所研究的主要内容。文章在基于对不同类型网页结构的深入分析,并充分考虑了论坛网页类型的不一致性、单个网站的易爬取性及通用爬虫的不可靠性,设计了一种基于网页纵向分析的提取方案,并详细叙述了主题爬虫的算法方案。
As an important part of the whole Internet ecology, BBS-type website contains a huge amount of data, and it is alsoan important source of information. How to design a general algorithm for these different types of forum pages to extract valuable in原formation such as theme posts and reply posts is the main concern of this paper. In this paper, based on the in-depth analysis ofdifferent types of web pages, the inconsistency of web page types, the accessibility of individual web sites and the unreliability ofcommon crawlers are fully considered. An extraction scheme based on longitudinal analysis of web pages is designed, and the algo原rithm of topic crawler is described in detail.
出处
《科技创新与应用》
2018年第9期132-133,共2页
Technology Innovation and Application
关键词
BBS
噪音处理
聚类分析
符号匹配
BBS
noise processing
cluster analysis
symbol matching