摘要
在深入研究网络信息采集技术的基础上,提出一个基于Web结构的新闻采集模型。该模型加载采集入口地址后,通过信息采集和过滤算法确定新闻列表页,结合正则表达式技术自动识别新闻内容页的链接地址,访问目标新闻内容页,使用采集算法自动提取新闻信息数据。同时,它可以过滤在此页面中嵌入的广告等信息。实践结果表明,该模型工作良好,可以自动化、高效率地采集新闻信息。
On the basis of depth studying the technology of web information gathering,a web structure-basednews gathering model is proposed.It load the gathering entry address,find the news list page with the informationgathering and filter algorithm,identify and improve the news content page link address according to the rules setby acquisition and the regular expression technology automatically.Furthermore,it load the target page—newscontent page,gather the news information with the algorithm automatically.At the same time,it can filter anyinformation that is set in this page such as embedded advertising messages.Practical results show that theproposed model works well and gathers news information efficiently and automatically.
作者
陈建国
CHEN Jian-guo(Software School of Hunan University,Changsha,Hunan 410082,China;Xiamen University of Technology,Xiamen,Fujian 361021,China)
出处
《井冈山大学学报(自然科学版)》
2012年第2期54-57,共4页
Journal of Jinggangshan University (Natural Science)
关键词
信息采集
WEB结构
正则表达式
数据挖掘
新闻采集
information gathering
Web structure
regular expressions
data mining
news gathering