期刊文献+

基于DBSCAN算法的网页正文提取 被引量:6

Webpage Content Extraction Based on DBSCAN
在线阅读 下载PDF
导出
摘要 针对网页正文提取问题,提出一种基于分段因子的方法对网页源文件进行过滤得到纯文本段,将每段看作二维空间中的一个点,利用DBSCAN聚类算法对这些点进行聚类得到正文内容。该方法复杂度低,并且不依赖于网站布局风格,适应性强。对各大国内外新闻类网站进行实验,结果表明,该方法对中英文新闻类网站的正文提取效果明显,具有较高的平均准确率。 For the problem of webpage content extraction, this paper presents a method based on section-factor to filter webpage and get the plain text paragraph. Each paragraph is regarded as a point in the two-dimensional space. The DBSCAN clustering algorithm can cluster these points to get the real content. This method has low complexity and does not depend on the site layout style, as well as has strong adaptability. Experiments are put on the news websites from domestic and international, and results show that for both Chinese and English news website has a high average accuracy and obvious effect.
出处 《计算机工程》 CAS CSCD 北大核心 2011年第3期64-66,69,共4页 Computer Engineering
基金 国家自然科学基金资助项目(60573043)
关键词 主题爬虫 正文提取 DBSCAN算法 密度 topic-focused crawler content extraction DBSCAN density
  • 相关文献

参考文献8

  • 1Eikvil L Information Extracuon from World Wide Web-A Survey[R].Blindem,Norway:Norwegian Computing Center,Tech.Rep:945,1999.
  • 2梅雪,程学旗,郭岩,张刚,丁国栋.一种全自动生成网页信息抽取Wrapper的方法[J].中文信息学报,2008,22(1):22-29. 被引量:21
  • 3于鲁波,陈超.互联网商品信息抽取技术[J].计算机工程,2008,34(5):274-276. 被引量:5
  • 4Cai Deng,Yu Shipeng,Wen Jirong,et al.VIPS:A Vision Based on Page Segmentation Algorithm[R].[S.1.]:Microsoft Co.,Tech.Rep.:MSR-TR-2003-79,2003.
  • 5Wang Jingqi,Chen Qingeai,Wang Xiaoiong,et al.Basic Semantic Units Based Web Page Content Extraction[C]//Proc.of SMC'08.Singapore:IEEE Press,2008.
  • 6Pan Donghua,Qiu Shaogang,Yin Dawei.web Page Content Extraction Method Based on Link Density and Stafisfic[C]//Proc.of WiCOM'08.Dalian,China:IEEE Press,2008.
  • 7韩忠明,李文正,莫倩.有效HTML文本信息抽取方法的研究[J].计算机应用研究,2008,25(12):3568-3571. 被引量:15
  • 8Han Jiawei,Kamber M.数据挖掘概念与技术[M].范明,孟小峰,译.北京:机械工业出版社,2008.

二级参考文献22

  • 1欧健文,董守斌,蔡斌.模板化网页主题信息的提取方法[J].清华大学学报(自然科学版),2005,45(S1):1743-1747. 被引量:70
  • 2常育红,姜哲,朱小燕.基于标记树表示方法的页面结构分析[J].计算机工程与应用,2004,40(16):129-132. 被引量:24
  • 3胡国平,张巍,王仁华.基于双层决策的新闻网页正文精确抽取[J].中文信息学报,2006,20(6):1-9. 被引量:16
  • 4赵欣欣,索红光,刘玉树.基于标记窗的网页正文信息提取方法[J].计算机应用研究,2007,24(3):144-145. 被引量:33
  • 5ALEXJC. The easy way to extract useful text from arbitrary HTML [ EB/OL ]. http ://ai-depot. com/articles/the-easy-way-to-extractuseful-text-from-arbitrary-html.
  • 6HAMMER J, McHUGH J, GARCIA-MOLINA H. Semi-structured data: the TSIMMIS experience[ C]//Proc of the 1st East-European Symposium on Advance in Databases and Information Systems. 1997:1-8.
  • 7LIU Ling, PU C, HAN Wei. XWRAP: an XML-enable wrapper construction system for the Web information source[ C]//Proc of the 16th IEEE International Conference on Data Engineering. 2000:611- 620.
  • 8CRESCENZI V, ROADRUNNER G M. Towards automatic data extraction from large Web site[ C]//Proc of the 26th International Conference on Very Large Database Systems. 2001:109-118.
  • 9PAWITAN Y, MICHIELS S, KOSCIELNY S, et al. False discovery rate, sensitivity and sample size for microarray studies [ J ]. Bioinformatics, 2005,21 ( 13 ) :3017-3024.
  • 10Justin Park and Denilson Barbosa.Adaptive Record Extraction From Web Pages[A].WWW 2007[C].

共引文献44

同被引文献64

  • 1崔杰,李陶深,兰红星.基于Hadoop的海量数据存储平台设计与开发[J].计算机研究与发展,2012,49(S1):12-18. 被引量:145
  • 2周强,陈岭,马骄阳,赵宇亮,吴勇,王敬昌.基于改进DPhyp算法的Impala查询优化[J].计算机研究与发展,2013,50(S2):114-120. 被引量:3
  • 3张令文,谈振辉.基于泰勒级数展开的蜂窝TDOA定位新算法[J].通信学报,2007,28(6):7-11. 被引量:40
  • 4许雪燕.模糊综合评价模型的研究及应用[D].成都:西南石油大学,2011.
  • 5Chan E C L, Bacieu G.Wireless tracking analysis in location fingerprinting[C]//Proc of IEEE International Conference on Wireless and Mobile Computing.Greece:IEEE Press,2008.
  • 6Seow C K.Non-line-of-sight localization in multipath envi- ronments[J].IEEE Transactions on Mobile Computing,2008, 7(5) :647-660.
  • 7Caffery J J.A new approach to the geometry of TOA loca- tion[C]//IEEE VTC,2000,4:1943-1949.
  • 8Cheung K W, So H C,Ma W K,et al.Least squares algo- rithms for time-of-arrival-based mobile location[J].IEEE Trans on Signal Processing,2004,52(4):l121-1128.
  • 9朱扬勇,熊赟.DNA序列数据挖掘技术[J].软件学报,2007,18(11):2766-2781. 被引量:37
  • 10Xu R,Wunsch D.Survey of Clustering Algorithms[J].IEEE Transactions on Neural Networks,2005,16 (3):645-678.

引证文献6

二级引证文献48

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部