期刊文献+

一种基于关键词的近似网页检测算法 被引量:3

Detecting Near-replicas of Web Pages Based on Keywords
在线阅读 下载PDF
导出
摘要 针对海量Web文本信息,利用从网页主题内容提取出来的特征关键词,在倒排索引基础上建立相似度计算模型。对一篇新入库的网页文档,利用所包含的关键词迅速缩小计算范围,提高计算效率。实验结果表明该算法是有效的,小规模评测结果得到较好的效果。 The presence of replicas or near - replicas of documents is very common on the Web. To solve near - replicas of large - scale web pages crawled by search engine, a similarity dealing algorithm was proposed based on keywords extracted from the web pages. The algorithm reduces the scope of web pages that to be processed and improves efficiency largely.
出处 《微计算机应用》 2008年第2期41-45,共5页 Microcomputer Applications
关键词 近似网页 搜索引擎 网页消重 Near - replicas detection, Vector space model, Search engine
  • 相关文献

参考文献8

二级参考文献44

  • 1张长利,赫枫龄,左万利.一种基于后缀数组的无词典分词方法[J].吉林大学学报(理学版),2004,42(4):548-553. 被引量:14
  • 2衣英楠,马军.数字文档管理系统的设计与实现[J].山东大学学报(理学版),2005,40(2):62-66. 被引量:1
  • 3Shian-Hua Lin, Jan-Ming Ho. Discovering informative content blocks from Web documents. In: SIGKDD, 2002
  • 4Soumen Chakrabarti, Mukul M. Joshi and Vivek B. Tawde.Enhanced topic distillation using text, markup tags, and hyperlinks. In: SIGIR, 2001
  • 5S. Chakrabarti, M. Joshi, and M. Subramanyam. Accelerated focused crawling through online relevance feedback. In :WWW, Hawaii. ACM, 2002
  • 6Yiming Yang. Noise reduction in a statistical approach to text categorization. In: Proceedings of SIGIR-95, 18th ACM International Conference on Research and Development in Information Retrieval, 1995
  • 7Li Xiaoli and Shi Zhongzhi. Innovating Web page classification through reducing noise. Journal of Computer Science & Technology, 2002 ,17(1): 9 ~ 17
  • 8http://162. 105.80.84/cgi-bin/getdirectory? ccode = 0
  • 9http://e. pku. edu. cn
  • 10Yang Y. Expert network:effective and efficient learning from human decisions in text categorization and retrieval. In: Proceedings of the Seventeenth International ACM SIGIR Conference on Research and Development in Information Retrieval,1994. 13 ~ 22

共引文献107

同被引文献29

引证文献3

二级引证文献9

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部