期刊文献+

网页去重策略 被引量:13

The Strategy on Processing Replicated Web Collections
在线阅读 下载PDF
导出
摘要 提出基于同源网页去重与内容去重的策略.通过对网址URL进行哈希散列完成对同源网页的去重,并对内容相同或近似的网页采用基于主题概念的去重判断.实验表明,该方法有效且去重效果良好.基于上述算法实现了教育资源库教育资讯搜索引擎系统. This paper presented techniques on how to build an effective crawler to collect non-replicative Web pages. A novel Hash function was proposed, together with a content-oriented approach, to filter based on URLs and contents. On one hand, this technique can parallelize crawling process while minimize the overlap effectively. On the other hand, it can identify those near-duplicated collections. The experimental results show the feasibility of the approach. On the basis of this work, the implementation of an educational search engine was presented in the end.
出处 《上海交通大学学报》 EI CAS CSCD 北大核心 2006年第5期775-777,782,共4页 Journal of Shanghai Jiaotong University
基金 国家高技术研究发展计划(863)项目(2002AA119050)
关键词 信息检索 搜索引擎 哈希函数 主题概念 information retrieval search engine Hash function subject concept
  • 相关文献

参考文献8

  • 1中国互联网络信息中心.第十六次中国互联网络发展状况统计报告[EB/OL].http://www.cnnic.net.cn/index/OE/00/11/index.htm.2005—07—01.
  • 2Border A Z, Glassman S C, Manasse M S, etal. Syntactic clustering of the Web[C]//Proceedings of the 6th ACM International Conference on World Wide Web.USA: ACM Press, 1997:1157-1166.
  • 3Cho J H, Shivakumar N, Gareia-Molina H. Finding replicated web collections[C]//Proeeedings of the ACM International Conference on Management of the Data.USA: ACM Press, 2000, 29(2): 355-366.
  • 4Shivakumar N, Garcia-Molilna H. Finding near-replicas of documents on the Web [C]//Proceedings of Workshop on Web Databases. Spain: Springer Press,1998:204-212.
  • 5Cho J H, Garcia-Molina H. Parallel crawlers[C]//Proceedings of the 11th ACM International Conference on World Wide Web. Hawaii: ACM Press, 2002:124-135.
  • 6Bharat K, Broder A Z. Mirror, mirror, on the Web:A study of host pairs with replicated content[J]. Computer Networks, 1999. 31(11-16): 1579-1590.
  • 7Nam G W, Park J H, Kim T Y. Dynamic management of URL based on object-oriented paradigm[C]//Proceedings of the International Conference on Parallel and Distributed Systems. Taiwan, China: IEEE Computer Society Press, 1998:226-230.
  • 8李晓明,凤旺森.两种对URL的散列效果很好的函数[J].软件学报,2004,15(2):179-184. 被引量:45

二级参考文献9

  • 1Cormen TH,Leiserson CE.Introduction to Algorithms.2nd ed.,Cambridge:MIT Press,2001.221-252.
  • 2Knuth DE.Sorting and Searching,Volume 3 of the Art of Computer Programming.New York:Addison-Wesley,1973.506-549.
  • 3McKenzie BJ,Harries R,Bell T.Selecting a hashing algorithm.Software Practice and Experience,1990,20(2):208-210.
  • 4Tong MCF.General hashing [Ph.D.Thesis].Computer Science Department,University of Auckland,1996.
  • 5Peter K.Pearson,fast hashing of variable length text strings.Communications of the ACM,1990,33(6):676-678.
  • 6Berners-Lee T.Universal resource locator.2003.http://www.w3.org/Addressing/URL/Overview.html
  • 7Yan HF,Wang JY,Li XM,Guo L.Architectural design and evaluation of an efficient Web-crawling system.Journal of System and Software,2002,60(3):185-193.
  • 8Shaffer CA.Zhang M,Liu XD,Trans.Data Structure and Algorithm Analysis.Beijing:Publishing House of Electronics Industry,1998.211-213(in Chinese).
  • 9ShafferCA 著 张铭 刘晓丹 译.数据结构与算法分析[M].北京:电子工业出版社,1998.211-213.

共引文献48

同被引文献76

引证文献13

二级引证文献31

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部