期刊文献+

增量式关键资源页面判定树

Incremental Decision Tree of Web Key Resource Pages
在线阅读 下载PDF
导出
摘要 本文针对互联网上信息的日益海量增长的情况,在评述前期算法的基础上,提出了一种关键资源页面判定树的增量式更新算法。新算法使用Web页面的链接分析方法,选择合适的Web页面属性,并基于反例的统计信息来构造判定树的测试属性值,采用ID5R算法来处理训练样本不断增长的关键资源页面判定的机器学习任务。同时设计了适合于该算法的剪枝策略,它通过引入并实时更新反例样本比率并在其值低于抑制因子时停止分裂的办法,避免了树的过度增长与抗嗓能力差、泛化情况糟糕的情况。实验表明增量式更新算法能更高效地生成关键资源页面判定树。最后讨论了该算法的应用领域。 To cope increasingly growing Web information, this paper presents an incremental updating algorithm for inducing Web Key Resource Pages based on analysis of previous algorithm. The new algorithm applies Link Analysis Method and choose appropriate Web page attribute value, constructs test attribute value based on statistical analysis of "negative instance", applies the IDSR induction process to learning tasks in which training instances are presented continuously. Meanwhile we present a new truing method to optimize our new algorithm, which avoids the overmuch growth of decision tree by introducing and updating negative ratio continually, and pausing the growth if it down flows to the assumed threshold value. Experiments show that incremental training makes it possible to select training instances more carefully, which can result in smaller decision trees. We discuss application of this new algorism in the end.
出处 《情报学报》 CSSCI 北大核心 2009年第3期469-474,共6页 Journal of the China Society for Scientific and Technical Information
基金 湖南省教育厅科学研究项目(2007C525) 湖南省教育科学规划课题(XJK06BJGl03) 湖南省大学生研究性学习与创新性实验项目.
关键词 关键资源页面 判定树 增量式更新 key resource pages, decision tree, incremental updating
  • 相关文献

参考文献11

  • 1Hawking D,Craswell N.Overview of the TREC-2002 web track[OL].[2009-01-10].http://trec.nist.gov/pubs/trec11/t11-proceedings.html.
  • 2Craswell N,Hawking D.Overview of the TREC-2003 web track[OL].[2009-01-10].http://trec.nist.gov/pubs/trec12/t12-proceedings.html.
  • 3Kleinberg J.M.Authoritative sources in a hyperlinked environment[J].Journal of the ACM,1999,46(5):604-632.
  • 4Brin S,Page L.The anatomy of a large-scale hyper textual Web Search Engine[J].Computer Networks,1998,30(7):107-117.
  • 5刘奕群,张敏,马少平.基于改进决策树算法的网络关键资源页面判定[J].软件学报,2005,16(11):1958-1966. 被引量:12
  • 6Mitchell T M.Machine Learning[M].New York:Mc Graw Hill,1997:55-64.
  • 7Schlimmer J C,Fisher D A.Case study of incremental concept induction[C].Proceedings of the 5th National Conference on Artificial Intelligence,Philadelphia,1986:25-39.
  • 8Utgoff P E.Incremental induction of decision trees[J].Machine Learning,1989,(4):161-186.
  • 9Wang Yong,Liu Yiqun,Zhang Min,et al.News Page Discovery Policy for Instant Crawlers[C],Asia Information Retrieval Symposium 2008,Harbin,China,2008:16-18.
  • 10Liu Yiqun,Zhang Min,Ru Liyun,et al.Data Cleansing for Web Information Retrieval using Query Independent Features[J].Journal of the American Society for Information Science and Technology.2007,58(12):1884-1898.

二级参考文献16

  • 1洪家荣,丁明峰,李星原,王丽薇.一种新的决策树归纳学习算法[J].计算机学报,1995,18(6):470-474. 被引量:92
  • 2Amento B, Terveen L, Hill W. Does authority mean quality? Predicting expert quality ratings of Web documents. In: Belkin NJ,Ingwersen P, Leong MK, eds. SIGIR 2000: Proc. of the 23rd Annual Int'l ACM SIGIR Conf. on Research and Development in Information Retrieval 2000. New York: ACM Press, 2000. 296-303.
  • 3Davison BD. Topical locality in the Web. In: Belkin NJ, Ingwersen P, Leong MK, eds. SIGIR 2000: Proc. of the 23rd Annual Int'l ACM SIGIR Conf. on Research and Development in Information Retrieval 2000. New York: ACM Press, 2000. 272-279.
  • 4Bharat K, Henzinger M. Improved algorithms for topic distillation in a hyperlinked environment. In: Croft BW, Moffat A, van Rijsbergen CJ, Wilkinson R, Zobel J, eds. SIGIR'98: Proc. of the 21st Annual Int'l ACM SIGIR Conf. on Research and Development in Information Retrieval. New York: ACM Press, 1998. 104-111.
  • 5Broder A. A taxonomy of Web search. SIGIR Forum, 2002,36(2):1-8.
  • 6Henzinger MR, Motwani R, Silverstein C. Challenges in Web search engines. In: Gottlob G, Walsh T, eds. IJCAI 2003, Proc. of the 18th Int'l Joint Conf. on Artificial Intelligence. San Francisco: Morgan Kanfmann Publishers, 2003. 1573-1579.
  • 7Kleinberg JM. Authoritative sources in a hyperlinked environment. Journal of the ACM, 1999,46(5):604-632.
  • 8Chakrabarti S, Dom B, Kumar R, Raghavan P, Rajagopalan S, Tomkins A. Experiments in topic distillation. In: Brown E, Smeaton A, eds. Proc. of the ACM SIGIR Workshop on Hypertext Information Retrieval. New York: ACM Press, 1998. 13-21.
  • 9Chakrabarti S, Joshi M, Tawde V, Bombay IIT. Enhanced topic distillation using text, markup, tags and hyperlinks. In: Croft BW,Harper D J, Kraft DH, Zobel J, eds. SIGIR 2001: Proc. of the 24th Annual Int'l ACM SIGIR Conf. on Research and Development in Information Retrieval. New York: ACM Press, 2001. 208-216.
  • 10Mitchell TM. Machine Learning. New York: McGraw-Hill, 1997. 55-64.

共引文献11

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部