期刊文献+

基于局部标签树匹配的改进网页聚类算法 被引量:14

Improved Web page clustering algorithm based on partial tag tree matching
在线阅读 下载PDF
导出
摘要 Web信息抽取中需要对目标网站的网页进行聚类分析,以检测并生成信息抽取所需的模板。传统的基于DOM树编辑距离的网页聚类算法不适合文档对象模型(DOM)树结构复杂的动态模板网页,提出了一种基于局部标签树匹配的改进网页聚类算法,利用标签树中模板节点和非模板节点的层次差异性,根据节点对布局影响的大小赋予节点不同的匹配权值,使用局部树匹配完成对网页结构相似性的有效计算。实验结果表明,改进的算法较传统的基于DOM树编辑距离的网页聚类算法,在对采用模板生成的动态网页进行聚类分析时具有更高的准确率,且时间复杂度低。 In the process of Web information extraction,Web pages on the target websites should be clustered in order to detect and generate templates that are used to extract required information.Traditional page clustering algorithm based on DOM tree edit distance is not suitable for the complex Document Object Model(DOM)tree structure pages created from dynamic templates.In this paper,an improved Web page clustering algorithm was proposed based on partial tag tree matching.In the proposed algorithm,the appropriate weights were assigned to the nodes according to their effects on the layout of Web pages and the level difference between template nodes and non-template nodes.After that,the structure similarity between Web pages was computed efficiently based on partial tree matching approach.Compared with the traditional algorithms,the experimental results show that the proposed algorithm is of higher accuracy in clustering dynamic Web pages and lower computing complexity.
出处 《计算机应用》 CSCD 北大核心 2010年第3期818-820,共3页 journal of Computer Applications
基金 湖南省自然科学基金资助项目(09JJ3123)
关键词 WEB信息抽取 网页聚类 树编辑距离 局部标签树匹配 Web information extraction Web page clustering tree edit distance partial tag tree matching
  • 相关文献

参考文献7

  • 1FLORESCU D,LEVY A,MENDELZON A.Database techniques for the world-wide Web:Survey[J].SIGMOD Record,1998,27(3):59-74.
  • 2肖建鹏,张来顺,任星.直推式支持向量机在Web信息抽取中的应用研究[J].计算机工程与应用,2009,45(2):147-149. 被引量:6
  • 3支宗良,陈少飞.一种基于XQuery的优化Web信息抽取方法[J].计算机应用,2008,28(1):152-154. 被引量:4
  • 4CRESCENZI V,MECCA G,MERIALDO P.Wrapping-oriented classification of Web pages[C]// Proceedings of the 2002 ACM Symposium on Applied Computing.New York:ACM Press,2002:1108-1112.
  • 5REIS D C,GOLGHER P B,SILVA A S,et al.Automatic Web news extraction using tree edit distance[C]// Proceedings of the 13th International Conference on World Wide Web.New York:ACM Press,2004:502-511.
  • 6ZHAI Y,LIU B.Structured data extraction from the Web based on partial tree alignment[J].IEEE Transactions on Knowledge and Data Engineering,2006,18(12):1614-1628.
  • 7YANG W.Identifying syntactic differences between two programs[J].Software-Practice and Experience,1991,21(7):739-755.

二级参考文献16

  • 1Vapnik V.The nature of statistical learning theory[M].New York:Springer-Verlag,2000.
  • 2Joachims T.Transductive inference for text classification using support vector machines[C]//Proceeding of the 16th International Conference on Machine Learning.San Francisco:Morgan Kanfmann,1999:200-209.
  • 3Thorsten J.Transductive inference for text classification using support vector machines[C]//Proc of International Conference on Machine Learning.San Francisco,CA,USA:Morgan Kaufmann,1999:200-209.
  • 4Nikola K,Shaoning P.Transductive support vector machines and applications in bioinformatics for promoter recognition,letters and reviews[J].Neural Information Processing,2004,3 (2):31-38.
  • 5Yu S P,Cai D,Wen J R,et al.Improving pseudo-relevance feedback in Web information retrieval using Web page segmentation[EB/OL].http://research.microsoft.com/research/pubs/view.aspx?type=Technical] 20Report&id=6322002.
  • 6许建华,张学工.统计学理论基础[M].北京:电子工业出版社,2004.
  • 7LAENDER A H F, RIBEIRO-NETO B A. SILVA A S, et al. A brief survey of Web data extraction tools [ J]. SIGMOD Record, 2002, 31(2): 84 - 93.
  • 8SODERLAND S . Learning information extraction rules for semi-structured and free text [J]. Machine Learning, 1999, 34(1/ 3) : 233 -272.
  • 9HAN W, BUTrLER D, PU C. Wrapping Web data into XML [J]. SIGMOD Record, 2001, 30(3) : 33 -39.
  • 10SAHUGUET A, AZAVANT F. Building light - weight wrappers for legacy Web data-sources using W4F[ C]// Proceedings of the 25th VLDB Conference. San Francisco: Morgan Kaufmann Publishers, 1999,738 - 741.

共引文献8

同被引文献108

  • 1陈挺,刘嘉勇,夏天,范刚.基于平板型Web论坛的信息抽取研究[J].成都信息工程学院学报,2009,24(1):1-4. 被引量:9
  • 2何昕,谢志鹏.基于简单树匹配算法的Web页面结构相似性度量[J].计算机研究与发展,2007,44(z3):1-6. 被引量:15
  • 3常育红,姜哲,朱小燕.基于标记树表示方法的页面结构分析[J].计算机工程与应用,2004,40(16):129-132. 被引量:24
  • 4陈小兵,张汉煜,骆力明,黄河.SQL注入攻击及其防范检测技术研究[J].计算机工程与应用,2007,43(11):150-152. 被引量:73
  • 5ZAMIR O,ETZIONI O,MADANI O,et al.Fast and intuitive clus-tering of Web documents[C]//Proceedings of the 3rd InternationalConference on Knowledge Discovery and Data Mining.New York:AAAI Press,1997:287-290.
  • 6HONG YI,SAM K.Learning assignment order of instances for theconstrained K-means clustering algorithm[J].IEEE Transactions onSystems Man and Cybernetics Part B-Cybernetics,2009,39(2):568-574.
  • 7HALL L O,GOLDGOF D B.On convergence properties of the sin-glepass and online fuzzy c-means algorithm[C]//2010 IEEE Inter-national Conference on Fuzzy Systems,Washington,DC:IEEE,2010:1-3.
  • 8AIOLLI F,SAN-MARTINO G,HAGENBUCHNER M,et al.Learning nonsparse kernels by self organizing maps for structured da-ta[J].IEEE Transactions on Neural Networks,2009,20(12):1938-1949.
  • 9ZAMIR O,ETZIONI O.Web document clustering:A feasibilitydemonstration[C]//SIGIR'98:Proceedings of the 21st Interna-tional ACM SIGIR Conference on Research and Development in In-formation Retrieval.New York:ACM Press,1998:46-54.
  • 10CHEN CHUNXI,BERTIL S.Parallel construction of large suffixtrees on a PC cluster[C]//Euro-Par 2005 Parallel Processing:11th International Euro-Par Conference.Berlin:Springer,2005:1227-1236.

引证文献14

二级引证文献53

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部