期刊文献+

An improved algorithm for weighting keywords in web documents 被引量:1

An improved algorithm for weighting keywords in web documents
在线阅读 下载PDF
导出
摘要 In this paper, an improved algorithm, web-based keyword weight algorithm (WKWA), is presented to weight keywords in web documents. WKWA takes into account representation features of web documents and advantages of the TF*IDF, TFC and ITC algorithms in order to make it more appropriate for web documents. Meanwhile, the presented algorithm is applied to improved vector space model (IVSM). A real system has been implemented for calculating semantic similarities of web documents. Four experiments have been carried out. They are keyword weight calculation, feature item selection, semantic similarity calculation, and WKWA time performance. The results demonstrate accuracy of keyword weight, and semantic similarity is improved. In this paper, an improved algorithm, web-based keyword weight algorithm (WKWA), is presented to weight keywords in web documents. WKWA takes into account representation features of web documents and advantages of the TF*IDF, TFC and ITC algorithms in order to make it more appropriate for web documents. Meanwhile, the presented algorithm is applied to improved vector space model (IVSM). A real system has been implemented for calculating semantic similarities of web documents. Four experiments have been carried out. They are keyword weight calculation, feature item selection, semantic similarity calculation, and WKWA time performance. The results demonstrate accuracy of keyword weight, and semantic similarity is improved.
出处 《Journal of Shanghai University(English Edition)》 CAS 2008年第3期235-239,共5页 上海大学学报(英文版)
基金 Project supported by the Science Foundation of Shanghai Municipal Commission of Science and Technology (Grant No.055115001)
关键词 improved vector space model (IVSM) representation feature feature item keyword weight semantic similarity improved vector space model (IVSM), representation feature, feature item, keyword weight, semantic similarity
  • 相关文献

参考文献9

  • 1张敏,马少平,宋睿华.DF还是IDF?主特征模型在Web信息检索中的使用[J].软件学报,2005,16(5):1012-1020. 被引量:13
  • 2许建潮,胡明.中文Web文本的特征获取与分类[J].计算机工程,2005,31(8):24-25. 被引量:24
  • 3ZHANG M,SONG R H,MA S P.DF or IDF?On the use of HTML primary feature fields for web IR[].The th World Wide Web Conference(www).2003
  • 4ZHANG M,MA S P.Efficient information retrieval based on a combination of vector space and prob- abilistic models[].The IEEE International Conference on SystemsMan and Cybernetics (IEEE SMC).2002
  • 5KOSALA R,BRUYNOOGHE M,BUSSCHE J V, et al.Information extraction from web documents based on local unranked tree automation infer- ence[].Proceedings of the Eighteenth International Joint Conference on Artificial Intelligence (IJCAI- ).2003
  • 6CRAVEN T C.HTML tags as extraction cuesfor web page description construction[].Informing Science Journal.2003
  • 7BREUEL T M.Information extraction from html doc- uments by structural matching[].Proceedings ofthe Second International Workshop on Web Document Analysis (WDA).2003
  • 8XU Jianchao1,2 , HU Ming1,2 (1. College of Computer Science and Engineering, Changchun University of Technology, Changchun 130012,2. Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University,Changchun 1.Feature Selection and Classification for Chinese Web Documents[].Computer Engineering.2005
  • 9ZHANG Min1,2+,MA Shao-Ping1,2,SONG Rui-Hua1,2 1(Department of Computer Science and Technology,Tsinghua University,Beijing 100084,China) 2(State Key Laboratory of Intelligent Technology and Systems,Tsinghua Univeristy,Beijing 100084,China).DF or IDF? On the Use of Primary Feature Model for Web Information Retrieval[].Journal of Software.2005

二级参考文献22

  • 1Yang Y. An Evaluation of Statistical Approaches to Text Categorization. Journal of Information (Retrieval 1 ),1999:69-90.
  • 2Mladenic M. Feature Subset Selection in Text-learning. http://www.ai.ijs.si/DunjaMladenic.
  • 3Wulfekuhler M R,Punch W F,Finding Salient Features for Personal Web Page Categorization. In Proc.of 6th International World Wide Web Conference,1997.
  • 4Salton G,Wong A,Yang C. A Vector Space Model for Automatic Indexing. Communications of the ACM,1995,18:613-620.
  • 5Lin Shian-hua. Extracting Classification Knowledge of Intemet Documents With Mining Term Associations: a Semantic Approach. In Proceedings of International ACM SIGIR Conference on Research and Development in Information Retrieval,1998:241-240.
  • 6Cohen W J,Singer Y. Context-sensitive Learning Methods for Text Categorization. In SIGIR'96:Proc. 19th Annual Int. ACM SIGIR Conf.on Research and Development in Information Retrieval,1996:307-315.
  • 7Yang Y,Pedersen J O. A Comparative Study on Feature Selection in Text Categorization. In the 14th Int. Conf. on Machine Learning,1997:412-420.
  • 8Yang Y,Liu X. A Re-examination of Text Categorization Methods.In 22nd Annual International ACM SIGIR Conference on Researchand Development in Information Retrieval(SIGIR'99),1999:42-49.
  • 9Anick PG. Adapting a full-text information retrieval system to computer the troubleshooting domain. In: Croft WB, van Rijsbergen CJ, eds. Proc. of the 17th Annual Int'l ACM-SIGIR Conf. on Research and Development in Information Retrieval (SIGIR'94).ACM Press, 1994. 349-358.
  • 10Croft WB, Cook R, Wilder D. Providing government information on the Internet: Experience with THOMAS. In: Proc. of the 2nd Int'l Conf. in Theory and Practice of Digital Libraries (DL'95). Texas, 1995. 19-24. http://csdl.tamu.edu/DL95/papers/croft/croft.html

共引文献35

同被引文献4

引证文献1

二级引证文献8

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部