期刊文献+

Tag-TextRank:一种基于Tag的网页关键词抽取方法 被引量:58

Tag-TextRank:A Webpage Keyword Extraction Method Based on Tags
在线阅读 下载PDF
导出
摘要 关键词抽取是从文本中抽取代表性关键词的过程,在文本处理领域中具有重要的应用价值.利用一种近年来受到广泛关注的新的信息源——社会化标签(tag)——来提高网页关键词抽取的质量.通过对Tag数据进行统计分析,发现用户往往对多个在话题上相关的网页使用同样的标签词,一个特定的文档可以通过其标注信息找到相关文档.在此基础上,提出了利用Tag进行关键词抽取的框架,并给出了一种具体的实现方法Tag-TextRank.该方法在TextRank基础上,通过目标文档中的每个Tag引入相关文档来估计词项图的边权重并计算得到词项的重要度,最后将不同Tag下的词项权重计算结果进行融合.在公开语料上的实验表明,Tag-TextRank在各项评价指标上均优于经典的关键词抽取方法TextRank,并具有很好的推广性. Keyword extraction is to extract representative keywords from texts and has been widely used in most text processing applications. In this paper, we explore the use of tags for improving the performance of webpage keyword extraction task. Specifically, we first analyze the characteristics of bookmarking behavior and find that people usually use the same tags to label multiple topic-related webpages, which is shown by the fact that over 90~ of labeled webpages can find relevant webpages through their tag information. Based on the discovery, we propose a method called Tag-TextRank. As an extension of the classic keyword extraction method TextRank, Tag-TextRank calculates the term importance based on a weighted term graph and the edge weight for a term pair is estimated by the statistics of the relevant documents which are introduced by a certain tag of the target webpage. The final importance score for a term is the combination of the above tag dependent importance scores. Tag-TextRank can measure the term relations by utilizing more documents so as to better estimate the term importance. Experimental results on a publicly available corpus show that Tag- TextRank outperforms TextRank on various metrics.
出处 《计算机研究与发展》 EI CSCD 北大核心 2012年第11期2344-2351,共8页 Journal of Computer Research and Development
基金 国家自然科学基金项目(60776797 60873166) 国家"九七三"重点基础研究发展计划基金项目(2007CB311103) 国家"八六三"高技术研究发展计划基金项目(2006AA010105)
关键词 社会化标注 标签 关键词抽取 网页关键词抽取 TextRank social annotation tag keyword extraction webpage keyword extraction TextRank
  • 相关文献

参考文献24

  • 1Yih W, Goodman J, Carvalho V R. Finding advertising keywords on Web pages [C]//Proc of WWW'06. New York: ACM, 2006:213-222.
  • 2Kelleher D, I.uz S. Automatic hypertext kcyphrase detection [C] //Proc of IJCAI-05. San Francisco: Morgan Kaufmann, 2005:1608-1609.
  • 3Turney P D. Coherent keyphrase extraction via web mining [C]//Proc of IJCAI 03. San Francisco: Morgan Kaufmann, 2003:434-439.
  • 4Hulth A. Improved automatic keyword extraction given more linguistic knowledge[C] //Proc of EMNLP'03. Stroudsburg: ACL, 2003:216-223.
  • 5A1 Khalifa H S, Davis H C. Folksonomies versus automatic keyword extraction: An empirical study [C]//Proc of IAD1S Web Applications and Research 2006. Southampton: ECS, 2006: 132-143.
  • 6Mihaleea R, Tarau P. TextRank.- Bringing order into texts [C] //ProeofEMNLP'04. Stroudsburg: ACL, 2004:404 - 411.
  • 7Wan Xiaojun, Yang Jianwu, Xiao Jianguo. Towards an iterative reinforcement approach for simultaneous document summarization and keyword extraction[C] //Proe of ACL'07. Stroudsburg: ACL, 2007: 552-559.
  • 8Turney P D. Learning algorithms for keyphrase extraction [J]. Information Retrieval, 2000, 2(4): 303-336.
  • 9Frank E, Paynter G W, Witten I H, et al. Domain specific keyphrase extraction [C] //Proc of IJCAI-99. San Francisco: Morgan Kaufmann, 1999:668-673.
  • 10李素建,王厚峰,俞士汶,辛乘胜.关键词自动标引的最大熵模型应用研究[J].计算机学报,2004,27(9):1192-1197. 被引量:93

二级参考文献11

  • 1李素建,王厚峰,俞士汶,辛乘胜.关键词自动标引的最大熵模型应用研究[J].计算机学报,2004,27(9):1192-1197. 被引量:93
  • 2索红光,刘玉树,曹淑英.一种基于词汇链的关键词抽取方法[J].中文信息学报,2006,20(6):25-30. 被引量:88
  • 3Jilin Chen, Benyu Zhang, Dou Shen, Qiang Yang. Zheng Chen. Diverse Topic Phrase Extraction from Text Collection. Data Mining [C]//ICDM apos: 06. Sixth International Conference on Volume, Issue, Digital Object Identifier. 2006.
  • 4Blaz Fortuna, Dunja Mladenic, Marko Grobelnik . Semi-Automatic Construction of Topic Ontology[C]// ESWC 2005.
  • 5Khaled M. Hammouda, Diego N. Matute, and Mohamed S. Kamel. CorePhrase: Keyphrase Extraction for Document Clustering[C]//Machine Learning and Data Mining in Pattern Recognition. 2005: 265-274.
  • 6Neto, J., Santos, A., Kaestner, C., Freitas, A. Document clustering and text summarization [C]// Proc. 4th International Conference Practical Applications of Knowledge Discovery and Data Mining (PADD-2000), London, UK: 2000:41-55.
  • 7Salton, G. (1991): Developments in Automatic Text Retrieval[J]. Science, Vol 253, 974-979.
  • 8K.B. Khoo and M. Ishizuka. Emerging Topic Track ing System [C]//Proc. of Web Intelligent (WI 2001), LNAI 2198 (Springer), Maebashi, Japan: 2001: 125-130.
  • 9Khoo Khyou Bun, Mitsuru Ishizuka, Topic Extraction from News Archive Using TF× PDF Algorithm[C]// The Third International Conference on Web Information Systems Engineering (WISE'02), 2002.
  • 10董振东 董强.[EB/OL].知网[EB/OL].http://www.keenage.com,1999.

共引文献104

同被引文献403

引证文献58

二级引证文献418

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部