期刊文献+

基于密度聚类模式的中文新闻网页关键词提取 被引量:2

Keyword Extraction Method Based on Density Clustering for Chinese News Web Pages
在线阅读 下载PDF
导出
摘要 关键词在文本聚类/分类、自动摘要、信息检索等领域具有重要地位,然而当前互联网上的众多新闻网页没有提供关键词,人工标注关键词代价巨大,并且大多数已有的关键词自动提取算法都需要建立在人工标注的训练集之上,因而难以实用。由于关键词是文章中较重要且主题关联较凝聚的词的集合,因此提出一种基于密度聚类模式的中文新闻网页关键词提取方法,根据词语之间的共现信息,对网页分词后的词语进行聚类,在分析词语关联度的基础上提取出反映新闻主题的关键词。通过大量随机新闻网页实验结果表明,与单纯的TF/IDF(词频和文档频率倒数的乘积)方法相比,此算法召回率平均提高了7.15%,准确率平均提高了7.075%。 Keywords play a key role in text clustering,text classification,automatic text summarization, and information retrieval. However,keywords available for most web pages. Manual assignment of keywords is expensive and time-consuming and most existing automatic keyword extraction approaches require training sets with human-labeled keywords. Keywords can be considered as a set of words which are important and subject correlated cohesively in a document. Therefore,an automatic keyword extraction method based on density clustering is proposed in this paper. Web pages are segmented, and the words are clustered based on their co-occurrence. Co-occurrence relations between words are analyzed, and keywords that capture the main topics of the document are extracted. Experiments from Chinese news pages demonstrate that the recall rate can be improved by 7.15 percent and the precision rate by 7. 075 percent compared with the TF/IDF (term frequency/inverted document frequency) method.
出处 《广西师范大学学报(自然科学版)》 CAS 北大核心 2009年第1期201-204,共4页 Journal of Guangxi Normal University:Natural Science Edition
基金 国家基金海青课题资助项目(60828005)
关键词 关键词提取 词共现 聚类 自然语言处理 keyword extraction word co-occurrence clustering natural language processing
  • 相关文献

参考文献11

  • 1IKONOMAKIS M. KOTSIANTIS S, TAMPAKAS V. Text classification using machine learning techniques [J]. WSEAS Transactions on Computers ,2005,4(8) : 966-974.
  • 2LI Juanzi FAN Qi'na ZHANG Kuo.Keyword Extraction Based on tf/idf for Chinese News Document[J].Wuhan University Journal of Natural Sciences,2007,12(5):917-921. 被引量:26
  • 3TURNEY P D. Learning algorithms for keyphrase extraction[J]. Information Retrieval,2000(2):303-336.
  • 4WITTEN I H,PAYNTER G W,FRANK E,et al. KEA:practical automatic keyphrase extraction[C]//Proc of the Fourth ACM Conference on Digital Libraries. New York:ACM Press, 1999:254-255.
  • 5索红光,刘玉树,曹淑英.一种基于词汇链的关键词抽取方法[J].中文信息学报,2006,20(6):25-30. 被引量:88
  • 6赵鹏,蔡庆生,王清毅,耿焕同.一种基于复杂网络特征的中文文档关键词抽取算法[J].模式识别与人工智能,2007,20(6):827-831. 被引量:44
  • 7耿焕同,蔡庆生,赵鹏,于琨.一种基于词共现图的文档自动摘要研究[J].情报学报,2005,24(6):651-656. 被引量:15
  • 8ERCAN G,CICEKLI I. Using lexical chains for keyword extraction [J]. Information Processing and Management, 2007,43(6):1705-1714.
  • 9PEAT H J ,WILLET P. The limitations of term co-occurrence data for query expansion in document retrieval systems [J]. JASIS,1991,42(5) : 378-383.
  • 10ESTER M,KRIEGEL H P,SANDER J,et al. A density-based algorithm for discovering clusters in targe spati-al databases with noise [C]//Proc. of the 2nd International Conference on Knowledge Discovery and Data Mining. Menlo Park,CA :AAAI Press, 1996 : 226-231.

二级参考文献32

  • 1刘群,张华平,俞鸿魁,程学旗.基于层叠隐马模型的汉语词法分析[J].计算机研究与发展,2004,41(8):1421-1429. 被引量:201
  • 2李素建,王厚峰,俞士汶,辛乘胜.关键词自动标引的最大熵模型应用研究[J].计算机学报,2004,27(9):1192-1197. 被引量:93
  • 3韦洛霞,李勇,李伟,邵明珠,罗诗裕.汉字网络的3度分隔与小世界效应[J].科学通报,2004,49(24):2615-2616. 被引量:16
  • 4郑家恒,卢娇丽.关键词抽取方法的研究[J].计算机工程,2005,31(18):194-196. 被引量:41
  • 5王军.词表的自动丰富——从元数据中提取关键词及其定位[J].中文信息学报,2005,19(6):36-43. 被引量:40
  • 6Reyhani N, Badie K, Kharrat M. A two layered case based reasoning approach to text summarization, based on summarization pattern. Systems and Information Engineering Design Symposium, Virginia, USA, 2003,47 - 50.
  • 7Mallett D, Elding J, Nascimento MA. Information-content based sentence extraction for text summarization. International Conference on Information Technology, Las Vegas, USA, 2004,214218.
  • 8Po Hu, Tingting He, Donghong Ji, Meng Wang. A study of Chinese text summarization using adaptive clustering of paragraphs. Computer and Information Technology, Wuhan, China,2004,1159- 1164.
  • 9Jian-Hui Wang, Shui-Geng Zhou, Yun-Fa Hu. Sentences clustering based automatic summarization. In: Proceedings of the Second International Conference on Machine Learning and Cybernetics, Xi' an, China, 2003,57 - 62.
  • 10Peat H.J., Willet P. The limitations of term co-occurrence datafor query expansion in document retrieval systems. JASIS,1991,42(5) :378 - 383.

共引文献153

同被引文献11

引证文献2

二级引证文献7

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部