期刊文献+

基于句子相关度的文本自动分类 被引量:4

Text classification based on sentence correlation
在线阅读 下载PDF
导出
摘要 提出一种基于句子相关度的文本自动分类模型(TCSC).该模型利用训练样本增量式地自动更新类别语料库,根据句子的位置权值和语料权值计算句子类别相关度,获得用于文本分类的句子相关度矩阵,通过该矩阵实现文档分类.该模型避免了分类阶段待分类文本特别是中文文本的分词,模糊了词的多义问题,且在文本分类的实验中能够达到86%以上的查全率和查准率;随着语料库的不断训练和调整,分类性能还可以进一步提高,具有简单实现的特点. A text category model based on sentence correlation(TCSC) was presented , which incrementally updates category corpus with the training documents automatically. Then, category correlation was obtained by means of sentence position weight and corpus item weight to achieve correlation matrix for text classification. This model avoids the problem of word segmentation in Chinese documents and lowers the effect of words with multiple meanings in the phase of classification. Experimental results show that the recall and precision of this model reached of over 86%, and can be improved by updating corpus. This model can also be implemented easily in programming.
出处 《中国科学技术大学学报》 CAS CSCD 北大核心 2006年第5期540-545,共6页 JUSTC
基金 国家自然科学基金(69835010)资助
关键词 文本分类 语料库 相关度矩阵 句权 text-classification corpus sentence correlation matrix sentence weight
  • 相关文献

参考文献9

  • 1Salton G,Buckley C.Term weighting approaches in automatic text retrieval[J].Information Processing & Management,1988,24(5):513-523.
  • 2Thorsten J.A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization[C]//Proc of the 14th International Conf on Machine Learning (ICML'97).San Francisco:Morgan Kaufmann Publishers,1997:143-151.
  • 3How net knowledge database[DB/OL].http://www.keenage.com
  • 4中国科学院计算技术研究所.计算所汉语词法分析系统ICTCLAS[DB/OL].http://www.nlp.org.cn/project/project.php? proj_id=6
  • 5Mladenic D,Grobelnik M.Feature selection for unbalanced class distribution and Naive Bayes[C]//Proc of the 16th International Conf.on Machine Learning (ICML ' 99).San Francisco:Morgan Kaufmann Publishers,1999,258-267.
  • 6Shankar S,Karypis G.A feature weight adjustment algorithm for document classification[C]//SIGKDD'00 Workshop on Text Mining,Boston,MA.Washington D C:ACM,2000.
  • 7陆玉昌,鲁明羽,李凡,周立柱.向量空间法中单词权重函数的分析和构造[J].计算机研究与发展,2002,39(10):1205-1210. 被引量:126
  • 8于琨,糜仲春,蔡庆生.可应用于互联网的自学习中文关键词抽取算法[J].中国科学技术大学学报,2002,32(3):381-384. 被引量:8
  • 9HAN Jia-wei,Kamber M.Data Mining:Concepts and Techniques[M].San Francisco:Morgan Kaufmann Publishers,2002.

二级参考文献7

共引文献132

同被引文献40

引证文献4

二级引证文献10

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部