期刊文献+

基于信息熵的TFIDF文本分类特征选择算法研究 被引量:5

Feature Selection Model of TFIDF Text Categorization Based on Information Entropy
在线阅读 下载PDF
导出
摘要 特征权值的选择是文本分类技术的基础环节.在详细分析文本分类技术特点的基础上,基于信息熵理论建立了TF_IDF的改进算法模型;并根据实际工程数据,验证了算法模型的有效性.理论分析和实例验证表明该算法弥补了传统TFIDF算法没有考虑词条文本类间分布的不足,能更好的体现特征词条的权重,从而能有效提高分类的精确度. The selection of feature weight is a basic link of text categorization. First, the traditional TFIDF feature selection algorithm was introduced in detail. Then we presented an improved TFIDF feature selection method based on information entropy. Finally, simulation examples indicated the presented algorithm is effective. The theoretical analysis indicates that the presented algorithm has the obvious advantage compared with the traditional TFIDF model and it can improve the accuracy of text categorization.
出处 《湖北民族学院学报(自然科学版)》 CAS 2008年第4期401-404,409,共5页 Journal of Hubei Minzu University(Natural Science Edition)
基金 重庆市自然科学基金项目(CSTC 2006BB2422) 重庆市教委科学技术研究项目(KJ060414) 重庆交通大学博士启动基金(07-01-12)
关键词 文本分类 特征词条 信息熵 TFIDF 特征选择 text categorization feature lemma information entropy TFIDF feature selection
  • 相关文献

参考文献8

二级参考文献44

  • 1陈涛,谢阳群.文本分类中的特征降维方法综述[J].情报学报,2005,24(6):690-695. 被引量:79
  • 2Yang Yiming,ProceedingsoftheSeventeenthInternationalACMSIGIRConferenceonResearchandDevelopme,1994年,12页
  • 3Apte C, Damerau F J, and Weiss S M. Automated learning of decision rules for text categorization. ACM Transactions on Information Systems, 1994, 12:233- 251.
  • 4Yang Yiming, and Pedersen J O. A comparative study on feature selection in text categorization. In- Proceedings of the 14^th International Conference on Machine Learning (ICML-97), 1997. 412 - 420.
  • 5Hwee Tou Ng, Wei Boon Goh, and Kok Leong Low. Feature selection, perceptron learning, and a usability case study for text categorization. In: Proceedings of the 20^th ACM International Conference on Research and Development in Information Retrieval (SIGIR-97), 1997. 67 - 73.
  • 6Schutze H, Hull D A, and Pedersen J O. A comparison of classifiers and document representations for the routing problem. In: Proceedings of the 18^th ACM International Conference on Research and Development in Information Retrieval (SIGIR-95). 1995. 229 - 237.
  • 7Li Y H, and Jain A K. Classification of text document. The Computer Journal, 1998, 41(8) :537 - 546.
  • 8Deerwester S, Dumais S, Furnas D, et al. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 1990, 41 (6) : 391 - 407.
  • 9Thomas Hofmann. Probabilistic latent semantic indexing. In:Proceedings of the 22^nd ACM International Conference on Research and Development in Information Retrieval (SIGIR-99), 1999. 50-57.
  • 10Thomas K Landauer, Peter W Foltz, and Darrell Laham. An introduction to latent semantic analysis. Discourse Processes,1998, 25:259 - 284.

共引文献259

同被引文献39

引证文献5

二级引证文献71

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部