摘要
特征权值的选择是文本分类的基础环节,TFIDF是文档特征权值表示常用方法之一。但其过于简单的词频和反文档频率表达式会忽略在一个类中频繁出现的特征,导致了特征预测能力相互削弱。文中提出了一种改进的特征选择算法(I-TFIDF),能更好的体现特征词条的权重,从而有效提高分类的正确率。实验结果表明I-TFIDF比传统的TFIDF算法具有更好的性能。
The selection of feature weight is a basic link of text categorization. And TFIDF is a kind of common method of feature weight. But the formula of Term Frequency and Inverse Document Frequency is too easy to ignore the terms which appears repeatedly,and can result in the fact that one feature's predictive power is weakened by oth- ers. In this paper, we propose a new improved feature selection method(I -TFIDF). The simulated results show that the presented algorithm has the obvious advantage compared with the traditional IFIDF model and it can improve the accuracy of text categorization.
出处
《贵州教育学院学报》
2009年第6期54-56,共3页
Journal of Guizhou Educational College(Social Science Edition)
关键词
文本分类
特征项
TFIDF
text categorization
feature selection
TFIDF