摘要
特征权值的选择是文本分类技术的基础环节.在详细分析文本分类技术特点的基础上,基于信息熵理论建立了TF_IDF的改进算法模型;并根据实际工程数据,验证了算法模型的有效性.理论分析和实例验证表明该算法弥补了传统TFIDF算法没有考虑词条文本类间分布的不足,能更好的体现特征词条的权重,从而能有效提高分类的精确度.
The selection of feature weight is a basic link of text categorization. First, the traditional TFIDF feature selection algorithm was introduced in detail. Then we presented an improved TFIDF feature selection method based on information entropy. Finally, simulation examples indicated the presented algorithm is effective. The theoretical analysis indicates that the presented algorithm has the obvious advantage compared with the traditional TFIDF model and it can improve the accuracy of text categorization.
出处
《湖北民族学院学报(自然科学版)》
CAS
2008年第4期401-404,409,共5页
Journal of Hubei Minzu University(Natural Science Edition)
基金
重庆市自然科学基金项目(CSTC
2006BB2422)
重庆市教委科学技术研究项目(KJ060414)
重庆交通大学博士启动基金(07-01-12)
关键词
文本分类
特征词条
信息熵
TFIDF
特征选择
text categorization
feature lemma
information entropy
TFIDF
feature selection