摘要
文本分类中普遍应用的TF-IDF特征权重算法没有引入特征项的纯度和类别属性。在结合基尼指数原理和TF-IDF特征权重算法基础上,提出一种基于基尼指数的特征权重改进算法,在计算特征权重时引入特征项的纯度和分类的已知类别属性。进一步,设计了两种特征权重算法的对比实验,并在SVM分类器和kNN分类器下选取不同的特征项数目进行多次实验。实验结果表明,该改进的基尼指数特征权重算法有更好的效果。
The universally used TF-IDF feature weight algorithm in the text categorization does neither introduce the purity of feature term,nor the known category property.So an improved feature weight algorithm based on the theory of gini index is proposed in the paper,which takes the purity of feature term and the known category property into account.And then experiments are designed to compare the improved algorithm with the TF-IDF algorithm,with different feature numbers in the SVM and the kNN classifier.The results show that the improved feature weight algorithm based on gini index has a better effect than the TF-IDF algorithm.
出处
《计算机与数字工程》
2010年第12期8-13,共6页
Computer & Digital Engineering