摘要
文本分类首先要解决的一个问题就是特征选择.简单分析了几种经典的特征选择方法,总结了它们的不足,提出了一个类别相关性方法,把交叉熵引入粗糙集并提出了一个基于交叉熵的属性约简算法,把该属性约简算法同类别相关性方法结合起来,提出了一个综合的特征选择方法.该方法首先利用类别相关性方法进行特征初选以过滤掉一些词条来降低特征空间的稀疏性,然后利用属性约简算法消除冗余,从而获得较具代表性的特征子集.实验结果表明,此特征选择方法效果良好.
The first problem which needs to be solved in text categorization is feature selection. Several classic feature selection methods are firstly analyzed simply and summarized. And then, the category correlation method is presented. Subsequently, cross entropy is introduced into rough sets and a new attribute reduction algorithm is provided. Finally, a comprehensive feature selection method is proposed. The comprehensive method firstly uses the category correlation method to select feature and filter out some terms to reduce the sparsity of feature spaces. And then it employs the new attribute reduction algorithm to eliminate redundancy, so that the feature subsets which are more representative are obtained. The experimental results show that the comprehensive method is promising.
出处
《郑州大学学报(理学版)》
CAS
北大核心
2010年第2期61-65,共5页
Journal of Zhengzhou University:Natural Science Edition
基金
四川省科技计划项目
编号2008GZ0003
四川省科技厅科技攻关项目
编号07GG006-014
关键词
文本分类
特征选择
类别相关性
交叉熵
属性约简
text categorization
feature selection
category correlation
cross entropy
attribute reduction