摘要
为了降低无关信息对文本分类精度的影响 ,提出了基于最小类差异的预处理算法 .算法通过分析文本特征在类中的分布情况 ,将特征划分为三种类型 ,按照特征在各类间的分布差异 ,保留对分类有作用的单类特征与多类特征 ,而将类分布差异较小的一般特征进行过滤 .实验结果表明 ,采用新算法进行分类预处理所得到的分类精度明显优于信息增益、互信息量等预处理算法 .
An irrelevant feature preprocess based on the minimal class difference is proposed.It computes the class distribution difference of features according to their distribution,then divides the features into three types.The new preprocess keeps the features including single-class features and multi-class features which make for classification,and filters the general features with little use for classification.The experimental results show that better performance can be obtained using the new algorithm than using those algorithms such as information gain,mutual information,and cross entropy.
出处
《电子学报》
EI
CAS
CSCD
北大核心
2003年第11期1750-1753,共4页
Acta Electronica Sinica
基金
国家自然科学基金 (No .60 2 72 0 51 )
湖南省自然科学基金 (No.0 1jjy1 0 0 7)
关键词
信息增益
互信息量
朴素贝叶斯
information gain
mutual information
naive Bayesian