摘要
对简单贝叶斯分类中的条件概率加权因子进行了改进,引进了体现词语分类贡献大小的类别区别度,新的加权方法为频率与类别区别度的乘积,既强调了区别度高的词语,降低了常见词的影响,又体现了区别度高的词语频次的积极作用.实验证明,在约3万篇测试集上(共15个大类,244个小类),该改进比原来的加权方法提高了分类效果:大类和小类微平均分别提高了约18.9%和7.6%.
The weighted factor of conditional probability in Naive-Bayes was ameliorated, the new factor is product of word's kinds-difference and frequency, which emphasizes words with high word' s kinds- difference, incarnates frequency's positivity, on the contrary, reduces the affect of common words. In corpus with 3 ten thousand documents, 15 kinds and 244 sub-kinds, the experiment verified this means: MicroF1 increase of 18.9 percent of parent-category, MicroF1 increase of 7.6 percent of sub-category.
出处
《暨南大学学报(自然科学与医学版)》
CAS
CSCD
北大核心
2007年第1期48-51,共4页
Journal of Jinan University(Natural Science & Medicine Edition)
基金
教育部"国家语言资源监测"项目(L2004-01-01-04)