摘要
分析了影响传统CHI统计方法分类精度的因素,去除了特征项与类别负相关的情况。同时将改进后的方法用于特征词的权重调整,使其分类效果有了明显提高;将分散度、集中度、频度等因素引入到改进后的方法中,提高了其在类分布不均匀语料集上的分类精确度。最后通过实验证明了该方法的有效性和可行性。
This paper analyzes the factors which influence the CHI categorization accuracy and removes the negative correlation between the items and the category.The improved approach is applied to weight adjustment,obviously improving categorization quality.Furthermore,concentration information,distribution information and frequency information are introduced into the improved approach,which increases the categorization accuracy on the corpus of category uneven distribution.The experimental results verify the efficiency and probability of the improved CHI approach.
出处
《计算机工程与应用》
CSCD
北大核心
2011年第4期128-130,194,共4页
Computer Engineering and Applications
基金
航空科学基金项目(No.2006ZC31001)~~
关键词
文本分类
特征选择
CHI统计
权值调整
分散度
集中度
频度
text classification
feature selection
CHI statistical approach
weight adjustment techniques
distribution information
concentration information
frequency information