摘要
为了改善一个词可能在多个类别中有较大的互信息而模糊了词的类别信息的问题,提出了一种改进的特征选择方法。该方法利用特征词在不同类别之间的表征差异建立领域特征词(即最能表现一个领域的信息的一系列词)从而可以对用互信息建立的特征集进行再次选择,这样既减少了特征的维数又使特征表示更有效。同时,还设计了一个文本相似度计算系统,系统中改进了传统的tf-idf。实验结果表明,改进的特征选择方法和设计的系统具有良好的性能效果。
To solve that a word may has greater mutual in multiple categories which leads to the category information of the word is fuzzy, a new method for feature selection based on mutual information is proposed by establishing domain feature words (They behave domain information better) which utilize the differences in the representation of word in different domains. By the me thod, the feature set out of the established one based on the traditional mutual information is reselected. It not only reduces the dimension of the vector but also represent the text more effectively. At the same time, a text similarity calculation system is designed and in this system the tradition tf-idf is improved. The experimental results show that the improved method of feature extraction is much more superior to traditional mutual information and the performance of the system is good.
出处
《计算机工程与设计》
CSCD
北大核心
2012年第11期4338-4342,共5页
Computer Engineering and Design
基金
广西自然科学基金项目(2011GXNSFA018158)
广西科学研究与技术开发计划基金项目(桂科攻11107006-45
桂科攻0996028)
关键词
互信息
文本分类
特征选择
领域特征词
文本相似度
mutual information
text classification
feature selection
domain feature word
text similarity