摘要
海量文本信息导致文本情感分类准确率低以及实时性差.针对这一问题,提出一种基于混合特征选择的向量空间模型聚类算法.首先将信息增益(Information Gain,IG)和互信息(Mutual Information,MI)与文档的不同词性特征相结合,生成文档的混合特征向量;然后计算文档向量空间模型之间的差异度,根据该差异度对向量空间模型进行聚类,得到聚类中心向量,采用聚类中心向量重新构造文档集的向量空间模型;最终采用支持向量机(Support Vector Machine,SVM)进行文档情感的判定.仿真实验结果表明:该混合特征向量空间模型聚类算法可以有效地降低文档样本特征的维数和数量,加快SVM的训练速度,同时实验结果也表明不同的词性特征和提取算法组合对系统的分类准确率有较大的影响.
Abstract: Massive amounts of text information caused low classification accuracy and real-time performance. In order to improve accuracy of text sentiment classification, a novel classification approach based on mixed vector space model clustering was proposed. IG and MI were used to select effective mixed feature vectors firstly. And then documents were clustered according to the diversity degree between VSMs. VSM which was reconstructed by clustering centre vector was used to train SVM. The experiment results show that the meth- od could reduce the dimension and quantity of document sample effectively. By doing this, training speed of SVM is sped up fast. Our experiment results also present that the rule of parts of speech feature selection and extraction algorithm have big effects on classification results.
出处
《中北大学学报(自然科学版)》
CAS
北大核心
2014年第1期41-45,共5页
Journal of North University of China(Natural Science Edition)
基金
甘肃省教育厅基金资助项目(1113-01)
甘肃联合大学科研高水平成果项目(2011GSP01)
关键词
文本情感分类
向量空间模型
K均值聚类算法
支持向量机
信息增益
互信息
text sentiment classification
vector space model
K-means clustering
support vector machine
information gain
mutual information