摘要
情感分类是一项具有较大实用价值的分类技术,它可以在一定程度上解决网络评论信息杂乱的现象,方便用户准确定位所需信息。目前针对中文情感分类的研究相对较少,其中各种有监督学习方法的分类效果以及文本特征表示方法和特征选择机制等因素对分类性能的影响更是亟待研究的问题。本文以n-gram以及名词、动词、形容词、副词作为不同的文本表示特征,以互信息、信息增益、CHI统计量和文档频率作为不同的特征选择方法,以中心向量法、KNN、Winnow、Na ve Bayes和SVM作为不同的文本分类方法,在不同的特征数量和不同规模的训练集情况下,分别进行了中文情感分类实验,并对实验结果进行了比较,对比结果表明:采用Bi Grams特征表示方法、信息增益特征选择方法和SVM分类方法,在足够大训练集和选择适当数量特征的情况下,情感分类能取得较好的效果。
Sentiment classification is an applied technology with great significance. It can solve information disorder and help people locate the required reviews in the Internet. Up to now, most research of sentiment classification is on English reviews, and little work has been done on Chinese reviews. To find an effective way for the task based on supervised machine learning method, and analyze the influence by term expression and term selection, this paper conducted some experiments under distinct environments, including different feature representation, different feature selection, different categorization technique, different size of features and different size of training data, over Chinese text collections. The experimental results show that sentiment classification will obtain high performance, when using bigrams representation, information gain and SVM classifier, enough training data and plenty of features.
出处
《中文信息学报》
CSCD
北大核心
2007年第6期88-94,108,共8页
Journal of Chinese Information Processing
基金
国家"973"重点基础研究发展规划基金资助项目(2004CB318109)
关键词
计算机应用
中文信息处理
情感分类
文本分类
语言模型
中文信息处理
computer application
Chinese information processing
sentiment classification
text categorization
language model
Chinese information processing