摘要
跨领域文本情感分类已成为自然语言处理领域的一个研究热点。针对传统主动学习不能利用领域间的相关信息以及词袋模型不能过滤与情感分类无关的词语,提出了一种基于逐步优化分类模型的跨领域文本情感分类方法。首先选择源领域和目标领域的公共情感词作为特征,在源领域上训练分类模型,再对目标领域进行初始类别标注,选择高置信度的文本作为分类模型的初始种子样本。为了加快目标领域的分类模型的优化速度,在每次迭代时,选取低置信度的文本供专家标注,将标注的结果与高置信度文本共同加入训练集,再根据情感词典、评价词搭配抽取规则以及辅助特征词从训练集中动态抽取特征集。实验结果表明,该方法不仅有效地改善了跨领域情感分类效果,而且在一定程度上降低了人工标注样本的代价。
Cross-domain sentiment classification has attracted more attention in natural language processing field. Given that tradition active learning can' t make use of the public information between domains and the bag of words model can't filter these words not related with sentiment classification, a method of cross-domain sentiment classification based on optimizing classification model progressively was proposed. Firstly, this paper selected the public sentiment words as features to train classification model on the labeled source domain, then used the classification model to predict the initial category label for target domain and selected the texts with high confidence value as initial seed texts of the learning model. Secondly, we added the high confidence text and low confidence text to the training set at each iteration. Finally, the feature set was extracted to transform feature space based on the sentimental dictionary, evaluation colloca- tion rules and assist feature words, The experimental results indicate that this method can not only improve the accuracy of cross domain sentiment classification effectively, but also reduce the manual annotation price to some extent.
出处
《计算机科学》
CSCD
北大核心
2016年第7期234-239,共6页
Computer Science
基金
国家自然科学基金资助项目(61175067
61272095
60875040)
国家"八六三"高技术研究发展计划基金项目(2015AA015407)
山西省科技攻关项目(20110321027-02)
山西省回国留学人员科研项目(2013-014)
山西省科技基础条件平台建设项目(2015091001-0102)资助
关键词
情感分类
跨领域
分类模型
特征抽取
置信度
Sentiment classification, Cross domain, Classification model, Feature extraction, Confidence