摘要
短文本分类经常面临特征维度高、特征稀疏、分类准确率差的问题。特征扩展是解决上述问题的有效方法,但却面临更大的短文本分类效率瓶颈。结合以上问题和现状,针对如何提升短文本分类准确率及效率进行了详细研究,提出了一种Spark平台上的基于关联规则挖掘的短文本特征扩展及分类方法。该方法首先采用背景语料库,通过关联规则挖掘的方式对原短文本进行特征补充;其次针对分类过程,提出基于距离选择的层叠支持向量机(support vector machine,SVM)算法;最后设计Spark平台上的短文本特征扩展与分类算法,通过分布式算法设计,提高短文本处理的效率。实验结果显示,采用提出的Spark平台上基于关联规则挖掘的短文本特征扩展方法后,针对大数据集,Spark集群上短文本特征扩展及分类效率约为传统单机上效率的4倍,且相比于传统分类实验,平均得到约15%的效率提升,其中特征扩展及分类优化准确率提升分别为10%与5%。
Short text classification is often confronted with some limitations including high feature dimensions,sparse feature existences and poor classification accuracy,which can be solved by feature extension effectively.However,it decreases the execution efficiency greatly.To improve classification accuracy and efficiency of short text,this paper proposes a new solution,association rule based feature extension method which is designed on Spark platform.Given a background data set of short text corpus,firstly extend origin corpus and complement the features by mining the association rules and the corresponding confidences.Then apply a new cascade SVM(support vector machine)algorithm based on distance to choose during classification.Finally design the feature extension and classification algorithm of short text on Spark platform and improve the efficiency of short text processing through distributed algorithm.The experiments show that the new method gains4times of efficiency improvement compared with the traditional method and15%increase in classification accuracy,in which the accuracy of feature extension and classification optimization is10%and5%respectively.
作者
王雯
赵衎衎
李翠平
陈红
孙辉
WANG Wen;ZHAO Kankan;LI Cuiping;CHEN Hong;SUN Hui(Key Laboratory of Data Engineering and Knowledge Engineering, Renmin University of China, Beijing 100872,China;School of Information, Renmin University of China, Beijing 100872, China)
出处
《计算机科学与探索》
CSCD
北大核心
2017年第5期732-741,共10页
Journal of Frontiers of Computer Science and Technology
基金
国家社会科学基金No.12&ZD220~~
关键词
短文本分类
特征扩展
关联规则
Spark平台
short text classification
feature extension
association rule
Spark platform