期刊文献+

Spark平台下的短文本特征扩展与分类研究 被引量:9

Feature Extension and Category Research for Short Text Based on Spark Platform
在线阅读 下载PDF
导出
摘要 短文本分类经常面临特征维度高、特征稀疏、分类准确率差的问题。特征扩展是解决上述问题的有效方法,但却面临更大的短文本分类效率瓶颈。结合以上问题和现状,针对如何提升短文本分类准确率及效率进行了详细研究,提出了一种Spark平台上的基于关联规则挖掘的短文本特征扩展及分类方法。该方法首先采用背景语料库,通过关联规则挖掘的方式对原短文本进行特征补充;其次针对分类过程,提出基于距离选择的层叠支持向量机(support vector machine,SVM)算法;最后设计Spark平台上的短文本特征扩展与分类算法,通过分布式算法设计,提高短文本处理的效率。实验结果显示,采用提出的Spark平台上基于关联规则挖掘的短文本特征扩展方法后,针对大数据集,Spark集群上短文本特征扩展及分类效率约为传统单机上效率的4倍,且相比于传统分类实验,平均得到约15%的效率提升,其中特征扩展及分类优化准确率提升分别为10%与5%。 Short text classification is often confronted with some limitations including high feature dimensions,sparse feature existences and poor classification accuracy,which can be solved by feature extension effectively.However,it decreases the execution efficiency greatly.To improve classification accuracy and efficiency of short text,this paper proposes a new solution,association rule based feature extension method which is designed on Spark platform.Given a background data set of short text corpus,firstly extend origin corpus and complement the features by mining the association rules and the corresponding confidences.Then apply a new cascade SVM(support vector machine)algorithm based on distance to choose during classification.Finally design the feature extension and classification algorithm of short text on Spark platform and improve the efficiency of short text processing through distributed algorithm.The experiments show that the new method gains4times of efficiency improvement compared with the traditional method and15%increase in classification accuracy,in which the accuracy of feature extension and classification optimization is10%and5%respectively.
作者 王雯 赵衎衎 李翠平 陈红 孙辉 WANG Wen;ZHAO Kankan;LI Cuiping;CHEN Hong;SUN Hui(Key Laboratory of Data Engineering and Knowledge Engineering, Renmin University of China, Beijing 100872,China;School of Information, Renmin University of China, Beijing 100872, China)
出处 《计算机科学与探索》 CSCD 北大核心 2017年第5期732-741,共10页 Journal of Frontiers of Computer Science and Technology
基金 国家社会科学基金No.12&ZD220~~
关键词 短文本分类 特征扩展 关联规则 Spark平台 short text classification feature extension association rule Spark platform
  • 相关文献

参考文献5

二级参考文献65

  • 1王元珍,钱铁云,冯小年.基于关联规则挖掘的中文文本自动分类[J].小型微型计算机系统,2005,26(8):1380-1383. 被引量:13
  • 2樊兴华,孙茂松.一种高性能的两类中文文本分类方法[J].计算机学报,2006,29(1):124-131. 被引量:70
  • 3李峰,李芳.中文词语语义相似度计算——基于《知网》2000[J].中文信息学报,2007,21(3):99-105. 被引量:106
  • 4SebastianiI F. Machine Learning in Automated Text Categorization Consiglio Nazionale delle Rieerche[J]. Italy. ACM Computing Surveys,2002,34(1) : 1-47
  • 5Zelikovitz S,Transductive M F. Learning for Short-Text Classification Problem using Latent Semantic Indexing International [J]. Journal of Pattern Recognition and Artificial Intelligence, 2005,19(2) : 143-163
  • 6Pu Qiang, Yang Guo Wei. Short-Text Classification Based on ICA and LSA[J]//Proceedings of International Symposium on Neural Networks, 2006 (ISNN 2) : 265-270
  • 7马后锋 樊兴华.一种改进的增量贝叶斯分类算法[J].仪器仪表学报,2007,28(8Ⅲ):312-316.
  • 8Chen Enhong,Wu Gaofeng. An Ontology Learning Method Enhanced by Frame Semantics [J]//Proceedings of the Seventh IEEE International Symposium on Multimedia. 2005:374-382
  • 9郑德权,赵铁军,李生,等.基于内容的词义本体知识自动获取[A]∥全国第八届计算语言学联合学术会议(JSCL-2005)论文集[C].2005.
  • 10徐长青.中文文本分类技术研究[D].长春:吉林大学,2007.

共引文献176

同被引文献87

引证文献9

二级引证文献90

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部