期刊文献+

基于语义与分类贡献的文本特征选择研究 被引量:2

Research on text feature selection based on semantic and classification contribution
在线阅读 下载PDF
导出
摘要 针对传统文本特征选择算法没有考虑特征的语义及特征与类别之间关系的问题,提出了一种结合语义和分类贡献的特征选择算法.利用LDA主题模型获取文本和词的表示,通过计算词与文本之间的语义相似度,获取词对文本的重要性.再利用Word2vec词向量模型获取文本类别特征,通过计算文本中的词与文本类别特征之间的语义相似度,获取词对类别的重要性,最后结合词对文本的重要性和词对类别的重要性选择分类贡献度高的词作为最终的分类特征.实验表明,该算法能够有效地降低文本特征数量,减少分类计算开销,降低噪声对分类的影响,提升分类效果. Aiming at the problem that traditional text feature selection algorithms do not consider the semantic of features and the relationship between features and categories,this paper proposes a feature selection method combining semantic and classification contribution.LDA topic model is used to obtain the representation of text documents and words,and the importance of words to text documents is obtained by calculating the similarity between text documents and words.Then,the text category features are obtained based on Word2vec word vector.By calculating the semantic similarity between words in documents and text category features,the importance of words to categories is obtained.Finally,the features with high classification contribution are selected by combining the importance of words to documents and the importance of words to categories.Experiments show that the algorithm can effectively reduce the number of text features,reduce the cost of classification calculation,reduce the impact of noise features on classification,and improve the classification effect.
作者 景永霞 苟和平 王治和 JING Yong-xia;GOU He-ping;WANG Zhi-he(College of Information Science and Technology,Qiongtai Normal University,Haikou 571100,Hainan,China;College of Computer Science and Engineering,Northwest Normal University,Lanzhou 730070,Gansu,China)
出处 《西北师范大学学报(自然科学版)》 CAS 北大核心 2020年第1期51-55,62,共6页 Journal of Northwest Normal University(Natural Science)
基金 海南省自然科学基金资助项目(617160,618MS086) 海南省高等学校教育教学改革研究项目(Hnjg2017-68)
关键词 LDA 特征选择 文本分类 语义分析 LDA feature selection text classification semantic analysis
  • 相关文献

参考文献5

二级参考文献26

  • 1苏金树,张博锋,徐昕.基于机器学习的文本分类技术研究进展[J].软件学报,2006,17(9):1848-1859. 被引量:394
  • 2李闯,丁晓青,吴佑寿.一种改进的AdaBoost算法——AD AdaBoost[J].计算机学报,2007,30(1):103-109. 被引量:54
  • 3熊文新,宋柔.信息检索用户查询语句的停用词过滤[J].计算机工程,2007,33(6):195-197. 被引量:16
  • 4Schapire R E, Singer Y. A boosting-based system for text categorization[J]. Machine Learning, 2000,39 (2/3) : 135-168.
  • 5Elisseeff A, Weston J. A kernel method for multilabeled elassifieation[J]. Advances in Neural Information Processing Sys- tems, 2001,14 : 681-687.
  • 6Zhang Minling, Zhou Zhihua. ML-KNN: A lazy learning approach to multi-label learning[J]. Pattern Recognition, 2007, 40 (7) : 2038-2048.
  • 7Huang Shengjun, Yu Yang, Zhou Zhihua. Multi label hypothesis reuse[C]//SIGKDD. Beijing: ACM, 2012:525-533.
  • 8Hofmann T. Unsupervised learning by probabilistic latent semantic analysis[J]. Machine Learning, 2001,42 (1) : 177-196.
  • 9Hofmann T. Probabilistic latent semantic indexing[J]. Proc of Annual ACM Conference on Research & Development in In formation Retrieval Berkeley California August, 1999,42(1) :56-73.
  • 10Hofmann T. Probabilistic latent semantic analysis[C]//Proc of the Fifteenth Conference on Uncertainty in Artificial Intelli gence. [S. 1. ] : Morgan Kaufmamn, 1999:289-296.

共引文献54

同被引文献18

引证文献2

二级引证文献3

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部