摘要
针对传统文本特征选择算法没有考虑特征的语义及特征与类别之间关系的问题,提出了一种结合语义和分类贡献的特征选择算法.利用LDA主题模型获取文本和词的表示,通过计算词与文本之间的语义相似度,获取词对文本的重要性.再利用Word2vec词向量模型获取文本类别特征,通过计算文本中的词与文本类别特征之间的语义相似度,获取词对类别的重要性,最后结合词对文本的重要性和词对类别的重要性选择分类贡献度高的词作为最终的分类特征.实验表明,该算法能够有效地降低文本特征数量,减少分类计算开销,降低噪声对分类的影响,提升分类效果.
Aiming at the problem that traditional text feature selection algorithms do not consider the semantic of features and the relationship between features and categories,this paper proposes a feature selection method combining semantic and classification contribution.LDA topic model is used to obtain the representation of text documents and words,and the importance of words to text documents is obtained by calculating the similarity between text documents and words.Then,the text category features are obtained based on Word2vec word vector.By calculating the semantic similarity between words in documents and text category features,the importance of words to categories is obtained.Finally,the features with high classification contribution are selected by combining the importance of words to documents and the importance of words to categories.Experiments show that the algorithm can effectively reduce the number of text features,reduce the cost of classification calculation,reduce the impact of noise features on classification,and improve the classification effect.
作者
景永霞
苟和平
王治和
JING Yong-xia;GOU He-ping;WANG Zhi-he(College of Information Science and Technology,Qiongtai Normal University,Haikou 571100,Hainan,China;College of Computer Science and Engineering,Northwest Normal University,Lanzhou 730070,Gansu,China)
出处
《西北师范大学学报(自然科学版)》
CAS
北大核心
2020年第1期51-55,62,共6页
Journal of Northwest Normal University(Natural Science)
基金
海南省自然科学基金资助项目(617160,618MS086)
海南省高等学校教育教学改革研究项目(Hnjg2017-68)
关键词
LDA
特征选择
文本分类
语义分析
LDA
feature selection
text classification
semantic analysis