期刊文献+

基于Huffman-LDA和Weight-Word2vec的文本表示模型研究 被引量:5

Text Representation Model Based on Huffman-LDA and Weight-Word2vec
在线阅读 下载PDF
导出
摘要 LDA是对主题到文档的全局结构建模,但其特征中缺少文档内部的局部词之间的关系,只能获得稀疏特征。Word2vec是一种基于上下文预测目标词的词嵌入模型,然而,基于这种方法只能以局部信息表示文档特征,缺乏全局信息。LDA和Word2vec的文本表示模型是基于主题向量和文档向量计算新的特征表示文本,但直接计算所得的稀疏主题特征与基于词向量的文档特征的距离,缺乏特征的一致性。本文提出了Huffman-LDA和Weight-Word2vec的文本表示模型,首先,使用LDA模型得到主题向量后构建主题哈夫曼树,再运用梯度上升方法更新主题向量,新的主题向量包含不同主题词之间的关系,求得的特征不再具有稀疏性;然后,使用LDA主题向量与主题矩阵中词的主题特性计算词权重更新Word2vec的词向量,使得词向量包含主题词之间的关系进而表示文档向量;最后,通过主题向量和文档向量的欧式距离得到具有强分类特征的文本表示。实验结果表明,该方法可获得更强的文本表示特征,有效提高文档分类精度。 LDA is to model the global structure of theme-to-document;but its features lack the relationship between the local words within the document;so only sparse features can be obtained.Word2vec is a word embedding model based on context prediction of target words.However,based on this method,document features can only be represented by local information,lacking global information.The mixed model of LDA and Word2vec is to calculate the new feature representation text based on topic vector and document vector,but the distance between the sparse theme feature is directly calculated and the document feature based on word vector is not consistent with the feature.In this paper,the text representation model of Huffman-LDA and Weight-Word2vec algorithm is proposed.Firstly,the topic huffman tree is constructed after the topic vector is obtained by using LDA model;and then the topic vector is updated by using gradient rise method.The new topic vector contains the relationship between different subject words,and the obtained feature is no longer sparse.Then,the LDA topic vector and the topic property of words in the topic matrix are used to calculate the word weight and update the word vector of Word2vec;so that the word vector contains the relationship between the subject words and then represents the document vector.Finally,the text representation with strong classification features is obtained through the Euclidean distance of subject vector and document vector.Experimental results show that the proposed method can obtain stronger text representation features and improve the accuracy of document classification.
作者 黄春雨 胡迪 邱宁佳 孙爽滋 HUANG Chun-yu;HU Di;QIU Ning-jia;SUN Shuang-zi(School of Computer Science and Technology,Changchun University of Science and Technology,Changchun 130022)
出处 《长春理工大学学报(自然科学版)》 2020年第1期89-96,132,共9页 Journal of Changchun University of Science and Technology(Natural Science Edition)
基金 吉林省重大科技招标项目(20170203004GX)。
关键词 主题模型 词嵌入 文本表示 Huffman-LDA Weight-Word2vec topic model word embedded text representation Huffman-LDA Weight-Word2vec
  • 相关文献

参考文献5

二级参考文献23

共引文献120

同被引文献44

引证文献5

二级引证文献8

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部