摘要
概率主题模型与词向量模型的结合已经成为主题分类研究的一大热点,本文基于该思想提出了一种适用于网页主题分类的Skip-PTM模型.Skip-PTM模型吸取了LDA主题模型的优势,扩展了Word2Vec的Skip-gram模型,由原来的使用词向量预测上下文词转变为使用上下文向量来预测上下文词.在网页主题类型变迁的研究中,本文根据一定的时间粒度,将网页文本集离散到时间窗口,然后在独立的时间窗口中使用Skip-PTM建模,从而挖掘主题的变迁.本文利用搜狗实验室语料数据和各门户网站搜集的数据集进行分析实验.实验表明,本文提出的方法可以通过潜在语义对网页主题进行分类,并且可以挖掘出主题变迁的趋势.
The combination of Probabilistic Topic Model and Word Vector Model has become hot in Topic Classification,based on this idea,the paper proposes a Skip-PTM model for Webpage Topic Classification.The Skip-PTM model absorbs the advantages of LDA model and expands Word2Vec’s Skip-gram model which previous prediction of context words by word vectors was changed to context vectors.In the study of variation of webpage topic,the paper discretizes webpage text sets to temporal windows according to certain temporal granularity and then dig out variation of topic by Skip-PTM model in independent temporal windows.We utilize corpus data from Sougou laboratory and news data from various webpage portals to carry out our experiment.Experimental results show that the proposed Skip-PTM can categorize webpage topics according to these underlying semantics,and can dig out the trend of topic variation.
作者
耿宜鹏
鞠时光
蔡文鹏
章恒
GENG Yi-peng;JU Shi-guang;CAI Wen-peng;ZHANG Heng(School of Computer Science and Communication Engineering,Jiangsu University,Zhenjiang 212013,China)
出处
《小型微型计算机系统》
CSCD
北大核心
2020年第7期1395-1399,共5页
Journal of Chinese Computer Systems
基金
国家重点研发项目(2016YFD0702001)资助
江苏省研究生科研与实践创新计划项目(5561170021)资助。