摘要
针对挖掘大规模科技文献中作者、主题和时间及其关系的问题,考虑科技文献的内外部特征,提出了一个作者主题演化(AToT)模型。模型中文档表示为一定概率比例的主题混合体,每个主题对应一个词项上的多项分布和一个随时间变化的贝塔分布,主题-词项分布不仅由文档中单词共现决定,同时受文档时间戳影响,每个作者也对应一个主题上的多项分布。主题-词项分布与作者-主题分布分别用来描述主题随时间变化的规律和作者研究兴趣的变化规律。采用吉布斯采样的方法,通过学习文档集可以获得模型的参数。在1700篇NIPS会议论文集上的实验结果显示,作者主题演化模型可以描述文档集中潜在的主题演化规律,动态发现作者研究兴趣的变化,可以预测与主题相关的作者,与作者主题模型相比计算困惑度更低。
To solve the problems of mining relationships among topics, authors and time in large scale scientific literature corpora, this paper proposed the Author-Topic over Time (AToT) model according to the intra-features and inter-features of scientific literature. In AToT, a document was represented as a mixture of probabilistic topics and each topic was correspondent with a muhinomial distribution over words and a beta distribution over time. The word-topic distribution was influenced not only by word co-occurrence but also by document timestamps. Each author was also correspondent with a multinomial distribution over topics. The word-topic distribution and author-topic distribution were used to describe the topics evolution and research interests changes of the authors over time respectively. Parameters in AToT could be learned from the documents by employing methods of Gibbs sampling. The experimental results by running in the collections of 1 700 NIPS conference papers show that AToT model can characterize the latent topics evolution, dynamically find authors' research interests and predict the authors related to the topics. Meanwhile, AToT model can also lower perplexity compared with the author-topic model.
出处
《计算机应用》
CSCD
北大核心
2013年第11期3080-3083,共4页
journal of Computer Applications
关键词
主题模型
时序分析
无监督学习
文本模型
困惑度
topic model
temporal analysis
unsupervised learning
text model
perplexity