期刊文献+

基于主题树的微博突发话题检测 被引量:6

Microblog bursty topic detection based on topic tree
在线阅读 下载PDF
导出
摘要 针对传统话题检测方法不能很好处理微博中用语不规范、随意性强、指代不明确以及存在大量网络用语的问题,提出了一种基于潜在狄利克雷分配(LDA)模型的主题树检测方法。首先,运用自然语言处理(NLP)中增大信息熵的方法将相关微博整理成一棵主题树,配合狄利克雷先验α与经验值β随主题数目动态变化的设计思想,结合该模型独特的双重概率统计模式,实现了对文本中每个词"贡献度"的统计,提前处理掉干扰信息,排除垃圾数据对话题检测的影响;然后,利用该"贡献度"作为空间向量模型(VSM)改进后的参数值计算文档间相似度来提取突发话题,达到提高突发话题检测精准度的目的。提出的基于LDA模型的主题树检测方法从F值比对与人工检测两个角度进行了相关实验,实验数据显示该算法不仅可以检测到突发话题,而且获得的结果与知网模型和TF-IDF算法相比分别高出3%、7%,且更符合人的判断逻辑。 A kind of topic tree detection method based on Latent Dirichlet Allocation (LDA) model was put forward, in order to solve the problems of nonstandard terms, randomness, uncertainty of reference and large number of network terms in microblog texts, which can not be solved in traditional detection method. Relevant microblogs were reorganized into a topic tree by increasing information entropy in Natural Language Processing (NLP), combining with the design idea that Dirichelet prior experience value α and experience value β vary with the topic number, then the contribution statistics of every word in the text was achieved using the specific dual probability statistical method of this model. Thus, the interference information would be disposed in advance and the influence of garbage data on topic detection was excluded. Using this contribution as the parameter value of the improved Vector Space Model (VSM), bursty topics were extracted through calculating the similarity between texts, in order to improve the detection precision of bursty topics. Experiments of the proposed detection method were made from two aspects: comparison of the value of F and the manual detection. The experimental data show that, this algorithm not only can detect the bursty topics, but also can improve the precision about 3% and 7% respectively compared with the HowNet model and the TF-IDF (Term Frequency-Inverse Document Frequency) algorithm, and it is more in accordance with human's logic judgments than the traditional ones.
出处 《计算机应用》 CSCD 北大核心 2014年第8期2332-2335,共4页 journal of Computer Applications
基金 国家自然科学基金资助项目(70971059) 辽宁省创新团队项目(2009T045) 辽宁省高等学校杰出青年学者成长计划项目(LJQ2012027)
关键词 潜在狄利克雷分配 主题树 语义相似度 空间向量模型 话题检测 Latent Dirichlet Allocation (LDA) topic tree semantic similarity Vector Space Model (VSM) topicdetection
  • 相关文献

参考文献13

二级参考文献218

共引文献610

同被引文献57

  • 1贺敏,王丽宏,杜攀,张瑾,程学旗.基于有意义串聚类的微博热点话题发现方法[J].通信学报,2013,34(S1):256-262. 被引量:13
  • 2唐伟,周志华.基于Bagging的选择性聚类集成[J].软件学报,2005,16(4):496-502. 被引量:99
  • 3郭平,康艳荣,史晓晨.基于最大Code码的极大完全子图算法[J].计算机科学,2006,33(2):188-190. 被引量:6
  • 4耿焕同,蔡庆生,于琨,赵鹏.一种基于词共现图的文档主题词自动抽取方法[J].南京大学学报(自然科学版),2006,42(2):156-162. 被引量:30
  • 5洪宇,张宇,刘挺,李生.话题检测与跟踪的评测及研究综述[J].中文信息学报,2007,21(6):71-87. 被引量:153
  • 6ZHANG C, FAN X, CHEN X. Hot topic detection on Chinese short text [ C]//CESM 2011: Proceedings of the International Conference on Advanced Research on Computer Education, Simulation and Modeling, Communications in Computer and Information Science Volume 176. Berlin: Springer-Verlag, 2011:207-212.
  • 7McCALLUM A, CORRADA-EMMANUEL A, WANG X. The au- thor-recipient-topic model for topic and role discovery in social net- works, with application to Enron and academic email [ C]// Pro- ceedings of the 2005 Workshop on Link Analysis, Counterterrorism and Security. Newport Beach: BibSonomy, 2005:33-44.
  • 8PON R K, CARDENAS A F, BUTTLER D, et al. Tracking multi- ple topics for finding interesting articles [ C]// KDD '07: Proceed- ings of the 13th ACM SIGKDD Intemational Conference on Knowl- edge Discovery and Data Mining. New York: ACM, 2007:560 - 569.
  • 9JIN X, SPANGLER S, MA R, et al. Topic initiator detection on the world wide Web [ C]//WWW '10: Proceedings of the 19th Interna- tional Conference on World Wide Web. New York: ACM, 2010: 481 - 490.
  • 10GARCIA-ALVARADO C, ORDONEZ C. ONTOCUBO: cube-based ontology construction and exploration [ C]//SIGMOD '14: Proceed- ings of the 2014 ACM SIGMOD International Conference on Manage- ment of Data. New York: ACM, 2014:1083 - 1086.

引证文献6

二级引证文献25

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部