摘要
目的自动从新浪微博中抓取含指定关键词的相关微博,通过对抓取的微博进行分析,得到相关舆情热点。方法首先通过多线程爬虫,自动爬取含有指定关键字的微博,将其保存于数据库中,再采用基于字符串匹配的逆向最大匹配法对微博进行分词,计算各分词项的TF-IDF权重作为文本聚类的输入数据,最后用k-means算法进行聚类分析,得出舆情热点。结果与结论这种方法能自动从新浪微博中抓取含指定关键词的相关微博,通过聚类分析,每一族的微博内容具有较高的一致性和共同的主题,由此可迅速找出热点舆情,对及时了解和引导舆情具有积极的意义。
Objective-To obtain public hotspots by automatically capturing and analyzing micro blogs which contains specified keywords from Sina Weibo. Methods-First, save in the database the crawling micro-blogs which contains specified keywords through the automatic multithreaded crawl ers. Then, segment the words in the micro-blogs with Reverse Maximum String Matching Method to calculate TF-IDF weight of each term as text clustering input data. Finally, obtain the hotspot of pub lic sentiment by analyzing the cluster with k-means algorithm. Results and Conclusion-This method can automatically capture the micro-blogs containing relevant keywords from Sina Weibo. After clus ter analysis, the contents of each cluster of micro-blogs have highly consistent and common themes, which can quickly find hot public opinions. The method has positive significance for the understanding and timely guiding public opinions.
出处
《宝鸡文理学院学报(自然科学版)》
CAS
2014年第1期51-54,共4页
Journal of Baoji University of Arts and Sciences(Natural Science Edition)
基金
南京森林警察学院科研项目(RWZD201352)
江苏省高等教育教改研究课题(2013JSJG199)
关键词
微博
爬虫
聚类
舆情
Weibo
crawler
clustering
public opinion