摘要
微博文本的数据稀疏特性,使传统话题跟踪技术只能捕获部分话题微博且准确度不高。同时,在追踪过程中,话题会出现漂移现象。针对以上两个问题,提出一种基于层叠条件随机场的微博热点话题跟踪方法。该方法先通过标识模型标识出可能相关的微博,源热点微博和标识微博分别作为分类模型的观察序列和状态序列来计算相关度分类。其次,通过构造自适应模型对识别模型进行更新且削弱数据稀疏问题,并从相关微博中选取新的观察序列,其余作为新的状态序列进行迭代分类处理。实验表明,该方法比传统方法综合指标F值平均提升4.13%。
Because of the sparse data characteristic of microblogging text,traditional topics tracking technologies can only capture part of the topical microblogs in low accuracy. At the same time,topic drifting problem will appear in tracking process as well. In this paper,we present a CCRFs-based hot microblogging topics tracking method for two problems mentioned above. The method first marks the microblogs possibly correlated with hot topics through identification model,the source microblogs with hot topics and the marked microblogs are used as the classification model 's observation sequence and the state sequence respectively to calculate the correlation classification. Then,by constructing the adaptive model it updates the identification model and weakens the data sparse problem,and selects new observation sequence from correlated microblogs and leaves the rest as new state sequence for iterative classification processing. Experiments showed that this method improved 4. 13% in average in value of comprehensive index( F) compared with traditional methods.
出处
《计算机应用与软件》
CSCD
2016年第4期56-59,102,共5页
Computer Applications and Software
基金
国家自然科学基金项目(81360230)
科技部科技型中小企业技术创新基金项目(13C26215305404)
关键词
话题跟踪
话题漂移
层叠条件随机场
话题词典
Topic tracking
Topic drifting
Cascaded conditional random fields(CCRFs)
Topic dictionary