摘要
针对现有主题挖掘方法的不足,本文提出一种以句子为粒度的微博主题挖掘方法。首先,以标点符号为依据进行微博文本的句子划分,选择名词和动词为特征词来表征句子;其次,以高频特征词在微博文本集中的共现频次为基础构建词语相似矩阵,辅助计算句子相似度,构建句子相似矩阵;然后,以句子相似矩阵为基础进行聚类分析,通过分析聚类结果实现主题发现;最后,利用改进的LexRank算法计算各主题句子的重要度值,组合重要度值高的句子生成主题摘要,以完成对主题的描述。文章通过实验证明了该方法的可行性。
For the lack of an existing topic mining methods, this paper proposes to carry out mining micro blog topics based on the sentence. First of all, we divide the micro blog text into sentences according to the punctuation and select the nouns and verbs as key words to characterize the sentence. Secondly, we build a word similarity matrix according to the cooccurrence frequency of the high frequency key words in the micro blog text sets, then calculation of sentence similarity based on the matrix and eonstruction sentence similarity matrix. Next, the sentence similarity matrix is being cluster analysis, and then analysis of clustering results achieve topic discovery. At last, we calculate the importance value of the topic sentence by the improved LexRank algorithm, and complete description of the topic by combining sentences to generate high importance value. The experiment proves the feasibility of this method.
出处
《情报学报》
CSSCI
北大核心
2014年第6期623-632,共10页
Journal of the China Society for Scientific and Technical Information
基金
国家自然科学基金资助项目“社会化媒体集成检索与语义分析方法研究”(项目编号:71273194)的研究成果之一
关键词
单句粒度
词语相似矩阵
主题挖掘
sentence granularity, word similarity matrix, topics mining