摘要
文本聚类作为一种无监督的机器学习方法,已经成为对文本信息进行有效地组织、摘要和导航的重要手段,为越来越多的研究人员所关注。本文以网络论坛的话题发现和追踪为背景,通过对论坛中的帖子进行聚类分析而获取话题。本文以层次聚类算法为基础,进行改进,提出高权重词集的概念,基于此设计并实现了增量聚类算法,通过实验验证了该算法适应动态数据以及时间、空间复杂性上的优越性,证明了系统在设计的时候采用的系统架构的合理性及必要性。
As an unsupervised machine learning method, text clustering becomes an important means of organizing, abstracting and navigating text message, which draws more and more attention from researchers. This article takes the network forum's topic discovery and tracing as the background, through cluster analysis of the forum posts to access topics. This paper proposes a concept named high weight words collection and on the basis ofit, incremental clustering algorithm is improved from hierarchical clustering algorithm. Experimental results show that the algorithm can adapt to dynamic data as well as the superiority of time and space complexity. Besides, a certain number of text tests have proved the rationality and necessity in the design of the system architecture.
出处
《微计算机信息》
2011年第2期170-172,共3页
Control & Automation
关键词
文本聚类
高权重词集
层次聚类
增量聚类
text clustering
high weight words collection
hierarchical clustering algorithm
incremental clustering