摘要
传统的文本聚类方法大部分采用基于词的文本表示模型,这种模型只考虑单个词的重要度而忽略了词与词之间的语义关系.同时,传统文本表示模型存在高维的问题.为解决以上问题,提出一种基于频繁词集的文本聚类方法(frequent itemsets based document clustering method,FIC).该方法从文档集中运用FP-Growth算法挖掘出频繁词集,运用频繁词集来表示每个文本从而大大降低了文本维度,根据文本间相似度建立文本网络,运用社区划分的算法对网络进行划分,从而达到文本聚类的目的.FIC算法不仅能降低文本表示的维度,还可以构建文本集中文本间的关联关系,使文本与文本间不再是独立的两两关系.实验中运用2个英文语料库Reuters-21578,20NewsGroup和1个中文语料库——搜狗新闻数据集来测试算法精度.实验表明:较传统的利用文本空间向量模型的聚类方法,该方法能够有效地降低文本表示的维度,并且,相比于常见的基于频繁词集的聚类方法能获得更好的聚类效果.
Traditional document clustering methods use vector space model (VSM) of words torepresent documents. This VSM representation only measures the importance of a single words, while ignores the semantic relationship between words, and has high dimensionality. In this study, wepropose a new document clustering method: FIC (frequent itemsets based document clustering method). In the method, we use frequent itemsets (where a frequent itemset is a set of frequently co-occurred words) mined by FP-Growth algorithm in documents to represent each document. We thenconstruct the document-document relationship network based on the similarity between pairs ofdocuments at this new representation. At last, we divide the network into communities using a given community detection method to complete document clustering. Thereby, FIC can not only overcome the high dimensionality of VSM , but also fully make use of topological relationship among documents. The experimental results on two English corpora (Reters-21578 and 20Newsgroup) and one Chinese corpus (Sougou-News) demonstrate that the proposed method FIC is superior tofrequent itemsets based methods and other classical state-o-the-art document clustering methods, and the top K words for characterizing each topic of documents identified by FIC are more meaningful than the classical topic model LDA (latent Dirichlet allocation).
作者
张雪松
贾彩燕
Zhang Xuesong;Jia Caiyan(Beijing Key Lab of Traffic Data Analysis and Mining (Beijing Jiaotong University) , Beijing 100044;School of Computer and Information Technology , Beijing Jiaotong University, Beijing 10004)
出处
《计算机研究与发展》
EI
CSCD
北大核心
2018年第1期102-112,共11页
Journal of Computer Research and Development
基金
国家自然科学基金面上项目(61473030)
数字出版国家重点实验室专项课题~~
关键词
文本聚类
频繁词集
复杂网络
社区划分
文本表示模型
document clustering
frequent itemsets
complex network
community division
text representation model