期刊文献+

一种基于频繁词集表示的新文本聚类方法 被引量:15

A New Documents Clustering Method Based on Frequent Itemsets
在线阅读 下载PDF
导出
摘要 传统的文本聚类方法大部分采用基于词的文本表示模型,这种模型只考虑单个词的重要度而忽略了词与词之间的语义关系.同时,传统文本表示模型存在高维的问题.为解决以上问题,提出一种基于频繁词集的文本聚类方法(frequent itemsets based document clustering method,FIC).该方法从文档集中运用FP-Growth算法挖掘出频繁词集,运用频繁词集来表示每个文本从而大大降低了文本维度,根据文本间相似度建立文本网络,运用社区划分的算法对网络进行划分,从而达到文本聚类的目的.FIC算法不仅能降低文本表示的维度,还可以构建文本集中文本间的关联关系,使文本与文本间不再是独立的两两关系.实验中运用2个英文语料库Reuters-21578,20NewsGroup和1个中文语料库——搜狗新闻数据集来测试算法精度.实验表明:较传统的利用文本空间向量模型的聚类方法,该方法能够有效地降低文本表示的维度,并且,相比于常见的基于频繁词集的聚类方法能获得更好的聚类效果. Traditional document clustering methods use vector space model (VSM) of words torepresent documents. This VSM representation only measures the importance of a single words, while ignores the semantic relationship between words, and has high dimensionality. In this study, wepropose a new document clustering method: FIC (frequent itemsets based document clustering method). In the method, we use frequent itemsets (where a frequent itemset is a set of frequently co-occurred words) mined by FP-Growth algorithm in documents to represent each document. We thenconstruct the document-document relationship network based on the similarity between pairs ofdocuments at this new representation. At last, we divide the network into communities using a given community detection method to complete document clustering. Thereby, FIC can not only overcome the high dimensionality of VSM , but also fully make use of topological relationship among documents. The experimental results on two English corpora (Reters-21578 and 20Newsgroup) and one Chinese corpus (Sougou-News) demonstrate that the proposed method FIC is superior tofrequent itemsets based methods and other classical state-o-the-art document clustering methods, and the top K words for characterizing each topic of documents identified by FIC are more meaningful than the classical topic model LDA (latent Dirichlet allocation).
作者 张雪松 贾彩燕 Zhang Xuesong;Jia Caiyan(Beijing Key Lab of Traffic Data Analysis and Mining (Beijing Jiaotong University) , Beijing 100044;School of Computer and Information Technology , Beijing Jiaotong University, Beijing 10004)
出处 《计算机研究与发展》 EI CSCD 北大核心 2018年第1期102-112,共11页 Journal of Computer Research and Development
基金 国家自然科学基金面上项目(61473030) 数字出版国家重点实验室专项课题~~
关键词 文本聚类 频繁词集 复杂网络 社区划分 文本表示模型 document clustering frequent itemsets complex network community division text representation model
  • 相关文献

参考文献3

二级参考文献35

  • 1李荣陆,王建会,陈晓云,陶晓鹏,胡运发.使用最大熵模型进行中文文本分类[J].计算机研究与发展,2005,42(1):94-101. 被引量:98
  • 2彭京,杨冬青,唐世渭,付艳,蒋汉奎.一种基于语义内积空间模型的文本聚类算法[J].计算机学报,2007,30(8):1354-1363. 被引量:45
  • 3Shi Zhong,Joydeep Ghosh. Generative model-based document clustering: a comparative study[J] 2005,Knowledge and Information Systems(3):374~384
  • 4Yang X, Ghoting A, Ruan Y, et al. A framework for summarizing and analyzing Twilter feeds [C] //Proc of the 18th ACM SIGKDD lnt Conf on Knowledge Discovery and Data Mining (KDD'12). New York: ACM, 2012:370-378.
  • 5Zhang X, Zhu S, Liang W. Detecting spare and promoting campaigns in the Twitter social network [C] //Proc of the 12th IEEE Int Conf on Data Mining (ICDM'12). Los Alamitos, CA: IEEEComputer Society, 2012:1194-1199.
  • 6Peng Min, Huang Jiaiia, Fu Hui, et al. High quality microblog extraction based on multiple features fusion and time frequency lransformation [G] //LNCS 8181 : Proc of the 14th Int Conf of Web Information Systems Engineering (WlSE'13). Berlin: Springer, 2013:188- 201.
  • 7Lin D. An information theoretic definition of similarity [C]// Proc of the 15th Int Conf on Machine I.earning (ICMI.'98). San Francisco, CA: Morgan Kaufmann, 1998, 296-304.
  • 8Schiitze H, Silverstein C. Projections for efficient document clustering [C] //Proc of the 20th Annual Int ACM SIGIR Conf on Research and Development in Information Retrieval (SIGIR'97). New York: ACM, 1997: 74-81.
  • 9Ramage D, Heymann P, Manning C D, et al. Clustering the tagged Web [C] //Proc of the 2nd ACM Int Conf on Web Search and Data Mining (WSDM'09). New York: ACM, 2009:54-63.
  • 10Freeman R, Yin H. Self-organising maps for hierarchical tree view document clustering using contextual information [G]//LNCS 2412: Proc of the IEEE Int Joint Conf on Neural Networks. Berlin: Springer, 2002:123-128.

共引文献155

同被引文献104

引证文献15

二级引证文献43

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部