期刊文献+

基于信息融合的网页文本聚类距离选择方法

Metric selection for web text clustering based on information ensembles
在线阅读 下载PDF
导出
摘要 在当前信息化的年代里,文本数据在高速的增长,人们获取有用的信息犹如大海捞针.文本聚类作为文本挖掘的基础技术,发挥了很重要的作用.由于缺乏预先定义的类和类标号的训练实例,如何选择合适的数据相似度是文本聚类的关键问题.文章为此提出一种新的衡量文本相似度的方法 Adaptive Metric Selection(AMS).文章通过抓取网页内容,为聚类提供数据来源,分词和向量化是必要的转化,利用特征提取的方法获取特征项,并用Isomap进行降维,最后利用自适应选择方法 AMS对数据进行相似度衡量再进行聚类分析.实验结果表明,AMS明显优于从多种相似度独立进行聚类的平均结果. In the current information age, text data grows at a high speed, and it is very hard for people to get useful information from huge data, which is like looking for a needle in a haystack. As the basic method in text mining, text clustering plays a very important role. Without predefined training set, it is one of the most important questions in text mining to select the suitable metric for different text data. Thus, in this thesis, we propose one novel Adaptive Metric Selection (AMS) method. The pipeline of our working includes : (1) crawling the webpage content to prepare the data source for clustering; (2) transforming the content to separate words and then to a vector form; (3) Extracting features; (4) Reducing dimension using Isomap; and (5) Using an adaptive selection method AMS to evaluate data similarity. K means is used as the basic clustering algorithm, and we use two popular clustering quality measures to evaluate the final results: (1) Adjusted Rand Index (ARI), and (2) Normal- ized Mutual Information (NMI). Simulation results show the effectiveness of our proposed methods compared to the averaged
出处 《广州大学学报(自然科学版)》 CAS 2016年第1期80-89,共10页 Journal of Guangzhou University:Natural Science Edition
关键词 数据挖掘 特征提取 聚类融合 results of different metrics. data mining feature extraction clustering ensembles
  • 相关文献

参考文献29

  • 1STREHL A, GHOSH J. Cluster ensembles-a knowledge reuse framework for combining multiple partitions [ J ]. J Mach Learn Res ,2002,3:583-617.
  • 2FERN X Z, BRODLEY C E. Solving cluster ensemble problems by bipartite graph partitioning[ C] //Proceed 21 Intern Conf Mach Learn, 2004: 69.
  • 3FERN X, LIN W. Cluster ensemble selection[ C]//Proc SIAM Int Conf Data Min (SDM), 2008: 787-797.
  • 4MONTI S, TAMAYO P, MESIROV J, et al. Consensus clustering:a resampling-based method for class discovery and visu- alization of gene expression microarray data[J]. Mach Learn, 2003, 52( 1 ) : 91-118.
  • 5FRED A, JAIN A K. Combining multiple clusterings using evidence accumulation[ J]. IEEE Trans Pattern Anal Mach In- tell, 2005,27 (6) : 835-850.
  • 6KUNCHEVA L I, VETROV D. Evaluation of stability of k-means cluster ensembles with respect to random initialization [ J]. IEEE Trans Pattern Anal Maeh Intell, 2006,28 ( 11 ) : 1798-1808.
  • 7AYAD H, KAMEL M S. Cumulative voting consensus method for partitions with variable number of clusters [ J ]. IEEE Trans Pattern Anal Mach Intell, 2008, 30( 1 ) : 160-173.
  • 8AZIMI J, FERN X. Adaptive cluster ensemble selection [ C ]//Proceed 21 st Intern Jont Confer Artif Intell, 2009 : 992-997.
  • 9WU O, HU W, MAYBANK S J,et al. Efficient clustering aggregation based on data fragments[J]. IEEE Trans Syst, Man, Cybern B, Cybern, 2012, 42(3) : 913-926.
  • 10IAM-ON N, BOONGOEN T, GARRETI" S. Lce: A link-based cluster ensemble method for improved gene expression data analysis[J]. Bioin-Format,2010,26(12) :1513-1519.

二级参考文献48

共引文献304

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部