期刊文献+

一种改进的基于质心的文本分类算法 被引量:3

AN IMPROVED TEXT CATEGORISATION ALGORITHM BASED ON CENTROID
在线阅读 下载PDF
导出
摘要 文本分类是数据挖掘与信息检索领域的热点话题,近年来迅速发展。基于质心的方法是一种建模迅速且效果较好的文本分类方法,许多学者对该方法进行了深入研究并提出改进策略,不断提高算法效果。提出一种新的动态调整质心位置算法,该算法根据训练集中的每个样本文本动态的调整质心位置。并且针对海量数据处理瓶颈,运用当前两种并行计算框架MapReduce和BSP,提出了算法的并行策略。通过与其它算法在5种不同数据集中的对比实验,证明该方法确有较准确的分类效果。 Text categorisation is a hot topic in data mining and information retrieval, and has been rapidly developing in recent years. Centroid-based approach is a text categorisation method modelling fast and having good effect, many researchers have studied this method thoroughly and put forward the improvement strategies to incessantly raise the performance of it. In this paper, we propose a novel algorithm to dynamically adjust the centroid position. The algorithm adjusts the centroid position dynamically based on every sample text in training set. Besides, we tackle the bottleneck aiming at mass data, make use of two current parallel computing frameworks, MapReduce and BSP, and put forward the parallel strategy of the algorithm. By the comparative experiments on 5 different datasets with other algorithms, we prove that the algorithm has quite accurate classification effect.
出处 《计算机应用与软件》 CSCD 北大核心 2013年第1期43-47,54,共6页 Computer Applications and Software
基金 国家自然科学基金项目(61074128 60905025)
关键词 文本分类 质心向量 动态调整 并行计算 Text categorisation Centroid vector Dynamic adjustment Parallel computing
  • 相关文献

参考文献11

  • 1Sebastiani F. Text Categorization [ M ]. Encyclopedia of Database Tech- nologies and Applications,2005:683 -678.
  • 2苏金树,张博锋,徐昕.基于机器学习的文本分类技术研究进展[J].软件学报,2006,17(9):1848-1859. 被引量:394
  • 3Yang Y. Expert network: Effective and efficient learning from human decisions in text categorization and retrieval[ C]//SIGIR-94,1994.
  • 4Cortes C, Vapnik V. Support vector networks [ J ]. Machine Learning, 1995,20:273 -297.
  • 5Hart E H,Karypis G. Centroid-based document classification algorithms:A- nalysis & experimentalresults [ R ]. Technical Report TR-00-017, Depart- ment of Computer Science,University of Minnesota, Minneapolis ,2000.
  • 6Tan S. An improved centroid classifier for text categorization [ J ]. Expert Systems with Applications ,2008,35 ( 1 - 2 ) :279 - 285.
  • 7Tan Songbo, Cheng Xueqi. An Effective Approach to Enhance Centroid Classifier for Text Categorization [ C ]//11 th European Conference on Principles and Practice of Knowledge Discovery in Databases, Proceed- iugs:581 -588.
  • 8Shankar S, Karypis G. Weight Adjustment Schemes for a Centroid Based Classifier[ R]. Army High Performance Computing Research Center,2000.
  • 9Tom White. Hadoop : The Definitive Guide [ M ]. O' Reilly Media,2009.
  • 10Dean J, Ghemawat S. MapReduce: simplified data processing on large clusters[ C ]//OSDI'04 :Sixth Symposium on Operating System Design and Implementation, San Francisco, CA, December,2004 : 107 - 113.

二级参考文献3

共引文献393

同被引文献20

  • 1陈瑞芬.一种结合反馈方法的中文文本分类算法[J].计算机应用,2005,25(12):2862-2864. 被引量:9
  • 2卫志华.中文文本多标签分类研究[D].上海:同济大学,2010.
  • 3艾英山,张德贤.基于文本和类别信息的.KNM文本分类算法[J].河南工业大学信息科学与工程学院,2007.34(6):67-69.
  • 4杨丽华,戴齐,郭艳军.KNN文本分类算法研究[J].西南交通大学,2006,22(7):183-185.
  • 5董立岩,刘光远,贾书宏.基于贝叶斯分类器的图像研究[J].吉林大学计算机科学与技术学院,2007.4S(2):249-253.
  • 6Tsoumakas G,Katakis I,Vlahavas I. Mining Multi-label Data[A].{H}Springer-Verlag,2010.667-685.
  • 7卫志华.中文文本多标签分类研究[D]{H}上海:同济大学,2010.
  • 8Godbole S,Sarawagi S.Discriminative methods for muhi-labeled classification[C]//Proceedings of the 8th Pacic-Asia Conference on Knowledge Discovery and Data Mining.2004,3056:22-30.
  • 9Streich A,Buhmann J.Classfication of multi-labeled data:A generative approach[C]//Proceedings of the ECMI/PKDD.Antwerp,Belgium,2008,2:390-405.
  • 10Tsoumakas G,Katakis I,Vlahavas I.Multi-Label Classification:An Overview[J].International Journal of Data Warehousing and Mining,2007,3(3):1-13.

引证文献3

二级引证文献3

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部