摘要
文本分类是数据挖掘与信息检索领域的热点话题,近年来迅速发展。基于质心的方法是一种建模迅速且效果较好的文本分类方法,许多学者对该方法进行了深入研究并提出改进策略,不断提高算法效果。提出一种新的动态调整质心位置算法,该算法根据训练集中的每个样本文本动态的调整质心位置。并且针对海量数据处理瓶颈,运用当前两种并行计算框架MapReduce和BSP,提出了算法的并行策略。通过与其它算法在5种不同数据集中的对比实验,证明该方法确有较准确的分类效果。
Text categorisation is a hot topic in data mining and information retrieval, and has been rapidly developing in recent years. Centroid-based approach is a text categorisation method modelling fast and having good effect, many researchers have studied this method thoroughly and put forward the improvement strategies to incessantly raise the performance of it. In this paper, we propose a novel algorithm to dynamically adjust the centroid position. The algorithm adjusts the centroid position dynamically based on every sample text in training set. Besides, we tackle the bottleneck aiming at mass data, make use of two current parallel computing frameworks, MapReduce and BSP, and put forward the parallel strategy of the algorithm. By the comparative experiments on 5 different datasets with other algorithms, we prove that the algorithm has quite accurate classification effect.
出处
《计算机应用与软件》
CSCD
北大核心
2013年第1期43-47,54,共6页
Computer Applications and Software
基金
国家自然科学基金项目(61074128
60905025)
关键词
文本分类
质心向量
动态调整
并行计算
Text categorisation Centroid vector Dynamic adjustment Parallel computing