摘要
针对现代大规模文本文档分类在单机计算机上训练和测试过程计算时间长,本文设计和实现了一种基于MapReduce架构的并行贝叶斯文本分类算法。在用普通PC搭建的Hadoop集群上研究实验,结果表明,基于MapReduce架构的贝叶斯文本自动分类算法处理大规模的文档自动分类时,在保证分类效果的情况下,并能获得接近线性的加速比。
Aiming to improve the computational time in training and testing process on large scale documents, a implementation of parallel bayes classification algorithm based on MapReduce is proposed.We studied the performance of our parallel algorithm on a large hadoop cluster.We report both timing and accuracy results which indicate that the proposed parallel algorithm based on MapReduce is capable of handling large document collections.
出处
《微计算机信息》
2010年第9期190-191,176,共3页
Control & Automation