摘要
针对海量文本邮件的挖掘过滤需要更大的存储空间、以及更强的计算能力,提出一种基于Hadoop云计算平台的垃圾邮件过滤方法。其思想:把相对孤立的数据集合并成易于云平台处理的大文件集合;依据评估函数构建文本向量,将邮件转换为结构化的描述;基于MapReduce分布式编程模型改进SVM算法,利用集群整体的计算能力求解最优平面。实验表明:该方法能利用廉价的计算机集群代替昂贵的高性能机器实现海量邮件数据的挖掘过滤;并且,分类效率能随着集群规模的扩增而提升较快。
Aiming at that the massive text e - mail mining filter requires more storage space and greater computing power, a method of implementing spare filtering based on Hadoop platform is pro- posed . The data is merged into one big file to be processed ; the feature words of every email is se- lected according to the evaluation function to create the txt vector and convert the e - mail to a struc- tured description. Using the improved SVM based on MapReduce to distribute the load into clusters, and solving the optimal plane using the whole cluster computing power. The experiments show that the improved SVM algorithm can take advantage of the cheap computer cluster to replace expensive high performance machine to implement e - mail mining filter; and the classification emciency is im- proved fast with expansion of the cluster scale.
出处
《无线通信技术》
2013年第2期52-56,62,共6页
Wireless Communication Technology
基金
国家自然科学基金(61202110)项目