摘要
针对目前Nutch搜索引擎中没有实现PageRank计算的缺憾,在分析和研究经典PageRank算法的基础上,通过设置控制站外与站内链接的比重因子对该算法进行了改进。利用MapReduce处理大数据集的优势,在Nutch机群系统上设计并实现了基于MapReduce的PageRank分布式并行算法。实验结果表明,处理的数据量越大,机群中的节点越多,计算PageRank的效率越高;另外,该分布式并行算法具有较好的可扩展性。
Presently,in view of Nutch search engine disappointment which has not realized the PageRank computation,after the classical PageRank algorithm is analyzed and studied,this algorithm is improved through establishing factor which controls the outside links and inside links proportion.Using the superiority of processing the big data set on MapReduce,the MapReduce-based PageRank distributional parallel algorithm is designed and implemented on Nutch compute clusters.Experiments show that the more processing data and cluster nodes are,the higher efficiency of computing PageRank is;moreover,this distributional parallel algorithm has good scalability.
出处
《计算机工程与设计》
CSCD
北大核心
2010年第20期4354-4356,4409,共4页
Computer Engineering and Design
基金
广西科学基金项目(桂科自0832059)