本文针对MapReduce框架在处理大规模数据时常见的数据倾斜问题,提出了一种基于抽样的映射分区优化方法。该方法通过水塘抽样算法对数据进行抽样,获取数据分布信息,并结合整体数据分布估计算法和映射分区算法实现数据的均衡分区。实验结...本文针对MapReduce框架在处理大规模数据时常见的数据倾斜问题,提出了一种基于抽样的映射分区优化方法。该方法通过水塘抽样算法对数据进行抽样,获取数据分布信息,并结合整体数据分布估计算法和映射分区算法实现数据的均衡分区。实验结果表明,该方法在不同倾斜度下均表现出良好的性能,显著降低了作业执行时间,提高了分区的平衡性,提升了集群资源利用率。This paper proposes a sampling-based mapping partitioning optimization method to address the common data skew problem in the MapReduce framework when processing large-scale data. The method uses reservoir sampling to sample the data, obtain information on data distribution, and then combines the overall data distribution estimation algorithm and the mapping partitioning algorithm to achieve balanced data partitioning. Experimental results show that the proposed method performs well under different degrees of skewness, significantly reducing job execution time, improving partition balance, and enhancing cluster resource utilization.展开更多
Deep learning (DL) techniques, more specifically Convolutional Neural Networks (CNNs), have become increasingly popular in advancing the field of data science and have had great successes in a wide array of applicatio...Deep learning (DL) techniques, more specifically Convolutional Neural Networks (CNNs), have become increasingly popular in advancing the field of data science and have had great successes in a wide array of applications including computer vision, speech, natural language processing, etc. However, the training process of CNNs is computationally intensive and has high computational cost, especially when the dataset is huge. To overcome these obstacles, this paper takes advantage of distributed frameworks and cloud computing to develop a parallel CNN algorithm. MapReduce is a scalable and fault-tolerant data processing tool that was developed to provide significant improvements in large-scale data-intensive applications in clusters. A MapReduce-based CNN (MCNN) is developed in this work to tackle the task of image classification. In addition, the proposed MCNN adopted the idea of adding dropout layers in the networks to tackle the overfitting problem. Close examination of the implementation of MCNN as well as how the proposed algorithm accelerates learning are discussed and demonstrated through experiments. Results reveal high classification accuracy and significant improvements in speedup, scaleup and sizeup compared to the standard algorithms.展开更多
为了提高深度卷积神经网络(DCNN)的图像并行处理能力,提高其图像识别的准确率和运行效率,研究过程以MapReduce并行计算框架和从图像到矩阵(Image to Column,Im2col)算法,分别进行原始图像特征并行提取和筛选、模型并行训练和参数并行更...为了提高深度卷积神经网络(DCNN)的图像并行处理能力,提高其图像识别的准确率和运行效率,研究过程以MapReduce并行计算框架和从图像到矩阵(Image to Column,Im2col)算法,分别进行原始图像特征并行提取和筛选、模型并行训练和参数并行更新,构建了并行DCNN优化算法。在性能检测阶段,将全连接神经网络和基于特征图和并行计算熵的深度卷积神经网络算法作为对照组,对比TOP⁃1准确率、浮点运算量、损失函数振荡性、运算时长四项指标,结果显示,此次提出的并行DCNN优化算法性能最佳。展开更多
文摘本文针对MapReduce框架在处理大规模数据时常见的数据倾斜问题,提出了一种基于抽样的映射分区优化方法。该方法通过水塘抽样算法对数据进行抽样,获取数据分布信息,并结合整体数据分布估计算法和映射分区算法实现数据的均衡分区。实验结果表明,该方法在不同倾斜度下均表现出良好的性能,显著降低了作业执行时间,提高了分区的平衡性,提升了集群资源利用率。This paper proposes a sampling-based mapping partitioning optimization method to address the common data skew problem in the MapReduce framework when processing large-scale data. The method uses reservoir sampling to sample the data, obtain information on data distribution, and then combines the overall data distribution estimation algorithm and the mapping partitioning algorithm to achieve balanced data partitioning. Experimental results show that the proposed method performs well under different degrees of skewness, significantly reducing job execution time, improving partition balance, and enhancing cluster resource utilization.
文摘Deep learning (DL) techniques, more specifically Convolutional Neural Networks (CNNs), have become increasingly popular in advancing the field of data science and have had great successes in a wide array of applications including computer vision, speech, natural language processing, etc. However, the training process of CNNs is computationally intensive and has high computational cost, especially when the dataset is huge. To overcome these obstacles, this paper takes advantage of distributed frameworks and cloud computing to develop a parallel CNN algorithm. MapReduce is a scalable and fault-tolerant data processing tool that was developed to provide significant improvements in large-scale data-intensive applications in clusters. A MapReduce-based CNN (MCNN) is developed in this work to tackle the task of image classification. In addition, the proposed MCNN adopted the idea of adding dropout layers in the networks to tackle the overfitting problem. Close examination of the implementation of MCNN as well as how the proposed algorithm accelerates learning are discussed and demonstrated through experiments. Results reveal high classification accuracy and significant improvements in speedup, scaleup and sizeup compared to the standard algorithms.
文摘为了提高深度卷积神经网络(DCNN)的图像并行处理能力,提高其图像识别的准确率和运行效率,研究过程以MapReduce并行计算框架和从图像到矩阵(Image to Column,Im2col)算法,分别进行原始图像特征并行提取和筛选、模型并行训练和参数并行更新,构建了并行DCNN优化算法。在性能检测阶段,将全连接神经网络和基于特征图和并行计算熵的深度卷积神经网络算法作为对照组,对比TOP⁃1准确率、浮点运算量、损失函数振荡性、运算时长四项指标,结果显示,此次提出的并行DCNN优化算法性能最佳。