With the support of more precision formats in emerging hardware architectures,mixed-precision has become a popular approach to accelerate deep learning(DL)training.Applying low-precision formats such as FP16 and BF16 ...With the support of more precision formats in emerging hardware architectures,mixed-precision has become a popular approach to accelerate deep learning(DL)training.Applying low-precision formats such as FP16 and BF16 to neural operators can save GPU memory while improving bandwidth.However,DL frameworks use black and white lists as default mixed-precision selections and cannot flexibly adapt to a variety of neural networks.In addition,existing work on automatic precision adjustment does not consider model convergence,and the decision cost of precision selection is high.To address the above problems,this paper proposes CoMP,a non-intrusive framework for Convergence-aware operator-wise Mixedprecision training.CoMP uses two-stage precision adjustment based on epochs and batches to ensure convergence and performance respectively.After that,CoMP performs subsequent training according to the searched optimal operator-wise mixed-precision plan.The experimental results on A100 GPU show that CoMP achieves a maximum performance speedup of 1.15×compared with PyTorch AMP implementation,while also saving up to 29.81%of GPU memory.展开更多
Graph neural networks(GNNs)can be adapted to GPUs with high computing capability due to massive arithmetic opera-tions.Compared with mini-batch training,full-graph training does not require sampling of the input graph...Graph neural networks(GNNs)can be adapted to GPUs with high computing capability due to massive arithmetic opera-tions.Compared with mini-batch training,full-graph training does not require sampling of the input graph and halo region,avoiding potential accuracy losses.Current deep learning frameworks evenly partition large graphs to scale GNN training to distributed multi-GPU platforms.On the other hand,the rapid revolution of hardware requires technology companies and research institutions to frequently update their equipment to cope with the latest tasks.This results in a large-scale cluster with a mixture of GPUs with various computational capabilities and hardware specifications.However,existing works fail to consider sub-graphs adapted to different GPU generations,leading to inefficient resource utilization and degraded training efficiency.Therefore,we propose_(ν)GNN,a Non-Uniformly partitioned full-graph GNN training framework on heterogeneous distributed platforms._(ν)GNN first models the GNN processing ability of hardware based on various theoretical parameters.Then,_(ν)GNN automatically obtains a reasonable task partitioning scheme by combining hardware,model,and graph dataset information.Finally,_(ν)GNN implements an irregular graph partitioning mechanism that allows GNN training tasks to execute efficiently on distributed heterogeneous systems.The experimental results show that in real-world scenarios with a mixture of GPU generations,_(ν)GNN can outperform other static partitioning schemes based on hardware specifications.展开更多
基金supported by National Natural Science Foundation of China(Grant No.62402525)the Fundamental Research Funds for the Central Universities(Grant No.2462023YJRC023).
文摘With the support of more precision formats in emerging hardware architectures,mixed-precision has become a popular approach to accelerate deep learning(DL)training.Applying low-precision formats such as FP16 and BF16 to neural operators can save GPU memory while improving bandwidth.However,DL frameworks use black and white lists as default mixed-precision selections and cannot flexibly adapt to a variety of neural networks.In addition,existing work on automatic precision adjustment does not consider model convergence,and the decision cost of precision selection is high.To address the above problems,this paper proposes CoMP,a non-intrusive framework for Convergence-aware operator-wise Mixedprecision training.CoMP uses two-stage precision adjustment based on epochs and batches to ensure convergence and performance respectively.After that,CoMP performs subsequent training according to the searched optimal operator-wise mixed-precision plan.The experimental results on A100 GPU show that CoMP achieves a maximum performance speedup of 1.15×compared with PyTorch AMP implementation,while also saving up to 29.81%of GPU memory.
基金supported by the National Natural Science Foundation of China(Grant No.62402525)the Fundamental Research Funds for the Central Universities(Grant No.2462023YJRC023).
文摘Graph neural networks(GNNs)can be adapted to GPUs with high computing capability due to massive arithmetic opera-tions.Compared with mini-batch training,full-graph training does not require sampling of the input graph and halo region,avoiding potential accuracy losses.Current deep learning frameworks evenly partition large graphs to scale GNN training to distributed multi-GPU platforms.On the other hand,the rapid revolution of hardware requires technology companies and research institutions to frequently update their equipment to cope with the latest tasks.This results in a large-scale cluster with a mixture of GPUs with various computational capabilities and hardware specifications.However,existing works fail to consider sub-graphs adapted to different GPU generations,leading to inefficient resource utilization and degraded training efficiency.Therefore,we propose_(ν)GNN,a Non-Uniformly partitioned full-graph GNN training framework on heterogeneous distributed platforms._(ν)GNN first models the GNN processing ability of hardware based on various theoretical parameters.Then,_(ν)GNN automatically obtains a reasonable task partitioning scheme by combining hardware,model,and graph dataset information.Finally,_(ν)GNN implements an irregular graph partitioning mechanism that allows GNN training tasks to execute efficiently on distributed heterogeneous systems.The experimental results show that in real-world scenarios with a mixture of GPU generations,_(ν)GNN can outperform other static partitioning schemes based on hardware specifications.